Multimodal Framework for Long-Tailed Recognition

Chen, Jian; Zhao, Jianyin; Gu, Jiaojiao; Qin, Yufeng; Ji, Hong

doi:10.3390/app142210572

Open AccessArticle

Multimodal Framework for Long-Tailed Recognition

by

Jian Chen

^†,

Jianyin Zhao

^*,†,

Jiaojiao Gu

,

Yufeng Qin

and

Hong Ji

College of Coastal Defense Force, Naval Aviation University, Yantai 264001, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(22), 10572; https://doi.org/10.3390/app142210572

Submission received: 22 August 2024 / Revised: 23 October 2024 / Accepted: 7 November 2024 / Published: 16 November 2024

Download

Browse Figures

Versions Notes

Abstract

Long-tailed data distribution (i.e., minority classes occupy most of the data, while most classes have very few samples) is a common problem in image classification. In this paper, we propose a novel multimodal framework for long-tailed data recognition. In the first stage, long-tailed data are used for visual-semantic contrastive learning to obtain good features, while in the second stage, class-balanced data are used for classifier training. The proposed framework leverages the advantages of multimodal models and mitigates the problem of class imbalance in long-tailed data recognition. Experimental results demonstrate that the proposed framework achieves competitive performance on the CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist2018 datasets for image classification.

Keywords:

long-tailed recognition; vision-language models; imbalanced classification

1. Introduction

Long-tailed image classification has been a significant challenge in the field of machine learning, where a small number of majority classes (head classes) dominate the training data, while the majority of classes (tail classes) have only a few samples. This class imbalance leads to models performing well on head classes when tested on balanced datasets but performing poorly on tail classes. Addressing this imbalance is crucial, particularly in applications such as medical diagnosis, species classification, and object recognition in natural scenes, where tail categories often represent rare yet important instances.

Over the years, researchers have explored several approaches to solve the long-tailed recognition problem. These can be broadly classified into several categories. Category Rebalancing Methods: These approaches aim to mitigate the class imbalance by adjusting the training data or loss functions. For example, Focal Loss [1] modifies the standard cross-entropy loss to focus more on hard-to-classify tail examples, while Class-Balanced Loss [2] introduces weighting schemes that give higher importance to tail classes based on the effective number of samples per class. These methods help alleviate the imbalance but often struggle with performance trade-offs between head and tail classes. Network Structure Improvements: This class of methods enhances the model architecture to better handle imbalanced data. LDAM-DRW [3] introduces a margin-based loss function specifically designed for long-tailed distributions, while a Bilateral-Branch Network (BBN) [4] creates two separate branches within the network—one for head classes and one for tail classes—to learn more specialized features for each group. Although effective, these methods are typically limited to visual modalities and do not leverage the potential of multimodal information. Data Augmentation Techniques: Some approaches, such as mixup [5] and its rebalanced version Remix [6], augment the data by generating synthetic samples from the original dataset. While these methods have been shown to improve model robustness, they rely heavily on the assumption that synthetic examples can accurately represent tail classes, which is not always the case in real-world long-tailed distributions. Two-Stage Approaches: Methods like cRT and LWS [7] separate training into two stages—first learning a representation that is less biased toward head classes, followed by training a classifier with re-weighting schemes. This decoupling improves performance by focusing on the specific challenges of imbalanced classification. However, they do not fully exploit the rich semantic relationships present in multimodal data. Multi-branch Models: Multi-branch architectures, such as RIDE [8] and LFME [9], employ multiple expert models to address class imbalance. Each expert specializes in different portions of the data distribution, allowing for more tailored learning. These methods have shown promising results, particularly in handling tail categories. However, they often involve significant computational complexity, making them less practical for large-scale applications. Additionally, few-shot learning methods also aim to solve the problem of recognizing classes with limited data, which bears some similarity to long-tailed classification. For example, Lee et al. proposed a Hellinger Distance-Attention-based Feature Aggregation Network (HELA-VFA) for few-shot classification [10]. Similarly, Roy et al. introduced a few-shot learning method, Felmi, based on hard mixup [11]. These methods offer valuable insights for addressing the issue of tail classes in long-tailed data.

In recent years, the problem of long-tailed recognition has garnered increasing attention and research [12,13,14,15,16,17,18]. Since real-world data often exhibit long-tailed distributions (e.g., Pareto distribution [19]), traditional visual models are prone to class imbalance issues. The abundance of head class data leads models to focus on learning head features, while the scarcity of tail class data results in poor generalization capabilities. Although collecting more data can alleviate this issue to some extent, this approach is often costly and unsustainable. Thus, developing more efficient algorithms to handle long-tailed distributions has become crucial.

Despite the effectiveness of existing approaches, such as category rebalancing and network structure improvements, in visual modalities, they often overlook the potential semantic information in multimodal data, such as the semantic features embedded in label texts. These features hold significant potential in enhancing model generalization and improving the recognition of tail classes. Therefore, this paper aims to explore multimodal solutions to the long-tailed recognition problem by integrating visual and textual information to optimize and enhance existing methods.

Contrastive visual-semantic models such as CLIP [20] have shown great capabilities in recent years. Such models are mainly pre-trained on a large number of image–text pairs collected from the Internet, aligning visual representation and language representation with contrastive loss. A model trained by this method shows strong classification performance in zero-shot cases. In the long-tail problem setting, if the information contained in the text label can be effectively used, it is expected to improve the recognition accuracy on long-tailed data. CLIP utilizes large-scale pre-trained models to align visual and textual representations, showing exceptional performance on various zero-shot tasks. However, these models have not been extensively applied to long-tailed data, where leveraging both image and text modalities could further enhance classification performance, particularly for tail categories. The motivation behind this work stems from the observation that most existing methods focus solely on visual data, overlooking the valuable information present in textual descriptions. By incorporating textual features through contrastive learning, we aim to improve the recognition of tail classes, which are typically underrepresented in visual data alone. This is particularly important in applications like natural-language-guided image classification, where textual descriptions can provide critical context for rare categories.

In this article, a long-tailed data recognition framework based on multimodality is proposed. According to the different distributions of input data, the method in this paper is divided into two stages: the first stage entails using long-tailed data for visual-semantic contrastive learning, and the second stage involves classifier learning using class-balanced data. The first stage is the comparative learning stage, in which long-tailed data are used to extract image features and text features, and in the process of learning, feature representations of the same category are shortened as much as possible, and the feature representations between different categories are pulled apart. When the first stage of learning is completed, the model usually has better classification performance for head categories but weaker classification performance for tail categories. Therefore, in the second stage of the dual-branch recognition framework, the training data after class-balanced sampling will be sent to the vision branch and the language branch for training so as to further improve the performance of the model on the tail category. The second stage will inherit image feature representations and filtered text feature representations from the first stage.

Specifically, the main contributions are as follows:

This paper proposes the use of text labels in long-tail problems to make up for the lack of a single visual modality and proposes a visual-semantic contrastive learning framework for the extraction of image features and text features, and in the process of extraction, the same-category features are kept as close as possible.
This paper proposes a multimodal classifier learning framework that first extracts the most relevant text features and then constructs a dual-branch classifier for text features and image features.
This paper presents a large number of conducted experiments, which achieved competitive results on four datasets: CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist2018.

2. Method

The method is divided into two stages. The first stage is contrastive learning, which utilizes long-tailed data to extract both image and text features. In this phase, the model learns to bring feature representations of the same class closer together while increasing the separation between different classes. After the first stage is completed, the model typically performs well on head classes but exhibits weaker classification performance on tail classes. To enhance performance on tail classes, the dual-branch recognition framework moves into the second stage, where class-balanced sampled data are fed into both the vision and language branches for training. The second stage inherits the image feature representations from the first stage, along with filtered text feature representations, to further improve tail class classification.

As shown in Figure 1, for the image–text branch, the image encoder takes in raw long-tail-distributed image data to extract image features for the batch (assuming a large number of samples for cat and dog classes and fewer samples for the panda class). Simultaneously, the text encoder inputs several textual descriptions for each category from the text database to extract text features. During training, the contrastive loss between image–text pairs is employed to pull similar instances of the same class closer in the feature space while pushing different classes apart. Additionally, the official pre-trained CLIP model is utilized, and KL divergence is employed to prevent overfitting of the trained model. After the initial training phase described above, in the second phase, we modify our inputs by mitigating the imbalance through a category resampling strategy. Then, by inheriting the feature extractor from the first phase and selecting the most discriminative textual feature descriptions using the scoring method outlined in Section 3.6, we concatenate them with the output of the image encoder and compute the score for each sample’s class membership. Simultaneously, leveraging the second branch, the image-classifier branch utilizes an MLP to implement a classifier and obtain balanced classifier scores. During training, the cross-entropy losses for both scores are calculated separately, while during testing, the weighted sum of both scores is used to obtain the final predicted classification results.

2.1. Image–Text Contrastive Model

Image–text contrastive learning models usually adopt a dual-encoder architecture, where one is an image encoder

E_{I}

and the other is a text encoder

E_{T}

, such as CLIP [20] and CoOp [21]. The main idea is to encode images and text separately through two independent encoders and then align the features of these two modalities. Figure 2 shows the most relevant text feature extraction process. Usually, the learning goal of this kind of model is the contrastive loss; the purpose is to bring the features described by the image and the corresponding text closer and, at the same time, push away features for which the image does not match the text in the feature space. For example, in the process of training, images

{I_{i}}_{i = 1}^{n}

and sentences

{T_{i}}_{i = 1}^{n}

of n batch size are randomly selected from the samples to obtain image features

F_{I_{i}} = E_{I} (I_{i}) \in R^{D_{V}}

and text features

F_{T_{i}} = E_{T} (T_{i}) \in R^{D_{L}}

, respectively, in the multimodal space. Then, the features are passed through two transformation matrices:

v_{i} = \frac{W_{V}^{⊤} \cdot F_{I_{i}}}{∥ W_{V}^{⊤} \cdot F_{I_{i}} ∥}, l_{i} = \frac{W_{L}^{⊤} \cdot F_{T_{i}}}{∥ W_{L}^{⊤} \cdot F_{T_{i}} ∥}

(1)

Here,

v_{i}

and

l_{i}

are the D-dimensional image and text features of the ith image normalized in the multimodal space, respectively. The goal of the original contrastive learning is to bring the image features and text features of a single image closer while pulling the image and all other text features away. In order to make features of the same category as compact as possible in the feature space, features of the same category should be excluded when pulling features out. In other words, when training image features, text features that are the same as the image category should be pulled closer, while text features that are different from the image category should be pulled away. In the same way, when training text images, image features that are the same as the text category should be pulled closer, and image features that are different from the text category should be pulled away.

L_{v} = - \frac{1}{n} \sum_{i}^{n} \frac{1}{| i_{l}^{+} |} \sum_{p \in i_{l}^{+}} log (\frac{exp (v_{i}^{⊤} l_{p} / τ)}{\sum_{j = 1}^{n} exp (v_{i}^{⊤} l_{j} / τ)})

(2)

L_{l} = - \frac{1}{n} \sum_{i}^{n} \frac{1}{| i_{v}^{+} |} \sum_{p \in i_{v}^{+}} log (\frac{exp (l_{i}^{⊤} v_{p} / τ)}{\sum_{j = 1}^{n} exp (l_{i}^{⊤} v_{j} / τ)})

(3)

Specifically,

L_{v}

and

L_{l}

represent the similarity loss of image features relative to text features and the similarity loss of text features relative to image features, respectively, and

i_{l}^{+}

indicates that it belongs to i in the current batch. The text feature index of the category,

i_{v}^{+}

, represents the image feature index belonging to the ith category in the current batch, and

τ

is a temperature parameter used to scale the logical output size.

After training using a large number of image–text pairs, a better multimodal feature representation can be obtained. In order to avoid overfitting on a specific dataset, this paper adds a comparison with the initial model during comparative learning so that the model after comparative learning training will not deviate too far from the original model, and it is limited by KL divergence, defined as

\begin{matrix} L_{K L} & = D (v_{i} | | v_{i}^{o}) + D (l_{i} | | l_{i}^{o}) \\ = \sum_{i = 1}^{n} v_{i} log (\frac{v_{i}}{v_{i}^{o}}) + \sum_{i = 1}^{n} l_{i} log (\frac{l_{i}}{l_{i}^{o}}) \end{matrix}

(4)

where

L_{K L}

represents the KL divergence between the trained model and the original model,

v_{i}

and

l_{i}

represent the visual and language features of the trained model for the ith sample, and

v_{i}^{o}

and

l_{i}^{o}

represent the visual features and language features of the original model for the ith sample. In order to increase the speed of model training in the early stage of training and avoid overfitting in the later stage of training, this paper introduces a parameter

λ

to adjust the focus of the training process. So far, the overall loss function in the comparative learning phase can be obtained as follows:

L_{c l} = λ \cdot (L_{v} + L_{l}) + (1 - λ) \cdot L_{K L}

(5)

In this context,

L_{c l}

represents the contrastive learning loss, and

λ \in [0.5, 1]

is used to gradually shift the focus from learning the model for a specific dataset to learning the reduced and original model during the training process difference.

λ

is defined as

λ = 1 - \frac{E}{2 E_{m a x}}

(6)

The model obtained through comparative learning already has the ability to perform image classification. For example, let v represent the visual feature representation of a certain picture after passing through the visual encoder, and let

l_{1}, . . . l_{i}, . . ., l_{C}

represent the language feature representation obtained after the text describing each category passes through the language encoder. The probability of each category of the picture can be calculated using the following formula:

p_{i} = \frac{exp (v^{⊤} l_{i}) / τ}{\sum_{j = 1}^{C} exp (v^{⊤} l_{j}) / τ}

(7)

where

p_{i}

represents the probability of predicting the ith category in a total of C categories, the picture is predicted to be the category corresponding to the highest probability, and

τ

is a scale used for scaling the temperature parameter.

2.2. Most Relevant Text Feature Extraction

For each category, the text information provided by the template text may be noisy; not all text information is necessarily beneficial to the classification of the text, and irrelevant text information may even lead to a decline in classification performance. Therefore, it is necessary to filter out text information that has nothing to do with the category or has little relevance, and the text information that is most relevant to the category is extracted.

Among the total T text features, the correlation between the tth text feature

F_{T_{t}}

and the ith picture in the total I pictures is calculated as follows:

r e l_{t, i} = l o g \frac{e x p (F_{T_{t}} \cdot F_{i}^{⊤})}{\sum_{j}^{I} e x p (F_{T_{t}} \cdot F_{j}^{⊤})}

(8)

Here, the larger the

r e l_{t, i}

value, the greater the correlation between the text t and image i. Assuming that the category corresponding to the tth text feature is c, the text and image i correlation score for the category c is calculated as follows:

s c o r e_{t, c} = \frac{\sum_{j = c}^{I} r e l_{t, j}}{p o s} + \frac{1}{(I - p o s) \sum_{j! = c}^{I} r e l_{t, j}}

(9)

where

p o s

is the number of image samples belonging to the category c, and the higher the correlation between the text feature t and the category c, the higher the correlation score.

This score takes into account not only the logical output values belonging to the category but also the logical output values not belonging to the category. The main purpose of this design formula is as follows: When a text feature t does not belong to the text feature category while other categories of c are related, the text feature t is likely to be a confusing text, which is likely to mislead the classification. Therefore, the relevant score of the text should be reduced, and the amount by which it should be reduced, according to the degree to which the text is confusing, is determined by the size of the logical output value.

2.3. Double-Branch Multimodal Recognition Framework

In this section, a simple and effective two-branch multimodal long-tailed data recognition framework is introduced. The two branches of the framework are the image–text branch and the image branch. In this context, the image–text branch uses both image information and text information for judgment, while the image branch only uses image information. In the previous section, S text features

F_{T} \in R^{S \times C \times D}

and images for each category have been obtained as features

F_{I} \in R^{D}

. The formula defining the logical output for computing the image–text branch is

\begin{matrix} O_{I, T} & = σ (A t t e n t i o n (F_{I}, F_{T}, F_{T}) \cdot F_{I}) \\ = σ (σ (\frac{{(W_{I}^{⊤} F_{I})}^{⊤} (W_{T}^{⊤} F_{T})}{\sqrt{d_{T}}}) F_{T} \cdot F_{I}) \end{matrix}

(10)

where

W_{I}

and

W_{T}

represent the linear transformation matrix of image features and text features, respectively;

σ (\cdot)

represents the Softmax Function; and

d_{T}

represents the latitude of the text feature.

The formula for the image branch logic output is defined as

O_{I} = σ (W_{I}^{' ⊤} F_{I})

(11)

where

W_{I}^{'}

represents the linear transformation matrix of the image branch.

The formulation of the final logistic output and loss function is defined as

L_{c l s} = - \sum_{c = 1}^{C} y^{c} \times l o g (O_{I, T}) - \sum_{c = 1}^{C} y^{c} \times l o g (O_{I})

(12)

where

y^{c}

represents the label of the cth category. Algorithm 1 gives the data processing flow of the model.

Algorithm 1 Long-tailed recognition algorithm.

Require:: Randomly select batch size n of images ${I_{i}}_{i = 1}^{n}$ and text ${T_{i}}_{i = 1}^{n}$ and the hyperparameters E and $E_{m a x}$ .
Ensure:: Classification result $p_{i}$ .
1:: Initialize hyperparameters $λ$ according to Equation (6).
2:: for $t = 1$ to T do
3:: $F_{I_{i}} = E_{I} (I_{i})$
4:: $F_{T_{i}} = E_{T} (T_{i})$
5:: Calculate $r e l_{t, i}$ according to Equation (8).
6:: Calculate $s c o r e_{t, c}$ according to Equation (9).
7:: Select the best text feature for each category based on the score.
8:: Normalize $F_{I_{i}}$ and $F_{T_{i}}$ according to Equation (1).
9:: Calculate the similarity loss of image features relative to text features $L_{v}$ according to Equation (2).
10:: Calculate the similarity loss of text features relative to image features $L_{l}$ according to Equation (3).
11:: Calculate $L_{K L}$ according to Equation (4).
12:: Calculate the overall loss function of the contrastive learning phase $L_{c l}$ according to Equation (5).
13:: Optimize $E_{I}$ and $E_{T}$ to $L_{c l}$ .
14:: Calculate the logical output of the image–text branch $O_{I, T}$ according to Equation (10).
15:: Calculate the image branch logic output $O_{I}$ according to Equation (11).
16:: Calculate the final classification loss $L_{c l s}$ according to Equation (12).
17:: Optimize Dot-Head and Classifier according to $L_{c l s}$ .
18:: end for
19:: Calculate the classification results $p_{i}$ based on Dot-Head and Classifier.

3. Experiment Details

3.1. Datasets

This paper presents extensive experiments conducted on three mainstream long-tailed datasets, which are introduced below.

CIFAR-10-LT and CIFAR-100-LT are long-tailed versions derived from the CIFAR-10 and CIFAR-100 datasets, primarily used to evaluate the model’s performance in handling imbalanced data. The original CIFAR-10 and CIFAR-100 datasets contain 10 and 100 classes, respectively, with a balanced number of images per class. In the long-tailed versions, the distribution is modified to follow a long-tailed distribution by downsampling the minority classes, simulating the common long-tail problem in real-world applications. Specifically, the long-tail factor determines the ratio between the number of samples in the head and tail classes. For instance, with an imbalance factor of 200, the class with the most samples has 200 times more samples than the class with the fewest. These long-tailed datasets are used to assess the model’s performance in imbalanced classification tasks.

ImageNet-LT is a long-tailed dataset sampled from the classic ImageNet-2012 dataset. To construct this dataset, researchers downsampled the original ImageNet data based on a Pareto distribution, retaining 1000 classes, with the number of samples per class ranging from hundreds to thousands. ImageNet-LT is designed to simulate real-world long-tailed distributions, where there is a significant imbalance between head classes (with many samples) and tail classes (with very few samples). This dataset is commonly used to evaluate model performance in large-scale, long-tailed image classification tasks.

iNaturalist2018 is a long-tailed dataset mainly used for studying fine-grained image classification problems. This dataset contains over 8000 species categories and approximately 437,000 images in total, but many categories have very few samples, exhibiting a typical long-tailed distribution. The iNaturalist2018 dataset is particularly challenging because it has not only a large number of categories but also a significant long-tail effect, where many categories have very few samples, while a few head categories have a large number of samples. It is widely used to evaluate model performance in fine-grained classification and long-tailed distribution problems, particularly in real-world applications like species recognition.

3.2. Evaluation Metrics

In our experiments, we trained the model on an imbalanced training set, where the number of samples per class is uneven, and tested it on a balanced test set. We primarily use Top-1 accuracy as the evaluation metric to measure the model’s classification performance. Top-1 accuracy refers to the proportion of input images where the predicted class matches the true class, i.e., the percentage of correctly classified samples out of the total. This is a standard evaluation metric widely used in image classification tasks, especially for multi-class classification problems.

Additionally, to more thoroughly assess the model’s performance on long-tailed data, we calculated Top-1 accuracy for different class groups separately: classes with more than 100 training samples (many-shot), classes with 20 to 100 training samples (medium-shot), and classes with fewer than 20 training samples (few-shot). This grouped evaluation method effectively reveals the performance differences between head and tail classes, providing a comprehensive reflection of the model’s performance on long-tailed datasets.

3.3. Implementation Details on ImageNet-LT

ImageNet-LT is a subset of ImageNet-2012, sampled from the original dataset according to the Pareto distribution. In order to compare with previous work [8,22], the visual encoder in this paper uses the ResNet-50 backbone network. In the language encoder, this paper uses the Transformer network structure. In the comparative learning phase of the first stage, a total of 60 epochs were used for training. The initial learning rate was

3 \times 10^{- 5}

, and the learning rate was reduced to the original 0.1 and 0.01 at the 36th epoch and 48th epoch; the language encoder’s text words have a maximum length of 60 (excluding first and last words). In the image–text branch of the second stage, the 50 most relevant text features are extracted for each category. The initial learning rate of image features is

2 \times 10^{- 4}

, and the initial learning rate of text features is

2 \times 10^{- 7}

. In the second stage, a total of 40 epochs were used for training, and a stepwise learning strategy was used. The learning rate was reduced to 0.1 and 0.01 at the 24th epoch and 32nd epoch, respectively. All models are optimized by a stochastic gradient descent optimizer with a momentum of 0.9 and a decay rate of

5 \times 10^{- 2}

. For the training sample, the image was proportionally scaled to a size of 256 pixels on the short side, and then the image or its horizontally flipped version was cropped to 224 × 224, and then autoAugment [23] was used for data augmentation.

3.4. iNaturalist 2018 Implementation Details

iNaturalist 2018 is a very challenging long-tailed, fine-grained dataset. In References [3,4], in their experiments with iNaturalist 2018, the visual encoder used ResNet-50, and the language encoder used the Transformer network. In the comparative learning phase of the first stage, a total of 80 epochs were used for training. The initial learning rate was

3 \times 10^{- 4}

, and the learning rate was reduced to the original 0.1 and 0.01 at the 48th epoch and 64th epoch. The initial learning rate of the image features in the second stage was

2 \times 10^{- 6}

, and the initial learning rate of the text features was

1 \times 10^{- 7}

. In the second stage, a total of 70 epochs were used with a stepwise learning strategy, and the learning rate was reduced to the original 0.1 and 0.01 at the 42nd and 56th epochs. The data enhancement method for training samples was the same as the data enhancement method on ImageNet-LT. The experiments conducted on the iNaturalist 2018 data and above were all optimized by a stochastic gradient descent optimizer with a momentum of 0.9 and a decay rate of

8 \times 10^{- 3}

.

3.5. Method of Comparison

Rebalancing methods. This paper compares the proposed approach with several commonly used rebalancing methods, including Focal [1], Class-Balanced [2], LDAM-DRW [3], and Equalization [24]. These are standard approaches commonly used in long-tailed classification to address class imbalance issues. They were chosen to validate the basic effectiveness of the new method on long-tailed data and to demonstrate its improvements over traditional methods.
Data augmentation methods. To demonstrate the effectiveness of the proposed method, this paper compares it with several strong augmentation methods, such as mixup [5], Rebalanced mixup [6], and CAM [22]. These methods alleviate the long-tail problem by generating more samples. Comparison with these methods helps demonstrate the robustness and performance improvement of the proposed multimodal approach under various data augmentation strategies.
Two-stage methods. The proposed approach is also compared with two-stage methods such as LWS [7] and cRT [7] to verify its effectiveness. These methods address the imbalance problem by separating representation learning and classification learning. By using these methods for comparison, we show how our proposed multimodal approach achieves higher classification accuracy within the same two-stage framework.
Multi-branch methods. LFME [9], BBN [4], and RIDE [8] are multi-branch models. This paper compares them with our architecture to demonstrate its effectiveness. These methods improve performance for long-tailed categories by designing different optimization paths for head and tail classes. Choosing these methods highlights the better integration and performance improvement of our multimodal approach in similar structures.
State-of-the-art methods. This paper also compares the proposed method with recent state-of-the-art methods: Meta-weight-net [25], Domain Adaptation [26], LADE [27], MiSLAS [28], and DisAlign [29]. They were selected to directly showcase the competitiveness of the new method against cutting-edge techniques, proving the latest breakthroughs in addressing long-tailed problems.

3.6. Main Experimental Results

3.6.1. Results on CIFAR-10-LT and CIFAR-100-LT

Table 1 reports the Top-1 accuracy of CIFAR-10-LT and CIFAR-100-LT using ResNet-32. The multimodal long-tailed data recognition (MMLTR) method proposed in this paper achieved the best results for all imbalance ratios (200, 100, 50, and 20), which verified the effectiveness of the multimodal long-tailed data recognition method in this paper. The results show that the proposed method outperforms classical rebalancing methods, reflecting the effectiveness of our contrastive learning framework.

The proposed method also outperforms the tested data augmentation methods. It also shows stronger competitiveness compared with the two-stage methods, which reflects the effectiveness of the two-stage method in this paper, namely, the contrastive learning stage and the classification learning stage. The proposed method employs a multi-branch model during the classification learning stage, and its performance is compared with various other multi-branch methods. The results show that its accuracy surpasses that of these other methods, demonstrating the effectiveness of the proposed model in classification. We also compared our framework with many recent state-of-the-art methods to demonstrate its effectiveness in multimodal long-tailed data recognition. In Table 2, the accuracy on the majority class (more than 100 training images), median class (20∼100 training images), and minority class (fewer than 20 training images) are reported. The recognition accuracy of the MMLTR method in this study is more than 7% higher than that of the previous method for the tail category, and the overall recognition accuracy is 1.8% higher than that of the previous method. It can be seen that most of the improvement comes from the tail category.

3.6.2. Results on ImageNet-LT and iNaturalist2018

This paper further verifies the effectiveness on the ImageNet-LT and iNaturalist 2018 datasets. Table 3 reports the results on these two large-scale imbalanced datasets. The method in this paper is 15.7% (ResNet-50) higher than RIDE on ImageNet-LT and 2.7% (ResNet-50) higher than RIDE on iNaturalist 2018, indicating that the method in this paper can be effectively generalized to large-scale datasets. In Table 4, the accuracy on the majority class (more than 100 training images), the median class (20∼100 training images), and the minority class (fewer than 20 training images) are reported. It can be seen from the table that most of the improvement comes from the tail category, which shows that the method in this paper has a good effect on the tail category.

3.7. Ablation Experiment

Table 5 shows the results of the ablation study, where our primary objective is to verify the contributions of different modules and strategies to the overall performance of the multimodal approach. The details of the experiments are described below.

Experiment 1: Only the image encoder was used, without any textual information, aiming to evaluate the performance with a single visual modality. Experiment 2: Building on Experiment 1, text feature filtering was added to assess the improvement brought by the text feature filtering module in image classification. Experiment 3: Both the image encoder and text encoder were used for contrastive learning without subsequent multimodal classification to evaluate the impact of contrastive learning on the model. Experiment 4: Only the image branch of the multimodal model was used, meaning that in the second-stage classification, only image features were relied upon to assess the performance of the image branch. Experiment 5: This was the complete multimodal framework, which includes image–text contrastive learning, text feature filtering, and a dual-branch classification model (image and image–text), aiming to evaluate the contribution of all combined modules to the overall model performance.

The results of ablation experiments on ImageNet-LT and iNaturalist2018 datasets are reported. Contrastive learning means that the image–text comparison model is used for image and text feature extraction in the first stage, and the most relevant text feature extraction means that the most relevant features are selected for each category to prepare for the classification task model in the second stage. In the second stage of the multimodal recognition framework, the results of using one of the branches alone and using the two branches at the same time are obtained. It can be seen from the table that when all the methods proposed in this paper are used, the best results can be achieved. The results show the good effect and thus effectiveness of each part proposed in this article, as well as the whole.

4. Conclusions

This paper proposes a new multimodal long-tail recognition method that effectively improves classification performance on tail categories by introducing visual-semantic contrastive learning and a two-stage training framework. Experimental results show that the proposed method outperforms traditional methods on several imbalanced datasets (such as CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist2018), with a significant improvement in recognizing tail categories. Compared to existing unimodal methods, our framework takes advantage of pre-trained visual-semantic models, reducing the reliance on complex resampling strategies and improving model generalization. However, despite the many advantages, our approach has certain limitations. First, the method relies on pre-trained models such as CLIP, and the final performance of the model is somewhat dependent on the quality and suitability of the pre-trained model. In addition, our method involves two stages of training, with the first stage, particularly visual-semantic contrastive learning, requiring simultaneous processing of image and text features, which may result in higher computational costs. Although the performance gains are substantial, this computational complexity may pose challenges in certain practical applications. To mitigate these limitations, several directions for future work can be explored. First, introducing model distillation techniques could simplify the inference phase while maintaining performance. Second, for different application scenarios, developing smaller pre-trained models or reducing the dimensionality of text features could help optimize computational resources. Lastly, further research could explore the efficient application of multimodal models in distributed training frameworks to reduce training time and resource consumption.

In conclusion, the proposed method demonstrates competitive performance in long-tail recognition tasks. Although its computational complexity is higher, this can be addressed through reasonable optimization strategies in real-world applications. We look forward to future research further enhancing the practicality and scalability of multimodal approaches.

Author Contributions

Conceptualization, J.C. and J.Z.; methodology, J.C.; software, J.Z.; validation, J.G.; formal analysis, Y.Q.; investigation, H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in the CIFAR-10 and CIFAR-100-LT datasets at https://www.cs.toronto.edu/~kriz/cifar.html, ImageNet-LT dataset at https://tensorflow.google.cn/datasets/catalog/imagenet_lt, and iNaturalist 2018 dataset at https://tensorflow.google.cn/datasets/catalog/i_naturalist2018 (accessed on 2 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. arXiv 2019, arXiv:1906.07413. [Google Scholar]
Zhou, B.; Cui, Q.; Wei, X.-S.; Chen, Z.-M. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9719–9728. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Chou, H.-P.; Chang, S.-C.; Pan, J.-Y.; Wei, W.; Juan, D.-C. Remix: Rebalanced mixup. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 95–110. [Google Scholar]
Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. arXiv 2019, arXiv:1910.09217. [Google Scholar]
Wang, X.; Lian, L.; Miao, Z.; Liu, Z.; Yu, S.X. Long-tailed recognition by routing diverse distribution-aware experts. arXiv 2020, arXiv:2010.01809. [Google Scholar]
Xiang, L.; Ding, G.; Han, J. Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 247–263. [Google Scholar]
Lee, G.Y.; Dam, T.; Poenar, D.P.; Duong, V.N.; Ferdaus, M.M. HELA-VFA: A Hellinger Distance-Attention-based Feature Aggregation Network for Few-Shot Classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 2173–2183. [Google Scholar]
Roy, A.; Shah, A.; Shah, K.; Dhar, P.; Cherian, A.; Chellappa, R. Felmi: Few shot learning with hard mixup. Adv. Neural Inf. Process. Syst. 2022, 35, 24474–24486. [Google Scholar]
Zhang, Y.; Kang, B.; Hooi, B.; Yan, S.; Feng, J. Deep long-tailed learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10795–10816. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Zong, Y.; Sun, W.; Wu, Q.; Hong, Q. A long-tail relation extraction model based on dependency path and relation graph embedding. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, Wuhan, China, 6–8 October 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 408–423. [Google Scholar]
Sapra, A.; Kumar, S. Assortment Planning at an Omni-Channel Retailer: Long-Tail Theory, Service Level, and Primary Channel. Available online: https://ssrn.com/abstract=4769931 (accessed on 23 March 2024).
Guan, Q.; Li, Z.; Zhang, J.; Huang, Y.; Zhao, Y. Joint representation and classifier learning for long-tailed image classification. Image Vis. Comput. 2023, 137, 104759. [Google Scholar] [CrossRef]
Li, B.; Han, Z.; Li, H.; Fu, H.; Zhang, C. Trustworthy long-tailed classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6970–6979. [Google Scholar]
Hyun Cho, J.; Krähenbühl, P. Long-tail detection with effective class-margins. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–24 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 698–714. [Google Scholar]
Schultheis, E.; Wydmuch, M.; Babbar, R.; Dembczynski, K. On missing labels, long-tails and propensities in extreme multi-label classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 1547–1557. [Google Scholar]
Pareto, V. Cours d’Économie Politique; Librairie Droz: Geneva, Switzerland, 1964; Volume 1. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; PMLR: London, UK, 2021; pp. 8748–8763. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Zhang, Y.; Wei, X.-S.; Zhou, B.; Wu, J. Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 3447–3455. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 113–123. [Google Scholar]
Tan, J.; Wang, C.; Li, B.; Li, Q.; Ouyang, W.; Yin, C.; Yan, J. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11662–11671. [Google Scholar]
Shu, J.; Xie, Q.; Yi, L.; Zhao, Q.; Zhou, S.; Xu, Z.; Meng, D. Meta-weight-net: Learning an explicit mapping for sample weighting. arXiv 2019, arXiv:1902.07379. [Google Scholar]
Jamal, M.A.; Brown, M.; Yang, M.-H.; Wang, L.; Gong, B. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7610–7619. [Google Scholar]
Hong, Y.; Han, S.; Choi, K.; Seo, S.; Kim, B.; Chang, B. Disentangling label distribution for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6626–6636. [Google Scholar]
Zhong, Z.; Cui, J.; Liu, S.; Jia, J. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16489–16498. [Google Scholar]
Zhang, S.; Li, Z.; Yan, S.; He, X.; Sun, J. Distribution alignment: A unified framework for long-tail visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2361–2370. [Google Scholar]
Cai, J.; Wang, Y.; Hwang, J.-N. ACE: Ally complementary experts for solving long-tailed recognition in one-shot. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 112–121. [Google Scholar]

Figure 1. An overview of the proposed framework. The dashed line above in the figure represents the image–text branch, while the dashed line below represents the image-classifier branch. The term cls-text denotes the textual database describing each category. The nums-cls graph illustrates several classes in the training set along with their corresponding sample counts, displaying a long-tailed distribution curve. The image encoder and text encoder are utilized separately to extract features from their respective modalities. Due to potential noise and interference in the textual descriptions for each category, such as ambiguous or entirely incorrect descriptions, in the second phase of training, we employ filters to retain the most representative textual features for each class. Each small square in the figure represents a feature extracted by the feature extractor for each sample, forming a vector. Different colors indicate different categories. For example, the three yellow squares represent three distinct samples, but they share the same class label.

Figure 2. A description of the feat-filter module. The function of this module is to select the features most relevant to the category from the textual features while filtering out irrelevant or noisy ones. We named it “feat-filter” because its primary purpose is to filter out features that are unrelated to the category. Through this module, we ensure that only the most discriminative textual features are retained for the subsequent multimodal classification tasks.

Table 1. Experimental results on the CIFAR-10-LT and CIFAR-100-LT datasets.

Datasets	CIFAR-10-LT				CIFAR-100-LT
Imbalance Ratio	200	100	50	20	200	100	50	20
Focal loss	65.29	70.38	76.72	83.23	35.62	38.69	44.32	51.95
Class-Balanced	68.89	74.57	79.27	84.36	36.23	39.60	45.32	52.59
LDAM-DRW	-	77.03	-	-	-	42.04	-	-
Equalization	-	-	-	-	43.38	-	-	-
LFME	-	-	-	-	-	43.80	-	-
BBN	-	79.82	82.18	-	-	42.56	47.02	-
RIDE	-	-	-	-	-	49.10	-	-
Mixup	-	73.09	-	-	-	40.83	-	-
Remix-DRW	-	79.76	-	-	-	46.77	-	-
CAM	-	80.03	83.59	-	-	47.83	51.69	-
Meta-weight-net	67.20	73.57	79.10	84.45	36.62	41.61	45.66	53.04
Domain Adap.	77.23	80.00	82.88	86.46	39.53	44.70	50.08	55.73
LADE	-	-	-	-	-	45.40	50.50	-
MMLTR (ours)	78.51	81.52	85.43	88.02	47.11	50.92	53.72	58.92

Table 2. Many-shot (over 100 training images), medium-shot (20∼100 training images), and few-shot class (fewer than 20 training images) accuracy on the CIFAR-100-LT dataset with an imbalance rate of 100. * Represents the result from RIDE [8]. Other results are from ACE [30]. (The best results are bolded).

Method	Many-Shot	Medium-Shot	Few-Shot	All
Focal loss	64.3	37.4	7.1	37.4
CB loss	65.0	37.6	10.3	38.7
Remix	69.6	40.7	8.8	40.9
OLTR *	61.8	41.4	17.6	41.2
Mixup	70.7	40.4	8.8	41.2
LDAM-DRW	61.5	41.7	20.2	42.0
$τ$ -norm *	65.7	43.6	17.3	43.2
cRT *	64.0	44.8	18.1	43.3
RIDE *	69.3	49.3	26.0	49.1
MMLTR (ours)	65.8	52.3	33.6	50.9

Table 3. Top-1 accuracy on ImageNet-LT and iNaturalist 2018 datasets. The backbone networks for training on the ImageNet-LT dataset are ResNet-10 and ResNet-50, and the backbone network for training on iNaturalist 2018 is ResNet-50. * Represents the result from BBN. Other results are derived from the original paper.

Datasets	ImageNet-LT	iNaturalist 2018
Backbone	ResNet-50	ResNet-50
LDAM-DRW *	-	66.12
BBN *	-	69.62
CAM	-	70.87
Remix-DRW	-	70.49
DisAlign	52.90	70.60
RIDE	54.40	71.40
MMLTR (ours)	70.13	74.16

Table 4. Accuracy of many-shot (over 100 training images), medium-shot (20∼100 training images), and few-shot class (fewer than 20 training images) on ImageNet-LT and iNaturalist 2018. The backbone is ResNet-50. * Means the result from MiSLAS [27].

Dataset	ImageNet-LT				iNaturalist 2018
	Many-Shot	Medium-Shot	Few-Shot	All	Many-Shot	Medium-Shot	Few-Shot	All
cRT *	62.5	47.4	29.5	50.3	73.2	68.8	66.1	68.2
LWS *	61.8	48.6	33.5	51.2	71.0	69.8	68.8	69.5
MiSLAS *	61.7	51.3	35.8	52.7	73.2	72.4	70.4	71.6
RIDE	65.8	51.0	34.6	54.4	70.2	71.3	71.7	71.4
MMLTR (ours)	77.5	67.4	50.5	70.1	73.5	74.8	74.2	74.1

Table 5. Ablation experiments on ImageNet-LT and iNaturalist 2018 datasets.

Experimentes	Contrastive Learing	Text Feature Filtering	Multi-Model		Top-1 Accuracy
Experimentes	Contrastive Learing	Text Feature Filtering	Image Brunch	Image–Text Brunch	ImageNet-LT	iNaturalist 2018
1	√				46.36	47.69
2	√	√			46.72	48.02
3	√	√	√		66.06	73.98
4	√			√	69.58	72.86
5	√	√	√	√	70.13	74.16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Zhao, J.; Gu, J.; Qin, Y.; Ji, H. Multimodal Framework for Long-Tailed Recognition. Appl. Sci. 2024, 14, 10572. https://doi.org/10.3390/app142210572

AMA Style

Chen J, Zhao J, Gu J, Qin Y, Ji H. Multimodal Framework for Long-Tailed Recognition. Applied Sciences. 2024; 14(22):10572. https://doi.org/10.3390/app142210572

Chicago/Turabian Style

Chen, Jian, Jianyin Zhao, Jiaojiao Gu, Yufeng Qin, and Hong Ji. 2024. "Multimodal Framework for Long-Tailed Recognition" Applied Sciences 14, no. 22: 10572. https://doi.org/10.3390/app142210572

APA Style

Chen, J., Zhao, J., Gu, J., Qin, Y., & Ji, H. (2024). Multimodal Framework for Long-Tailed Recognition. Applied Sciences, 14(22), 10572. https://doi.org/10.3390/app142210572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Framework for Long-Tailed Recognition

Abstract

1. Introduction

2. Method

2.1. Image–Text Contrastive Model

2.2. Most Relevant Text Feature Extraction

2.3. Double-Branch Multimodal Recognition Framework

3. Experiment Details

3.1. Datasets

3.2. Evaluation Metrics

3.3. Implementation Details on ImageNet-LT

3.4. iNaturalist 2018 Implementation Details

3.5. Method of Comparison

3.6. Main Experimental Results

3.6.1. Results on CIFAR-10-LT and CIFAR-100-LT

3.6.2. Results on ImageNet-LT and iNaturalist2018

3.7. Ablation Experiment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI