ProFusion: Multimodal Prototypical Networks for Few-Shot Learning with Feature Fusion

Zhao, Jia; Cao, Ziyang; Wang, Huiling; Wang, Xu; Chen, Yingzhou

doi:10.3390/sym17050796

Open AccessArticle

ProFusion: Multimodal Prototypical Networks for Few-Shot Learning with Feature Fusion

by

Jia Zhao

^1,2

,

Ziyang Cao

¹

,

Huiling Wang

^1,2,*

,

Xu Wang

¹

and

Yingzhou Chen

¹

School of Computer and Information Engineering, Fuyang Normal University, Fuyang 236037, China

²

Anhui Engineering Research Center for Intelligent Computing and Information Innovation, Fuyang Normal University, Fuyang 236037, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(5), 796; https://doi.org/10.3390/sym17050796

Submission received: 14 March 2025 / Revised: 11 May 2025 / Accepted: 15 May 2025 / Published: 20 May 2025

(This article belongs to the Special Issue Symmetry and Asymmetry in Computer Vision and Graphics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Existing few-shot learning models leverage vision-language pre-trained models to alleviate the data scarcity problem. However, such models usually process visual and text information separately, which causes still inherent disparities between cross-modal features. Therefore, we propose the ProFusion model, which leverages multimodal pre-trained models and prototypical networks to construct multiple prototypes. Specifically, ProFusion generates image and text prototypes symmetrically using the visual encoder and text encoder, while integrating visual and text information through the fusion module to create more expressive multimodal feature fusion prototypes. Additionally, we introduce the alignment module to ensure consistency between image and text prototypes. During inference, ProFusion calculates the similarity of test images to the three types of prototypes separately and applies a weighted sum to generate the final prediction. Experiments demonstrate that ProFusion performs outstanding classification tasks on 15 benchmark datasets.

Keywords:

few-shot learning; pre-trained models; multimodal feature fusion; prototypical networks

1. Introduction

With the rapid development of artificial intelligence technologies, deep learning has made significant advances in both natural language processing (NLP) and computer vision (CV). In the field of NLP, models based on the Transformer [1] architecture, such as BERT [2] and GPT3 [3], have surpassed traditional methods in tasks such as machine translation, text generation, and sentiment analysis. In the field of CV, deep learning has driven the progress in tasks such as image classification, object detection, and semantic segmentation. As one of the fundamental tasks in computer vision, traditional image classification has also seen remarkable improvements. Traditional methods for image classification [4,5,6] rely on large-scale labeled datasets, where deep neural networks learn feature representations of images from a substantial number of training samples. However, in practical applications, the acquisition of large-scale labeled data is often associated with high costs. In this context, few-shot learning (FSL) [7], as a cutting-edge research area, aims to address the data bottleneck issue. Its core objective is to enhance the model’s generalization ability and mitigate the reliance on large-scale labeled data, enabling the model to achieve good performance even with extremely limited training data. As an important branch of few-shot learning, few-shot image classification similarly seeks to accurately classify unlabeled samples using a limited number of labeled examples [8].

To achieve this goal, researchers have proposed various learning strategies, among which transfer learning, meta-learning, and multimodal learning are representative approaches. Transfer learning mitigates the problem of limited labeled data in the target domain by pretraining models on large-scale source data and transferring the acquired knowledge to the target task. In few-shot learning, GDA-CLIP [9] employs a transfer learning strategy by using the CLIP pre-trained model as the backbone network for feature extraction. Meta-learning aims to “learn to learn” by training models in a task-oriented manner, thereby enhancing their generalization ability to new tasks. For example, MAML [10] optimizes the initialization of model parameters through multi-task training, enabling rapid adaptation to new tasks with only a few samples. In addition, metric-based meta-learning methods (e.g., matching networks and prototypical networks) learn a shared embedding space, allowing classification based on similarity measurements between samples. Multimodal learning focuses on integrating information from different modalities (e.g., images and text) to construct richer feature representations. For instance, methods such as Tip-Adapter [11] and CoOP [12] incorporate class names as auxiliary information to compensate for the limitations of visual features. It can be seen that transfer learning primarily focuses on transferring knowledge acquired from the source domain to the target domain, emphasizing the ability of “knowledge transfer”; meta-learning, on the other hand, aims to improve the model’s ability to quickly adapt to and learn new tasks, that is, “learning how to learn”; and multimodal learning focuses on fully utilizing the symmetrical complementarity between multimodal information to enhance the model’s representation ability.

In recent years, an increasing number of researchers have applied vision-language pre-trained models (VLPs) [13,14,15] to few-shot image classification tasks. For example, Tip-Adapter [11] uses the CLIP [14] pre-trained model to construct a key-value cache model as an adapter. During inference, the model calculates the similarity between the test image and the adapter, as well as the text features, ultimately performing a weighted fusion to obtain the classification result. Proto-CLIP [16] also uses the CLIP pre-trained model to generate image and text prototypes and then compute the prediction scores for the test image based on these two prototypes. These methods demonstrate that current few-shot image classification approaches primarily emphasize the independent modeling of visual and textual modalities, leveraging cross-modal information matching to mitigate the challenge of limited training samples. However, such methods often fail to adequately address the inherent disparities between cross-modal features and tend to neglect the effective integration of multimodal information. Moreover, the feature extraction capability of pre-trained models plays a pivotal role in the overall performance of the few-shot image classification process. As shown in Figure 1, this paper compares the image feature extraction capabilities of two different pre-trained models: ResNet50 [14] and BEiT3-large-itc [17]. From the visualization results in Figure 1a, which shows the image features extracted by ResNet50, it can be seen that this model has a more scattered feature distribution, with unclear boundaries between categories, reflecting its limitations in distinguishing between classes. In contrast, in Figure 1b, which shows the image features extracted by BEiT3-large-itc, it can be seen that this model demonstrates a much more compact distribution, with more robust cohesion of features within the same category and more distinct boundaries between different categories, leading to a clearer class separation. Therefore, in practical applications, selecting the appropriate pre-trained model based on the task requirements plays a crucial role in improving classification accuracy and performance.

To solve such problems, we propose ProFusion, a few-shot image classification model that integrates multimodal features. The ProFusion model leverages the BEiT-3 [17] multimodal pre-trained model to enhance feature extraction capabilities and significantly improve category discrimination through the generated image prototypes, text prototypes, and symmetrically fused multimodal feature fusion prototypes. Specifically, inspired by prototypical networks [18], ProFusion employs a visual encoder and a text encoder to extract image and text features, then aggregates them to compute image and text prototypes. In addition, the model creates multimodal feature fusion prototypes for each category by symmetrically combining visual and text information, thereby alleviating the inherent disparities between cross-modal features.

To comprehensively evaluate the performance of the ProFusion model, we performed experiments on 15 publicly available datasets from various domains, demonstrating the model’s superiority. Our contributions can be summarized as follows.

We present ProFusion, a few-shot image classification model that draws inspiration from prototypical networks to construct multimodal class prototype representations.
We propose a cross-modal feature fusion method that utilizes the fusion module of multimodal pre-trained models to integrate visual and text information to create multimodal feature fusion prototypes, thereby alleviating the inherent disparities between cross-modal features.
We perform an experimental evaluation of the ProFusion model on 15 benchmark datasets, demonstrating its significant performance improvement over other models.

2. Related Work

2.1. Few-Shot Learning

Unlike traditional deep learning methods, FSL aims to enable models to adapt to new tasks and generalize effectively with limited training data [19,20,21]. To achieve this goal, various approaches have emerged in the FSL domain, focusing primarily on data augmentation [22], transfer learning [23], meta-learning [24], and multimodal learning [25]. Traditional data augmentation techniques generate new samples by applying transformations such as rotation, scaling, and random cropping to existing samples. Modern methods leverage advanced generative techniques. For example, the Generative Adversarial Network (GAN) [26] creates high-quality images through adversarial training between a generator and a discriminator. Variational Autoencoder (VAE) [27] encodes input images into latent variables and then decodes and reconstructs them to generate new images. Furthermore, models like DALL-E [28] enhance data diversity by generating images based on text descriptions. Transfer learning adapts pre-trained models on large-scale datasets to specific tasks by fine-tuning them on limited target data. Meta-learning is a method that enables quick adaptation to new tasks by “learning to learn”, leveraging knowledge from previous tasks to help the model generalize quickly. Lastly, multimodal learning symmetrically integrates information from multiple modalities, including text, images, and audio, to compensate for the limitations of single-modal information.

2.2. Pre-Trained Models

In recent years, with the exponential growth in computational resources and data scale, VLPs have achieved remarkable progress in both CV and NLP [14,17,29,30]. These models leverage large-scale datasets [31,32,33] to learn complex and symmetry-aware semantic relationships between images and text, allowing deep cross-modal understanding. For instance, CLIP enhances cross-modal retrieval by employing image–text contrastive training, maximizing the similarity of positive pairs while minimizing that of negative pairs. BEiT-3 [17] uses the Multiway Transformers architecture and a unified masked data modeling approach for pretraining. During training, BEiT-3 randomly masks parts of text tokens or image patches and trains the model to reconstruct these masked portions, thereby improving its feature learning capabilities. Similarly to CLIP, BEiT-3 incorporates visual and text encoders to process visual and text data separately. To adapt to multimodal tasks, BEiT-3 introduces a multimodal fusion module that leverages visual-language experts to integrate visual and text information. In general, due to their superior representation learning capabilities, multimodal pre-trained models have been extensively utilized across a wide range of downstream tasks, including visual question answering [34], image classification [35], image-text retrieval [36], image captioning [37], and visual reasoning [38].

2.3. Mainstream Methods

Over the past few years, the field of few-shot image classification has seen a surge in approaches leveraging vision-language pre-trained models as backbone networks, achieving remarkable progress. For example, Linear-probe CLIP [14] employs image features to train a logistic regression classifier for classification. CoOP [12] improves the performance of few-shot image classification by introducing learnable prompts. CLIP-Adapter [39] fine-tunes image and text embeddings using lightweight adapters. Tip-Adapter [11] proposes constructing a key-value cache model, where the features of the test image are matched to keys, and the resulting prediction scores are linearly combined with those based on text features to produce the final classification result. Proto-CLIP [16] leverages image and text features to construct image and text prototypes, comparing image features with class prototypes for classification in a few-shot setting. PMPro [40] improves few-shot classification by partially fine-tuning pre-trained model parameters and constructing symmetry-aware mixed-modal prototypes. MaPLe [41] introduces learnable prompts in both the text and image branches of CLIP and takes advantage of a coupling mechanism between the prompts to achieve better alignment of vision-language. CALIP [42] proposes enhancing the zero-shot capability of the CLIP model by introducing a parameter-free attention mechanism, without the need for additional training or fine-tuning. GDA-CLIP [9] introduces a training-free adaptive method based on Gaussian Discriminant Analysis (GDA), assuming that class features follow Gaussian distributions with shared covariance, and formulates the classifier using class means and covariance through Bayesian inference. Whether through key-value cache matching or mixed-modal prototype construction, prior methods have primarily focused on processing either visual or text information independently, overlooking the symmetrical integration of image and text information. Consequently, we introduce multimodal feature fusion prototypes to alleviate the inherent disparities between cross-modal features. Table 1 presents a comparative analysis between our proposed model, ProFusion, and other existing few-shot learning approaches.

3. Method

3.1. Overview

In few-shot image classification tasks, one-shot or two-shot tasks refer to models that learn from only one or two samples from each class. Such tasks are commonly referred to as the “N-way K-shot” problem, where N represents the number of classes and K represents the number of samples per class, with

K ≪ N

. Given a few-shot classification task, the dataset is usually partitioned into a support set and a query set, and the knowledge learned from the support set is used to classify the query set. The support set is defined as

S = {x_{i}^{s}, u_{i}^{s}, y_{i}^{s}}_{i = 1}^{M_{s}}

, where

M_{s} = N \times K

represents the total number of samples in the support set,

x_{i}^{s}

denotes the image of the support set for the sample

i th

,

u_{i}^{s}

represents the category name, and

y_{i}^{s} \in {1, 2, \dots, N}

represents the category label of the support set. Meanwhile, the query set is defined as

Q = {x_{j}^{q}, y_{j}^{q}}_{j = 1}^{M_{q}}

, where

x_{j}^{q}

denotes the

j th

query set,

y_{j}^{q} \in {1, 2, \dots, N}

denotes the labels of the query set, and

M_{q}

is the total number of items in the query set. We dedicate ourselves to constructing multiple prototypical representations for each category by leveraging multimodal pre-trained models to learn features from the support set, thereby providing a reference for classifying images in the query set.

3.2. Multimodal Prototype

The BEiT-3 multimodal pre-trained model serves as the backbone network. Its visual encoder

E_{v} (\cdot)

is used to extract the features of the image, expressed as

v_{i}^{s} = E_{v} (x_{i}^{s}; θ_{v})

, where

v_{i}^{s} \in R^{1 \times d}

, d represents the dimensions of the feature and

θ_{v}

denotes the parameters of the visual encoder. For text class names, we adopt a manually designed prompt template, such as “a photo of a {class}”, where “class” refers to the name of each category. Using this template, text prompts that describe the images are represented as

w_{i} = P_{n} (u_{i}^{s}),

n \in {1, 2, \dots, N}

. Before extracting features from these prompts, the text is tokenized as

e_{i} = Tokenizer (w_{i})

. Subsequently, the text encoder

E_{t} (\cdot)

is applied to extract text features, expressed as

t_{i}^{s} = E_{t} (e_{i}; θ_{t})

, where

t_{i}^{s} \in R^{1 \times d}

, d represents the dimensions of the feature and

θ_{t}

denotes the parameters of the text encoder.

The key challenge of few-shot image classification lies in the limited number of training samples, which makes traditional deep learning methods prone to overfitting. To address this issue, we adopt the design philosophy of prototypical networks [18], where image and text information is jointly mapped to a unified feature space, and multimodal class prototypes are constructed as class anchors. The test samples are then classified on the basis of their distances to these class prototypes, thereby avoiding overfitting. Therefore, we incorporate the prototype network method into the few-shot image classification task. Specifically, for the

k th

category in the support set, the image prototype

ρ_{k}^{v}

is calculated by averaging the image features of samples in the category:

ρ_{k}^{v} = \frac{1}{M_{k}} \sum_{y_{i}^{s} = k} v_{i}^{s}

(1)

where

ρ_{k}^{v} \in R^{1 \times d}

, d is the dimension of the feature,

y_{i}^{s} = k

indicates samples with a category label of k, and

M_{k}

represents the total number of samples labeled as k. Similarly, based on the text prompt features

t_{i}^{s}

, the corresponding text prototype

ρ_{k}^{t}

can be computed as follows:

ρ_{k}^{t} = \frac{1}{M_{k}} \sum_{y_{i}^{s} = k} t_{i}

(2)

where

ρ_{k}^{t} \in R^{1 \times d}

.

3.3. Multimodal Feature Fusion

Although existing models attempt to leverage multimodal information from images and text for few-shot classification, there are inherent disparities between the two modalities. Therefore, we employ a multimodal fusion module

E_{f} (\cdot)

, which integrates image and text information using a shared cross-modal self-attention mechanism along with a vision-language expert network. This process generates symmetrically fused multimodal fusion features, represented as

f_{i}^{s} = E_{f} (x_{i}^{s}, e_{i}; θ_{f})

, where

f_{i}^{s} \in R^{1 \times d}

and

θ_{f}

denote the parameters of the fusion module. As shown in Figure 2, the fusion module is primarily composed of L layers of Multiway Transformer blocks. Each layer incorporates self-attention mechanisms, residual connections, layer normalization, as well as vision and language expert networks. In the top F layers of the Multiway Transformer, an additional vision-language expert network is introduced to facilitate joint modeling of visual and language information. Taking BEiT3-base-itc as an example, the model consists of 12 Multiway Transformer layers with a hidden size of 768 and 12 attention heads. In contrast, BEiT3-large-itc contains 24 Multiway Transformer layers with a hidden size of 1024 and 16 attention heads. In both variants, every layer is equipped with vision and language expert networks, and the top layers additionally incorporate vision-language expert networks. It is worth noting that the parameters of the fusion module are kept frozen during training.

Specifically, images are first divided into multiple patches and transformed into visual tokens, while text is tokenized into text tokens. In the fusion module, visual tokens and text tokens are processed separately by vision and language feed-forward networks. Simultaneously, a vision-language feed-forward network captures the cross-modal relationships between image and text. The shared self-attention mechanism aligns and interacts with image and text information, allowing the deep fusion of multimodal data. Based on the fused features obtained from images and text, the multimodal feature fusion prototype for the

k th

category,

ρ_{k}^{f}

, is calculated as follows:

ρ_{k}^{f} = \frac{1}{M_{k}} \sum_{y_{i}^{s} = k} f_{i}^{s}

(3)

where

ρ_{k}^{f} \in R^{1 \times d}

.

3.4. Similarity Measurement

For query set images, we employ a lightweight adapter [16] to learn their features, avoiding overfitting due to the limited amount of data. The lightweight adapters include two designs based on MLP and convolution. The MLP-based adapter consists of a two-layer fully connected network for feature transformation. The first layer reduces the dimensionality of the input features, while the second layer maps them back to the original dimension. A fusion coefficient (e.g., 0.2) is then used to perform weighted residual fusion between the output and input features, preserving the original information. The convolution-based adapter includes three convolutional layers (1 × 1, 3 × 3, and 1 × 1), along with normalization layers and ReLU activation functions. The input features are first reshaped, then processed through the three convolutional layers. Finally, the output features are fused with the input features via a residual connection, effectively preserving the original information. On ImageNet, assuming an input feature dimension of 1024, the parameter sizes of the MLP- and convolution-based adapters are 0.30 M and 0.03 M, respectively. After the query set image

x_{j}^{q}

is processed by the visual encoder

E_{v} (\cdot)

to extract features, it is passed through the lightweight adapter module. A residual connection is used to combine the new features with the original. The adapter is defined as

A (\cdot)

, and the feature representation of the query set image is given by

v_{j}^{q} = A (E_{v} (x_{j}^{q}; θ_{v}); θ_{a})

, where

v_{j}^{q} \in R^{1 \times d}

, and the adapter parameters

θ_{a}

are trained using the support set.

Based on the above, our objective is to classify the query image using the image prototype, text prototype, and multimodal feature fusion prototype, as illustrated in Equation (4):

P (\hat{y} = k ∣ x_{j}^{q}, ρ) = \sum_{m \in {v, t, f}} λ_{m} \cdot P (\hat{y} = k ∣ x_{j}^{q}, ρ^{m})

(4)

where

ρ

represents the set of the three types of prototypes;

ρ^{v} = {ρ_{k}^{v} ∣ k \in {1, 2, \dots, N}}

,

ρ^{t} = {ρ_{k}^{t} ∣ k \in {1, 2, \dots, N}}

, and

ρ^{f} = {ρ_{k}^{f} ∣ k \in {1, 2, \dots, N}}

represent the sets of image prototypes, text prototypes, and multimodal fusion prototypes, respectively.

P (\hat{y} = k ∣ x_{j}^{q}, ρ^{m})

denotes the predicted probability for the query set image based on different modal prototypes, where

\hat{y}

is the predicted label. The weight relations are given by

λ_{v} = α

,

λ_{t} = β

, and

λ_{f} = γ

, where

α + β + γ = 1

and

α, β, γ \in [0, 1]

.

Specifically, the image prototype

ρ_{k}^{v}

, text prototype

ρ_{k}^{t}

, and multimodal feature fusion prototype

ρ_{k}^{f}

are treated as learnable parameters, which are fine-tuned using the support set to dynamically adjust these parameters. For the query set image

x_{j}^{q}

, visual features

v_{j}^{q}

are extracted using the visual encoder

E_{v} (\cdot)

and the lightweight adapter

A (\cdot)

. The squared Euclidean distance between the query set image features

v_{j}^{q}

and the prototype features is then calculated. The results are converted into probabilities using the softmax function. The entire process can be represented by the following equation:

P (\hat{y} = k ∣ x_{j}^{q}, ρ^{m}) = \frac{exp (- τ \cdot D (v_{j}^{q}, ρ_{k}^{m}))}{\sum_{k^{'} = 1}^{N} exp (- τ \cdot D (v_{j}^{q}, ρ_{k^{'}}^{m}))}

(5)

where

D (\cdot)

denotes the squared Euclidean distance,

k^{'} \in {1, 2, \dots, N}

, and

m \in {v, t, f}

indicates the three modalities: image (v), text (t), and multimodal fusion (f). The temperature parameter

τ \in R^{+}

controls the “smoothness” of the probability distribution. A smaller

τ

sharpens the distribution, emphasizing the selection of more similar categories, while a larger

τ

makes the distribution smoother, allowing more categories to be considered as possible choices. Finally, the predicted class probabilities, calculated based on the image prototype

ρ_{k}^{v}

, text prototype

ρ_{k}^{t}

, and multimodal feature fusion prototype

ρ_{k}^{f}

, are weighted and summed using the hyperparameters

α

,

β

, and

γ

to obtain the final classification result, as shown in Figure 2.

3.5. Loss Function

To ensure the consistency of the image and text prototypes in the feature space, we introduce an image–text prototype alignment module. This module maximizes the similarity between the image and text prototypes of the same category, while minimizing the similarity between prototypes of different categories. We use the InfoNCE loss function [43] from contrastive learning to achieve this alignment:

L_{i t a}^{k} (ρ_{k}^{ξ}, ρ^{ζ}) = - log (\frac{exp (ρ_{k}^{ξ} \cdot ρ_{k}^{ζ})}{\sum_{k^{'} = 1}^{N} exp (ρ_{k}^{ξ} \cdot ρ_{k^{'}}^{ζ})})

(6)

where · denotes the dot product,

ξ, ζ \in {(v, t), (t, v)}

.

L_{i t a}^{k} (ρ_{k}^{ξ}, ρ^{ζ})

measures the similarity between the image prototypes and the text prototypes. The theoretical foundation of InfoNCE originates from the mutual information maximization principle in contrastive predictive coding (CPC). By implicitly estimating and optimizing the ratio between the conditional distribution and the marginal distribution (i.e., the density ratio), the model is able to capture global dependencies within the data. Its core mechanism is to maximize the similarity between positive sample pairs (context and true future observations) in the latent space, while minimizing the similarity between negative sample pairs (context and randomly sampled observations). Based on this principle, we apply the InfoNCE loss function to align cross-modal prototype features, thereby enhancing the representational capacity of the prototypes. Additionally, to improve classification performance, we introduce a negative log-probability loss function to optimize the model parameters. Its form is given by Equation (7):

L_{cls} = - log (P (\hat{y} = k ∣ x_{j}^{q}, ρ), y_{j}^{q})

(7)

where

P (\hat{y} = k ∣ x_{j}^{q}, ρ)

is derived from Equation (2), and

y_{j}^{q}

is the true label of the query set

x_{j}^{q}

. By minimizing this loss function, the model can more accurately predict the category of the sample. Therefore, the total loss during training is

L_{total} = - \frac{1}{M_{q}} \sum_{j = 1}^{M_{q}} log (P (\hat{y} = k ∣ x_{j}^{q}, ρ), y_{j}^{q}) + \frac{1}{N} \sum_{k = 1}^{N} L_{ita}^{k} (ρ_{k}^{ξ}, ρ^{ζ})

(8)

4. Experiment

4.1. Experiment Setup

To validate the effectiveness of the ProFusion model, we evaluated it on 15 benchmark datasets, which include ImageNet [31], StanfordCars [44], UCF101 [45], Caltech101 [46], Flowers102 [47], SUN397 [48], DTD [49], EuroSAT [50], FGVCAircraft [51], OxfordPets [52], Food101 [53], ImageNetV2 [54], ImageNet-Sketch [55], ImageNet-A [56], and ImageNet-R [57]. In addition, to highlight the superiority of the ProFusion, we compared it with several SOTA models, including CoOp [12], CLIP-Adapter [39], Tip-Adapter [11], Proto-CLIP [16], GDA-CLIP [9] and PMPro [40].

In the experiments, we used the BEiT3-base-itc and BEiT3-large-itc pre-trained models as the backbone networks. To increase data diversity, we applied random cropping and horizontal flipping to the support set. The ProFusion model was built using the PyTorch framework and trained on a single NVIDIA GeForce RTX 3090 GPU with a batch size of 256. The AdamW optimizer was used, with an initial learning rate of 0.0001, and the CosineAnnealingLR scheduler for learning rate adjustment.

Meanwhile, the hyperparameters

α

,

β

, and

γ

, as mentioned in Equation (4), play a crucial role in improving the classification precision. To determine the optimal hyperparameter values for each dataset, we performed a grid search within the range of

α

,

β

,

γ \in [0.0, 1.0]

with a step size of 0.1, while imposing the constraint

α + β + γ = 1

to limit the search space. We evaluated the performance of all candidate weight combinations on the validation set and selected the combination that yielded the best performance for the final testing phase. This approach facilitated a reasonable allocation of weights across modalities. The model was divided into two versions: ProFusion and ProFusion-F. For the ProFusion model, image prototypes, text prototypes, and multimodal feature fusion prototypes were constructed using the encoder, and these prototypes were directly used for classifying the query set. For the ProFusion-F model, the prototype feature parameters were fine-tuned using the support set to further enhance classification performance. Furthermore, for each dataset, we continued to use the text prompt templates selected in previous works [9,40], as shown in Table 2.

4.2. Comparison with SOTA Models

To validate the superiority of our ProFusion model in different scenarios, we present a performance comparison in Table 3, showcasing ProFusion against the SOTA few-shot learning models on 11 datasets in various shot settings. In the case of an extreme one-shot setting, ProFusion leverages the rich pre-trained knowledge of multimodal models by directly classifying test images using image prototypes, text prototypes, and multimodal feature fusion prototypes. Our model significantly outperforms existing models. For instance, on the DTD dataset, ProFusion achieves a one-shot classification accuracy of 62.41%, outperforming TiP-Adapter and Proto-CLIP, which attain 46.22% and 46.04%, respectively. This corresponds to an absolute improvement of 16.19% over Tip-Adapter and 16.37% over Proto-CLIP. Similarly, on the ImageNet dataset, ProFusion reaches an accuracy of 72.63%, surpassing Tip-Adapter (60.70%) by 11.93% and Proto-CLIP (60.31%) by 12.32%. On the StanfordCars dataset, ProFusion achieves a classification accuracy of 83.08%, outperforming GAD-CLIP (56.77%) by a substantial 26.31%. Furthermore, ProFusion-F, which fine-tunes the three types of prototypes using support sets, further enhances classification performance. In the 16-shot setting, ProFusion-F maintains its advantage, achieving 77.61% accuracy on the ImageNet dataset, an improvement of 12.10% over Tip-Adapter-F (65.51%) and 11.86% over Proto-CLIP-F (65.75%). On the Food101 dataset, ProFusion-F achieves a classification accuracy of 87.62%, outperforming PMPro (79.31%) by 8.31%. In general, our method demonstrates significant improvements in few-shot learning tasks, especially in limited-data and complex scenarios, highlighting its ability to effectively utilize multimodal pre-trained knowledge for improved classification accuracy.

However, under the one-shot setting of the EuroSAT fine-grained dataset, the high similarity between categories leads to insufficient discriminability of support set samples, affecting the model’s classification capability. To enhance the stability of the result, we conducted multiple experiments and reported the average accuracy as the final result. Additionally, under the one-shot condition on both FGVCAircraft and EuroSAT, fine-tuned ProFusion-F slightly underperforms the non-fine-tuned ProFusion. For example, on FGVCAircraft, ProFusion-F achieves 23.82%, slightly lower than ProFusion’s 23.19%. This phenomenon arises from the limited inter-class variability in fine-grained tasks, where fine-tuning with only a single support sample is prone to overfitting.

To enhance the generalization capability of the model, we incorporate text information from class names to compensate for the insufficiency of visual information. Inspired by prototypical networks, we construct three types of prototype as class anchors and classify test images based on their similarity, effectively mitigating overfitting. In addition, we adopt the BEiT-3 multimodal pre-trained model as the backbone network to extract features from both image and text information, thereby avoiding the overfitting issue that may arise from training a feature extractor from scratch. However, due to differences among datasets, the model exhibits varying performance across different datasets. As shown in Table 3, for conventional datasets (such as Flowers102, OxfordPets, Caltech101, etc.), where there are clear class differences, the model is able to easily extract discriminative features, leading to better recognition performance. For example, under the one-shot condition, ProFusion achieves an accuracy of 96.02% on the Caltech101 dataset. In contrast, for fine-grained datasets (such as FGVCAircraft, DTD, EuroSAT, etc.), the task complexity is higher, the class differences are subtle, and the image textures are abstract, making it difficult for the model to learn effective distinguishing features from a very limited number of samples, resulting in poorer performance. For example, under the 1-shot setting, ProFusion achieves an accuracy of only 23.82% on the FGVCAircraft dataset.

As shown in Figure 3, the average accuracy of ProFusion under 1-shot to 16-shot conditions is compared to that of the other SOTA models on the ImageNet dataset. ProFusion demonstrates significant performance advantages, achieving an average accuracy of 74.30%, which is significantly higher than CLIP-Adapter (62.17%), Tip-Adapter-F (62.97%) and Proto-CLIP-F (62.39%). Specifically, compared to GDA-CLIP (61.97%) and PMPro (63.11%), ProFusion’s accuracy improves by 12.33% and 11.19%, respectively. Meanwhile, ProFusion-F, which leverages the support set for fine-tuning, further boosts the average accuracy to 75.00%, a 0.70% improvement over ProFusion.

To comprehensively demonstrate the robustness of our model, we calculate and compare the average classification accuracy on 11 datasets under different shot settings. As shown in Table 4, ProFusion consistently outperforms other methods across all shot settings. Without requiring additional training, ProFusion achieves an accuracy of 73.55% for 1-shot and 80.46% for 16-shot, with an overall average accuracy of 77.46%. In contrast, ProFusion-F, which uses support set fine-tuning, further enhances performance across all shot settings, reaching 73.67% for 1-shot, 83.63% for 16-shot, and an impressive average accuracy of 78.82%. Moreover, the experimental results indicate that both ProFusion and ProFusion-F consistently surpass the SOTA models in different shot settings. For example, in the 1-shot setting, ProFusion outperforms Tip-Adapter-F by 8.95%, while ProFusion-F exceeds PMPro by 8.04%. In the 16-shot setting, ProFusion-F demonstrates a 7.80% improvement over Tip-Adapter-F. Overall, the average accuracy of ProFusion-F exceeds that of CALIP and GDA-CLIP by 8.06% and 9.29%, respectively, reflecting its advantage in few-shot classification tasks.

We also performed experiments to evaluate the performance of our model in out-of-distribution generalization. Specifically, we trained our model using a 16-shot setting on the ImageNet dataset. We then directly transferred the model to target datasets, including ImageNetV2, ImageNet-Sketch, ImageNet-A, and ImageNet-R. As shown in Table 5, we compared our method with CLIP, Tip-Adapter, Tip-Adapter-F, Proto-CLIP, Proto-CLIP-F, MaPLe and GDA-CLIP. The experimental results demonstrate that both ProFusion and ProFusion-F exhibit significant advantages in out-of-distribution generalization tasks. On the target datasets, our model consistently outperforms the other baselines, especially on ImageNet-R and ImageNet-Sketch, where ProFusion improves by 7.33% and ProFusion-F improves by 11.67%, showing remarkable performance gains over other models. However, on the ImageNet-V2 and ImageNet-A datasets, ProFusion performs relatively poorly, improving by only 1.88% and 0.24% compared to the second-place Tip-Adapter series methods, respectively. This is mainly due to the presence of challenging and adversarial samples in these two datasets, which increase the difficulty of classification. Overall, ProFusion-F achieves an average score of 66.04%, surpassing the 59.76% of CLIP and the 60.22% of GDA-CLIP, demonstrating the superiority of our method in out-of-distribution generalization tasks.

4.3. Ablation Study

In the ablation study presented in Table 6, we used BEiT3-large-itc as the backbone network (

K = 16

) to evaluate the impact of image prototypes, text prototypes and multimodal feature fusion prototypes on model performance. Experimental results show that when image and text prototypes are used independently, there is no significant difference in classification accuracy. For example, using only image prototypes, the model achieves an accuracy of 97.61% on the Caltech101 dataset, which is comparable to using only text prototypes (97.65%). However, on regular datasets (such as OxfordPets and SUN397), where inter-class differences are more pronounced, the semantic information provided by text descriptions enables relatively accurate classification, resulting in better performance than unimodal image prototypes. When using both image and text prototypes, the ImageNet’s accuracy increases to 77.48%, but the classification accuracy drops on more challenging datasets such as EuroSAT (84.07%) and FGVCAircraft (42.00%).The primary reason for this is that, in fine-grained datasets, the semantic information conveyed by textual descriptions (e.g., “a photo of {class}”) differs from the actual visual information. The alignment module encourages the image prototypes to move closer to the text prototypes, which weakens the discriminative ability of the image prototypes and ultimately leads to lower classification accuracy compared to using unimodal image prototypes alone. To this end, we perform interactive fusion of image and text information through the shared attention mechanism of the fusion module and leverage the vision-language feed-forward network to capture the cross-modal relationships between images and text, thereby constructing the fused prototypes. After introducing the fused prototypes, the model shows improvements across multiple datasets. For example, on the FGVCAircraft dataset, the accuracy increases by 2.10% compared to using only image and text prototypes, reaching 44.10%. On the EuroSAT dataset, the accuracy improves by by 3.73%, reaching 87.80%.

In Table 7, we compare the performance of the baseline fusion strategy with our proposed fusion strategy in 11 datasets. The baseline strategy performs a simple fusion of image prototypes and text prototypes using element-wise multiplication, while our method utilizes the fusion module of a multimodal pre-trained model to generate more information-rich fused prototypes. The experimental results show that on datasets such as ImageNet, OxfordPets, and Flowers102, the performance of both methods is comparable. However, in fine-grained datasets such as FGVCAircraft, EuroSAT, DTD, and UCF101, our method shows significant advantages. For example, on the FGVCAircraft dataset, the precision increases from 41.88% in the baseline to 44. 10% with our model. On the EuroSAT dataset, our model improves by 3.69% over the baseline (84.11%). On the DTD dataset, the accuracy increases from 76.36% in the baseline to 77.54%. The experimental results clearly show that simply performing a basic fusion of image and text features is insufficient to fully exploit the complex relationships between the image and text, leading to significantly poorer performance on fine-grained tasks. In contrast, our proposed fusion strategy not only performs well on general datasets but also exhibits significant improvements on fine-grained datasets, fully demonstrating its superior ability to capture complex image–text associations and integrate multimodal information.

Figure 4 presents the image classification results using two different multimodal pre-trained models, BEiT3-base-itc and BEiT3-large-itc, as backbone networks. In general, the stronger the backbone network, the more discriminative feature representations it can learn, thereby improving classification accuracy. For example, on the ImageNet dataset, when using the BEiT3-base-itc pre-trained model, the zero-shot accuracy of BEiT3 is 68.96%. However, with the BEiT3-large-itc pre-trained model, the accuracy improves to 71.89%. Additionally, we compare our method with existing SOTA approaches, GDA-CLIP and Tip-Adapter, by using the same pre-trained models as backbone networks. In ImageNet (K = 16), with the BEiT3-base-itc backbone, ProFusion-F achieves an accuracy of 73.28%, surpassing GDA-CLIP (71.84%) by 1.44%. When using the BEiT3-large-itc backbone, ProFusion-F reaches 77.61%, exceeding GDA-CLIP (76.45%) by 1.16% and Tip-Adapter-F (76.72%) by 0.89%. Similarly, on the UCF101 dataset (K = 16), with the BEiT3-base-itc backbone, ProFusion-F achieves an accuracy of 80.78%, improving by 1.98% over Tip-Adapter-F (78.80%). With the BEiT3-large-itc backbone, ProFusion-F reaches an accuracy of 86.12% in the 16-shot setting, outperforming GDA-CLIP (85.23%) by 0.89%.

Data augmentation is an important technique to improve the generalization ability of models. Applying random transformations to the support set increases the diversity of images and mitigates the model’s overfitting to the training data distribution. In Figure 5, we evaluate the impact of data augmentation on model performance on the four datasets: UCF101, SUN397, Food101, and StanfordCars. The data augmentation method used includes random cropping and random horizontal flipping (with a probability of 50%), followed by image normalization. The model without data augmentation only applies resizing and normalization to the images. The experimental results show that for the UCF101 dataset, data augmentation improves performance in most shot settings, with the largest improvement (0.87%) observed in the 8-shot setting. However, a slight performance drop (−0.63%) is observed in the 1-shot setting. With experimental analysis, it is found that under the 1-shot setting of the UCF101 dataset, moderately reducing data augmentation increases accuracy to 75.28%, achieving a 0.29% improvement over Without Augmentation (74.99%) and a 0.92% improvement over With Augmentation (74.36%). This indicates that excessive data augmentation interfere with essential features of the original samples, ultimately leading to reduced model performance. On the SUN397 dataset, the effect of data augmentation is relatively stable, providing noticeable improvements in both the 1-shot and 16-shot settings (0.66% improvement for 1-shot and 0.50% for 16-shot), but almost no difference is observed in the 2-shot setting, with an improvement of only 0.03%. Generally, data augmentation can increase the diversity of samples in few-shot learning tasks, thus improving the generalization ability of the model to some extent. However, the specific effect is closely related to the characteristics of the dataset and the number of samples.

4.4. Visualization

To better illustrate the effectiveness of prototype features, we employ the t-SNE technique to visualize the multimodal feature fusion prototypes on both the validation and test sets of OxfordPets. In addition, we performed a comparative visualization of image and text prototypes on the EuroSAT test set. As shown in Figure 6a,b, the fusion prototypes of different categories are distributed near the cluster centers of their corresponding category test samples, demonstrating high consistency and robustness on both the validation and test sets. In the two-dimensional space after dimensionality reduction, image feature points from different categories form distinct and compact clusters, with the fusion prototypes positioned at the cluster centers of their respective categories. These results indicate that multimodal information fusion can effectively integrate information from different modalities, effectively representing each category. Furthermore, as illustrated in Figure 6c,d on the EuroSAT test set, both the image and text prototypes effectively serve as anchors for their respective categories. Our experimental findings further validate the effectiveness of the prototype network approach for few-shot image classification tasks.

As shown in Figure 7, we visualize the image and text prototypes for 10 categories in the EuroSAT dataset, comparing the results before and after fine-tuning. During the fine-tuning process, we introduce an image–text alignment module to enhance the consistency of image and text prototypes in the feature space. From the visualization results, it is evident that Figure 7a represents the case without fine-tuning, where the image and text prototypes of the same category are noticeably distant in the feature space. In contrast, Figure 7b shows the results after fine-tuning with the alignment module, where the image prototypes and their corresponding text prototypes are much closer. The experimental results demonstrate that incorporating the image–text alignment module during fine-tuning helps to improve the consistency of prototypes within the same category, thereby enhancing the semantic alignment between cross-modal representations.

4.5. Similarity Measurement Methods

In classification tasks, the choice of an appropriate similarity measurement method is crucial, as different metrics can significantly impact model performance. To investigate this effect, we perform experiments on the DTD and UCF101 datasets, evaluating multiple commonly used similarity metrics in various shot settings. As shown in Table 8, the squared Euclidean distance (SED) consistently outperforms other similarity metrics, achieving the highest classification accuracy in all shot configurations. For example, in the 1-shot setting, SED achieves an accuracy of 63.36% on the DTD dataset, demonstrating a slight advantage over alternative methods. More importantly, in the 16-shot setting, SED achieves an accuracy of 77.54%, significantly exceeding other metrics. In comparison, cosine similarity (CS) and matrix multiplication (MM) exhibit a noticeable performance gap. Specifically, in the 16-shot setting, CS and MM achieve classification accuracies of 75.77% and 75.71% on the DTD dataset, respectively, which are approximately 1.8% lower than that of SED. These findings underscore the critical role of similarity measurement methods in classification tasks, as they fundamentally influence how the model quantifies relationships between samples, directly impacting its robustness and overall effectiveness.

4.6. Performance Analysis

We compare our model, ProFusion, with SOTA approaches, including Tip-Adapter, Tip-Adapter-F, Proto-CLIP-F, and GDA-CLIP, from multiple perspectives, such as training time, testing time, and classification accuracy. All models use a batch size of 256, and all experiments are performed on a single NVIDIA GeForce RTX 3090 GPU. As shown in Table 9, compared to fine-tuning-based methods such as Tip-Adapter-F and Proto-CLIP-F, ProFusion-F not only reduces training time but also achieves significantly higher accuracy, reaching 73.28%, which is an improvement of 4.61% over Tip-Adapter-F. Additionally, ProFusion is training-free, achieving an accuracy of 71.48%, which is a 3.65% improvement over Proto-CLIP-F, although with a significant increase in testing time. It can be concluded that ProFusion and ProFusion-F are slightly superior in accuracy, but their efficiency in training and testing time is slightly inferior.

4.7. Hyperparameter Analysis

We conducted experiments on the Caltech101, UCF101, DTD, EuroSAT, and OxfordPets datasets to investigate the impact of the hyperparameters

α

,

β

, and

γ

on classification performance. The experiments compare the average accuracy across multiple datasets with the same weight configuration. As shown in Figure 8, the impact of different prototype features on the final classification accuracy varies significantly across different weight configurations. As shown in Figure 8a, image prototypes are crucial to the classification results, with the overall accuracy steadily increasing as the

α

ratio increases. Additionally, when the

γ

ratio is between 0.2 and 0.8, the overall classification accuracy remains high and stable, with the average accuracy reaching its highest value of 85.63% when

γ

is 0.8, indicating that the effective fusion of multimodal information enhances the generalization ability of the model. In contrast, when relying solely on the prediction probability of the text prototype (i.e.,

β \approx 1

), the classification accuracy decreases, which restricts the ability of the model. The main reason for this is that, in some fine-grained datasets, simple textual information lacks sufficient discriminative power to effectively distinguish between visually similar categories, making it difficult for the model to make accurate predictions. Therefore, only by balancing and optimizing the weight distribution of different prototype features in the prediction probabilities for test images can we fully leverage the advantages of multimodal information, thereby improving the model’s robustness.

In the classification task, ProFusion uses the grid-searched hyperparameters

α

,

β

, and

γ

to regulate the contributions of the image prototype, text prototype, and multimodal feature fusion prototype to the final prediction probability. We attempted to replace grid-searched hyperparameters with learnable parameters to reduce tuning time and computational overhead. However, experimental results indicate that this approach performs poorly, failing to achieve the expected model performance. The primary reason for this is the limited number of training samples, which leads to overfitting. As shown in Figure 8d, in the 1-shot setting, the performance gap between grid-searched hyperparameters (

α

,

β

,

γ

) and learnable parameters reaches 8.93%. As the number of samples increases, this gap gradually decreases. This indicates that the number of training samples plays a crucial role in the training of learnable parameters. However, in few-shot image classification, due to the extremely limited number of training samples and the large number of learnable parameters, overfitting is likely to occur. Therefore, in few-shot image classification, the grid-searched hyperparameters

α

,

β

, and

γ

are superior to the learnable parameters and serve as a relatively better choice.

5. Discussion

The superior performance of traditional machine learning models is primarily attributed to iterative training on large-scale datasets. However, this also limits their generalization ability when dealing with a few-shot data. We propose the ProFusion model, which fully leverages the rich visual and textual knowledge embedded in multimodal pre-trained models to effectively overcome the limitations imposed by data scarcity in few-shot image classification tasks. As shown in Figure 1, powerful multimodal pre-trained models can extract features with distinct class separability, ensuring that samples from the same category exhibit a more compact distribution in the feature space, while clear and well-defined boundaries emerge between different categories. Consequently, even under limited data conditions, strong feature extraction capabilities enable effective classification. Furthermore, we draw inspiration from prototypical networks by representing each category using the mean multimodal features of multiple samples from the same category. As illustrated in Figure 6, the image prototypes, text prototypes, and symmetrically fused multimodal feature fusion prototypes are positioned at the center of the test image features corresponding to their respective categories, allowing for a more accurate representation of the category features.

Current models primarily focus on the independent utilization of visual or textual information, without fully exploring or achieving an efficient and symmetrical fusion of the two. As a result, there are still inherent discrepancies between cross-modal features. Therefore, we utilize the fusion module of multimodal pre-trained models to jointly encode image and text information. In the fusion module, images and text are processed separately through symmetrical visual and language feed-forward networks. At the same time, a visual-language feed-forward network is used to capture the cross-modal relationships between images and text. A shared self-attention module aligns and interacts with image and text information, thus achieving a deep fusion of multimodal image–text information, mitigating the inherent discrepancies between cross-modal features. In terms of experiments, this study was conducted on 15 public datasets and compared with current SOTA methods. The experimental results show that the proposed method outperforms other approaches and demonstrates significant performance improvements.

Although the proposed method (ProFusion) demonstrates superior performance compared to existing few-shot image classification methods, there is still room for improvement in certain tasks, such as FGVCAircraft. For example, by analyzing tools such as the confusion matrix, the model’s prediction performance across different categories can be revealed, providing a basis for targeted improvements to the model. Additionally, the quality of support set samples significantly impacts the performance of few-shot image classification models. Therefore, selecting high-quality training samples becomes a key factor in enhancing performance. Additionally, future work could explore the integration of other multimodal fusion methods (e.g., BLIP, Flamingo, etc.) and leverage their cross-modal information processing mechanisms to further enhance the fusion performance of the model. Furthermore, besides commonly used modalities like image and text, other modalities of information could be explored to improve the model’s performance in data-scarce scenarios. Finally, we anticipate that future research will combine existing large-scale multimodal model techniques with few-shot learning, advancing the development of few-shot learning.

6. Conclusions

To alleviate the problems of data scarcity and inherent disparities between cross-modal features, we introduce the ProFusion model for few-shot image classification, which utilizes multimodal pre-trained models and prototypical networks to construct multimodal prototypes and innovatively employs a fusion module to jointly encode images and text, generating fusion prototypes to mitigate inherent disparities between cross-modal features. The alignment module is also introduced to ensure the consistency of the image and text prototypes. ProFusion calculates the similarity of the test images to the three prototypes and generates the prediction result through weighted summation. Compared to the existing model, ProFusion demonstrates superior performance by integrating prototypes from multimodal information.

Author Contributions

Conceptualization, Z.C. and J.Z.; methodology, Z.C. and X.W.; validation, H.W. and Y.C.; investigation, H.W.; resources, J.Z.; data curation, Z.C.; writing—original draft preparation, Z.C.; writing—review and editing, J.Z.; visualization, Y.C.; supervision, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (61906044), the Natural Science Foundation of Anhui Province (2408085MF154), and the key projects of natural science research in Anhui colleges and universities (2023AH050406, 2023AH050418 and 2022AH051324).

Data Availability Statement

Data are contained within this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Devlin, J.; Chang, Mi.; Lee, K.; Toutanova, K. Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a Few Examples: A Survey on Few-Shot Learning. ACM Comput. Surv. 2019, 53, 63. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, H.; Zhang, W.; Lu, G.; Tian, Q.; Ling, N. Few-Shot Image Classification: Current Status and Research Trends. Electronics 2022, 11, 1752. [Google Scholar] [CrossRef]
Wang, Z.; Liang, J.; Sheng, L.; He, R.; Wang, Z.; Tan, T. A hard-to-beat baseline for training-free clip-based adaptation. arXiv 2024, arXiv:2402.04087. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; PMLR: Sydney, Australia, 2017; pp. 1126–1135. [Google Scholar]
Zhang, R.; Zhang, W.; Fang, R.; Gao, P.; Li, K.; Dai, J.; Qiao, Y.; Li, H. Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification. In Computer Vision—ECCV 2022, Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 493–510. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Yang, J.; Duan, J.; Tran, S.; Xu, Y.; Chanda, S.; Chen, L.; Zeng, B.; Chilimbi, T.; Huang, J. Vision-Language Pre-Training with Triple Contrastive Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 15671–15680. [Google Scholar]
P, J.J.; Palanisamy, K.; Chao, Y.-W.; Du, X.; Xiang, Y. Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024. [Google Scholar]
Wang, W.; Bao, H.; Dong, L.; Bjorck, J.; Peng, Z.; Liu, Q.; Aggarwal, K.; Mohammed, O.K.; Singhal, S.; Som, S.; et al. Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19175–19186. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-Shot Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zhong, X.; Gu, C.; Ye, M.; Huang, W.; Lin, C.-W. Graph Complemented Latent Representation for Few-Shot Image Classification. IEEE Trans. Multimed. 2022, 25, 1979–1990. [Google Scholar] [CrossRef]
Phaphuangwittayakul, A.; Guo, Y.; Ying, F. Fast Adaptive Meta-Learning for Few-Shot Image Generation. IEEE Trans. Multimed. 2021, 24, 2205–2217. [Google Scholar] [CrossRef]
Guo, K.; Shen, C.; Hu, B.; Hu, M.; Kui, X. RSNet: Relation Separation Network for Few-Shot Similar Class Recognition. IEEE Trans. Multimed. 2022, 25, 3894–3904. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A Survey of Transfer Learning. J. Big Data 2016, 3, 9. [Google Scholar] [CrossRef]
Peng, H. A Comprehensive Overview and Survey of Recent Advances in Meta-Learning. arXiv 2020, arXiv:2004.11149. [Google Scholar]
Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A Comprehensive Survey of Few-Shot Learning: Evolution, Applications, Challenges, and Opportunities. ACM Comput. Surv. 2023, 55, 271. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]
Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends® Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
Reddy, M.D.M.; Basha, M.S.M.; Hari, M.M.C.; Penchalaiah, M.N. DALL-E: Creating Images from Text. UGC Care Group J. 2021, 8, 71–75. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2022; pp. 12888–12900. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Part V 13; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Bandy, J.; Vincent, N. Addressing “documentation debt” in machine learning research: A retrospective datasheet for BookCorpus. arXiv 2021, arXiv:2105.05241. [Google Scholar]
Yu, Z.; Zhao, J.; Guo, C.; Yang, Y. StableNet: Distinguishing the hard samples to overcome language priors in visual question answering. IET Comput. Vis. 2024, 18, 315–327. [Google Scholar] [CrossRef]
Tang, Y.; Lin, Z.; Wang, Q.; Zhu, P.; Hu, Q. Amu-tuning: Effective logit bias for clip-based few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 10–15 June 2024; pp. 23323–23333. [Google Scholar]
Bai, Y.; Xu, X.; Liu, Y.; Khan, S.; Khan, F.; Zuo, W.; Goh, R.S.M.; Feng, C.-M. Sentence-level prompts benefit composed image retrieval. arXiv 2023, arXiv:2310.05473. [Google Scholar]
Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 23318–23340. [Google Scholar]
Gupta, T.; Kembhavi, A. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14953–14962. [Google Scholar]
Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; Qiao, Y. Clip-adapter: Better vision-language models with feature adapters. Int. J. Comput. Vis. 2024, 132, 581–595. [Google Scholar] [CrossRef]
Su, Y.; Liu, X.; Zhao, Y.; Hong, R.; Wang, M. Partial-tuning based mixed-modal prototypes for few-shot classification. IEEE Trans. Multimed. 2024, 26, 9175–9186. [Google Scholar] [CrossRef]
Khattak, M.U.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F.S. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Guo, Z.; Zhang, R.; Qiu, L.; Ma, X.; Miao, X.; He, X.; Cui, B. Calip: Zero-shot enhancement of clip with parameter-free attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Krause, J.; Stark, M.; Deng, J.; Li, F.-F. 3D object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 554–561. [Google Scholar]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Li, F.; Fergus, R.; Perona, P. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop, Washington, DC, USA, 27 June–2 July 2004; p. 178. [Google Scholar]
Nilsback, M.-E.; Zisserman, A. Automated flower classification over a large number of classes. In Proceedings of the Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Washington, DC, USA, 16–19 December 2008; pp. 722–729. [Google Scholar]
Xiao, J.; Hays, J.; Ehinger, K.A.; Oliva, A.; Torralba, A. SUN database: Large-scale scene recognition from abbey to zoo. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3485–3492. [Google Scholar]
Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; Vedaldi, A. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3606–3613. [Google Scholar]
Helber, P.; Bischke, B.; Dengel, A.; Borth, D. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2217–2226. [Google Scholar] [CrossRef]
Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; Vedaldi, A. Fine-grained visual classification of aircraft. arXiv 2013, arXiv:1306.5151. [Google Scholar]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A.; Jawahar, C.V. Cats and dogs. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 16–21 June 2012; pp. 3498–3505. [Google Scholar]
Bossard, L.; Guillaumin, M.; Van Gool, L. Food-101 – Mining discriminative components with random forests. In Computer Vision—ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Part VI; Springer: Cham, Switzerland, 2014; pp. 446–461. [Google Scholar]
Recht, B.; Roelofs, R.; Schmidt, L.; Shankar, V. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5389–5400. [Google Scholar]
Wang, H.; Ge, S.; Lipton, Z.; Xing, E.P. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS); MIT Press: Cambridge, MA, USA, 2019. [Google Scholar]
Hendrycks, D.; Zhao, K.; Basart, S.; Steinhardt, J.; Song, D. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15262–15271. [Google Scholar]
Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Guo, M.; et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8340–8349. [Google Scholar]

Figure 1. Comparison of t-SNE visualization of image features extracted by different pre-trained models. The symbol (•), displayed in various colors, indicates image features from distinct categories. (a) t-SNE visualization of image features extracted by the ResNet50 pre-trained model. (b) t-SNE visualization of image features extracted by the BEiT3-large-itc pre-trained model.

Figure 2. Overview of the ProFusion model. ProFusion leverages a BEiT-3 multimodal pre-trained model with frozen parameters to construct image prototypes, text prototypes, and multimodal feature fusion prototypes, which are then fine-tuned. Given a query set image, the model first extracts its features and calculates the squared Euclidean distances to the three types of prototypes. Next, the results are converted into probabilities using the softmax function. Finally, a weighted sum of the three prediction probabilities is computed to generate the final classification result.

P_{n}

denotes the text prompt template and SED stands for the squared Euclidean distance.

P_{v}

,

P_{t}

, and

P_{f}

represent the prediction probabilities of the query set images based on the image prototype, text prototype, and fusion prototype, respectively.

Figure 2. Overview of the ProFusion model. ProFusion leverages a BEiT-3 multimodal pre-trained model with frozen parameters to construct image prototypes, text prototypes, and multimodal feature fusion prototypes, which are then fine-tuned. Given a query set image, the model first extracts its features and calculates the squared Euclidean distances to the three types of prototypes. Next, the results are converted into probabilities using the softmax function. Finally, a weighted sum of the three prediction probabilities is computed to generate the final classification result.

P_{n}

denotes the text prompt template and SED stands for the squared Euclidean distance.

P_{v}

,

P_{t}

, and

P_{f}

represent the prediction probabilities of the query set images based on the image prototype, text prototype, and fusion prototype, respectively.

Figure 3. The comparison of average accuracy between ProFusion and other SOTA models on the ImageNet dataset.

Figure 4. Ablation experiment on the backbone network. The experiments are performed on the ImageNet and UCF101 datasets using BEiT3-base-itc and BEiT3-large-itc as backbone networks. The backbone network used in (a,c) is BEiT3-base-itc, while the backbone network in (b,d) is BEiT3-large-itc.

Figure 5. Impact of image data augmentation strategies on model performance. Specifically, panels (a–d) show experimental results with and without data augmentation strategies on the UCF101, SUN397, Food101, and StanfordCars datasets, respectively.

Figure 6. t-SNE visualizations of prototype features on the OxfordPets and EuroSAT datasets under the 16-shot setting. The symbol (•), shown in different colors, represents image features from distinct categories in the validation and test sets. (a) t-SNE visualization of multimodal fusion prototype features (✩) and image features from the validation set for the selected 20 categories on the OxfordPets dataset. (b) t-SNE visualization of multimodal fusion prototype features (✩) and image features from the test set for the selected 20 categories on the OxfordPets dataset. (c) t-SNE visualization of image prototype features (Δ) and image features from the test set for all categories on the EuroSAT dataset. (d) t-SNE visualization of text prototype features (◊) and image features from the test set for all categories on the EuroSAT dataset.

Figure 7. Visualization comparison of image and text prototypes before and after fine-tuning (FT), where the fine-tuning process incorporates an image–text alignment module. (a) t-SNE visualization of image and text prototypes before fine-tuning. (b) t-SNE visualization of image and text prototypes after introducing the image–text alignment module during fine-tuning. The numerical labels indicate the category numbers. Different colors of the symbols (Δ) and (✩) represent image prototypes and text prototypes of different categories, respectively; the same color indicates that the image prototype and text prototype belong to the same category.

Figure 8. The impact of different prototype features on classification accuracy in terms of their contribution to the predicted probabilities of test images. (a) The weight

α

assigned to the image prototypes. (b) The weight

β

assigned to the text prototypes. (c) The weight

γ

assigned to the fusion prototypes. The shaded region indicates the range between the maximum and minimum classification accuracy observed for different weight distributions. (d) The impact of grid-searched hyperparameters and learnable parameters on model performance.

Figure 8. The impact of different prototype features on classification accuracy in terms of their contribution to the predicted probabilities of test images. (a) The weight

α

assigned to the image prototypes. (b) The weight

β

assigned to the text prototypes. (c) The weight

γ

assigned to the fusion prototypes. The shaded region indicates the range between the maximum and minimum classification accuracy observed for different weight distributions. (d) The impact of grid-searched hyperparameters and learnable parameters on model performance.

Table 1. Comparison between ProFusion and existing state-of-the-art (SOTA) methods. “Image-Tuned” and “Text-Tuned” indicate fine-tuning of image and text features, respectively; “Image–Text AM” refers to the alignment of image and text features; “MF Fusion” denotes multimodal feature fusion. The symbols “✔“ and “✘“ indicate the presence or absence of the corresponding technique in each model.

Model	Image-Tuned	Text-Tuned	Image–Text AM	MF Fusion
CoOp [12]	✘	✔	✘	✘
CLIP-Adapter [39]	✔	✔	✘	✘
Tip-Adapter [11]	✔	✘	✘	✘
Proto-CLIP [16]	✔	✔	✔	✘
PMPro [40]	✔	✘	✔	✘
GDA-CLIP [9]	✘	✘	✘	✘
ProFusion (Ours)	✔	✔	✔	✔

Table 2. Number of categories, sample sizes for the validation and test sets, and corresponding text prompt templates for each dataset.

Dataset	Classes	Val Samples	Test Samples	Prompt Template
ImageNet [31]	1000	50,000	50,000	“itap of a {class}.”, “a bad photo of the {class}.”
				“a origami {class}.”, “a photo of the large {class}.”
				“a {class} in a video game.”, “art of the {class}.”
				“a photo of the small {class}.”
FGVCAircraft [51]	100	3333	3333	“a photo of a {class}, a type of aircraft.”
OxfordPets [52]	37	736	3699	“a photo of a {class}, a type of pet.”
StanfordCars [44]	196	1635	8041	“a photo of a {class}.”
EuroSAT [50]	10	5400	8100	“a centered satellite photo of {class}.”
Caltech101 [46]	100	1649	2465	“a photo of a {class}.”
SUN397 [48]	397	3970	19,850	“a photo of a {class}.’
DTD [49]	47	1128	1692	“{class} texture.”
Flowers102 [47]	102	1633	2463	“a photo of a {class}, a type of flower.”
Food101 [53]	101	20,200	30,300	“a photo of {class}, a type of food.”
UCF101 [45]	101	1898	3783	“a photo of a person doing {class}.”

Table 3. Comparison with SOTA models across 11 datasets.

Dataset	ImageNet	FGVC	Pets	Cars	EuroSAT	Caltech	SUN	DTD	Flowers	Food	UCF
Classes	1000	100	37	196	10	100	397	47	102	101	101
Zero-shot CLIP [14]	60.33	17.10	85.83	55.74	37.52	85.92	58.52	42.20	66.02	77.32	61.35
1-shot
CoOP [12]	57.15	9.64	85.89	55.59	50.63	87.53	60.29	44.39	68.12	74.32	61.92
CLIP-Adapter [39]	61.20	17.49	85.99	55.13	61.40	88.60	61.30	45.80	73.49	76.82	62.20
CALIP [42]	61.51	18.81	86.93	57.75	66.69	89.28	63.16	50.17	76.38	77.63	65.59
Tip-Adapter [11]	60.70	19.05	86.10	57.54	54.38	87.18	61.30	46.22	73.12	77.42	62.60
Tip-Adapter-F [11]	61.13	20.22	87.00	58.86	59.53	89.33	62.50	49.65	79.98	77.51	64.87
Proto-CLIP [16]	60.31	19.59	86.10	57.29	55.53	87.99	60.81	46.04	76.98	77.36	63.15
Proto-CLIP-F [16]	60.32	19.50	85.72	57.34	54.93	88.07	60.83	35.64	77.47	77.34	63.07
Proto-CLIP-F-Q^T [16]	59.12	16.26	83.62	52.77	61.95	88.48	61.43	32.27	68.53	75.16	62.44
GDA-CLIP [9]	60.64	17.29	85.49	56.77	58.30	87.28	59.95	46.26	72.11	77.42	62.61
PMPro [40]	60.57	22.23	85.94	60.14	66.04	89.66	64.04	49.76	80.19	76.72	66.61
ProFusion (Ours)	72.63	23.82	90.22	83.08	65.76	96.02	72.13	62.41	85.55	85.42	72.03
ProFusion-F (Ours)	73.04	23.19	90.79	84.14	62.17	97.24	73.75	63.36	82.41	85.91	74.36
2-shot
CoOP [12]	57.81	18.68	82.64	58.28	61.50	87.93	59.48	45.15	77.51	72.49	64.09
CLIP-Adapter [39]	61.52	20.10	86.73	58.74	63.90	89.37	63.29	51.48	81.61	77.22	67.12
CALIP [42]	62.08	19.74	86.42	59.21	74.57	90.57	65.08	55.28	83.08	78.18	68.77
Tip-Adapter [11]	60.96	21.21	87.03	57.93	61.68	88.44	62.70	49.47	79.13	77.52	64.74
Tip-Adapter-F [11]	61.69	23.19	87.03	61.50	66.15	89.74	63.64	53.72	82.30	77.81	66.43
Proto-CLIP [16]	60.64	22.14	87.38	60.01	64.89	89.05	63.12	51.06	83.39	77.34	67.46
Proto-CLIP-F [16]	60.64	22.14	87.38	60.04	64.86	89.09	63.20	49.88	83.52	77.34	67.49
Proto-CLIP-F-Q^T [16]	60.48	20.01	85.28	60.02	63.59	89.49	65.46	45.69	81.20	76.15	68.83
GDA-CLIP [9]	61.23	22.45	86.69	59.30	67.88	88.63	63.64	51.26	89.21	77.87	66.28
PMPro [40]	62.18	25.14	87.00	63.85	69.15	90.59	66.89	56.21	84.73	78.17	70.02
ProFusion (Ours)	73.32	29.85	90.19	84.28	65.22	96.80	74.59	65.78	90.37	85.70	78.22
ProFusion-F (Ours)	73.58	29.31	91.47	84.59	65.48	97.07	75.52	67.43	90.91	86.39	79.70
4-shot
CoOP [12]	59.99	21.87	86.70	62.62	70.18	89.55	63.47	53.49	86.20	73.33	67.03
CLIP-Adapter [39]	61.84	22.59	87.46	62.45	73.38	89.98	65.96	56.86	87.17	77.92	69.05
CALIP [42]	63.06	24.96	87.63	62.43	78.05	91.38	67.22	59.63	88.44	78.39	71.22
Tip-Adapter [11]	60.98	22.41	86.45	61.45	65.32	89.89	64.15	53.96	83.80	77.54	66.36
Tip-Adapter-F [11]	62.52	25.80	87.54	64.57	74.12	90.56	66.21	57.39	88.83	78.24	70.55
Proto-CLIP [16]	61.30	23.25	87.19	63.33	68.67	89.57	65.51	55.91	88.23	77.58	69.50
Proto-CLIP-F [16]	61.30	23.31	86.95	63.34	68.52	89.62	65.57	57.21	88.27	77.58	69.55
Proto-CLIP-F-Q^T [16]	61.80	27.63	87.11	66.24	80.64	91.81	68.09	56.86	89.85	76.94	70.16
GDA-CLIP [9]	61.71	28.24	87.19	62.75	75.40	90.62	65.27	57.47	89.73	77.83	71.31
PMPro [40]	63.02	26.97	89.15	66.40	74.49	92.45	68.80	61.23	89.44	78.77	74.17
ProFusion (Ours)	74.13	30.87	90.87	85.07	73.09	97.28	76.61	69.50	94.07	85.97	80.70
ProFusion-F (Ours)	74.78	34.41	91.52	86.33	76.21	97.65	77.34	70.69	95.13	86.73	82.87
8-shot
CoOP [12]	61.56	26.13	85.32	68.43	76.73	90.21	65.52	59.97	91.18	71.82	71.94
CLIP-Adapter [39]	62.68	26.25	87.65	67.89	77.93	91.40	67.50	61.00	91.72	78.04	73.30
CALIP [42]	64.19	35.52	88.14	67.94	83.81	92.48	69.61	63.98	93.24	78.76	75.76
Tip-Adapter [11]	61.45	25.59	87.03	62.93	67.95	89.83	65.62	58.63	87.98	77.76	68.68
Tip-Adapter-F [11]	64.00	30.21	88.09	69.25	77.93	91.44	68.87	62.71	91.51	78.64	74.25
Proto-CLIP [16]	62.12	27.63	88.04	64.93	69.42	90.22	67.37	59.34	92.08	77.90	71.08
Proto-CLIP-F [16]	63.92	31.32	88.55	70.35	78.94	92.54	69.59	62.35	93.79	78.29	74.81
Proto-CLIP-F-Q^T [16]	64.03	35.82	87.46	71.50	81.89	92.62	70.02	64.01	94.28	78.61	75.34
GDA-CLIP [9]	62.46	34.07	89.12	68.93	81.70	91.35	68.58	62.77	93.23	78.43	75.74
PMPro [40]	65.39	32.25	90.11	69.36	81.63	93.06	69.52	64.07	93.02	78.82	76.98
ProFusion (Ours)	75.24	34.23	92.04	86.71	75.16	96.91	78.33	72.93	94.72	86.59	83.61
ProFusion-F (Ours)	76.01	40.89	92.31	87.73	81.60	97.73	79.14	75.00	97.65	87.11	85.41
16-shot
CoOP [12]	62.95	31.26	87.01	73.36	83.53	91.83	69.26	63.58	94.21	74.67	75.71
CLIP-Adapter [39]	63.59	32.01	87.84	74.01	84.43	92.49	69.55	65.96	93.90	78.25	76.76
CALIP [42]	68.81	45.44	89.29	76.61	88.67	94.19	71.32	68.44	96.97	79.34	80.20
Tip-Adapter [11]	62.02	29.76	88.14	66.77	70.54	90.18	66.85	60.93	89.89	77.83	70.58
Tip-Adapter-F [11]	65.51	35.55	89.70	75.74	84.54	92.86	71.47	66.55	94.80	79.43	78.03
Proto-CLIP [16]	62.77	29.67	88.61	68.11	72.95	91.08	68.09	61.64	92.94	78.11	73.35
Proto-CLIP-F [16]	65.75	37.56	89.62	75.25	83.53	93.43	71.94	68.56	95.78	79.09	77.50
Proto-CLIP-F-Q^T [16]	65.91	40.65	89.34	76.76	86.59	93.59	72.19	68.50	96.35	79.34	78.11
GDA-CLIP [9]	63.82	40.61	88.81	75.12	86.12	92.55	70.70	66.51	95.72	79.05	77.53
PMPro [40]	65.39	36.12	91.47	74.08	86.81	93.91	71.85	68.79	95.53	79.31	79.28
ProFusion (Ours)	76.18	37.78	92.04	87.68	74.46	97.16	78.82	74.05	95.45	86.90	84.56
ProFusion-F (Ours)	77.61	44.10	92.94	89.57	87.80	98.13	80.37	77.54	98.17	87.62	86.12

Table 4. Average accuracies for 11 datasets under different shot settings, compared with SOTA models.

Model	Number of Shots					Average
Model	1-Shot	2-Shot	4-Shot	8-Shot	16-Shot	Average
Zero-shot CLIP [14]	Zero-Shot					58.90
CoOP [12]	59.59	62.32	66.77	69.89	73.40	66.39
CLIP-Adapter [39]	62.67	65.55	68.61	71.40	74.44	68.53
CALIP [42]	64.90	67.54	70.22	73.95	78.12	70.95
Tip-Adapter-F [11]	64.60	66.65	69.67	72.45	75.83	69.84
Proto-CLIP-F [16]	61.84	65.96	68.29	73.13	76.18	69.08
GDA-CLIP [9]	62.69	66.77	69.77	73.31	76.05	69.72
PMPro [40]	65.63	68.54	71.35	73.93	76.59	71.21
ProFusion (Ours)	73.55	75.85	78.01	79.68	80.46	77.51
ProFusion-F (Ours)	73.67	76.50	79.42	81.87	83.63	79.01

Table 5. Out-of-distribution Generalization.

Method	Train	Source	Target
Method	Train	ImageNet	−V2	−Sketch	−A	−R	Avg
Zero-shot CLIP [14]	✘	68.79	62.25	48.38	50.71	77.71	59.76
Tip-Adapter [11]	✘	70.83	63.72	48.98	51.09	77.26	60.26
Tip-Adapter-F [11]	✔	73.75	66.00	49.14	50.67	77.93	60.94
Proto-CLIP [16]	✘	71.85	64.45	48.91	50.73	77.78	60.47
Proto-CLIP-F [16]	✔	73.77	62.43	48.38	50.64	77.78	59.81
MaPLe [41]	✔	70.72	64.07	49.15	50.90	76.98	60.28
GDA-CLIP [9]	✘	72.22	64.86	49.00	50.14	76.88	60.22
ProFusion (Ours)	✘	76.18	67.88	60.82	50.19	85.26	66.04
ProFusion-F (Ours)	✔	77.61	66.77	60.34	51.33	85.08	65.88
		+3.84	+1.88	+11.67	+0.24	+7.33	+5.10

Table 6. The impact of different types of prototypes on model performance. The symbols “✔” and “–” indicate the use and absence of different types of prototypes, respectively.

Prototype			Dataset
Image	Text	Fusion	ImageNet	FGVC	Pets	Cars	EuroSAT	Caltech	SUN	DTD	Flowers	Food	UCF
✔	–	–	74.41	43.22	90.65	88.72	86.48	97.61	77.91	74.82	97.81	85.50	85.09
–	✔	–	77.37	33.54	92.15	89.03	84.09	97.65	80.20	74.94	97.48	87.21	85.51
✔	✔	–	77.48	42.00	92.50	89.18	84.07	97.61	80.28	75.89	97.77	87.40	85.06
–	✔	✔	77.28	43.51	92.57	88.81	85.88	97.77	79.31	75.83	97.69	87.28	85.28
✔	✔	✔	77.61	44.10	92.94	89.57	87.80	98.13	80.37	77.54	98.17	87.62	86.12

Table 7. The impact of different feature fusion strategies on model performance. The ”Baseline” refers to the method of constructing the feature fusion prototype by element-wise multiplication of the image and text prototypes; “Ours” refers to the method of constructing the feature fusion prototype using the fusion module.

Method	ImageNet	FGVC	Pets	Cars	EuroSAT	Caltech	SUN	DTD	Flowers	Food	UCF
8-shot
Baseline	76.01	40.23	92.15	87.46	80.53	97.57	78.95	73.82	97.65	86.99	83.74
Ours	76.01	40.89	92.31	87.73	81.60	97.73	79.14	75.00	97.65	87.11	85.41
16-shot
Baseline	76.80	41.88	92.45	88.21	84.11	97.69	79.83	76.36	98.05	87.44	85.70
Ours	77.61	44.10	92.94	89.57	87.80	98.13	80.37	77.54	98.17	87.62	86.12

Table 8. Classification results of different similarity measurement methods. MM denotes matrix multiplication; CS denotes cosine similarity; SED denotes squared Euclidean distance.

Dataset	Method	1-Shot	2-Shot	4-Shot	8-Shot	16-Shot
	MM	73.62	78.56	82.61	85.36	85.64
UCF101	CS	73.62	79.20	82.53	85.46	85.78
	SED	74.36	79.70	82.87	85.41	86.12
	MM	62.35	66.61	68.26	73.88	75.71
DTD	CS	63.30	66.43	68.26	73.88	75.77
	SED	63.36	67.43	70.69	75.00	77.54

Table 9. Comparison of ProFusion performance with SOTA models in the ImageNet dataset in terms of training time, testing time, and classification accuracy.

Method	Train. Set	Train. Time	Test. Time	Acc. (%)	Gain. (%)
Tip-Adapter [11]	0	0	3.4 min	65.62	0
Tip-Adapter-F [11]	16-shot	6.4 min	3.4 min	68.67	3.05
Proto-CLIP-F [16]	16-shot	6.8 min	0.2 min	67.83	2.21
GDA-CLIP [9]	0	0	2.7 min	67.00	1.38
ProFusion (Ours)	0	0	1.9 min	71.48	5.86
ProFusion-F (Ours)	16-shot	4.19 min	1.9 min	73.28	7.66

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, J.; Cao, Z.; Wang, H.; Wang, X.; Chen, Y. ProFusion: Multimodal Prototypical Networks for Few-Shot Learning with Feature Fusion. Symmetry 2025, 17, 796. https://doi.org/10.3390/sym17050796

AMA Style

Zhao J, Cao Z, Wang H, Wang X, Chen Y. ProFusion: Multimodal Prototypical Networks for Few-Shot Learning with Feature Fusion. Symmetry. 2025; 17(5):796. https://doi.org/10.3390/sym17050796

Chicago/Turabian Style

Zhao, Jia, Ziyang Cao, Huiling Wang, Xu Wang, and Yingzhou Chen. 2025. "ProFusion: Multimodal Prototypical Networks for Few-Shot Learning with Feature Fusion" Symmetry 17, no. 5: 796. https://doi.org/10.3390/sym17050796

APA Style

Zhao, J., Cao, Z., Wang, H., Wang, X., & Chen, Y. (2025). ProFusion: Multimodal Prototypical Networks for Few-Shot Learning with Feature Fusion. Symmetry, 17(5), 796. https://doi.org/10.3390/sym17050796

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ProFusion: Multimodal Prototypical Networks for Few-Shot Learning with Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Learning

2.2. Pre-Trained Models

2.3. Mainstream Methods

3. Method

3.1. Overview

3.2. Multimodal Prototype

3.3. Multimodal Feature Fusion

3.4. Similarity Measurement

3.5. Loss Function

4. Experiment

4.1. Experiment Setup

4.2. Comparison with SOTA Models

4.3. Ablation Study

4.4. Visualization

4.5. Similarity Measurement Methods

4.6. Performance Analysis

4.7. Hyperparameter Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI