4.1. Experiment Setup
To validate the effectiveness of the ProFusion model, we evaluated it on 15 benchmark datasets, which include ImageNet [
31], StanfordCars [
44], UCF101 [
45], Caltech101 [
46], Flowers102 [
47], SUN397 [
48], DTD [
49], EuroSAT [
50], FGVCAircraft [
51], OxfordPets [
52], Food101 [
53], ImageNetV2 [
54], ImageNet-Sketch [
55], ImageNet-A [
56], and ImageNet-R [
57]. In addition, to highlight the superiority of the ProFusion, we compared it with several SOTA models, including CoOp [
12], CLIP-Adapter [
39], Tip-Adapter [
11], Proto-CLIP [
16], GDA-CLIP [
9] and PMPro [
40].
In the experiments, we used the BEiT3-base-itc and BEiT3-large-itc pre-trained models as the backbone networks. To increase data diversity, we applied random cropping and horizontal flipping to the support set. The ProFusion model was built using the PyTorch framework and trained on a single NVIDIA GeForce RTX 3090 GPU with a batch size of 256. The AdamW optimizer was used, with an initial learning rate of 0.0001, and the CosineAnnealingLR scheduler for learning rate adjustment.
Meanwhile, the hyperparameters
,
, and
, as mentioned in Equation (4), play a crucial role in improving the classification precision. To determine the optimal hyperparameter values for each dataset, we performed a grid search within the range of
,
,
with a step size of 0.1, while imposing the constraint
to limit the search space. We evaluated the performance of all candidate weight combinations on the validation set and selected the combination that yielded the best performance for the final testing phase. This approach facilitated a reasonable allocation of weights across modalities. The model was divided into two versions: ProFusion and ProFusion-F. For the ProFusion model, image prototypes, text prototypes, and multimodal feature fusion prototypes were constructed using the encoder, and these prototypes were directly used for classifying the query set. For the ProFusion-F model, the prototype feature parameters were fine-tuned using the support set to further enhance classification performance. Furthermore, for each dataset, we continued to use the text prompt templates selected in previous works [
9,
40], as shown in
Table 2.
4.2. Comparison with SOTA Models
To validate the superiority of our ProFusion model in different scenarios, we present a performance comparison in
Table 3, showcasing ProFusion against the SOTA few-shot learning models on 11 datasets in various shot settings. In the case of an extreme one-shot setting, ProFusion leverages the rich pre-trained knowledge of multimodal models by directly classifying test images using image prototypes, text prototypes, and multimodal feature fusion prototypes. Our model significantly outperforms existing models. For instance, on the DTD dataset, ProFusion achieves a one-shot classification accuracy of 62.41%, outperforming TiP-Adapter and Proto-CLIP, which attain 46.22% and 46.04%, respectively. This corresponds to an absolute improvement of 16.19% over Tip-Adapter and 16.37% over Proto-CLIP. Similarly, on the ImageNet dataset, ProFusion reaches an accuracy of 72.63%, surpassing Tip-Adapter (60.70%) by 11.93% and Proto-CLIP (60.31%) by 12.32%. On the StanfordCars dataset, ProFusion achieves a classification accuracy of 83.08%, outperforming GAD-CLIP (56.77%) by a substantial 26.31%. Furthermore, ProFusion-F, which fine-tunes the three types of prototypes using support sets, further enhances classification performance. In the 16-shot setting, ProFusion-F maintains its advantage, achieving 77.61% accuracy on the ImageNet dataset, an improvement of 12.10% over Tip-Adapter-F (65.51%) and 11.86% over Proto-CLIP-F (65.75%). On the Food101 dataset, ProFusion-F achieves a classification accuracy of 87.62%, outperforming PMPro (79.31%) by 8.31%. In general, our method demonstrates significant improvements in few-shot learning tasks, especially in limited-data and complex scenarios, highlighting its ability to effectively utilize multimodal pre-trained knowledge for improved classification accuracy.
However, under the one-shot setting of the EuroSAT fine-grained dataset, the high similarity between categories leads to insufficient discriminability of support set samples, affecting the model’s classification capability. To enhance the stability of the result, we conducted multiple experiments and reported the average accuracy as the final result. Additionally, under the one-shot condition on both FGVCAircraft and EuroSAT, fine-tuned ProFusion-F slightly underperforms the non-fine-tuned ProFusion. For example, on FGVCAircraft, ProFusion-F achieves 23.82%, slightly lower than ProFusion’s 23.19%. This phenomenon arises from the limited inter-class variability in fine-grained tasks, where fine-tuning with only a single support sample is prone to overfitting.
To enhance the generalization capability of the model, we incorporate text information from class names to compensate for the insufficiency of visual information. Inspired by prototypical networks, we construct three types of prototype as class anchors and classify test images based on their similarity, effectively mitigating overfitting. In addition, we adopt the BEiT-3 multimodal pre-trained model as the backbone network to extract features from both image and text information, thereby avoiding the overfitting issue that may arise from training a feature extractor from scratch. However, due to differences among datasets, the model exhibits varying performance across different datasets. As shown in
Table 3, for conventional datasets (such as Flowers102, OxfordPets, Caltech101, etc.), where there are clear class differences, the model is able to easily extract discriminative features, leading to better recognition performance. For example, under the one-shot condition, ProFusion achieves an accuracy of 96.02% on the Caltech101 dataset. In contrast, for fine-grained datasets (such as FGVCAircraft, DTD, EuroSAT, etc.), the task complexity is higher, the class differences are subtle, and the image textures are abstract, making it difficult for the model to learn effective distinguishing features from a very limited number of samples, resulting in poorer performance. For example, under the 1-shot setting, ProFusion achieves an accuracy of only 23.82% on the FGVCAircraft dataset.
As shown in
Figure 3, the average accuracy of ProFusion under 1-shot to 16-shot conditions is compared to that of the other SOTA models on the ImageNet dataset. ProFusion demonstrates significant performance advantages, achieving an average accuracy of 74.30%, which is significantly higher than CLIP-Adapter (62.17%), Tip-Adapter-F (62.97%) and Proto-CLIP-F (62.39%). Specifically, compared to GDA-CLIP (61.97%) and PMPro (63.11%), ProFusion’s accuracy improves by 12.33% and 11.19%, respectively. Meanwhile, ProFusion-F, which leverages the support set for fine-tuning, further boosts the average accuracy to 75.00%, a 0.70% improvement over ProFusion.
To comprehensively demonstrate the robustness of our model, we calculate and compare the average classification accuracy on 11 datasets under different shot settings. As shown in
Table 4, ProFusion consistently outperforms other methods across all shot settings. Without requiring additional training, ProFusion achieves an accuracy of 73.55% for 1-shot and 80.46% for 16-shot, with an overall average accuracy of 77.46%. In contrast, ProFusion-F, which uses support set fine-tuning, further enhances performance across all shot settings, reaching 73.67% for 1-shot, 83.63% for 16-shot, and an impressive average accuracy of 78.82%. Moreover, the experimental results indicate that both ProFusion and ProFusion-F consistently surpass the SOTA models in different shot settings. For example, in the 1-shot setting, ProFusion outperforms Tip-Adapter-F by 8.95%, while ProFusion-F exceeds PMPro by 8.04%. In the 16-shot setting, ProFusion-F demonstrates a 7.80% improvement over Tip-Adapter-F. Overall, the average accuracy of ProFusion-F exceeds that of CALIP and GDA-CLIP by 8.06% and 9.29%, respectively, reflecting its advantage in few-shot classification tasks.
We also performed experiments to evaluate the performance of our model in out-of-distribution generalization. Specifically, we trained our model using a 16-shot setting on the ImageNet dataset. We then directly transferred the model to target datasets, including ImageNetV2, ImageNet-Sketch, ImageNet-A, and ImageNet-R. As shown in
Table 5, we compared our method with CLIP, Tip-Adapter, Tip-Adapter-F, Proto-CLIP, Proto-CLIP-F, MaPLe and GDA-CLIP. The experimental results demonstrate that both ProFusion and ProFusion-F exhibit significant advantages in out-of-distribution generalization tasks. On the target datasets, our model consistently outperforms the other baselines, especially on ImageNet-R and ImageNet-Sketch, where ProFusion improves by 7.33% and ProFusion-F improves by 11.67%, showing remarkable performance gains over other models. However, on the ImageNet-V2 and ImageNet-A datasets, ProFusion performs relatively poorly, improving by only 1.88% and 0.24% compared to the second-place Tip-Adapter series methods, respectively. This is mainly due to the presence of challenging and adversarial samples in these two datasets, which increase the difficulty of classification. Overall, ProFusion-F achieves an average score of 66.04%, surpassing the 59.76% of CLIP and the 60.22% of GDA-CLIP, demonstrating the superiority of our method in out-of-distribution generalization tasks.
4.3. Ablation Study
In the ablation study presented in
Table 6, we used BEiT3-large-itc as the backbone network (
) to evaluate the impact of image prototypes, text prototypes and multimodal feature fusion prototypes on model performance. Experimental results show that when image and text prototypes are used independently, there is no significant difference in classification accuracy. For example, using only image prototypes, the model achieves an accuracy of 97.61% on the Caltech101 dataset, which is comparable to using only text prototypes (97.65%). However, on regular datasets (such as OxfordPets and SUN397), where inter-class differences are more pronounced, the semantic information provided by text descriptions enables relatively accurate classification, resulting in better performance than unimodal image prototypes. When using both image and text prototypes, the ImageNet’s accuracy increases to 77.48%, but the classification accuracy drops on more challenging datasets such as EuroSAT (84.07%) and FGVCAircraft (42.00%).The primary reason for this is that, in fine-grained datasets, the semantic information conveyed by textual descriptions (e.g., “a photo of {class}”) differs from the actual visual information. The alignment module encourages the image prototypes to move closer to the text prototypes, which weakens the discriminative ability of the image prototypes and ultimately leads to lower classification accuracy compared to using unimodal image prototypes alone. To this end, we perform interactive fusion of image and text information through the shared attention mechanism of the fusion module and leverage the vision-language feed-forward network to capture the cross-modal relationships between images and text, thereby constructing the fused prototypes. After introducing the fused prototypes, the model shows improvements across multiple datasets. For example, on the FGVCAircraft dataset, the accuracy increases by 2.10% compared to using only image and text prototypes, reaching 44.10%. On the EuroSAT dataset, the accuracy improves by by 3.73%, reaching 87.80%.
In
Table 7, we compare the performance of the baseline fusion strategy with our proposed fusion strategy in 11 datasets. The baseline strategy performs a simple fusion of image prototypes and text prototypes using element-wise multiplication, while our method utilizes the fusion module of a multimodal pre-trained model to generate more information-rich fused prototypes. The experimental results show that on datasets such as ImageNet, OxfordPets, and Flowers102, the performance of both methods is comparable. However, in fine-grained datasets such as FGVCAircraft, EuroSAT, DTD, and UCF101, our method shows significant advantages. For example, on the FGVCAircraft dataset, the precision increases from 41.88% in the baseline to 44. 10% with our model. On the EuroSAT dataset, our model improves by 3.69% over the baseline (84.11%). On the DTD dataset, the accuracy increases from 76.36% in the baseline to 77.54%. The experimental results clearly show that simply performing a basic fusion of image and text features is insufficient to fully exploit the complex relationships between the image and text, leading to significantly poorer performance on fine-grained tasks. In contrast, our proposed fusion strategy not only performs well on general datasets but also exhibits significant improvements on fine-grained datasets, fully demonstrating its superior ability to capture complex image–text associations and integrate multimodal information.
Figure 4 presents the image classification results using two different multimodal pre-trained models, BEiT3-base-itc and BEiT3-large-itc, as backbone networks. In general, the stronger the backbone network, the more discriminative feature representations it can learn, thereby improving classification accuracy. For example, on the ImageNet dataset, when using the BEiT3-base-itc pre-trained model, the zero-shot accuracy of BEiT3 is 68.96%. However, with the BEiT3-large-itc pre-trained model, the accuracy improves to 71.89%. Additionally, we compare our method with existing SOTA approaches, GDA-CLIP and Tip-Adapter, by using the same pre-trained models as backbone networks. In ImageNet (
K = 16), with the BEiT3-base-itc backbone, ProFusion-F achieves an accuracy of 73.28%, surpassing GDA-CLIP (71.84%) by 1.44%. When using the BEiT3-large-itc backbone, ProFusion-F reaches 77.61%, exceeding GDA-CLIP (76.45%) by 1.16% and Tip-Adapter-F (76.72%) by 0.89%. Similarly, on the UCF101 dataset (
K = 16), with the BEiT3-base-itc backbone, ProFusion-F achieves an accuracy of 80.78%, improving by 1.98% over Tip-Adapter-F (78.80%). With the BEiT3-large-itc backbone, ProFusion-F reaches an accuracy of 86.12% in the 16-shot setting, outperforming GDA-CLIP (85.23%) by 0.89%.
Data augmentation is an important technique to improve the generalization ability of models. Applying random transformations to the support set increases the diversity of images and mitigates the model’s overfitting to the training data distribution. In
Figure 5, we evaluate the impact of data augmentation on model performance on the four datasets: UCF101, SUN397, Food101, and StanfordCars. The data augmentation method used includes random cropping and random horizontal flipping (with a probability of 50%), followed by image normalization. The model without data augmentation only applies resizing and normalization to the images. The experimental results show that for the UCF101 dataset, data augmentation improves performance in most shot settings, with the largest improvement (0.87%) observed in the 8-shot setting. However, a slight performance drop (−0.63%) is observed in the 1-shot setting. With experimental analysis, it is found that under the 1-shot setting of the UCF101 dataset, moderately reducing data augmentation increases accuracy to 75.28%, achieving a 0.29% improvement over Without Augmentation (74.99%) and a 0.92% improvement over With Augmentation (74.36%). This indicates that excessive data augmentation interfere with essential features of the original samples, ultimately leading to reduced model performance. On the SUN397 dataset, the effect of data augmentation is relatively stable, providing noticeable improvements in both the 1-shot and 16-shot settings (0.66% improvement for 1-shot and 0.50% for 16-shot), but almost no difference is observed in the 2-shot setting, with an improvement of only 0.03%. Generally, data augmentation can increase the diversity of samples in few-shot learning tasks, thus improving the generalization ability of the model to some extent. However, the specific effect is closely related to the characteristics of the dataset and the number of samples.
4.4. Visualization
To better illustrate the effectiveness of prototype features, we employ the t-SNE technique to visualize the multimodal feature fusion prototypes on both the validation and test sets of OxfordPets. In addition, we performed a comparative visualization of image and text prototypes on the EuroSAT test set. As shown in
Figure 6a,b, the fusion prototypes of different categories are distributed near the cluster centers of their corresponding category test samples, demonstrating high consistency and robustness on both the validation and test sets. In the two-dimensional space after dimensionality reduction, image feature points from different categories form distinct and compact clusters, with the fusion prototypes positioned at the cluster centers of their respective categories. These results indicate that multimodal information fusion can effectively integrate information from different modalities, effectively representing each category. Furthermore, as illustrated in
Figure 6c,d on the EuroSAT test set, both the image and text prototypes effectively serve as anchors for their respective categories. Our experimental findings further validate the effectiveness of the prototype network approach for few-shot image classification tasks.
As shown in
Figure 7, we visualize the image and text prototypes for 10 categories in the EuroSAT dataset, comparing the results before and after fine-tuning. During the fine-tuning process, we introduce an image–text alignment module to enhance the consistency of image and text prototypes in the feature space. From the visualization results, it is evident that
Figure 7a represents the case without fine-tuning, where the image and text prototypes of the same category are noticeably distant in the feature space. In contrast,
Figure 7b shows the results after fine-tuning with the alignment module, where the image prototypes and their corresponding text prototypes are much closer. The experimental results demonstrate that incorporating the image–text alignment module during fine-tuning helps to improve the consistency of prototypes within the same category, thereby enhancing the semantic alignment between cross-modal representations.
4.7. Hyperparameter Analysis
We conducted experiments on the Caltech101, UCF101, DTD, EuroSAT, and OxfordPets datasets to investigate the impact of the hyperparameters
,
, and
on classification performance. The experiments compare the average accuracy across multiple datasets with the same weight configuration. As shown in
Figure 8, the impact of different prototype features on the final classification accuracy varies significantly across different weight configurations. As shown in
Figure 8a, image prototypes are crucial to the classification results, with the overall accuracy steadily increasing as the
ratio increases. Additionally, when the
ratio is between 0.2 and 0.8, the overall classification accuracy remains high and stable, with the average accuracy reaching its highest value of 85.63% when
is 0.8, indicating that the effective fusion of multimodal information enhances the generalization ability of the model. In contrast, when relying solely on the prediction probability of the text prototype (i.e.,
), the classification accuracy decreases, which restricts the ability of the model. The main reason for this is that, in some fine-grained datasets, simple textual information lacks sufficient discriminative power to effectively distinguish between visually similar categories, making it difficult for the model to make accurate predictions. Therefore, only by balancing and optimizing the weight distribution of different prototype features in the prediction probabilities for test images can we fully leverage the advantages of multimodal information, thereby improving the model’s robustness.
In the classification task, ProFusion uses the grid-searched hyperparameters
,
, and
to regulate the contributions of the image prototype, text prototype, and multimodal feature fusion prototype to the final prediction probability. We attempted to replace grid-searched hyperparameters with learnable parameters to reduce tuning time and computational overhead. However, experimental results indicate that this approach performs poorly, failing to achieve the expected model performance. The primary reason for this is the limited number of training samples, which leads to overfitting. As shown in
Figure 8d, in the 1-shot setting, the performance gap between grid-searched hyperparameters (
,
,
) and learnable parameters reaches 8.93%. As the number of samples increases, this gap gradually decreases. This indicates that the number of training samples plays a crucial role in the training of learnable parameters. However, in few-shot image classification, due to the extremely limited number of training samples and the large number of learnable parameters, overfitting is likely to occur. Therefore, in few-shot image classification, the grid-searched hyperparameters
,
, and
are superior to the learnable parameters and serve as a relatively better choice.