Review Reports - ProFusion: Multimodal Prototypical Networks for Few-Shot Learning with Feature Fusion

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

To enhance the clarity and scope of the introduction, please provide illustrative examples or contrasting characteristics to better differentiate between transfer learning, meta-learning, and multimodal learning.

To further contextualise the contribution of ProFusion, consider adding a concise paragraph that summarises the limitations or gaps present in prior work.

For improved replicability, please provide more concrete details regarding the architecture of the fusion module, such as the number of layers, attention heads, and hidden dimensions.

Could you clarify whether the fusion encoder is fine-tuned during the training process or kept entirely frozen?

Please provide the value range, default value, and intuition behind the temperature parameter τ introduced in Equation 5.

Further details on the lightweight adapter used for query image features would be beneficial. Specifically, could you describe its architecture and quantify how "lightweight" it is?

To better evaluate the significance of the reported results, please consider including measures of variance, confidence intervals, or statistical significance tests, especially when performance differences are marginal (e.g., <1%).

Please elaborate on the methodology used to tune the modality weights α, β, and γ for each dataset.

To improve reader understanding, consider incorporating additional visualisations or examples. For instance, a visual representation of the multimodal feature fusion process or an illustration of the alignment module's functionality with sample image and text prototypes would be valuable.

Adding a paragraph that discusses the limitations of the proposed approach to provide a more balanced perspective and suggest potential directions for future research can be considered.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript proposes ProFusion, a novel few-shot learning approach that integrates multimodal pre-trained models with prototypical networks. The key innovation lies in constructing three types of prototypes (image, text, and multimodal fusion) and developing: (1) a fusion module to address cross-modal feature disparities, and (2) an alignment module to ensure consistency between modalities. The method demonstrates strong performance across 15 benchmarks. However, the manuscript must be improved to meet the publication standard.
1. The fusion module architecture needs more detailed description (e.g., layer configurations, attention mechanisms);
2. Mathematical notation could be more consistent (e.g., mixing θ_v and θ_v in equations);
3. Visualizations (Figs 1,6) need better labelling and interpretation;
4. Error analysis of failure cases would strengthen results;
5. Discussion of limitations (e.g., performance variation across datasets) should be added;

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors propose a few-shot learning model combining multimodal (image + text) prototypes using BEiT-3’s encoders and a fusion module to align cross-modal features.

Some comments next:

The authors propose a "multimodal feature fusion" method, but the fusion module is not described in sufficient technical detail. The paper relies heavily on BEiT-3's existing fusion capabilities without clearly explaining how ProFusion's approach differs or improves upon them. The authors assume familiarity with BEiT-3's architecture, but a standalone description of the fusion module (e.g., equations, layer specifics, or ablation studies comparing fusion strategies) is missing. This obscures the novelty of their contribution.
While the paper compares ProFusion to SOTA methods (e.g., Tip-Adapter, Proto-CLIP), it omits comparisons to other multimodal fusion techniques (e.g., BLIP, Flamingo) or recent few-shot methods (e.g., MaPLe, CALIP). Thus, the focus on CLIP-based baselines may exaggerate ProFusion's superiority. Broader comparisons would better contextualize its advances.
The alignment module (Section 3.5) uses InfoNCE loss without theoretical grounding for why this is optimal for prototype alignment. Alternatives (e.g., Wasserstein distance, triplet loss) are not discussed. For this reviewer, the choice appears empirical rather than principled. A deeper analysis of loss functions could strengthen the method's foundation.
The ablation on prototype types (Table 6) lacks statistical significance tests. The impact of fusion weights (α, β, γ) is analyzed only for UCF101 (Figure 7), raising questions about generalizability. For this reviewer, there is insufficient rigor in the ablation design that undermines claims about the necessity of each component.
Figure 5 shows mixed results for augmentation (e.g., performance drops in 1-shot UCF101). Still, the authors dismiss this without investigation, so the analysis ignores potential overfitting risks or dataset-specific biases, weakening the robustness argument.
OOD tests (Table 5) only use ImageNet variants, ignoring domain shifts (e.g., medical or satellite imagery). ProFusion’s gains on ImageNet-R/Sketch are marginal (≤1.68% over Proto-CLIP-F), yet framed as significant. Thus, Narrow OOD benchmarks inflate perceived generalizability. Cross-domain tests (e.g., medical→natural images) would be more persuasive.
The paper highlights fixed weights (α, β, γ) as optimal but does not address scalability to diverse tasks. Learnable weights (Figure 7d) perform poorly, but no solutions are proposed. For this reviewer, this suggests ProFusion may lack adaptability to tasks where modality importance varies.
t-SNE plots (Figure 6) are qualitative and lack quantitative metrics (e.g., cluster purity, separation scores), where visual evidence alone is insufficient to justify "strong discriminative capability."
Training/testing times (Table 9) are reported but not compared to model size or FLOPs. BEiT-3-large’s overhead may dwarf gains over lighter models (e.g., CLIP), then the efficiency claims are unsubstantiated without hardware-agnostic metrics.
The admitted weaknesses (e.g., sample bias, FGVC-Aircraft performance) lack actionable insights or future directions, so the authors avoid deeper critique (e.g., modality imbalance in fusion, scalability to >2 modalities).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Here are some scientific weaknesses with justifications:

The reported improvements in accuracy (e.g., 1-3% in several cases) may not be statistically significant given the variability in the datasets used. More rigorous statistical tests, such as significance testing, are necessary to confirm the reported gains.
The models are evaluated on datasets that may have imbalanced classes (e.g., UCF101 or EuroSAT), which can skew the performance metrics. Precision, recall, and F1 scores should be used more extensively to account for such imbalances.
The article compares accuracy across multiple methods but does not present other relevant metrics such as recall, precision, or F1-score, which are critical for comprehensive model evaluation.
The model shows weaker performance on fine-grained classification tasks (e.g., FGVC Aircraft), suggesting that its performance may be overestimated for complex, nuanced tasks .
The model's generalization ability is questionable, as it shows a significant drop in accuracy across some datasets. More robust testing on a wider range of tasks is needed.
The comparisons with baseline models are shallow and fail to provide enough context for why the proposed model outperforms others beyond raw accuracy gains.
While the model performs well, there is no discussion about its interpretability, which is essential for practical deployment, especially in sensitive applications like medical diagnosis.
The reliance on pre-trained models (e.g., BEiT3 and CLIP) may limit the generalization ability to new, unseen data or domains. This should be addressed with more robust fine-tuning and domain adaptation strategies.
The results show that text prototypes alone perform poorly, but the exact role of text data in multimodal tasks is not sufficiently clarified. Further analysis on the utility of textual features is needed.
Performance varies significantly with different numbers of shots (e.g., 1-shot vs. 16-shot settings), which raises concerns about the model's stability and ability to perform well across different few-shot learning scenarios.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript proposes ProFusion, a novel few-shot learning approach that integrates multimodal pre-trained models with prototypical networks. After the first roll of revision, the manuscript has been improved and met the standard of publication.

Author Response

Comments : None

Response: Thank you very much for your positive comments and support. We sincerely appreciate your recognition of our work and are truly grateful for the valuable time and professional feedback you have provided despite your busy schedule.

Reviewer 3 Report

Comments and Suggestions for Authors

The new version of the proposed paper has been significantly improved from the first version. This second version addresses the concerns and questions from the first manuscript. The new article submitted was well done, incorporating the latest information.

Author Response

Comments : None

Reviewer 4 Report

Comments and Suggestions for Authors

Based on the responses provided to the reviewers, the authors have indeed made several improvements, including providing additional clarity, refining certain explanations, and addressing some of the concerns raised. Here's a breakdown of how these responses align with the points raised by Reviewer :

Review of Author Responses:

Statistical Significance (Comment 1):
- The authors acknowledged the lack of statistical significance testing and agreed to include it in future work. However, they emphasize the consistency of results across multiple runs and justify the lack of statistical tests due to small differences in accuracy.
- Assessment: The response is reasonable but could be more convincing with the inclusion of some statistical analysis, even if preliminary.
Class Imbalance (Comment 2):
- The authors have accepted the suggestion and performed additional experiments to account for class imbalance, particularly in the EuroSAT dataset, and provided a detailed analysis of the results.
- Assessment: The response is satisfactory as they adjusted their methodology and presented results showing improvements.
Model Evaluation Metrics (Comment 3):
- The authors have incorporated additional evaluation metrics such as training time and inference time. They also conducted t-SNE-based feature visualization to complement the analysis.
- Assessment: This is a good addition, although more detailed analysis using recall, precision, and F1 score could be beneficial for a more comprehensive evaluation.
Fine-grained Classification (Comment 4):
- The authors acknowledge that their model performs weakly on fine-grained datasets but provide comparative performance results, including a discussion of future improvements.
- Assessment: The response is valid, and the authors provide sufficient explanations for the observed performance drop.
Generalization Ability (Comment 5):
- The authors performed additional experiments to assess the model's generalization ability and provided an analysis of performance variation across different datasets.
- Assessment: The response is appropriate, and the additional experiments help support their argument. However, more robust testing could further strengthen their case.
Shallow Comparisons (Comment 6):
- The authors improved their comparison with baseline models, including a more in-depth analysis of the model’s framework and the fusion module.
- Assessment: The response is well thought out, and the authors provide a better context for comparing the proposed model with others.
Interpretability (Comment 7):
- The authors have added a discussion on interpretability, especially in the context of medical applications.
- Assessment: This is a good addition, but a deeper discussion on how the model could be used in practical, real-world scenarios would be beneficial.
Pre-trained Models (Comment 8):
- The authors discuss the limitations of using pre-trained models in few-shot learning and mention that they plan to explore more robust fine-tuning strategies.
- Assessment: The response is adequate, though more immediate strategies for addressing this limitation could be explored.
Text Prototypes (Comment 9):
- The authors clarify the role of text prototypes and provide further analysis, especially regarding their performance on fine-grained datasets.
- Assessment: The authors adequately address the concerns raised and provide additional analysis.
Shot Variability (Comment 10):
- The authors provide a reasonable explanation for the performance variation between 1-shot and 16-shot settings, emphasizing that this is common in few-shot learning.
- Assessment: The response is reasonable, and the explanation is sufficient to address the concern.

Suggestions for Further Improvements:

To improve the article and increase its likelihood of acceptance, here are four minor suggestions:

Incorporate Statistical Significance Testing: Although the authors mention plans to include statistical tests in future work, adding preliminary tests or at least acknowledging the importance of this in the current manuscript could strengthen the results' reliability.
Expand on the Use of Evaluation Metrics: While the authors have incorporated new metrics, they could include recall, precision, and F1-score for a more detailed assessment, especially in datasets with class imbalances.
Enhance the Discussion on Generalization: The authors mention testing across 15 datasets, but a more detailed discussion on how to improve generalization would be valuable. Discussing specific challenges or insights learned from these datasets would help.
Increase Focus on Interpretability: Since the model is intended for applications like medical diagnosis, more detailed analysis on how the interpretability can be implemented in practical applications (e.g., decision-making processes in healthcare) would strengthen this aspect of the paper.

Author Response

Comments 1: Incorporate Statistical Significance Testing: Although the authors mention plans to include statistical tests in future work, adding preliminary tests or at least acknowledging the importance of this in the current manuscript could strengthen the results' reliability.

Response 1: Thank you for pointing this out. Regarding statistical significance testing, we have not included such analysis in the current work for the following three reasons. Firstly, we reviewed representative studies in the field of few-shot image classification, such as Tip-Adapter^[1]^, Proto-CLIP^[2], CaFo^[3], GDA-CLIP^[4], and TCP^[5]. These works generally do not employ statistical significance testing but instead compare results based directly on accuracy. This practice is common in the multimodal few-shot image classification domain. To ensure fair and consistent evaluation with existing methods, we follow the same evaluation criteria. Secondly, in our study, the experimental settings and results have been validated through multiple independent runs, and in most cases, the performance improvement is sufficiently significant. In cases where the performance difference is relatively small, we believe that such differences would not materially affect the overall conclusions under the current experimental design and parameter settings. Therefore, we chose not to introduce statistical significance testing in order to maintain clarity and readability of the experimental results. Lastly, since statistical significance tests typically require a solid foundation in mathematics and statistics, and our team currently has limited expertise in this area, in order to ensure the reliability of data comparisons, we have considered not to include such analysis methods at this stage. Once again, thank you for your valuable comments.

Comments 2: Expand on the Use of Evaluation Metrics: While the authors have incorporated new metrics, they could include recall, precision, and F1-score for a more detailed assessment, especially in datasets with class imbalances.

Response 2: Thank you for pointing this out. Therefore, we have supplemented the evaluation metrics with Precision, Recall, and F1-score, as shown in Table 1 below, in addition to the original metrics. These metrics can be divided into "Macro" and "Micro" calculation methods. In the micro approach, the metrics are calculated globally by summing the True Positives (TP), False Positives (FP), and False Negatives (FN) across all classes. Therefore, in a single-label multi-class task, Accuracy is equal to Micro-Precision, Micro-Recall, and Micro-F1-score. To ensure fairness and simplicity in the comparison, previous mainstream methods such as Tip-Adapter^[1]^, Proto-CLIP^[2], CaFo^[3], GDA-CLIP^[4], and TCP^[5] have all used Accuracy as the evaluation metric. Therefore, we also follow the practice of previous mainstream methods by using Accuracy as the evaluation metric to ensure a fair comparison. Meanwhile, in response to the reviewer’s suggestion, we have further added the evaluation results of the training-free model ProFusion on 11 datasets with respect to Macro-Precision, Macro-Recall, and Macro-F1-score as shown in Table 1 below. Since previous mainstream methods did not provide the corresponding Precision, Recall, and F1-score data, we have not included these methods in the comparison in our paper. In addition to the accuracy-based comparison, we also conducted other forms of evaluation in the paper, including performance difference analysis (Table 9, Page 19, Paragraph 2, Lines 538–549), visualization of prototype features (Figure 6, Pages 17–18, Paragraph 2, Lines 495–510), and visual comparison of image and text prototypes before and after fine-tuning (Figure 7, Page 18, Paragraph 2, Lines 511–521). Once again, thank you for your valuable comments.

Table 1. Comparison of Accuracy, Precision, Recall, and F1-Score across 11 datasets.

Dataset	IN	FGVC	Pets	Cars	Euro	Cal	SUN	DTD	Flowers	Food	UCF
1-shot
Accuracy	72.63	23.82	90.22	83.08	65.76	96.02	72.13	62.41	85.55	85.42	72.03
Micro-Precision	72.63	23.82	90.22	83.08	65.76	96.02	72.13	62.41	85.55	85.42	72.03
Micro-Recall	72.63	23.82	90.22	83.08	65.76	96.02	72.13	62.41	85.55	85.42	72.03
Micro-F1	72.63	23.82	90.22	83.08	65.76	96.02	72.13	62.41	85.55	85.42	72.03
Macro-Precision	74.65	26.53	90.80	84.46	69.36	95.36	74.65	64.73	84.90	86.64	75.81
Macro-Recall	72.63	23.85	90.15	83.03	66.96	94.85	72.13	62.41	83.77	85.42	72.15
Macro-F1	72.00	21.20	90.04	82.46	64.46	94.18	71.77	61.34	80.82	85.47	70.44
2-shot
Accuracy	73.32	29.85	90.19	84.28	65.22	96.80	74.59	65.78	90.37	85.70	78.22
Micro-Precision	73.32	29.85	90.19	84.28	65.22	96.80	74.59	65.78	90.37	85.70	78.22
Micro-Recall	73.32	29.85	90.19	84.28	65.22	96.80	74.59	65.78	90.37	85.70	78.22
Micro-F1	73.32	29.85	90.19	84.28	65.22	96.80	74.59	65.78	90.37	85.70	78.22
Macro-Precision	75.13	34.11	90.67	85.63	69.23	95.84	75.95	67.29	91.39	86.79	80.15
Macro-Recall	73.32	29.81	90.14	84.20	66.40	95.29	74.59	65.78	90.82	85.70	77.82
Macro-F1	72.71	27.16	90.03	83.53	63.00	95.01	73.92	64.16	89.48	85.78	77.26
4-shot
Accuracy	74.13	30.87	90.87	85.07	73.09	97.28	76.61	69.50	94.07	85.97	80.70
Micro-Precision	74.13	30.87	90.87	85.07	73.09	97.28	76.61	69.50	94.07	85.97	80.70
Micro-Recall	74.13	30.87	90.87	85.07	73.09	97.28	76.61	69.50	94.07	85.97	80.70
Micro-F1	74.13	30.87	90.87	85.07	73.09	97.28	76.61	69.50	94.07	85.97	80.70
Macro-Precision	75.98	36.15	91.38	86.50	77.65	96.37	77.87	71.31	93.93	86.90	81.38
Macro-Recall	74.13	30.87	90.85	85.00	73.91	96.12	76.61	69.50	93.92	85.97	80.28
Macro-F1	73.62	28.58	90.69	84.45	72.44	95.88	76.27	68.33	93.05	86.04	78.88
8-shot
Accuracy	75.24	34.23	92.04	86.71	75.16	96.91	78.33	72.93	94.72	86.59	83.61
Micro-Precision	75.24	34.23	92.04	86.71	75.16	96.91	78.33	72.93	94.72	86.59	83.61
Micro-Recall	75.24	34.23	92.04	86.71	75.16	96.91	78.33	72.93	94.72	86.59	83.61
Micro-F1	75.24	34.23	92.04	86.71	75.16	96.91	78.33	72.93	94.72	86.59	83.61
Macro-Precision	76.78	39.47	92.34	87.51	77.10	95.87	79.37	74.84	95.14	87.31	84.27
Macro-Recall	75.24	34.22	91.99	86.70	76.13	95.79	78.33	72.93	94.75	86.59	83.18
Macro-F1	74.68	32.65	91.97	86.35	73.81	95.51	78.04	72.01	94.18	86.60	82.79
16-shot
Accuracy	76.18	37.78	92.04	87.68	74.46	97.16	78.82	74.05	95.45	86.90	84.56
Micro-Precision	76.18	37.78	92.04	87.68	74.46	97.16	78.82	74.05	95.45	86.90	84.56
Micro-Recall	76.18	37.78	92.04	87.68	74.46	97.16	78.82	74.05	95.45	86.90	84.56
Micro-F1	76.18	37.78	92.04	87.68	74.46	97.16	78.82	74.05	95.45	86.90	84.56
Macro-Precision	77.75	41.49	92.55	88.23	77.79	96.42	79.40	75.04	95.55	87.46	84.70
Macro-Recall	76.18	37.83	92.02	87.69	75.33	96.03	78.82	74.05	95.85	86.90	84.00
Macro-F1	75.71	35.69	91.87	87.27	73.23	95.71	78.52	73.44	95.27	86.95	83.41

Comments 3: Enhance the Discussion on Generalization: The authors mention testing across 15 datasets, but a more detailed discussion on how to improve generalization would be valuable. Discussing specific challenges or insights learned from these datasets would help.

Response 3: Thank you for pointing this out. We agree with this suggestion. Therefore, we have additionally included an analysis of the model's generalization capability. First, we elaborated on the importance of incorporating the method of Prototypical Networks in enhancing generalization performance (Pages 5–6, Paragraph 3, Lines 192–198). We also provided a detailed explanation of the reasons behind the model's improved generalization, along with an analysis of its performance variations across different datasets (Pages 10–12, Paragraph 5, Lines 359–375). Second, in the ablation study presented in Table 6, we investigated different prototype combinations and further analyzed the impact of these combinations and textual information on the improvement of model generalization (Page 14, Paragraph 2, Lines 415–439).Additionally, in the hyperparameter analysis shown in Figure 8, we examined the effect of different weight configurations on generalization by comparing the average accuracy across multiple datasets under identical weight settings (Pages 19–20, Paragraph 3, Lines 550–568). Once again, thank you for your valuable comments.

Specifically, we analyze the enhancement of model generalization from the following perspectives. First, given the scarcity of image data in few-shot image classification tasks, we incorporate text information from class names from a multimodal information perspective to alleviate the insufficiency of visual information, thereby improving the model’s generalization capability. Second, due to the limited number of training samples, traditional image classification methods are not well-suited for few-shot scenarios. To address this, we draw inspiration from Prototypical Networks and construct image prototypes, text prototypes, and multimodal fusion prototypes in the feature space as multiple class anchors. Classification is then performed based on the similarity between the test images and these prototypes. Finally, considering that each class typically contains only 1 to 16 samples, we adopt the BEiT-3 pre-trained model as the backbone network to extract features from both image and text information, further enhancing the model’s generalization performance and and avoiding overfitting issues that arise from training a feature extractor from scratch.

Comments 4: Increase Focus on Interpretability: Since the model is intended for applications like medical diagnosis, more detailed analysis on how the interpretability can be implemented in practical applications (e.g., decision-making processes in healthcare) would strengthen this aspect of the paper.

Response 4: Thank you for pointing this out. The mainstream methods in the field of few-shot image classification, such as Tip-Adapter^[1]^, Proto-CLIP^[2], CaFo^[3], GDA-CLIP^[4], and TCP^[5], all use general natural image datasets and do not involve medical images or high-risk application scenarios like medical diagnosis. This is mainly because training samples for medical images are often scarce under few-shot conditions, and relying on just 1 to 16 samples for classification may lead to poor performance. Therefore, we followed the experimental settings of previous mainstream few-shot learning methods and did not specifically design image data for the medical field. As a result, our current work does not address interpretability designs directly related to medical scenarios.

Regarding the interpretability of the model implementation, our response is as follows. We primarily utilize the methods of multimodal pre-training models and prototype networks to address the few-shot image classification problem. Specifically, we first extract features of both image and text information using multimodal pre-trained models, and construct image prototypes and text prototypes. Furthermore, we fuse the image and text information through a fusion module, and then build multimodal feature fusion prototypes. Based on these three types of prototypes, we use them as anchors for the categories. During the classification process, we calculate the distance between the features of the test image and the prototype features of each category, and assign the test image to the corresponding category based on the minimum distance principle. Meanwhile, in Figure 6(Pages 17), we employed the t-SNE visualization method to perform dimensionality reduction on prototype features and test image features, and conducted an analysis(Pages 17-18, Paragraph 2, Lines 495-510) based on the visualization results to further elucidate the decision-making basis and interpretability of the model in classifying test images. Once again, thank you for your valuable comments.

References:

[1] ZHANG R, ZHANG W, FANG R, et al. Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [J].

[2] P J, PALANISAMY K, CHAO Y-W, et al. Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning [J]. 2023.

[3] ZHANG R, HU X, LI B, et al. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners; proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, F, 2023 [C].

[4] WANG Z, LIANG J, SHENG L, et al. A HARD-TO-BEAT BASELINE FOR TRAINING-FREE CLIP-BASED ADAPTATION [J].

[5] YAO H, ZHANG R, XU C. TCP: Textual-based Class-aware Prompt tuning for Visual-Language Model; proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, F, 2024 [C].

Author Response File: Author Response.pdf