Background-Enhanced Visual Prompting Transformer for Generalized Few-Shot Semantic Segmentation
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- Quantify the experimental results in the abstract by supplementing specific improvement values of key metrics to enhance persuasiveness.
- In the experiments, to validate the advancement of the proposed method, please add more comparisons with papers proposed in 2024.
- The design of HCAM is one of the core innovations of this paper, but a more detailed technical explanation is needed. For example, how does HCAM avoid interference from base class prompts on new class prompts?
- Is it necessary to verify the role of HCAM in the ablation experiments?
- The computational cost of the method (such as training time, memory usage) is not discussed in this paper.
Author Response
Comments 1: Quantify the experimental results in the abstract by supplementing specific improvement values of key metrics to enhance persuasiveness.
Response 1: Thank you for your advice! We have quantified the experimental results in the abstract by supplementing specific improvement values of key metrics. We have highlighted the changes in red.
Comments 2: In the experiments, to validate the advancement of the proposed method, please add more comparisons with papers proposed in 2024.
Response 2: Thank you for your advice! We have added more comparisons with papers proposed in 2024 (all from top journals and conferences such as nips, ijcv). In addition, we have added a GFSS paper within the remote sensing field to further enhance the persuasion. The above changes have been highlighted in Table 1 in red.
Comments 3: The design of HCAM is one of the core innovations of this paper, but a more detailed technical explanation is needed. For example, how does HCAM avoid interference from base class prompts on new class prompts?
Response 3: Thank you for pointing this out. Here is the explanation. In this paper, HCAM plays two main roles: (1) making base prompts teach novel prompts how to query class information in multi-scale visual features, and (2) passing potential information from the background into the novel prompts. Base prompts should affect novel prompts rather than avoid interfering with them. We have added a more detailed technical explanation in Section 4.2, and the change is highlighted in red.
Comments 4: Is it necessary to verify the role of HCAM in the ablation experiments?
Response 4: Thank you for pointing this out. It is necessary to verify the role of HCAM in the ablation experiments. We have modified Section 5.5.2 to detail the HCAM ablation studies, and the change is highlighted in red.
Comments 5: The computational cost of the method (such as training time, memory usage) is not discussed in this paper.
Response 5: Thank you for your advice! We have added the computational cost of the method in Section 6: Discussion, and the change is highlighted in red.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper introduces background visual prompts to improve novel class segmentation while preserving base class performance. The key innovation is that the prompts leverage potential novel class information hidden in background regions during base-class training. However, some problems and concerns could be addressed.
- The related work section is well-structured but could be better if it highlights prior works' limitations. Instead of just listing methods, emphasize how existing studies fail to address the challenges Beh-VPT overcomes.
- The reason for freezing certain model parameters while fine-tuning others should be justified more clearly. Why were only specific layers chosen for Singular Value Fine-Tuning (SVF)?
- The qualitative results (Figure 5) demonstrate the effectiveness of Beh-VPT, but some cases where it fails compared to prior methods would provide a more balanced discussion.
Author Response
Comments 1: The related work section is well-structured but could be better if it highlights prior works' limitations. Instead of just listing methods, emphasize how existing studies fail to address the challenges Beh-VPT overcomes.
Response 1: Thank you for your advice! We have highlighted prior works' limitations and emphasized how existing studies fail to address the challenges Beh-VPT overcomes in Sections 2.2 and 2.3. Changes are highlighted in blue.
Comments 2: The reason for freezing certain model parameters while fine-tuning others should be justified more clearly. Why were only specific layers chosen for Singular Value Fine-Tuning (SVF)?
Response 2: Thank you for pointing this out. We chosen layers 2, 3,4 for SVF through comparative experiments with different combinations of layers. We have added more detailed ablation experiments and explanations in Section 5.5.4, and the change is highlighted in blue.
Comments 3: The qualitative results (Figure 5) demonstrate the effectiveness of Beh-VPT, but some cases where it fails compared to prior methods would provide a more balanced discussion.
Response 3: Thank you for your advice! Actually, the last line of the quantitative result (Figure 5) shows a bad case where it fails compared to prior methods. We have expanded the explanation of this bad case in the text preceding Figure 5, and the change is highlighted in blue. Thanks again for your advice.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript presents a novel approach called Background-enhanced Visual Prompting Transformer (Beh-VPT) for Generalized Few-shot Semantic Segmentation (GFSS). The authors introduced background visual prompts to capture potential novel class information during base class pre-training and transfer this knowledge to novel class fine-tuning via a Hybrid Causal Attention Module (HCAM). The authors also propose a background-enhanced segmentation head and Singular Value Fine-tuning (SVF) to enhance the model's ability to learn novel classes while preventing overfitting. The presented results in the manuscript achieves state-of-the-art performance on the PASCAL-5i and COCO-20i datasets, both with 1-shot and 5-shot settings.
The manuscript presents novel and interesting research results, the manuscript itself is well structured and written. I have a few questions and remarks for potential improvements:
- The manuscript focuses on non meta-learning paradigms, but it would be interesting to compare the proposed method with meta-learning-based approaches, especially in terms of generalization and robustness.
- The manuscript highlights the performance improvements, but the computational cost or complexity of the proposed method is not discussed. Given the additional modules (HCAM, background-enhanced segmentation head and SVF), it would be beneficial to add an analysis on the model's efficiency, especially in terms of training time and memory usage.
- The ablation study of the manuscript focuses on the background visual prompts and the segmentation head, however lacks a detailed analysis of the HCAM's impact. A more thorough ablation study on the HCAM would provide deeper insights into its contribution to the overall performance.
- Did the authors perform any cross-dataset generalization tests to check if the model transfers well beyond the PASCAL-5i and COCO-20i datasets?
- How does the model handle cases where the background contains highly diverse elements that may not be relevant to novel classes?
- How sensitive is the HCAM attention mechanism to different datasets? Would it work well for open-world segmentation?
Author Response
Comments 1: The manuscript focuses on non meta-learning paradigms, but it would be interesting to compare the proposed method with meta-learning-based approaches, especially in terms of generalization and robustness.
Response 1: Thank you for your advice! We have added comparative results with some classic or up-to-date meta-learning-based FSS approaches. The experimental results further demonstrate the effectiveness of our approach. Changes have been added to Table 1, Table 2 and the text immediately following Table 2. We highlight changes in green.
Comments 2: The manuscript highlights the performance improvements, but the computational cost or complexity of the proposed method is not discussed. Given the additional modules (HCAM, background-enhanced segmentation head and SVF), it would be beneficial to add an analysis on the model's efficiency, especially in terms of training time and memory usage.
Response 2: Thank you for your advice! We have added the computational cost of the method in Section 6: Discussion. Since the first reviewer also made this suggestion, we highlight it in red.
Comments 3: The ablation study of the manuscript focuses on the background visual prompts and the segmentation head, however lacks a detailed analysis of the HCAM's impact. A more thorough ablation study on the HCAM would provide deeper insights into its contribution to the overall performance.
Response 3: Thank you for pointing this out. We have modified Section 5.5.2 to detail the HCAM ablation studies. Since the first reviewer also made this suggestion, we highlight it in red.
Comments 4: Did the authors perform any cross-dataset generalization tests to check if the model transfers well beyond the PASCAL-5i and COCO-20i datasets?
Response 4: Thank you for pointing this out and this is a very good suggestion. However, we have only validated our model on the classic benchmark of FSS at the moment. In the future, we can test our model on FSS benchmarks of some fields such as remote sensing and medical imaging, etc. We have added this future optimization direction in Section 6, and we highlight it in green.
Comments 5: How does the model handle cases where the background contains highly diverse elements that may not be relevant to novel classes?
Response 5: During base class pre-training, we trained the background prompts by treating all image regions except the base class foreground as ground truth. As a result, the background prompts contains information about the novel classes as well as information about the highly diverse elements (i.e., true background). During novel class fine-tuning, the attention mechanism in HCAM selectively conveys information from the background prompts to the novel class prompts, thus avoiding the interference from highly diverse elements on novel classes. In summary, the model teaches background prompts to handle this situation by a learning-based way. This relies on the powerful learning and representational capabilities of the attention mechanism.
Comments 6: How sensitive is the HCAM attention mechanism to different datasets? Would it work well for open-world segmentation?
Response 6: This is a very interesting question. In our experiments, HCAM showed nearly the same ability on two classic benchmarks. Open-world segmentation (OWS) is designed to address scenarios where models must continuously adapt to unknown categories in dynamic real-world environments. If HCAM is applied to OWS, there is no explicit supervision signal to guide HCAM to segment the unseen classes (our novel class fine-tuning has this supervision). Therefore, simply implementing HCAM to OWS probably won't work well. Nonetheless, it is still worth investigating when further combining HCAM and some classic approach (distillation? adversarial learning?) in OWS. Thank you for pointing this out and it's a great revelation for us.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe article presents a novel approach to Background-enhanced Visual Prompting Transformer for Generalized Few-shot Semantic Segmentation. The article is structured well and rich in terms of experiments. However, the following needs to be addressed.
Abstract: Kindly revise the abstract grammatically to increase the clarity of the approach. For instance, the sentence, "Additionally, we further propose..." should be either, "Additionally, we propose... " or "We further propose.."
Also include the figure of merits achieved such as IoU etc. (quantitative results) in contrast to state-of-the-art.
Introduction: Kindly change the title of figure 1 to, “Percentage of novel classes in each base-fold image (Courtesy of [18]).
Related works: Title of section should be “Related work.” Moreover, it is better to provide a summary of related work at the end of the section, highlighting the research gap.
Problem definition: It is good to provide the mathematical description of both FSS and GFSS.
Methods: In the methodology section, kindly provide a diagram/flowchart/sequence diagram of the proposed methodology for the readers.
It is recommended to cite the equations in the text to help readers understand their application.
Experiments and Discussion: Kindly add a subsection “Discussion” and enlist the potential limitations of the proposed study and future approaches to mitigate them.
Conclusion: Kindly add future dimensions of the study.
Comments on the Quality of English LanguageMinor revisions are required.
Author Response
Comments 1:
Abstract: Kindly revise the abstract grammatically to increase the clarity of the approach. For instance, the sentence, "Additionally, we further propose..." should be either, "Additionally, we propose... " or "We further propose.."
Also include the figure of merits achieved such as IoU etc. (quantitative results) in contrast to state-of-the-art.
Response 1: Thank you for pointing out this grammatical error! We've changed it in the Abstract and highlighted it in purple. In addition, we have added the figure of merits achieved in contrast to state-of-the-art. Since the first reviewer also made this suggestion, we highlight it in red.
Comments 2:
Introduction: Kindly change the title of figure 1 to, “Percentage of novel classes in each base-fold image (Courtesy of [18]).
Response 2: Thank you for pointing this out. We've changed the title of Figure 1, and we highlight it in purple.
Comments 3:
Related works: Title of section should be “Related work.” Moreover, it is better to provide a summary of related work at the end of the section, highlighting the research gap.
Response 3: Thank you for pointing this out. We have changed the title of Section 2. In addition, we have provided a summary of related work at the end of the section. Since the second reviewer also made this suggestion, we highlight it in blue.
Comments 4:
Problem definition: It is good to provide the mathematical description of both FSS and GFSS.
Response 4: Thank you for your advice. We have added detailed mathematical description of both FSS and GFSS in Section 3. The changes are highlighted in purple.
Comments 5:
Methods: In the methodology section, kindly provide a diagram/flowchart/sequence diagram of the proposed methodology for the readers.
It is recommended to cite the equations in the text to help readers understand their application.
Response 5: Thank you for your advice. The flowchart of the proposed method has been added to the paper. Specifically, Figure 2 shows the flowchart of our model during base class pre-training. After base class pre-training, Figure 3 shows the flowchart of our model during novel class fine-tuning.
We have cited the equations in the text preceding Equations 1, 2, 4, 5 and 6, to help readers understand their application. The changes are highlighted in purple.
Comments 6:
Experiments and Discussion: Kindly add a subsection “Discussion” and enlist the potential limitations of the proposed study and future approaches to mitigate them.
Response 6: Thank you for your advice. We have added a section “Discussion”. The changes are highlighted.
Comments 7:
Conclusion: Kindly add future dimensions of the study.
Response 7: Thank you for your advice. We have added future dimensions of the study. The changes are highlighted in purple.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper has been revised and has been greatly improved. It is acceptable when the following points are noted.
1. Tables 1-9 do not have units and should ideally be the same as Table 10. It would be better to explain in the tabular interpretation of table 1, so that repetition can be avoided.
2. Reference formats need to be standardized. It is better to add some updated literature such as: “Network and Dataset for Multiscale Remote Sensing Image Change Detection”; “Full-scale Change Detection Network for Remote Sensing Images Based on Deep Feature Fusion”
Author Response
Comments 1: Tables 1-9 do not have units and should ideally be the same as Table 10. It would be better to explain in the tabular interpretation of table 1, so that repetition can be avoided.
Response 1: Thanks for pointing this out. We have added the explanation in the tabular interpretation of table 1.
Comments 2: Reference formats need to be standardized. It is better to add some updated literature such as: “Network and Dataset for Multiscale Remote Sensing Image Change Detection”; “Full-scale Change Detection Network for Remote Sensing Images Based on Deep Feature Fusion”
Response 2: Thank you for your advice. We have standardized reference formats and have added some updated literature.