Bridging Vision Foundation and Vision–Language Models for Open-Vocabulary Semantic Segmentation of UAV Imagery
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis study proposes HOSU, a hybrid OVSS framework that integrates distribution-aware fine- tuning, text-guided multi-level regularization, and masked feature consistency to inject DINOv2’s fine-grained spatial priors into CLIP, while maintaining CLIP-only inference. Then, it uses 4 training settings and 6 UAV benchmarks to verify. The dataset is substantial and the experiments are comprehensive. To further enhance the quality of the manuscript, the following points should be addressed.
- The introduction of learnable proxy queries is a key design choice to avoid fine-tuning CLIP's weights directly. However, the motivation and advantage of this mechanism are not sufficiently elaborated.
- Figures 6-9 are valuable for visual comparison. However, the captions only list the datasets and methods. They would be more impactful with brief annotations pointing out specific improvements. For example, it would be beneficial toadd some annotations directly to the figures or provide a more detailed narrative in the caption to guide the reader's attention to the key visual evidence of HOSU's superiority.
- The ablation study in Table 4 effectively shows that each component contributes to the overall performance. However, it would be even more insightful to analyze which specific challenges each component addresses.
- The paper rightly highlights the advantage of a "CLIP-only inference" pipeline. However, there is no quantitative analysis of its efficiency compared to other methods that might use multiple heavy backbones during inference.Please include a comparison of model parameters (M), GFLOPs, and/or FPS against the main baselines.
- Furthermore, while the inference is efficient, the training process leverages CLIP and DINOv2 models. The potential computational overhead and memory footprint during training should be acknowledged and briefly discussed.
- There are a few inconsistent formats. (a)With the exception of Reference [11], none of the other cited sources provide DOI numbers. (b)References 12 and 13 contain abbreviated conference names.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis study focuses on the field of open-vocabulary semantic segmentation (OVSS) for UAV imagery, which is both theoretically challenging and practically valuable. It innovatively proposes the HOSU hybrid framework, leveraging the powerful feature extraction capability of DINOv2 and the multimodal capability of CLIP to effectively break through the performance bottlenecks of traditional methods. This work provides a valuable technical solution for UAV imagery in the field of open-vocabulary segmentation.
However, the following issues require further discussion:
The decoder part of the paper adopts a segmentation head composed of dilated convolutions and FFN for semantic segmentation. Have other types of segmentation heads been tested to evaluate their impact on model performance? It is recommended to conduct comparative experiments to verify the effectiveness of the current decoder.
Although the proposed method achieves improved performance on small categories, no comparison of mIoU between large and small categories is provided. This makes it impossible to determine whether the model is overfitting to large categories and whether the performance improvement on small categories is statistically significant, resulting in a lack of quantitative support for the method's effectiveness in addressing class imbalance.
The current HOSU framework only supports open-vocabulary semantic segmentation for RGB imagery. Can the data be extended to commonly used UAV imagery types such as multispectral imagery? Additionally, can the task scope be expanded to include object detection and instance segmentation for UAV imagery?
It is hoped that the research team can open-source the code and model weights to facilitate experiment reproduction and further research in this field.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsA nice work. Some comments
1) how image content and the type of Remote sensing imaging can affect the results?
2) which is the computational cost?
3) how well the semantic segmentation can be generalised?
4) the advantages of OVSS should be strengthened more.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe questions have basically been addressed.
Reviewer 2 Report
Comments and Suggestions for AuthorsMy technical concerns have been fully clarified. As a supplementary suggestion, optimizing the typographic scale in Figures 8-11 would achieve better visual harmony with the document's layout.

