Next Article in Journal
GLFFEN: A Global–Local Feature Fusion Enhancement Network for Hyperspectral Image Classification
Previous Article in Journal
RGB to Infrared Image Translation Based on Diffusion Bridges Under Aerial Perspective
 
 
Article
Peer-Review Record

Bridging Vision Foundation and Vision–Language Models for Open-Vocabulary Semantic Segmentation of UAV Imagery

Remote Sens. 2025, 17(22), 3704; https://doi.org/10.3390/rs17223704
by Fan Li, Zhaoxiang Zhang, Xuanbin Wang, Xuan Wang and Yuelei Xu *
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2025, 17(22), 3704; https://doi.org/10.3390/rs17223704
Submission received: 1 October 2025 / Revised: 6 November 2025 / Accepted: 10 November 2025 / Published: 13 November 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This study proposes HOSU, a hybrid OVSS framework that integrates distribution-aware fine- tuning, text-guided multi-level regularization, and masked feature consistency to inject DINOv2’s fine-grained spatial priors into CLIP, while maintaining CLIP-only inference. Then, it uses 4 training settings and 6 UAV benchmarks to verify. The dataset is substantial and the experiments are comprehensive. To further enhance the quality of the manuscript, the following points should be addressed.

  1. The introduction of learnable proxy queries is a key design choice to avoid fine-tuning CLIP's weights directly. However, the motivation and advantage of this mechanism are not sufficiently elaborated.
  2. Figures 6-9 are valuable for visual comparison. However, the captions only list the datasets and methods. They would be more impactful with brief annotations pointing out specific improvements. For example, it would be beneficial toadd some annotations directly to the figures or provide a more detailed narrative in the caption to guide the reader's attention to the key visual evidence of HOSU's superiority.
  3. The ablation study in Table 4 effectively shows that each component contributes to the overall performance. However, it would be even more insightful to analyze which specific challenges each component addresses.
  4. The paper rightly highlights the advantage of a "CLIP-only inference" pipeline. However, there is no quantitative analysis of its efficiency compared to other methods that might use multiple heavy backbones during inference.Please include a comparison of model parameters (M), GFLOPs, and/or FPS against the main baselines.
  5. Furthermore, while the inference is efficient, the training process leverages CLIP and DINOv2  models. The potential computational overhead and memory footprint during training should be acknowledged and briefly discussed.
  6. There are a few inconsistent formats. (a)With the exception of Reference [11], none of the other cited sources provide DOI numbers.  (b)References 12 and 13 contain abbreviated conference names.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This study focuses on the field of open-vocabulary semantic segmentation (OVSS) for UAV imagery, which is both theoretically challenging and practically valuable. It innovatively proposes the HOSU hybrid framework, leveraging the powerful feature extraction capability of DINOv2 and the multimodal capability of CLIP to effectively break through the performance bottlenecks of traditional methods. This work provides a valuable technical solution for UAV imagery in the field of open-vocabulary segmentation.
However, the following issues require further discussion:
The decoder part of the paper adopts a segmentation head composed of dilated convolutions and FFN for semantic segmentation. Have other types of segmentation heads been tested to evaluate their impact on model performance? It is recommended to conduct comparative experiments to verify the effectiveness of the current decoder.
Although the proposed method achieves improved performance on small categories, no comparison of mIoU between large and small categories is provided. This makes it impossible to determine whether the model is overfitting to large categories and whether the performance improvement on small categories is statistically significant, resulting in a lack of quantitative support for the method's effectiveness in addressing class imbalance.
The current HOSU framework only supports open-vocabulary semantic segmentation for RGB imagery. Can the data be extended to commonly used UAV imagery types such as multispectral imagery? Additionally, can the task scope be expanded to include object detection and instance segmentation for UAV imagery?
It is hoped that the research team can open-source the code and model weights to facilitate experiment reproduction and further research in this field.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

A nice work. Some comments

1) how image content and the type of Remote sensing imaging can affect the results?

2) which is the computational cost?

3) how well the semantic segmentation can be generalised?

4) the advantages of OVSS should be strengthened more. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The questions have basically been addressed.

Reviewer 2 Report

Comments and Suggestions for Authors

My technical concerns have been fully clarified. As a supplementary suggestion, optimizing the typographic scale in Figures 8-11 would achieve better visual harmony with the document's layout.

Back to TopTop