Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Bridging Vision Foundation and Vision–Language Models for Open-Vocabulary Semantic Segmentation of UAV Imagery

Remote Sens. 2025, 17(22), 3704; https://doi.org/10.3390/rs17223704

by Fan Li

, Zhaoxiang Zhang, Xuanbin Wang, Xuan Wang and Yuelei Xu^*

Reviewer 1:

Jie Li

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Remote Sens. 2025, 17(22), 3704; https://doi.org/10.3390/rs17223704

Submission received: 1 October 2025 / Revised: 6 November 2025 / Accepted: 10 November 2025 / Published: 13 November 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This study proposes HOSU, a hybrid OVSS framework that integrates distribution-aware fine- tuning, text-guided multi-level regularization, and masked feature consistency to inject DINOv2’s fine-grained spatial priors into CLIP, while maintaining CLIP-only inference. Then, it uses 4 training settings and 6 UAV benchmarks to verify. The dataset is substantial and the experiments are comprehensive. To further enhance the quality of the manuscript, the following points should be addressed.

The introduction of learnable proxy queries is a key design choice to avoid fine-tuning CLIP's weights directly. However, the motivation and advantage of this mechanism are not sufficiently elaborated.
Figures 6-9 are valuable for visual comparison. However, the captions only list the datasets and methods. They would be more impactful with brief annotations pointing out specific improvements. For example, it would be beneficial toadd some annotations directly to the figures or provide a more detailed narrative in the caption to guide the reader's attention to the key visual evidence of HOSU's superiority.
The ablation study in Table 4 effectively shows that each component contributes to the overall performance. However, it would be even more insightful to analyze which specific challenges each component addresses.
The paper rightly highlights the advantage of a "CLIP-only inference" pipeline. However, there is no quantitative analysis of its efficiency compared to other methods that might use multiple heavy backbones during inference.Please include a comparison of model parameters (M), GFLOPs, and/or FPS against the main baselines.
Furthermore, while the inference is efficient, the training process leverages CLIP and DINOv2 models. The potential computational overhead and memory footprint during training should be acknowledged and briefly discussed.
There are a few inconsistent formats. （a）With the exception of Reference [11], none of the other cited sources provide DOI numbers. (b)References 12 and 13 contain abbreviated conference names.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This study focuses on the field of open-vocabulary semantic segmentation (OVSS) for UAV imagery, which is both theoretically challenging and practically valuable. It innovatively proposes the HOSU hybrid framework, leveraging the powerful feature extraction capability of DINOv2 and the multimodal capability of CLIP to effectively break through the performance bottlenecks of traditional methods. This work provides a valuable technical solution for UAV imagery in the field of open-vocabulary segmentation.
However, the following issues require further discussion:
The decoder part of the paper adopts a segmentation head composed of dilated convolutions and FFN for semantic segmentation. Have other types of segmentation heads been tested to evaluate their impact on model performance? It is recommended to conduct comparative experiments to verify the effectiveness of the current decoder.
Although the proposed method achieves improved performance on small categories, no comparison of mIoU between large and small categories is provided. This makes it impossible to determine whether the model is overfitting to large categories and whether the performance improvement on small categories is statistically significant, resulting in a lack of quantitative support for the method's effectiveness in addressing class imbalance.
The current HOSU framework only supports open-vocabulary semantic segmentation for RGB imagery. Can the data be extended to commonly used UAV imagery types such as multispectral imagery? Additionally, can the task scope be expanded to include object detection and instance segmentation for UAV imagery?
It is hoped that the research team can open-source the code and model weights to facilitate experiment reproduction and further research in this field.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

A nice work. Some comments

1) how image content and the type of Remote sensing imaging can affect the results?

2) which is the computational cost?

3) how well the semantic segmentation can be generalised?

4) the advantages of OVSS should be strengthened more.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The questions have basically been addressed.

Reviewer 2 Report

Comments and Suggestions for Authors

My technical concerns have been fully clarified. As a supplementary suggestion, optimizing the typographic scale in Figures 8-11 would achieve better visual harmony with the document's layout.

Article Menu

Bridging Vision Foundation and Vision–Language Models for Open-Vocabulary Semantic Segmentation of UAV Imagery

Further Information

Guidelines

MDPI Initiatives

Follow MDPI