Abstract
Open-vocabulary semantic segmentation (OVSS) is of critical importance for unmanned aerial vehicle (UAV) imagery, as UAV scenes are highly dynamic and characterized by diverse, unpredictable object categories. Current OVSS approaches mainly rely on the zero-shot capabilities of vision–language models (VLMs), but their image-level pretraining objectives yield ambiguous spatial relationships and coarse-grained feature representations, resulting in suboptimal performance in UAV scenes. In this work, we propose a novel hybrid framework for OVSS in UAV imagery, named HOSU, which leverages the priors of vision foundation models to unleash the potential of vision–language models in representing complex spatial distributions and capturing fine-grained small-object details in UAV scenes. Specifically, we propose a distribution-aware fine-tuning method that aligns CLIP with DINOv2 across intra- and inter-region feature distributions, enhancing the capacity of CLIP to model complex scene semantics and capture fine-grained details critical for UAV imagery. Meanwhile, we propose a text-guided multi-level regularization mechanism that leverages the text embeddings of CLIP to impose semantic constraints on the visual features, preventing their drift from the original semantic space during fine-tuning and ensuring stable vision–language correspondence. Finally, to address the pervasive occlusion in UAV imagery, we propose a mask-based feature consistency strategy that enables the model to learn stable representations, remaining robust against viewpoint-induced occlusions. Extensive experiments across four training settings on six UAV datasets demonstrate that our approach consistently achieves state-of-the-art performance compared with previous methods, while comprehensive ablation studies and analyses further validate its effectiveness.