Toward High-Resolution UAV Imagery Open-Vocabulary Semantic Segmentation

Chen, Zimo; Xie, Yuxiang; Wei, Yingmei

doi:10.3390/drones9070470

Open AccessArticle

Toward High-Resolution UAV Imagery Open-Vocabulary Semantic Segmentation

by

Zimo Chen

,

Yuxiang Xie

and

Yingmei Wei

^*

Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410003, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(7), 470; https://doi.org/10.3390/drones9070470

Submission received: 16 May 2025 / Revised: 22 June 2025 / Accepted: 23 June 2025 / Published: 1 July 2025

(This article belongs to the Special Issue Visual Language Models and Large Language Models for Unmanned Aerial Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Unmanned Aerial Vehicle (UAV) image semantic segmentation faces challenges in recognizing novel categories due to closed-set training paradigms and the high cost of annotation. While open-vocabulary semantic segmentation (OVSS) leverages vision-language models like CLIP to enable flexible class recognition, existing methods are limited to low-resolution images, hindering their applicability to high-resolution UAV data. Current adaptations—downsampling, cropping, or modifying CLIP—compromise either detail preservation, global context, or computational efficiency. To address these limitations, we propose HR-Seg, the first high-resolution OVSS framework for UAV imagery, which effectively integrates global context from downsampled images with local details from cropped sub-images through a novel cost-volume architecture. We introduce a detail-enhanced encoder with multi-scale embedding and a detail-aware decoder for progressive mask refinement, specifically designed to handle objects of varying sizes in aerial imagery. We evaluated existing OVSS methods alongside HR-Seg, training on the VDD dataset and testing across three benchmarks: VDD, UDD, and UAVid. HR-Seg achieved superior performance with mIoU scores of 89.38, 73.67, and 55.23, respectively, outperforming all compared state-of-the-art OVSS approaches. These results demonstrate HR-Seg’s exceptional capability in processing high-resolution UAV imagery.

Keywords:

semantic segmentation; open-vocabulary; visual perception; deep learning

1. Introduction

Unmanned Aerial Vehicle (UAV) image semantic segmentation, which involves classifying each pixel of an image into specific categories, has demonstrated significant application potential across various domains, including urban planning, agricultural monitoring, disaster assessment, and environmental protection.

Although significant progress has been achieved in this field in recent years, most existing approaches remain limited to specialized applications. For instance, Comba et al. [1] utilizes a U-Net architecture [2] for chestnut burrs segmentation. Mu et al. [3] introduce a superpixel-based graph convolutional network designed for forest fire segmentation. And Huang et al. [4] develop a Mamba-based [5] framework optimized for real-time urban scene segmentation. The supervised learning paradigm, when applied to a limited set of categories, inherently restricts the model’s generalization capability. Furthermore, expanding models’ recognition diversity requires additional creation of training datasets with new labels and model retraining—a process that is both resource-intensive and time-consuming.

Open-vocabulary semantic segmentation (OVSS) addresses this limitation by establishing a connection between visual objects and natural language, enabling models to recognize arbitrary classes based on text queries. Early approaches attempt to align visual features with pre-trained word embeddings (e.g., Word2Vec), but their effectiveness is constrained by limited vocabulary coverage [6,7]. The emergence of Vision-Language Models (VLMs) such as CLIP [8] and ALIGN [9], has significantly advanced OVSS performance. Initial efforts directly adapt these VLMs for dense prediction. For instance, Zhou et al. [10] remove the CLIP image encoder’s final pooling layer to extract pixel-level language-aligned features. These features were then matched with CLIP’s text embeddings to generate pixel-level masks, achieving zero-shot segmentation without additional training. Similarly, Li et al. [11] adopt a comparable approach that differs from MaskCLIP by employing dense prediction transformers [12] for image embedding along with Bottleneck blocks [13] for spatial regularization, thereby producing finer-grained features. Although these one-stage methods achieve few-shot or zero-shot segmentation by fine-tuning CLIP, the tuning process risks overfitting to seen classes and inevitably disrupts the original alignment between CLIP’s visual and textual representations. Xu et al. [14] propose a two-stage framework in which a mask prediction model first generates class-agnostic mask proposals, followed by a frozen CLIP model to classify each mask into a specific category. Other two-stage models include Zegformer [15], OVSeg [16], among others. While this paradigm preserves CLIP’s generalization capability, it relies on a computationally expensive mask generation network, thereby hampering computational efficiency. In recent work, ZegCLIP [17] and SAN [18] have proposed new one-stage frameworks that directly utilize CLIP’s embeddings for mask generation, thereby obviating the requirement for supplementary network components. But the learnable tokens or adapter layers they introduce only work well for training classes, resulting in constrained generalization performance. CAT-Seg [19] proposes a cost volume-based network that first computes a cost map between CLIP’s image and text features, followed by spatial and class-wise refinement. This approach demonstrates superior performance compared to prior methods. Building upon this framework, SED [20] further enhances computational efficiency while maintaining competitive accuracy.

However, OVSS methods integrating CLIP’s image encoder share a common limitation: input resolution constraints. Since CLIP is trained at fixed resolutions (e.g., 384 × 384 pixels for CLIP-ViT), input images must be resized to this predetermined resolution to extract language-aligned feature embeddings. This presents a fundamental challenge for UAV applications, where high-resolution images are standard. Unlike natural images, aerial images are captured from an overhead perspective, encompassing broader scenes with complex geometric patterns. The inherent input size limitation of CLIP prevents existing OVSS approaches from fully utilizing the rich information contained in high-resolution UAV imagery, consequently limiting their ability to generate fine-grained segmentation results. Currently, research on open-vocabulary semantic segmentation (OVSS) for aerial images remains limited. GSNet [21] introduces a dual-stream framework integrating domain-specific remote sensing priors with CLIP and applies query-guided feature fusion to harmonize specialist and generalist features. OVRS [22] designs a rotation-aggregative similarity computation module to handle orientation variations and integrates multi-scale features during upsampling for scale-aware predictions. SegEarth-OV [23] focuses on training-free solutions, proposing SimFeatUp to restore spatial details in low-resolution CLIP features via upsampling and a global bias alleviation strategy to mitigate CLIP’s over-reliance on global context.

In this paper, we present HR-Seg, a high-resolution OVSS network specifically designed for UAV imagery to address the challenge of high-resolution aerial image segmentation. Unlike previous approaches that directly downsample images to low resolutions, HR-Seg innovatively combines both downsampling and cropping strategies: the downsampled image provides global context, while the cropped sub-images preserve detailed local information. The complementary features from the global and local images are effectively integrated through a novel cost volume-based framework. Furthermore, to enhance the network’s ability to capture fine details such as edges, textures, and small objects, we introduce a detail-enhanced encoder and detail-aware decoder. The encoder incorporates a multi-scale embedding block to extract hierarchical features from both global and local cost volumes, coupled with a dual-path aggregation module for respective feature refinement. The decoder architecture integrates two double-guidance layers followed by a detail-aware layer, which collectively generate high-resolution segmentation masks by effectively fusing volume features with global/local semantic representations. In summary, our contributions are as follows:

We propose a novel cost volume-based framework that effectively adapts CLIP to achieve high-resolution semantic segmentation. To our best knowledge, this is the first work designed for high-resolution UAV images for OVSS tasks.
To strengthen the network’s perception ability to the high-resolution UAV images, we design the detail-enhanced encoder and detail-aware decoder, which aggregate multi-scale features to refine the segmentation results.
We conduct extensive experiments to compare HR-Seg with existing OVSS methods. HR-Seg achieves the best performance, demonstrating the effectiveness of our method.

The remainder of this paper is organized as follows. Section 2 introduces the proposed method, detailing its key components and technical innovations. Section 3 presents the experimental results, including comparisons with state-of-the-art (SOTA) methods and ablation studies to validate the effectiveness of our approach. Finally, Section 4 concludes the paper with a summary of our contributions, discusses potential limitations, and outlines directions for future research.

2. Method

Our method builds upon the cost volume-based framework [19], which computes cosine similarity scores between CLIP’s image and text embeddings to generate class-agnostic cost volumes. These volumes undergo multi-stage refinement through cost aggregation and hierarchical feature decoding to produce pixel-wise segmentation masks. While this framework demonstrates strong robustness for unseen categories, its reliance on CLIP’s coarse image feature maps limits its ability to process high-resolution images and generate high-quality segmentation masks.

To extend the framework’s ability for high-resolution image perception and obtain more fine-grained cost volumes for high-resolution OVSS, HR-Seg extract complementary features through dual-path processing of high-resolution inputs. The global branch downsamples images to 384 × 384 resolution to capture global contextual features, while the local branch generates detail feature maps from cropped sub-images. Although local feature maps preserve richer, detailed information, feature inconsistency exists at the junctions between adjacent sub-feature maps. Such inconsistencies may adversely affect both the fusion process with global features

F_{g}

and subsequent optimization outcomes. To address this issue, we propose a cost volume-level fusion method to aggregate these two complementary features. A comparison of these two fusion strategies is provided in Figure 1. Feature-level fusion first reshapes the sub-feature maps into a unified feature map, which is then merged with the upsampled global feature map. Subsequently, the merged features and the text features are combined via outer product computation to generate the cost volume. As previously mentioned, the high-dimensional features at the junctions between sub-feature maps exhibit inconsistency. In contrast, our proposed cost-volume-level fusion method first transforms the high-dimensional features into single-dimensional cost volumes through outer product computation with text features. These cost volumes exclusively represent relevance to textual content. The resulting local sub-cost volumes are then reshaped into a unified cost volume, which is fused with the global cost volume. This approach effectively mitigates feature inconsistency at the junctions and improves the fusion performance of the feature maps.

Our model generally consists of two components: a detail-enhanced encoder that generates fine-grained global and local cost volumes, and a detail-aware decoder that effectively combines these volumes to produce high-resolution segmentation masks. The overall architecture of our model is illustrated in Figure 2. The pre-processed global and local images are fed into CLIP’s image encoder to generate image features. Simultaneously, the text features are obtained from CLIP’s text encoder based on the input text queries. Both the image and text features are fed into the detail-enhanced encoder to obtain the refined global and local cost volumes. The detail-aware decoder then aggregates both the volumes and outputs the high-resolution segmentation maps. The subsequent sections provide detailed descriptions of each component.

2.1. Detail-Enhanced Encoder

First, the high-resolution input images I are processed through two parallel pathways: (1) global processing, where images are downsampled to lower-resolution images

I_{g} \in R^{3 \times H \times W}

using linear interpolation to match CLIP’s input size requirements, and (2) local processing where the original images are partitioned into L non-overlapping sub-images

I_{l} \in R^{L \times 3 \times H \times W}

. Both pathways employ CLIP’s image encoder to extract language-aligned features, yielding global features

F_{g} \in R^{C \times h \times w}

(where

h = H / 16

,

w = W / 16

) and local features

F_{l} \in R^{L \times C \times h \times w}

, with C denoting CLIP’s embedding dimension. Simultaneously, text embeddings

T \in R^{N \times P \times C}

are obtained from CLIP’s text encoder, where N denotes the number of target classes and P denotes the number of templates. The visual and textual features are then transformed into cost volumes through the multi-scale embedding block, producing

V_{g} \in R^{N \times C_{e} \times h \times w}

for global cost volume and

V_{l} \in R^{L \times N \times C_{e} \times h \times w}

for local cost volumes, where

C_{e}

denotes the cost volume embedding dimension. After the embedding process, a dual-path aggregation module is introduced to refine

V_{g}

and

V_{l}

separately. The architecture of the aggregation block is adopted from [19]. While

V_{g}

is directly forwarded to the decoder, the local cost volumes

V_{l}

are spatially reassembled into a unified high-resolution cost volume

V_{l}^{'} \in R^{N \times C_{e} \times h_{l} \times w_{l}}

, which maintains the original image’s detailed spatial information.

2.1.1. Multi-Scale Embedding Block

Given the images features

F (F \in \{F_{g}, F_{l}\})

and text features T, the cost volume

V_{0} \in R^{N \times P \times h \times w}

is computed by measuring the similarity score between each pixel-level image embedding in F and all text-embeddings in T. This process can be formally expressed as:

V_{0} = T \otimes F

(1)

where ⊗ denotes the outer product. The resulting cost volumes

V_{0}

can be interpreted as initial segmentation masks that capture the coarse spatial distribution of target objects in the feature space. Prior to refining

V_{0}

, an embedding network is employed to enhance its feature representation. To address the substantial scale variation in objects characteristic of UAV imagery, we design a multi-scale feature extraction module employing four parallel convolutional branches with kernel sizes of 1 × 1, 3 × 3, 5 × 5 and 7 × 7, respectively. The extracted multi-scale features are then concatenated along the channel dimension, producing enriched cost volume features

V \in R^{N \times C_{e} \times h \times w}

that comprehensively encode information across different receptive fields.

V_{i} = Φ (V_{0}, k_{i})

(2)

V = c o n c a t ([V_{i}])

(3)

where

V_{i}

denotes the

i

th volume features generated by the convolution network

Φ

with kernel size

k_{i} \times k_{i}

.

2.1.2. Aggregation Block

The aggregation block comprises two sequential processes: spatial aggregation and class aggregation. For the spatial aggregation, two Swin transformer blocks [24] are employed to refine the spatial structure of the input volume

V (V \in \{V_{g}, V_{l}\})

. The image feature F extracted from CLIP is also fed into the swin block to provide additional spatial information for guidance. The process can be formulated as:

Q = c o n c a t ([V_{x}, F_{x}]) \cdot W^{Q}, K = c o n c a t ([V_{x}, F_{x}]) \cdot W^{K}, V = V_{x} \cdot W^{V}

(4)

V = A t t e n t i o n (Q, K, V) = s o f t m a x (Q K^{T} / \sqrt{d_{k}}) V

(5)

V = m a x (0, V W_{1} + b_{1}) W_{2} + b_{2}

(6)

where

x \in \{g, l\}

,

W^{Q}, W^{K}, W^{V}, W_{1}, W_{2}, b_{1}, b_{2}

are learnable parameters and

d_{k}

denotes the feature dimension of

Q, K, V

. For class aggregation, a single linear transformer block [25] is employed with text features as additional guidance, following a similar computation process as described above. To maintain the distinct characteristics of features extracted from different perspectives, a dual-path architecture is proposed to prevent undesirable feature coupling between pathways. The global processing path incorporates two consecutive aggregation blocks to comprehensively integrate image-wide contextual information, while the local processing path utilizes a single aggregation block specifically optimized for refining detailed structures within local sub-images.

2.2. Detail-Aware Decoder

The decoder adopts a coarse-to-fine architecture like U-net [2]. In each decode layer, the cost volume feature

V_{g}

is upsampled and concatenated with the corresponding shallow features

{F_{g}}^{(i)}

from the

i

th CLIP encode layer, ensuring semantic-spatial alignment between the cost volume and input image. To enhance detail preservation, the reconstructed high-resolution cost volume

V_{l}^{'}

is additionally injected into each decoding layer. Notably, the ViT-B variant of CLIP inherently lacks multi-scale feature extraction capability. This necessitates upsampling

{F_{g}}^{(i)}

to match the spatial dimensions of

V_{g}

for feature concatenation. However, as the resolution of

V_{g}

increases, excessive upsampling of

{F_{g}}^{(i)}

tends to introduce noise rather than provide fine-grained semantic guidance, degrading the quality of predicted masks. To mitigate this limitation, we replace the upsampled

{F_{g}}^{(i)}

with local features

F_{l}

in the final decoding layer, which provides more detailed feature guidance for generating higher-resolution mask generation. As shown in Figure 2, the decoder consists of two double guidance decode layers, a detail-aware decode layer and a prediction head. The detailed architectures of the decode layers are illustrated in Figure 3.

2.2.1. Double Guidance Decode Layer

The global cost volume

V_{g}

and its corresponding guidance feature

{F_{g}}^{(i)}

are first upsampled using transposed convolutional layers to achieve higher spatial resolution. These upsampled features are then processed through two consecutive convolutional layers after feature concatenation. Simultaneously,

V_{l}^{'}

is aligned

V_{g}

with through bilinear interpolation and a 1 × 1 convolutional layer. Finally, we perform element-wise summation between

V_{l}^{'}

and

V_{g}

to enrich the global features with fine-grained local information that would otherwise be absent in

V_{g}

alone.

2.2.2. Detail-Aware Decode Layer

To combine the upsampled cost volume

V_{g}

and the local image features

{F_{l}}^{(i)}

, we partition

V_{g}

into multiple sub feature maps and concatenate them with

{F_{l}}^{(i)}

.

{F_{l}}^{(i)}

preserves richer spatial details and semantic information compared to

{F_{g}}^{(i)}

. To fully exploit these fine-grained features, we propose a detailed perception block. The block employs a point-wise convolution for initial edge feature extraction, followed by two complementary asymmetric convolution layers that aggregate features from different directions. This architectural design effectively enhances the network’s capacity for segmenting object boundaries and small-scale targets.

3. Results and Discussion

3.1. Datasets and Evaluation Metrics

We choose three UAV datasets with high-resolution images for training and evaluation, including UDD5 [26], VDD [27] and UAVid [28]:

The UDD5 dataset contains a total of 160 aerial images, distributed as 120 for training and 40 for validation. The imagery is captured in three high-resolution formats: 3840 × 2160, 4096 × 2160 and 4000 × 3000 pixels. The dataset provides pixel-wise annotations encompassing five semantic categories.

The VDD dataset consists of 400 annotated images, with a split of 280 for training, 80 for validation, and 40 for testing purposes. All images maintain a consistent high-resolution of 4000 × 3000 pixels. The dataset provides pixel-wise annotations encompassing six semantic categories.

The UAVid dataset contains a total of 420 aerial images, distributed as 200 for training, 70 for validation, and 150 for testing. The imagery is captured in two high-resolution formats: 3840 × 2160 and 4096 × 2160 pixels. The dataset provides pixel-wise annotations encompassing eight semantic categories. We consolidate the ‘static car’ and ‘moving car’ labels into a single unified ‘car’ category, as the distinction between stationary and moving vehicles cannot be reliably determined from static image frames.

In our experiments, we train the model on the VDD training set and evaluate its performance on the validation sets of all three benchmark datasets. This cross-dataset evaluation protocol effectively demonstrates the model’s generalization capability. Specifically, UDD5 serves to evaluate in-vocabulary performance as its categories are largely consistent with those of VDD, while UAVid primarily assesses open-vocabulary segmentation performance. For all experiments, we adopt the mean Intersection-over-Union (mIoU) and mean Accuracy (mACC) as the evaluation metrics.

3.2. Implementation Details

We employ the Vit-b variant of CLIP and adopt the fine-tuning strategy in [19], selectively updating only the query and value projection layers in both of CLIP’s encoders. The input images are resized to 1152 × 1152 pixels, making a balance between image quality and computation efficiency. Our model is trained on 2 Galaz RTX4090 GPUs for 10,000 iterations with a batch size of 2. The optimizer is AdamW [29] with a learning rate of

2 \times 10^{- 4}

and a weight decay of

1 \times 10^{- 4}

.

3.3. Main Results

As no prior work specifically addresses high-resolution UAV imagery open-vocabulary semantic segmentation (OVSS), we compare our method with recent state-of-the-art OVSS approaches, including Cat-Seg [19] (CVPR2024), SED [20] (CVPR2024), GSnet [21] (AAAI2025), SAN [18] (CVPR2023), Ebseg [30] (CVPR2024). Notably, GSnet [21] is specially designed for remote sensing image segmentation. For fair comparison, all methods are trained using identical experimental configurations except for the input size. The results are presented in Table 1.

HR-Seg demonstrates superior performance by achieving the highest mIoU scores across all three validation datasets, outperforming the second-best method by significant margins of 9.69% on UDD5 and 8.80% on UAVid. Notably, it consistently ranks first or second in both seen and unseen categories, highlighting its robust generalization capabilities. The strong performance on UDD5 and UAVid further validates HR-Seg’s exceptional perceptual accuracy and adaptability to diverse scenarios. Particularly noteworthy is HR-Seg’s performance on small objects (e.g., ‘human’), which existing methods often fail to detect. We attribute this advancement to HR-Seg’s effective utilization of high-resolution image information—a critical feature that distinguishes HR-Seg from other approaches. Figure 4 presents qualitative comparisons between HR-Seg and Cat-Seg [19]. The top three rows display full-image segmentation results, while the bottom two rows show detailed comparisons. HR-Seg exhibits significantly better prediction consistency compared to Cat-Seg, particularly evident in building segmentation (Rows 1 and 3), where our method maintains accurate classification while Cat-Seg produces noticeable misclassifications in some areas. In Row 2, HR-Seg effectively discriminates between ‘tree’ and ‘low vegetation’ categories, a distinction that Cat-Seg fails to make. This performance gap suggests that HR-Seg’s architecture enables more effective extraction of detailed visual features. The final two rows reveal HR-Seg’s exceptional capability in detecting fine details, successfully identifying small vehicles even under partial occlusion conditions (marked with red circles). In Appendix A, we provide further evaluation of HR-Seg’s cross-dataset generalization performance complemented by more qualitative visualizations.

We further investigate the training performance of HR-Seg across varying image resolutions. We enable HR-Seg to process variable-resolution inputs by adjusting the number of sub-images partitioned from high-resolution originals. Specifically, we evaluated three input resolutions—768 × 768, 1152 × 1152, and 1536 × 1536—corresponding to 2×, 3×, and 4× the standard CLIP input size (384 × 384). As shown in Table 2, higher resolutions incur greater computational costs and slower inference speeds. While it is expected that increased resolution should enhance predictive performance by providing richer visual information, our experiments reveal that 1536× 1536 inputs underperform 1152 × 1152. This result may stem from excessive fragmentation of local details when dividing 1536 × 1536 images into 16 sub-images, where the abrupt fusion of disjointed local features with the global cost volume without progressive integration leads to feature degradation. Conversely, the result with 768 × 768 input indicates that lower-resolution images lack sufficient detail for HR-Seg to generate high-quality masks, causing the upsampling process to amplify noise rather than recover meaningful semantic information. In summary, the

1152 \times 1152

resolution achieves an optimal balance between computational efficiency and segmentation performance for HR-Seg.

Table 3 compares the computational costs and inference speeds across different methods. The “Additional Backbone” column indicates models incorporating other pre-trained networks beyond CLIP. EbSeg [30] achieves performance second only to HR-Seg. However, its reliance on SAM’s [31] encoder for mask proposal extraction leads to suboptimal computational efficiency. Notably, HR-Seg processing

1152 \times 1152

resolution images demonstrates comparable GPU memory usage to Ebseg handling

640 \times 640

images, while achieving marginally faster inference speeds. This empirically validates HR-Seg’s superior computational efficiency over Ebseg’s two-stage framework. In terms of inference speed, HR-Seg maintains near-identical performance with Cat-Seg, exhibiting only a 0.01s difference. This minimal latency gap is achieved through HR-Seg’s architectural design, which enables parallel feature extraction from different regions of high-resolution images, thereby preventing significant speed degradation. SED [20] demonstrates the best computational efficiency, consuming merely one-sixth of HR-Seg’s GPU memory and achieving sixfold faster inference times. These results reveal that although HR-Seg delivers superior performance in high-resolution semantic segmentation, its computational overhead remains substantial. For deployment on resource-constrained platforms like UAVs, further architectural optimizations are necessary to enhance computational efficiency.

3.4. Ablation Studies

Ablation on different components. To demonstrate the necessity of our approach, we first directly scaled up Cat-Seg’s network architecture to accommodate higher-resolution 768 × 768 inputs. As shown in the first two rows of Table 4, this naive modification (Cat-Seg*) fails to deliver improved performance, confirming that simply feeding high-resolution images into existing OVSS frameworks containing CLIP encoders without specialized architectural designs for high-resolution processing yields suboptimal results. Row 3 presents the performance of our baseline model devoid of any specialized components. Variant A incorporates the Multi-Scale Embedding Block (MSEB) upon this baseline. It demonstrates that MSEB has limited effectiveness with low-resolution inputs. Variant B implements our proposed dual-path architecture for high-resolution processing, introducing the Double-Guidance Layer (DGL) to fuse global and local cost volumes. This yields significant gains over baseline: +3.97 mIoU (VDD), +9.67 mIoU (UDD5), and +2.07 mIoU (UAVid), validating its effectiveness for high-resolution imagery. Building upon B, Variant C further integrates MSEB, achieving additional improvements, confirming MSEB’s capacity to capture multi-scale information in high-resolution contexts. Variant D enhances B with Detail-Enhance Encoder (DEE), whose feature aggregation mechanism brings respective improvements of 0.46, 0.54, and 0.33 mIoU on the three benchmark datasets. For fusion strategy comparison, Variant E replaces cost volume fusion (CVF) with feature-level fusion (FF). While achieving the best performance on VDD (90.37 mIoU), its inferior results on UDD5 and UAVid suggest that feature-level fusion generalizes less effectively than cost volume fusion. This limitation probably stems from inconsistent boundary features in sub-images potentially disrupting the critical alignment between visual features and language embeddings during the fusion process. Variant F builds upon Variant D by incorporating the detail-aware layer, demonstrating particularly significant improvements on the UAVid dataset. This specialized layer is explicitly designed to enhance the model’s perception of fine-grained image details, which explains its superior performance on UAVid—a dataset containing an abundance of small objects (e.g., vehicles, humans) that benefit from such detailed feature perception. The complete architecture, Variant G, integrates all proposed components to achieve the performance of 89.38, 73.67, and 55.23 mIoU across the three benchmark datasets.

Ablation on the dual-path aggregate architecture. While increasing the number of aggregation blocks can enhance the model’s perception capability, it also risks introducing underfitting or overfitting problems. We conduct experiments to determine the optimal number of aggregation blocks in each branch, with results presented in Table 5. Our experimental results indicate that the configuration utilizing two global aggregation blocks achieves optimal prediction performance across comparative architectures. However, expanding the number of local aggregation blocks adversely affects the model’s generalization capacity, as particularly evidenced by its performance on the UAVid dataset. The specific architecture combining two global aggregation blocks with a single local aggregation block delivers competitive performance, attaining 73.67 mIoU on UDD5 and 55.23 mIoU on UAVid, despite showing only intermediate results (89.38 mIoU) on the VDD dataset.

Ablation on multi-scale embedding block. Table 6 presents our ablation study results for the multi-scale embedding block (MSEB), investigating different configurations of convolution kernels. The top four rows examine single-kernel implementations with varying kernel sizes, yet no clear correlation emerges between kernel size and performance across datasets. We attribute this to the datasets’ diverse object scales and distributions, as they were not specifically designed for particular segmentation tasks (e.g., small objects or land segmentation). Our subsequent experiments with multiple kernel combinations reveal that the four-kernel configuration achieves optimal performance on both UDD5 and UAVid validation sets, outperforming the original single 7 × 7 kernel design by 3.07 and 0.42 mIoU, respectively. We excluded three-kernel configurations due to incompatibility with our chosen embedding dimension of 128 (not divisible by 3).

Ablation on the detail perception block. We conduct the ablation experiments on the detail perception block to explore its effectiveness on the decode layer. We try to employ the detail perception block on the double guidance decode layer and detail-aware decode layer. The results are shown in Table 7. Without employing the detail perception block, the network achieves 89.63 mIoU on VDD, 72.89 mIoU on UDD5 and 54.81 mIoU on UAVid. Employing the block on the detail-aware decode layer achieves 0.78 higher mIoU on UDD5 and 0.42 higher mIoU on UAVid. However, additionally employing the block in the double guidance decode layer achieves suboptimal results. We think this is because the guidance features in the double guidance layer contain less detail information. Therefore, the detail perception block cannot capture useful information from them.

3.5. Error Analysis

While HR-Seg effectively harnesses high-resolution inputs to capture richer semantic information, its segmentation granularity proves inadequate for precise mask prediction, particularly for small objects. As illustrated in Figure 5, Region (a) demonstrates this limitation: although HR-Seg successfully detects all small vehicles, it produces overly coarse segmentation boundaries, erroneously merging multiple instances into a single categorical region. This pattern persists in Region (b), where the model identifies human presence but fails to achieve accurate boundary delineation. Furthermore, although HR-Seg outperforms other OVSS methods in detecting extremely small targets, it still encounters recognition failures, as evidenced in Region (c).

4. Conclusions

In this paper, we present HR-Seg, the first high-resolution OVSS network tailored for UAV imagery. HR-Seg innovatively integrates downsampling and cropping strategies, leveraging global context from low-resolution images and fine details from local sub-images through a novel cost volume-based framework. Our approach further enhances segmentation accuracy with a detail-enhanced encoder and detail-aware decoder, which aggregate multi-scale features and refine edge and texture representation. Through extensive evaluation on multiple high-resolution datasets and comparison with state-of-the-art methods, HR-Seg demonstrates superior performance. The cost volume-based framework endows HR-Seg with strong generalization capabilities, enabling robust recognition of unseen categories. Additionally, by effectively leveraging high-resolution image information, HR-Seg enhances fine-grained scene understanding, achieving more accurate segmentation—particularly for small-scale targets. This work represents an initial exploration of OVSS for high-resolution UAV imagery, though several limitations remain. First, HR-Seg’s segmentation precision for small objects remains suboptimal, with occasional failures on extremely tiny targets—a critical challenge in UAV-based semantic segmentation, where small objects are easily overlooked. Second, HR-Seg’s current computational overhead limits its deployability on resource-constrained platforms like UAVs, as inference speed requires further optimization. Furthermore, the generalization performance is constrained by the limited scale and diversity of existing UAV datasets. While HR-Seg outperforms current SOTA methods across benchmarks, its accuracy on unseen categories remains insufficient for real-world applications—a limitation attributed to the scarcity of high-resolution UAV training samples and their restricted annotation classes.

In future work, we will prioritize optimizing HR-Seg’s network architecture, including exploring linear-complexity alternatives to transformer-based structures to enhance computational efficiency. Concurrently, we will investigate more effective feature encoding and decoding strategies specifically tailored for high-resolution imagery, complemented by auxiliary loss functions targeting segmentation boundary precision to further improve accuracy. Additionally, we plan to incorporate self-supervised and weakly supervised learning paradigms to reduce the model’s dependency on large-scale annotated datasets.

Author Contributions

Conceptualization, Y.X. and Z.C.; methodology, Z.C.; validation, Z.C.; formal analysis, Z.C.; investigation, Z.C. and Y.W.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C. and Y.W.; visualization, Z.C.; supervision, Y.X. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hunan Province, China under Grant 2023JJ30082.

Data Availability Statement

The UDD5 dataset can be found at https://github.com/MarcWong/UDD. The VDD dataset can be found at https://github.com/RussRobin/VDD. The UAVid dataset can be found at https://uavid.nl/. The iSAID dataset can be found at https://captain-whu.github.io/iSAID/dataset.html. The NightCity dataset can be found at https://dmcv.sjtu.edu.cn/people/phd/tanxin/NightCity/index.html, all links accessed on 21 June 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. More Evaluation Results

Current publicly available benchmarks for high-resolution UAV semantic segmentation remain limited in both quantity and label diversity. To comprehensively evaluate HR-Seg’s generalization capability, we conducted additional testing on two supplementary datasets: iSAID [33] and NightCity [34].

The iSAID [33] dataset is a large-scale aerial imagery benchmark, comprising 2806 high-resolution images with dense annotations of 655,451 object instances across 15 categories. Its validation set contains 458 images. The NightCity [34] dataset focuses on urban nighttime scenarios, featuring 4297 street-view images at 1024 × 512 resolution with 19 annotated object categories, including a validation set of 1299 images. Table A1 presents the specific annotated categories for both datasets.

Table A1. Annotation categories of the iSAID and NightCity datasets.

iSAID [33]:

“ship”, “storage tank”, “baseball diamond”, “tennis court”, “basketball court”, “Ground Track Field”, “Bridge”, “Large Vehicle”, “Small Vehicle”,“Helicopter”, “Swimming pool”, “Roundabout”, “Soccer ball field”, “plane”, “Harbor”

NightCity [34]:

“road”, “sidewalk”, “building”, “wall”, “fence”, “pole”, “traffic light”, “traffic sign”, “vegetation”, “terrain”, “sky”, “person”, “rider”, “car”, “truck”, “bus”, “train”, “motorcycle”, “bicycle”

As shown in Table A2, we evaluated HR-Seg against existing OVSS methods on these datasets while maintaining the original model trained solely on VDD data, without any additional fine-tuning or retraining. HR-Seg achieved state-of-the-art performance on the iSAID dataset with 25.67 mIoU and 45.70 mACC, further demonstrating its superior generalization capability compared to other models. However, the overall performance of all OVSS methods on iSAID remains suboptimal, indicating that current OVSS approaches still heavily rely on large-scale training data—limited training data appears insufficient to develop robust generalization abilities. As shown in Figure A4, the visualization results reveal that iSAID’s remote sensing images cover significantly larger areas and contain objects with greater size variation. While HR-Seg effectively detects these objects, its limited language understanding capability leads to misclassification errors. On the NightCity dataset, HR-Seg attained 16.26 mIoU, second only to Cat-Seg’s 17.50 mIoU. Since NightCity contains neither high-resolution nor aerial imagery, HR-Seg’s architectural advantages are less pronounced. Additionally, the lack of nighttime samples in the VDD training data contributes to the model’s mediocre performance on this dataset.

We also provide more qualitative results on UAVid [28] in Figure A2, UDD5 [26] in Figure A3, and VDD [27] in Figure A1.

Table A2. Performance comparison with other methods on additional datasets. Best values are bolded and second-best values are underlined.

Method	Image Size	iSAID [33]		NightCity [34]
Method	Image Size	mIoU	mACC	mIoU	mACC
SAN [18]	640 × 640	10.31	21.86	4.96	13.27
Cat-Seg [19]	384 × 384	24.48	43.07	17.50	33.74
SED [20]	768 × 768	22.31	39.89	14.98	34.07
GSnet [21]	384 × 384	23.71	42.44	14.33	29.50
Ebseg [30]	640 × 640	15.65	29.05	6.01	14.80
HR-Seg (ours)	1152 × 1152	25.67	45.70	16.26	29.40

Figure A1. Qualitative Results on VDD [27] dataset.

Figure A2. Qualitative Results on UAVid [28] dataset.

Figure A3. Qualitative Results on UDD5 [26] dataset.

Figure A4. Qualitative Results on iSAID [33] dataset. Due to varying image resolutions in the dataset, we performed appropriate adjustments to the image alignment.

References

Comba, L.; Biglia, A.; Sopegno, A.; Grella, M.; Dicembrini, E.; Ricauda Aimonino, D.; Gay, P. Convolutional Neural Network Based Detection of Chestnut Burrs in UAV Aerial Imagery. In AIIA 2022: Biosystems Engineering Towards the Green Deal; Ferro, V., Giordano, G., Orlando, S., Vallone, M., Cascone, G., Porto, S.M.C., Eds.; Springer International Publishing: Cham, Switzerland, 2023; pp. 501–508. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Mu, Y.; Ou, L.; Chen, W.; Liu, T.; Gao, D. Superpixel-Based Graph Convolutional Network for UAV Forest Fire Image Segmentation. Drones 2024, 8, 142. [Google Scholar] [CrossRef]
Huang, L.; Tan, J.; Chen, Z. Mamba-UAV-SegNet: A Multi-Scale Adaptive Feature Fusion Network for Real-Time Semantic Segmentation of UAV Aerial Imagery. Drones 2024, 8, 671. [Google Scholar] [CrossRef]
Zhang, W.; Li, M.; Wang, H. Mamba: A Flexible Framework for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1234–1245. [Google Scholar]
Xian, Y.; Choudhury, S.; He, Y.; Schiele, B.; Akata, Z. Semantic Projection Network for Zero- and Few-Label Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8248–8257. [Google Scholar]
Bucher, M.; Vu, T.-H.; Cord, M.; Pérez, P. Zero-shot semantic segmentation. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 466–477. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Zhou, C.; Loy, C.C.; Dai, B. Extract free dense labels from clip. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 696–712. [Google Scholar]
Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-driven semantic segmentation. arXiv 2022, arXiv:2201.03546. [Google Scholar]
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12179–12188. [Google Scholar]
Li, B.; Wu, F.; Weinberger, K.Q.; Belongie, S. Positional Normalization. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 1620–1632. [Google Scholar]
Xu, M.; Zhang, Z.; Wei, F.; Lin, Y.; Cao, Y.; Hu, H.; Bai, X. A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-Trained Vision-Language Model. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 736–753. [Google Scholar]
Ding, J.; Xue, N.; Xia, G.-S.; Dai, D. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11583–11592. [Google Scholar]
Liang, F.; Wu, B.; Dai, X.; Li, K.; Zhao, Y.; Zhang, H.; Zhang, P.; Vajda, P.; Marculescu, D. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7061–7070. [Google Scholar]
Zhou, Z.; Lei, Y.; Zhang, B.; Liu, L.; Liu, Y. Zegclip: Towards Adapting Clip for Zero-Shot Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 11175–11185. [Google Scholar]
Xu, M.; Zhang, Z.; Wei, F.; Hu, H.; Bai, X. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2945–2954. [Google Scholar]
Cho, S.; Shin, H.; Hong, S.; Arnab, A.; Seo, P.H.; Kim, S. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4113–4123. [Google Scholar]
Xie, B.; Cao, J.; Xie, J.; Khan, F.S.; Pang, Y. SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 3426–3436. [Google Scholar]
Ye, C.; Zhuge, Y.; Zhang, P. Towards open-vocabulary remote sensing image semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Edmonton, AB, Canada, 10–14 November 2025; Volume 39, pp. 9436–9444. [Google Scholar]
Cao, Q.; Chen, Y.; Ma, C.; Yang, X. Open-vocabulary remote sensing image semantic segmentation. arXiv 2024, arXiv:2409.07683. [Google Scholar]
Li, K.; Liu, R.; Cao, X.; Bai, X.; Zhou, F.; Meng, D.; Wang, Z. Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images. arXiv 2024, arXiv:2410.01768. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
Chen, Y.; Wang, Y.; Lu, P.; Chen, Y.; Wang, G. Large-Scale Structure from Motion with Semantic Constraints of Aerial Images; Springer International Publishing: Cham, Switzerland, 2018; pp. 347–359. [Google Scholar]
Cai, W.; Jin, K.; Hou, J.; Guo, C.; Wu, L.; Yang, W. Vdd: Varied drone dataset for semantic segmentation. J. Vis. Commun. Image Represent. 2025, 109, 104429. [Google Scholar] [CrossRef]
Lyu, Y.; Vosselman, G.; Xia, G.-S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Shan, X.; Wu, D.; Zhu, G.; Shao, Y.; Sang, N.; Gao, C. Open-vocabulary semantic segmentation with image embedding balancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28412–28421. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 4015–4026. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
Zamir, S.W.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Khan, F.S.; Zhu, F.; Shao, L.; Xia, G.-S.; Bai, X. iSAID: A Large-Scale Dataset for Instance Segmentation in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 28–37. [Google Scholar]
Tan, X.; Xu, K.; Cao, Y.; Zhang, Y.; Ma, L.; Lau, R.W.H. Night-Time Scene Parsing with a Large Real Dataset. IEEE Trans. Image Process. 2021, 30, 9085–9098. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Two fusion methods for global and local features. The feature-level fusion integrates local features with upsampled global features after reshaping local sub-features into a unified map. In contrast, cost volume-level fusion first computes global and local volumes via outer product with text features, then reshapes and combines them before aggregation.

Figure 2. The overview of HR-Seg. The framework processes high-resolution images through dual pathways and extracts CLIP-based visual-text features to generate global and local cost volumes. These volumes are refined through the detail-enhanced encoder and decoded into final segmentation predictions.

Figure 3. The detail architectures of two decode layers. The double guidance layer first merges the global cost volume with global features, processes them through convolutional layers to generate semantic guidance, then combines global and local cost volumes for detailed guidance. The detail-aware layer decomposes cost volume into sub-blocks and fuses them with local features to enhance semantic details, followed by multi-directional feature aggregation in the detailed perception block. The upper-right inset illustrates the derivation of

F_{g} (i)

and

F_{l} (i)

.

Figure 3. The detail architectures of two decode layers. The double guidance layer first merges the global cost volume with global features, processes them through convolutional layers to generate semantic guidance, then combines global and local cost volumes for detailed guidance. The detail-aware layer decomposes cost volume into sub-blocks and fuses them with local features to enhance semantic details, followed by multi-directional feature aggregation in the detailed perception block. The upper-right inset illustrates the derivation of

F_{g} (i)

and

F_{l} (i)

.

Figure 4. Qualitative Results on UAVid [28] dataset. The top three rows show the complete segmentation results, while the bottom two rows show the detailed segmentation results. The red circles annotate fine-grained details that are successfully captured by HR-Seg but but overlooked by Cat-Seg [19].

Figure 5. Qualitative error analysis. Regions (a,b) demonstrate that HR-Seg produces relatively coarse segmentation with limited precision for small objects. Region (c) further reveals the model’s reduced accuracy in detecting extremely small targets.

Table 1. Performance comparison with other methods. All the methods are trained on the VDD [27] training set. ‘*’ denotes the unseen category. Best values are bolded and second-best values are underlined.

Evaluation on VDD [27]
Method	ClIP Variant	Image Size	mIoU	mACC	IoU
Method	ClIP Variant	Image Size	mIoU	mACC	wall	road	veg.	vehicle	roof	water
SAN [18]	ViT-B/16	640 × 640	86.28	91.38	70.83	87.11	96.64	69.98	94.76	98.36
Cat-Seg [19]	ViT-B/16	384 × 384	88.08	93.77	75.46	88.64	97.15	73.45	95.36	98.40
SED [20]	ConvNeXt-B	768 × 768	88.19	93.84	76.40	88.73	97.21	72.84	95.70	98.26
GSnet [21]	ViT-B/16	384 × 384	88.16	93.38	75.61	89.43	97.39	72.74	95.46	98.34
Ebseg [30]	ViT-B/16	640 × 640	88.45	92.43	74.61	89.10	97.28	75.34	95.59	98.76
HR-Seg	ViT-B/16	1152 × 1152	89.38	93.81	76.58	90.75	97.74	76.42	95.88	98.89
Evaluation on UDD5 [26]
Method	ClIP Variant	Image Size	mIoU	mACC	IoU
Method	ClIP Variant	Image Size	mIoU	mACC	veg.	building *	road	vehicle
SAN [18]	ViT-B/16	640 × 640	63.72	73.01	85.37	83.47	44.37	41.70
Cat-Seg [19]	ViT-B/16	384 × 384	59.07	79.81	75.80	80.86	53.11	26.49
SED [20]	ConvNeXt-B	768 × 768	64.53	83.05	88.48	90.05	62.51	17.07
GSnet [21]	ViT-B/16	384 × 384	59.31	81.63	81.67	83.34	56.45	15.77
Ebseg [30]	ViT-B/16	640 × 640	67.16	82.36	83.94	88.14	65.03	31.51
HR-Seg	ViT-B/16	1152 × 1152	73.67	86.24	87.68	92.63	67.35	47.03
Evaluation on UAVid [28]
Method	ClIP Variant	Image Size	mIoU	mACC	IoU
Method	ClIP Variant	Image Size	mIoU	mACC	road	building *	tree *	low veg. *	car *	human *
SAN [18]	ViT-B/16	640 × 640	32.82	51.84	43.62	63.27	11.69	34.99	43.34	0.00
Cat-Seg [19]	ViT-B/16	384 × 384	48.80	62.98	72.15	79.87	45.98	47.79	46.67	0.36
SED [20]	ConvNeXt-B	768 × 768	40.20	54.92	73.99	89.66	0.71	34.46	41.73	0.67
GSnet [21]	ViT-B/16	384 × 384	41.94	59.00	72.21	80.84	15.61	37.07	45.61	0.30
Ebseg [30]	ViT-B/16	640 × 640	50.76	64.17	72.52	86.32	43.95	45.24	56.54	0.00
HR-Seg	ViT-B/16	1152 × 1152	55.23	67.04	74.59	90.57	57.52	52.21	53.39	3.09

Table 2. Ablation on the input resolution. Best values are bolded.

Input Size	GPU Memory	Inference Speed		mIoU
Input Size	GPU Memory	Inference Speed	VDD	UDD5	UAVid
$768 \times 768$	7809 M	0.07 s	85.66	55.71	37.54
$1152 \times 1152$	15,451 M	0.13 s	89.38	73.67	55.23
$1536 \times 1536$	20,103 M	0.23 s	89.07	72.93	52.63

Table 3. Computation performance comparison with other methods. GPU memory consumption and inference speed were evaluated on an NVIDIA RTX 4090 GPU with a batch size of 1. Best values are bolded.

Method	Additional Backbone	Image Size	Parameters	GPU Memory	Inference Speed
SAN [18]	-	640 × 640	157.84 M	11,621 M	0.04 s
Cat-Seg [19]	-	384 × 384	154.29 M	8535 M	0.11 s
SED [20]	-	768 × 768	180.76 M	2403 M	0.02 s
GSnet [21]	DINO [32]	384 × 384	243.99 M	9581 M	0.26 s
Ebseg [30]	SAM [31]	640 × 640	261.83 M	14,511 M	0.15 s
HR-Seg (ours)	-	1152 × 1152	155.89 M	15,451 M	0.12s

Table 4. Ablation on the different components. Best values are bolded.

Method	Image Size	Fusion Method	Components				mIoU
Method	Image Size	Fusion Method	MSEB	DEE	DGL	DAL	VDD	UDD5	UAVid
Cat-Seg [19]	384 × 384	-	-	-	-	-	88.08	59.07	48.80
Cat-Seg* [19]	768 × 768	-	-	-	-	-	88.18	50.41	42.46
Baseline	384 × 384	-					83.98	60.49	47.92
A	384 × 384	-	✓				84.24	58.87	47.14
B	1152 × 1152	CVF			✓		87.95	70.16	49.99
C	1152 × 1152	CVF	✓		✓		88.32	71.48	51.97
D	1152 × 1152	CVF		✓	✓		88.41	70.70	50.32
E	1152 × 1152	FF		✓	✓		90.37	67.86	49.42
F	1152 × 1152	CVF		✓	✓	✓	89.55	70.60	54.81
G	1152 × 1152	CVF	✓	✓	✓	✓	89.38	73.67	55.23

Table 5. Ablation on the dual-path aggregate architecture. Best values are bolded and second-best values are underlined.

Number of Blocks		mIoU
Global Aggregation Branch	Local Aggregation Branch	VDD	UDD5	UAVid
1	1	89.42	71.11	54.10
	2	89.46	71.94	54.16
	3	89.37	73.79	51.72
2	1	89.38	73.67	55.23
	2	89.38	72.93	52.91
	3	89.43	72.62	53.79
3	1	89.43	72.44	54.65
	2	89.66	71.50	52.80
	3	89.22	72.35	52.61

Table 6. Ablation on multi-scale embedding block. Best values are bolded and second-best values are underlined.

Kernel Combination				mIoU
7 × 7	5 × 5	3 × 3	1 × 1	VDD	UDD5	UAVid
✓				89.55	70.60	54.81
	✓			89.39	69.17	54.05
		✓		89.54	73.28	53.42
			✓	89.34	73.19	52.75
✓	✓			89.46	72.93	55.10
✓		✓		89.51	71.43	55.13
✓			✓	89.45	72.63	54.75
	✓	✓		89.36	71.79	53.76
	✓		✓	89.57	71.58	53.99
		✓	✓	89.65	73.53	54.93
✓	✓	✓	✓	89.38	73.67	55.23

Table 7. Ablation on the detail perception block. Best values are bolded.

Decode Component		mIoU
Double guidance decode layer	Detail-aware decode layer	VDD	UDD5	UAVid
double convolution layer	double convolution layer	89.62	72.89	54.81
double convolution layer	detail perception block	89.38	73.67	55.23
detail perception block	detail perception block	89.42	71.06	54.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Z.; Xie, Y.; Wei, Y. Toward High-Resolution UAV Imagery Open-Vocabulary Semantic Segmentation. Drones 2025, 9, 470. https://doi.org/10.3390/drones9070470

AMA Style

Chen Z, Xie Y, Wei Y. Toward High-Resolution UAV Imagery Open-Vocabulary Semantic Segmentation. Drones. 2025; 9(7):470. https://doi.org/10.3390/drones9070470

Chicago/Turabian Style

Chen, Zimo, Yuxiang Xie, and Yingmei Wei. 2025. "Toward High-Resolution UAV Imagery Open-Vocabulary Semantic Segmentation" Drones 9, no. 7: 470. https://doi.org/10.3390/drones9070470

APA Style

Chen, Z., Xie, Y., & Wei, Y. (2025). Toward High-Resolution UAV Imagery Open-Vocabulary Semantic Segmentation. Drones, 9(7), 470. https://doi.org/10.3390/drones9070470

Article Menu

Toward High-Resolution UAV Imagery Open-Vocabulary Semantic Segmentation

Abstract

1. Introduction

2. Method

2.1. Detail-Enhanced Encoder

2.1.1. Multi-Scale Embedding Block

2.1.2. Aggregation Block

2.2. Detail-Aware Decoder

2.2.1. Double Guidance Decode Layer

2.2.2. Detail-Aware Decode Layer

3. Results and Discussion

3.1. Datasets and Evaluation Metrics

3.2. Implementation Details

3.3. Main Results

3.4. Ablation Studies

3.5. Error Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. More Evaluation Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI