UAV-OVD: Open-Vocabulary Object Detection in UAV Imagery via Multi-Level Text-Guided Decoding

Tao, Lijie; Wei, Guoting; Wang, Zhuo; Qi, Zhaoshuai; Li, Ying; Zhang, Haokui

doi:10.3390/drones9070495

Open AccessArticle

UAV-OVD: Open-Vocabulary Object Detection in UAV Imagery via Multi-Level Text-Guided Decoding

by

Lijie Tao

^1,†

,

Guoting Wei

^2,†

,

Zhuo Wang

¹,

Zhaoshuai Qi

³

,

Ying Li

³ and

Haokui Zhang

^1,*

¹

School of Cybersecurity, Northwestern Polytechnical University, Xi’an 710072, China

²

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

³

School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2025, 9(7), 495; https://doi.org/10.3390/drones9070495

Submission received: 9 June 2025 / Revised: 9 July 2025 / Accepted: 11 July 2025 / Published: 14 July 2025

(This article belongs to the Special Issue Applications of UVs in Digital Photogrammetry and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Object detection in drone-captured imagery has attracted significant attention due to its wide range of real-world applications, including surveillance, disaster response, and environmental monitoring. Although the majority of existing methods are developed under closed-set assumptions, and some recent studies have begun to explore open-vocabulary or open-world detection, their application to UAV imagery remains limited and underexplored. In this paper, we address this limitation by exploring the relationship between images and textual semantics to extend object detection in UAV imagery to an open-vocabulary setting. We propose a novel and efficient detector named Unmanned Aerial Vehicle Open-Vocabulary Detector (UAV-OVD), specifically designed for drone-captured scenes. To facilitate open-vocabulary object detection, we propose improvements from three complementary perspectives. First, at the training level, we design a region–text contrastive loss to replace conventional classification loss, allowing the model to align visual regions with textual descriptions beyond fixed category sets. Structurally, building on this, we introduce a multi-level text-guided fusion decoder that integrates visual features across multiple spatial scales under language guidance, thereby improving overall detection performance and enhancing the representation and perception of small objects. Finally, from the data perspective, we enrich the original dataset with synonym-augmented category labels, enabling more flexible and semantically expressive supervision. Experiments conducted on two widely used benchmark datasets demonstrate that our approach achieves significant improvements in both mean mAP and Recall. For instance, for Zero-Shot Detection on xView, UAV-OVD achieves 9.9 mAP and 67.3 Recall, 1.1 and 25.6 higher than that of YOLO-World. In terms of speed, UAV-OVD achieves 53.8 FPS, nearly twice as fast as YOLO-World and five times faster than DetrReg, demonstrating its strong potential for real-time open-vocabulary detection in UAV imagery.

Keywords:

open-vocabulary object detection; UAV imagery; remote sensing

1. Introduction

Object detection in UAV (Unmanned Aerial Vehicle) imagery, which involves locating and classifying objects within images captured by drones from high altitudes, plays a crucial role in a wide range of real-world applications such as environmental monitoring [1,2], urban planning [3], traffic surveillance [4], disaster response [5], and military reconnaissance [6]. Compared to ground-based images, UAV imagery poses unique challenges due to its complex backgrounds, multi-scale object distributions, high object density, and diverse viewing perspectives. Traditional object detectors for aerial images are typically built to recognize a fixed set of predefined categories, often prioritizing detection accuracy while neglecting adaptability and scalability. This closed-set assumption restricts their performance in open-world scenarios where unknown or unseen objects may emerge. A core limitation lies in the reliance on static class indices, which hinders the model’s ability to generalize beyond the training vocabulary. With the recent progress in vision-language models that align visual inputs with free-form textual descriptions, new opportunities have emerged to build open-vocabulary detection systems for UAV imagery, enabling flexible recognition of novel objects without the need for exhaustive manual labeling.

Recent advances in natural image understanding have demonstrated that aligning visual features with textual information can effectively endow object detectors with open-vocabulary capabilities [7,8,9,10]. By leveraging vision-language alignment, these methods have shown promising results in recognizing categories beyond those seen during training. Broadly speaking, such approaches can be categorized into two main types. The first type [7,9,11,12] relies on feature pyramid networks (FPNs) or similar architectures to generate object proposals, which are subsequently classified using pretrained vision-language models (VLMs), such as VILD [7] and VL-PLM [9]. However, these two-stage detection pipelines often impose significant computational burdens, making them less suitable for real-time applications on resource-constrained platforms like UAVs. The second type of approach either utilizes pretrained VLMs to generate region–text pairs [13,14] or jointly models detection and grounding tasks [8,10,15,16,17], with the goal of collecting large-scale data to train task-specific VLMs from scratch. For example, GLIP [15] formulates object detection as a phrase grounding task, effectively unifying these two objectives. Nonetheless, existing methods typically limit class names to a fixed set of predefined English terms. Unlike natural images, UAV imagery presents unique challenges such as complex backgrounds, significant variations in object scales, and diverse object orientations. Among these, the detection of small objects has drawn particular attention due to its wide range of practical applications. However, general-purpose and traditional object detection methods often yield suboptimal performance under such conditions. To address these issues, we design a text-guided decoding module specifically tailored to enhance detection in aerial scenarios by leveraging semantic guidance from language. In an open-world setting, however, a single object category may be referred to in diverse linguistic forms. To address this challenge, we propose a novel approach that incorporates synonymous expressions of class names into the training process. This strategy demonstrates significant improvements in scenarios requiring diverse language understanding and offers a promising direction for future research on open-vocabulary object detection.

In this paper, we propose UAV-OVD, an efficient open-vocabulary object detector tailored for aerial imagery. UAV-OVD builds upon the RT-DETR framework [18], with modifications specifically designed to meet the efficiency and deployment constraints of UAV-based detection tasks. Leveraging the principle of image–text alignment, our model integrates a CLIP-based encoder [19,20] and introduces a region–text contrastive loss to replace the conventional category classification loss, thereby removing the reliance on a fixed set of predefined categories. To address the challenges of detecting small objects and dealing with cluttered backgrounds in aerial images, we design a text-guided decoder that enhances class-relevant feature extraction through cross-modal fusion. Specifically, we introduce a multi-level text-guided fusion decoder (MTFD), which effectively boosts detection performance for small-scale targets in complex aerial environments. Furthermore, to further enhance the model’s adaptability in open-world scenarios, we manually curate a set of common synonyms for each class name and incorporate them into the training process. During optimization, both the original and extended label representations are fused into the loss computation, guiding the model to learn more diverse semantic associations. This strategy significantly improves the model’s generalization ability in handling diverse linguistic expressions. Our main contributions are summarized as:

We propose UAV-OVD, an efficient open-vocabulary detection framework tailored for aerial scenarios. By leveraging image–text alignment, UAV-OVD incorporates class-level semantic information and employs a region–text contrastive loss, enabling the model to recognize categories beyond predefined label sets with greater flexibility.
We design a multi-level text-guided fusion decoder (MTFD) to address the challenges of small object sizes and complex backgrounds in aerial imagery. This module fuses multi-scale visual features with textual cues, enhancing the extraction of class-relevant representations and significantly improving the detection of small and visually ambiguous objects.
We introduce a class extension strategy to improve the model’s linguistic generalization capabilities. By integrating manually curated synonyms into the training process, the model learns to associate diverse natural language expressions with target categories, enhancing its robustness in open-world scenarios.

The remainder of this paper is structured as follows. In Section 2, we provide an overview of related work in the field of UAV-based target detection. Section 3 presents the proposed UAV-OVD framework, detailing its overall architecture and the key modules that drive its performance. Section 4 outlines the experimental setup, including datasets, evaluation metrics, and result analysis. Section 5 concludes the paper with a summary of findings and future research directions. Finally, Section 6 mentioned the possible future development direction and application of open vocabulary detection in drones.

2. Related Work

2.1. Object Detection

Object detection plays an important role in the field of computer vision, which deals with detecting instances of visual objects of a certain class (such as humans, animals, or cars) in digital images. The goal of object detection is to design computational models and methods that determine the presence and locations of objects within an image. Object detection serves as a basis for many other computer vision tasks, such as instance segmentation, image captioning, and object tracking.

In recent years, the rapid development of deep learning techniques has greatly promoted the progress of object detection. Object detection has undergone three major stages of evolution: the region-based two-stage detectors, the single-stage detectors, and the Transformer-based end-to-end detectors. The region-based two-stage detectors marked the beginning of deep learning’s major impact on object detection. In 2014, Girshick et al. proposed R-CNN [21], which generates region proposals using selective search and classifies each region with CNN-extracted features. Although it achieved significant performance gains, its inference speed was slow due to redundant computations. This issue was addressed by Faster R-CNN [22], which introduced a Region Proposal Network (RPN) [22] to generate proposals directly within the network, forming an end-to-end trainable framework with improved efficiency. Later, feature pyramid networks (FPNs) [23] were developed to enhance detection across scales by combining semantic-rich features from different layers, further improving accuracy on objects of varying sizes. Following the two-stage detectors, one-stage detectors emerged to improve detection speed by performing classification and localization in a single step. YOLO (You Only Look Once) [24], proposed in 2015, applies a unified network to the entire image for fast detection, though it initially had lower accuracy on small objects. Later versions like YOLOv7 [25] improved both speed and precision with optimized designs. RetinaNet [26], introduced in 2017, tackled the accuracy gap by using focal loss to address class imbalance during training, achieving accuracy comparable to two-stage methods while maintaining high speed.

Recently, Transformers have deeply affected the field of computer vision. Transformers discard the traditional convolution operator in favor of self-attention in order to overcome the limitation of CNNs, which offer superior global modeling, end-to-end detection, and enhanced multi-scale feature fusion, leading to improved accuracy and generalization in object detection. In 2020, Carion et al. proposed DETR [27], the first end-to-end object detector based on Transformers, which formulates object detection as a direct set prediction problem without using anchor boxes or NMS. To address DETR’s slow convergence and poor performance on small objects, Zhu et al. introduced Deformable DETR [28], which uses sparse attention around reference points for faster training and better multi-scale feature handling. Later, Zhao et al. presented RT-DETR [18], a real-time detection Transformer that balances speed and accuracy by optimizing the Transformer structure for efficient multi-scale feature interaction and fusion.

2.2. UAV Imagery Detection

UAV imagery object detection is a critical technology widely used in fields such as military reconnaissance, urban surveillance, agricultural inspection, and disaster emergency response. Since UAVs capture images from an overhead perspective during flight, the resulting images often feature complex backgrounds, small targets, multi-scale variations, and target occlusions, posing significant challenges for object detection. Traditional UAV object detection methods primarily rely on hand-crafted feature extraction and shallow machine learning algorithms. These include background modeling, frame differencing, edge detection, sliding window search, and support vector machines (SVMs). For example, Shehata et al. [29] proposed a block-based background modeling method for vehicle detection using pixel-level differences and various techniques, with discrete cosine transform (DCT) achieving the best results. Chen et al. [30] combined HOG features with an SVM classifier using a sliding window to detect small vehicles, but this method has high computational cost and is less suitable for real-time UAV tasks.

With the rapid development of deep learning technology, CNN- and Transformer-based object detection methods have demonstrated strong feature learning and generalization capabilities in UAV imagery. Deep neural networks can automatically extract robust multi-level features, achieving significant progress in complex scenes and small-object detection. For example, PHSI-RTDETR [31] combines the Hilo attention mechanism with cross-scale feature fusion and optimizes the loss function to achieve high accuracy and real-time performance in infrared small-object detection. LSKNet [32] is a lightweight backbone network that dynamically adjusts its large spatial receptive field to better handle scale variations and complex backgrounds in remote sensing images. Multi-scale feature fusion has emerged as a critical technique to enhance detection accuracy in UAV imagery, particularly in addressing challenges posed by significant scale variations and the presence of small targets. By aggregating features from multiple layers or resolutions, these methods effectively improve the model’s ability to capture both coarse and fine-grained information. Common approaches include feature pyramid networks (FPNs) [23], bidirectional feature pyramid networks (BiFPNs) [33], and attention mechanisms such as Coordinate Attention (CA) and Global Attention Mechanism (GAM), which facilitate adaptive weighting and integration of multi-scale features. For instance, the Dogfight algorithm [34] incorporates spatiotemporal information fusion with attention mechanisms, leading to improved detection of small UAVs in complex scenarios. TGC-YOLOv5 [35] integrates Transformer modules and multiple attention mechanisms to enhance detection performance under low-visibility conditions. Moreover, weighted BiFPN architectures have been proposed to reduce computational complexity while maintaining or improving accuracy through efficient multi-scale feature fusion.

2.3. Open Vocabulary Object Detection

The past decade has witnessed a steady and remarkable progress in object detection tasks driven by CNN-based and Transformer-based models. However, existing detectors are limited to localizing predefined semantic categories within a specific dataset. This closed-set constraint significantly hinders their applicability in real-world scenarios. Zero-Shot Detection (ZSD) [36] was initially proposed to address this issue by enabling object detectors to recognize unseen categories without requiring annotated examples during training. Typically, ZSD replaces the conventional learnable classifier with fixed semantic embeddings, such as GloVe [37], which captures word co-occurrence statistics to represent semantics, or BERT [38], a deep contextual language model that encodes rich semantic and syntactic information. It then applies visual–semantic space mapping or novel visual feature synthesis to transfer knowledge from seen to unseen classes. While ZSD represents a significant conceptual advancement, its performance is often limited due to the lack of supervision and the domain gap between visual and semantic modalities.

Recently, open-vocabulary object detection (OVD) has gained traction as a promising direction in modern object detection, which aims to detect objects beyond the predefined categories. Inspired by vision-language models (VLMs), recent works have incorporated class semantic information into detectors and leveraged techniques such as image–text alignment and cross-modal fusion, enabling open-vocabulary object detection. For instance, CastDet [39] employs a CLIP-guided student-teacher framework to generate high-quality pseudo-labels, significantly improving novel class detection. YOLO-World [10] integrates vision-language modeling into the YOLO framework with a reparameterized path aggregation network, achieving strong performance and real-time inference on benchmarks like LVIS [40]. UAV-OVD in this paper aims to achieve open-vocabulary detection for UAV images by leveraging image–text alignment and cross-modal fusion, enhancing feature extraction under limited aerial data conditions.

3. Methods

The overall architecture of UAV-OVD is illustrated in Figure 1. To address the challenges of open-vocabulary object detection in aerial scenarios, we propose UAV-OVD, a high-efficiency detection framework developed based on RT-DETR [18] to meet the real-time demands of UAV-based applications. Our method introduces coordinated improvements across the training strategy, network architecture, and data design:

Training level: A region–text contrastive loss is introduced to replace conventional classification loss.
Architectural level: A multi-level text-guided fusion decoder (MTFD) is designed to enhance feature fusion across scales.
Data level: A synonym-based class extension strategy is applied to enrich textual supervision.

Concretely, the region–text contrastive loss guides the model to align image regions with their corresponding textual descriptions, enabling flexible category recognition beyond fixed label sets. To address the challenges of small object sizes and cluttered backgrounds in aerial imagery, the proposed MTFD module integrates multi-scale visual features under the guidance of language embeddings, enhancing the model’s ability to perceive fine-grained and class-relevant details. Furthermore, at the data level, we incorporate manually curated synonyms for each category label into the training process. This encourages the model to generalize across diverse textual expressions, improving robustness in open-world settings where object descriptions may vary significantly.

3.1. Image–Text Alignment

To effectively bridge the gap between visual and textual modalities, we formulate the image–text alignment task as a region–text matching problem. As illustrated in Figure 2, we design a carefully tailored loss composition to enhance the precision of bounding box predictions. Specifically, we utilize a CLIP-based image encoder [19] to extract visual features and a text encoder to transform category labels into dense semantic embeddings. These features are then projected into a shared embedding space via a contrastive head. We achieve image–text alignment through the formulation of a region–text contrastive loss, which serves as the core component of our alignment strategy. This loss encourages matched region–text pairs to stay close in the embedding space while pushing apart mismatched pairs, thereby replacing traditional categorical classification loss and promoting open-vocabulary recognition. In addition to the contrastive objective, we incorporate an IoU loss and an L1 loss, inspired by previous works [18,41], to supervise the localization accuracy. Furthermore, a Wasserstein loss is introduced to enhance the model’s capability in capturing fine-grained spatial distributions of object boxes.

Region–text contrastive loss In conventional object detectors such as RT-DETR [18], classification supervision is typically imposed through a categorical regression loss applied to the outputs of a classification head. Notably, RT-DETR [18] adopts the Varifocal Loss [42], which computes a soft classification loss that emphasizes high-quality predictions by weighting the classification confidence with the corresponding IoU scores. This design allows the detector to better capture both the correctness and the localization quality of object predictions.
Building upon this foundation, we depart from the use of traditional category logits and instead adopt a more generalizable strategy based on semantic embeddings. Specifically, we leverage a CLIP-based architecture [19,20] to extract both visual and textual embeddings in a unified semantic space. Specifically, category labels are encoded using the CLIP text encoder to obtain dense semantic representations. For the visual stream, we adopt the ResNet-50 variant of the CLIP image encoder for its balance between performance and efficiency. To better adapt it to the detection setting, we remove the final AttnPool layer and extract multi-scale feature maps from the last three stages of the backbone. The resulting region-level visual features are then projected into the same embedding space as the text features through a lightweight linear projection, enabling effective region–text alignment for open-vocabulary object detection. To establish a shared visual-linguistic space, we introduce a contrastive head that projects both image regions and textual embeddings into comparable dimensions.
As illustrated in Figure 2, given an aerial image and a list of category labels, we first encode each class label into a dense text embedding, denoted as $T_{j}$ , using the CLIP text encoder. The aerial image is passed through the CLIP-based visual encoder to extract multi-scale feature maps, which are subsequently flattened and processed to generate region-level object queries. From these, we select the top-K object queries based on confidence scores, denoted as $Q_{i}$ . To measure the semantic similarity between visual and textual representations, we compute a similarity score between each object query $Q_{i}$ and class embedding $T_{j}$ as follows:

$S (Q_{i}, T_{j}) = μ \frac{Q_{i} \cdot T_{j}^{T}}{{∥Q_{i}∥}_{2} \cdot {∥T_{j}∥}_{2}} + β$

(1)

where $S (Q_{i}, T_{j})$ denotes the cosine similarity between the visual query and the textual embedding. Following the default setting in [42], the region–text contrastive loss adopts the Varifocal Loss formulation as follows:

$L_{R T C} (S, q) = \{\begin{matrix} - q \cdot [q log (S) + (1 - q) log (1 - S)] & q > 0 \\ - α S^{γ} log (1 - S) & q = 0 \end{matrix}$

(2)

where $S \in [0, 1]$ is the predicted similarity score between a region query and a text embedding and $q \in [0, 1]$ is defined as the IoU between the predicted box $B_{j}$ and its matched ground-truth box $B_{k}$ via Hungarian assignment.
IOU loss To accurately supervise the spatial alignment between predicted and ground-truth bounding boxes, we follow prior works [18,41] and adopt the IoU loss [43] as one of our localization objectives. Unlike smooth L1 loss [22] that treats box coordinates independently, IoU-based losses directly measure the overlap quality between two boxes, providing a more geometry-aware and task-relevant supervision signal. This loss encourages predicted boxes to match the ground truth not just in location but also in shape and scale.
Given a predicted box $B^{pred}$ and its assigned ground-truth box $B^{gt}$ , the IoU loss is defined as:

$L_{IoU} = 1 - \frac{| B^{pred} \cap B^{gt} |}{| B^{pred} \cup B^{gt} |}$

(3)

where $| \cdot |$ denotes the area of a box and ∩, ∪ represent the intersection and union between the predicted and ground-truth boxes, respectively. A lower IoU loss corresponds to a higher degree of spatial alignment.
L1 loss In addition to the IoU loss, we also incorporate the L1 loss to supervise the regression of bounding box coordinates. L1 loss is a commonly used objective in object detection tasks, offering a stable and straightforward optimization target. Similar to the IoU loss, we follow prior works [18,41] and apply it to directly minimize the difference between predicted and ground-truth box parameters.
Given a predicted bounding box $B^{pred} = (x_{p}, y_{p}, w_{p}, h_{p})$ and a ground-truth box $B^{gt} = (x_{g}, y_{g}, w_{g}, h_{g})$ , the L1 loss is computed as:

$L_{L 1} = \sum_{i \in {x, y, w, h}} | B_{i}^{pred} - B_{i}^{gt} |$

(4)

This formulation ensures that each coordinate of the predicted box is regressed toward the ground-truth value, contributing to precise localization in combination with the IoU-based supervision.
Wasserstein loss Beyond the standard localization losses (i.e., IoU and L1), we introduce a Wasserstein loss [44] as a complementary objective to further enhance the model’s ability to capture the spatial distribution and geometric alignment of predicted bounding boxes. Unlike IoU and L1 losses, which focus on overlap and coordinate-wise differences, respectively, Wasserstein loss provides a more holistic measure of distance between box distributions in a metric space.
This loss is inspired by the optimal transport theory and is designed to measure the “effort” required to transform one bounding box into another. Given two boxes $B^{pred} = (x_{p}, y_{p}, w_{p}, h_{p})$ and $B^{gt} = (x_{g}, y_{g}, w_{g}, h_{g})$ , we define the Wasserstein loss as:

$L_{Wass} = \sqrt{{(x_{p} - x_{g})}^{2} + {(y_{p} - y_{g})}^{2}} + \sqrt{{(w_{p} - w_{g})}^{2} + {(h_{p} - h_{g})}^{2}}$

(5)

The first term captures the Euclidean distance between the centers of the predicted and ground-truth boxes, while the second term reflects the distance in size (width and height). This loss encourages predicted boxes not only to match the location but also to mimic the spatial extent of the ground truth, which is particularly beneficial in complex aerial scenes with high object variability.
Since this component is not present in prior works [18,41], it represents our novel contribution to the overall loss design, aimed at improving robustness and fine-grained localization performance in open-vocabulary UAV detection.

To jointly optimize the model for both semantic alignment and precise localization, we integrate all aforementioned loss components into a unified detection objective. The total loss is defined as:

L_{\det} = λ L_{RTC} + δ L_{IoU} + ν L_{L 1} + τ L_{Wass}

(6)

where

L_{RTC}

denotes the region–text contrastive loss, which serves as the cornerstone for aligning visual and textual modalities in a shared embedding space. The terms

L_{IoU}

and

L_{L 1}

provide complementary supervision for bounding box regression, capturing overlap quality and coordinate-wise accuracy, respectively. Our proposed

L_{Wass}

further enhances the geometric consistency between predicted and ground-truth boxes by modeling their spatial distribution through optimal transport principles.

This multi-component loss design ensures that the model not only learns robust image–text associations but also achieves accurate and stable object localization, which is crucial for open-vocabulary detection in UAV imagery.

3.2. Multi-Level Text-Guided Fusion Decoder

As illustrated in Figure 3, the proposed multi-level text-guided fusion decoder (MTFD) is designed to enhance the detection capability for small and scale-varying objects. Previous CNN-based detectors [25,45,46] have demonstrated that emphasizing low-level visual features can significantly improve sensitivity to small objects. Inspired by this insight, MTFD incorporates multi-level semantic guidance to modulate the decoding process more effectively.

Specifically, we first enhance the original class embeddings from the text encoder using multi-scale image features extracted from the visual backbone. This fusion is achieved through multi-head cross-attention (MHCA), allowing each class embedding to absorb relevant visual context from different feature levels. Formally, the enhanced class embeddings at level i are computed as:

T_{i} = T + MHCA (T, I_{i}, I_{i}), i \in {1, 2, 3}

(7)

where T is the original class embedding obtained from the CLIP text encoder and

I_{i}

represents the visual feature map at scale level i. Each MHCA module is defined as:

MHCA (T, I_{i}, I_{i}) = Concat (H_{1}, \dots, H_{h}) W^{O}

(8)

H_{j} = Softmax (\frac{T W_{j}^{T} {(I_{i} W_{j}^{I_{i}})}^{⊤}}{\sqrt{d_{h}}}) I_{i} W_{j}^{I_{i}}

(9)

where

W_{j}^{T}, W_{j}^{I_{i}} \in R^{d \times d_{h}}

are learned projection matrices for the j-th head and

d_{h} = d / h

is the dimension of each head.

W^{O} \in R^{d \times d}

is the final output projection matrix after concatenating all heads.

Softmax

is applied along the key dimension to compute attention weights.

Using the same feature map for both key (K) and value (V) inputs in MHCA enables the model to perform more focused attention, ensuring that the semantic query (class embedding) retrieves information that is both semantically and spatially aligned. This self-consistency in K and V helps preserve feature integrity while minimizing noise from unrelated regions.

We denote the resulting multi-level class embeddings as:

T_{high} = T_{1}, T_{mid} = T_{2}, T_{low} = T_{3}

(10)

These representations contain both textual semantics and level-specific visual cues, enabling more informed interaction with object queries during decoding.

In each decoder layer, we inject these class-aware embeddings into the object queries using MHCA and Deformable MHCA [28] to guide the object decoding process. The object query

Q_{i}

at decoder layer i is updated as follows:

{\tilde{Q}}_{i} = Q_{i} + MHCA (Q_{i}, T_{s}, T_{s})

(11)

Q_{i}^{'} = {\tilde{Q}}_{i} + DeformMHCA ({\tilde{Q}}_{i}, I_{s})

(12)

DeformMHCA (\tilde{Q}, I_{s}) = \sum_{m = 1}^{M} A_{m} \cdot I (I_{s}, p_{0} + Δ p_{m})

(13)

where

T_{s}

and

I_{s}

correspond to the class embedding and image feature map at the current scale (high/mid/low);

p_{0}

is a reference point predicted from

\tilde{Q}

;

Δ p_{m}

are learned offsets for the m-th sampling point;

A_{m}

are attention weights predicted from

\tilde{Q}

; and

I (I_{s}, p_{0} + Δ p_{m})

is bilinear interpolation of the key/value feature map

I_{s}

at location

p_{0} + Δ p_{m}

. Deformable MHCA [28] allows the model to focus on a sparse set of key points across different spatial locations, improving efficiency and accuracy in capturing relevant features from the input. It is especially useful for handling high-resolution images or dense scenes by reducing computational cost while maintaining performance.

T_{s}

and

I_{s}

are selected based on the decoder layer index:

$T_{s} = T_{high}, I_{s} = I_{high}$ for $i = 1$
$T_{s} = T_{mid}, I_{s} = I_{mid}$ for $i = 2, 3$
$T_{s} = T_{low}, I_{s} = I_{low}$ for $i = 4, 5, 6$

This hierarchical fusion strategy ensures that different decoder layers attend to semantic information that best matches their target object scales. In particular, allocating three decoder layers to

T_{low}

enhances the network’s focus on fine-grained details, thus significantly improving the detection of small objects in UAV imagery.

3.3. Extend Class

To address the linguistic diversity inherent in open-world scenarios, we introduce a class extension strategy that enhances the model’s ability to generalize across varied natural language expressions. In such settings, a single object may be referred to using multiple synonymous terms, which poses a challenge for traditional detection methods relying on fixed vocabulary. To overcome this limitation, we manually curate a set of synonyms for each base class label and incorporate them into the training process. For existing UAV-based detection datasets such as DIOR [1], although class labels are defined and verified by domain experts, they typically do not include corresponding synonyms. To address this, we first leverage GPT-4o [47] to automatically generate candidate synonyms for each original category name. These candidates are then carefully reviewed by human annotators to ensure that the expanded vocabulary accurately and unambiguously reflects the semantics of the original terms. This hybrid approach guarantees both the linguistic richness and the semantic consistency of the label space. For example, the class name ”swimming-pool” is extended to include synonymous expressions such as ”artificial swimming pool”, ”outdoor pool facility”, and ”man-made pool”. Specifically, for each original class label

{c l a s s}_{o r i g i n a l}

, we define a set of extended class names

{c l a s s}_{e x t e n d}

. Given an image region feature

q_{i}

, text embeddings

t_{o r i g i n a l}

and

t_{e x t e n d}

, we compute both the similarity score with the original class

S_{o r i g i n a l} (q_{i}, t_{o r i g i n a l})

and with the extended synonym class

S_{e x t e n d} (q_{i}, t_{e x t e n d})

, following the formulation introduced in Equation (1). Thus, the total visual-text similarity

S_{i, j}

is calculated by:

S_{i, j} = AVG (S_{o r i g i n a l} (q_{i}, t_{o r i g i n a l}), S_{e x t e n d} (q_{i}, t_{e x t e n d}))

(14)

where

q_{i}

is the query embedding,

t_{j}

stands for the class, and the function

AVG

denotes the operation of taking the average. By integrating both into the loss computation, we effectively guide the model to learn more diverse and robust vision-language associations. Figure 2 clearly illustrates the architecture of our class extension strategy. This method not only improves performance in open-vocabulary detection tasks but also offers a valuable direction for future research in enhancing language grounding for real-world applications. This enables our model to exhibit excellent performance in zero-shot evaluations where the original class names in the test set are replaced with their synonymous expressions. For detailed results, please refer to the ablation studies in Section 4.3.

4. Results

In this section, we demonstrate that the proposed UAV-OVD can effectively overcome category constraints in object detection. Specifically, we evaluate its Zero-Shot Detection capabilities on two open-vocabulary benchmark datasets [1,48].

4.1. Experimental Setup

To ensure fair comparison and reproducibility, we initially follow the dataset partitioning strategy proposed in DescReg [49], which divides data across three standard aerial image datasets: xView [48], DIOR [1], and DOTA [50]. However, we observe that the DIOR and DOTA datasets contain overlapping images and object instances, which introduces a risk of data leakage during evaluation. In addition, many categories are shared between the two datasets. As a result, certain classes may appear multiple times during training under different dataset splits, effectively receiving disproportionately more training exposure. This can lead to inflated performance when these categories are later evaluated in the testing phase, resulting in overly optimistic and less reliable results.

To address this, we design two splitting protocols: one that follows the original setup and another that carefully reassigns training and testing categories to ensure no category-level overlap between the splits. This revised partitioning enables a cleaner and more reliable evaluation of open-vocabulary detection performance, especially in cross-dataset generalization scenarios. We report results under both the original DescReg [49] split and our curated split for a comprehensive assessment.

In addition, for consistency and training efficiency, we uniformly crop all images in the datasets to a fixed size of

800 \times 800

and convert their annotations into the COCO format [51]. This normalization simplifies data processing across datasets. To better capture the dense object distribution in aerial scenes, we utilize 500 object queries in the decoder. Unlike traditional methods that initialize queries based on classification scores, we rank encoder features using the similarity between image and text embeddings, and select the top 500 features as initial queries. This same similarity-driven strategy is also employed during label assignment, ensuring semantic consistency between image regions and textual category representations. We build our model on MMDetection [52], using RT-DETR [18] with ResNet-50 [53] as the backbone. Both the image and text encoders are initialized from R50-CLIP [20]. Training is performed on four RTX 4090 GPUs with a batch size of 8. Following [15], we dynamically sample positive classes per image and randomly select negatives, while shuffling class embedding order to avoid positional bias. For fair comparison with YOLO-World [10], we adopt their training protocol and use their released models pretrained on Objects365V1 [54], GQA [55], Flickr30K [56], and CC3M [57].

We conduct experiments on two well-established benchmarks for zero-shot aerial object detection: xView [48] and DIOR [1].

xView contains over 1 million object instances across 60 categories. The imagery is captured at a ground resolution of 0.3 m, offering significantly higher spatial detail compared to most public satellite datasets. It features a wide variety of small, rare, fine-grained, and multi-type objects with bounding box annotations, making it particularly suitable for UAV-based object detection tasks.
DIOR is a large-scale benchmark dataset for object detection in optical remote sensing imagery. It consists of 23,463 images and 192,472 annotated object instances, covering 20 object categories. The dataset provides a rich diversity of scenes and object types that support evaluation of detection performance in more generalized aerial contexts.

4.2. Evaluation Metrics

We evaluate our model using standard object detection metrics, including mean average precision (mAP), recall (R), and the harmonic mean (HM), following the protocols in [39,49]. All metrics are computed at an Intersection over Union (IoU) threshold

τ = 0.5

to ensure consistent comparison.

The mean Average Precision is defined as:

mAP = \frac{1}{C} \sum_{c = 1}^{C} {AP}_{c},

(15)

where

{AP}_{c}

denotes the area under the Precision–Recall curve for class c and C is the total number of evaluated categories.

Recall is defined as:

Recall = \frac{TP}{TP + FN}

(16)

where TP and FN refer to the number of true positives and false negatives, respectively.

To measure the balance between base (seen) and novel (unseen) class performance, we compute the harmonic mean:

HM = \frac{2 \times {mAP}_{B} \times {mAP}_{N}}{{mAP}_{B} + {mAP}_{N}},

(17)

where

{mAP}_{B}

and

{mAP}_{N}

are the mAP scores on base and novel categories, respectively. A higher HM indicates better generalization and balanced recognition across both seen and unseen categories.

We report results under two evaluation settings. The Zero-Shot Detection (ZSD) setting evaluates the model’s ability to recognize and localize novel categories that were not present during training. In contrast, the Generalized Zero-Shot Detection (GZSD) setting assesses the model’s performance on both base and novel classes jointly, requiring it to distinguish between seen and unseen categories in an open-world scenario. This setup provides a more realistic and challenging evaluation of open-vocabulary detection systems.

4.3. Ablation Experiment

As shown in Table 1, we conduct a comprehensive ablation study to evaluate the effectiveness of each key component in UAV-OVD on the xView dataset under both Generalized Zero-Shot Detection (GZSD) and Zero-Shot Detection (ZSD) settings. The main modules analyzed include the multi-level text-guided fusion decoder (MTFD), image–text alignment, and extend class.

In the baseline configuration without the decoder and class extension, the model achieves 3.4 mAP and 28.7 Recall for novel classes under the GZSD setting, and only 7.3 mAP under the ZSD setting. After incorporating the MTFD decoder, the model shows a slight drop in overall Recall (

{Recall}_{HM}

: 33.2) under GZSD, while ZSD mAP improves to 9.6, suggesting that the text-guided structure enhances semantic alignment for novel classes, possibly at the cost of generalization to all categories.

When image–text alignment is further applied, the model demonstrates improved generalization to varied linguistic expressions. Novel class Recall increases to 27.0, and ZSD mAP reaches 10.3, showing that replacing traditional region-level classification loss with contrastive alignment facilitates better recognition of novel objects beyond predefined categories.

Finally, after fully integrating the class extension, the model exhibits a significant performance boost. In this configuration, novel class mAP under GZSD increases to 8.0, and Recall improves to 32.0, achieving a harmonic mean (HM) of 12.5—a gain of over 6.5 compared to the baseline. Under the ZSD setting, mAP and Recall rise to 9.9 and 67.3, respectively, significantly outperforming all other ablation variants.

Figure 4 presents a comparison of object detection results from different modules on the xView [48] dataset. We selected several representative examples of novel classes from the Zero-Shot Detection (ZSD) setting for illustration. In the first row, the example showcases a scene with densely packed and overlapping objects. While most baseline variants miss several instances due to occlusion or clutter, UAV-OVD accurately detects all relevant objects. This highlights its strong capability in handling crowded scenes and distinguishing overlapping targets through enhanced semantic reasoning. In the second row, we examine a large object instance, specifically a “barge.” Although multiple models are able to localize it, they vary in terms of bounding box alignment. UAV-OVD produces the most precise prediction with the highest spatial overlap, demonstrating its advantage in localization accuracy—a critical factor under IoU-based evaluation. The third row focuses on a challenging small-object detection case involving a “truck-tractor-with-box-trailer.” The baseline model fails by incorrectly predicting a nearby demolished structure. With the introduction of the MTFD decoder and image–text alignment modules, predictions gradually shift closer to the correct region. Ultimately, UAV-OVD successfully identifies and localizes the small object, illustrating its enhanced capacity for fine-grained recognition in zero-shot small-object scenarios. As shown, our proposed model demonstrates significant improvements in both detection accuracy and localization precision.

These results confirm the individual contributions of each module, with contrastive loss being especially crucial for enhancing image–text alignment and achieving robust open-vocabulary detection in UAV scenarios. It is evident that the incorporation of the Extend Class module, while causing a minor reduction in ZSD mAP compared to the prior configuration, it significantly outperforms other ablation variants overall. This demonstrates the effectiveness of our class extension strategy in enhancing the model’s generalization to both seen and unseen categories. In addition, all other modules consistently contribute performance gains over the baseline in the ablation studies, further validating the effectiveness of each component in the proposed framework.

4.4. Comparison with the State of the Art

As shown in Table 2 and Table 3, we compare UAV-OVD with current state-of-the-art open-vocabulary object detection methods on the xView and DIOR datasets under both Generalized Zero-Shot Detection (GZSD) and Zero-Shot Detection (ZSD) settings.

On the xView dataset (Table 2), UAV-OVD demonstrates superior performance in both base and novel categories. Under the GZSD setting, UAV-OVD achieves

{mAP}_{B}

= 19.4,

{mAP}_{HM}

= 7.4,

{Recall}_{B}

= 57.3,

{Recall}_{N}

= 32.1, and

{Recall}_{HM}

= 42.1 and under the ZSD setting, UAV-OVD achieves mAP = 9.1 and Recall = 63.2, outperforming all existing methods. Compared with the previous best-performing method YOLO-World-L, UAV-OVD improves

{mAP}_{B}

by 1.9 and

{Recall}_{N}

by 16.9, leading to a significant gain of 19.0 in

{Recall}_{HM}

under the GZSD setting. These results highlight UAV-OVD’s strong capability in balancing detection accuracy between seen and unseen categories. Under the ZSD setting, UAV-OVD outperforms YOLO-World-L by a large margin of 7.8 mAP and 43.3 in Recall. This substantial improvement confirms the effectiveness of UAV-OVD in generalizing to previously unseen object categories in zero-shot scenarios.

To further verify the generality of our proposed OVOD-DETR framework, we enhance YOLO-World-L by training with our Extend Class module. With these enhancements, YOLO-World-L* achieves

{mAP}_{B}

= 21.7,

{Recall}_{B}

= 46.6, and

{Recall}_{HM}

= 24.4 under the GZSD setting. Compared with the initial model, YOLO-World-L* increases

{mAP}_{B}

,

{Recall}_{B}

, and

{Recall}_{HM}

by 3.2, 5.2, and 2.0, respectively. These improved results confirm the compatibility and effectiveness of our proposed techniques when integrated with existing models.

On the DIOR dataset (Table 3), UAV-OVD also demonstrates strong performance under both the GZSD and ZSD settings. Under the GZSD setting, UAV-OVD achieves

{mAP}_{B}

= 77.9,

{mAP}_{N}

= 17.2,

{mAP}_{HM}

= 28.1,

{Recall}_{B}

= 93.1,

{Recall}_{N}

= 67.3, and

{Recall}_{HM}

= 78.1. Compared with the strongest baseline YOLO-World-L, UAV-OVD improves

{Recall}_{N}

and

{Recall}_{HM}

by 29.1 and 24.3, respectively, while maintaining comparable mAP performance. These results demonstrate UAV-OVD’s superior ability to generalize across novel categories without sacrificing accuracy on base classes. Under the ZSD setting, UAV-OVD achieves 32.6 mAP and 81.1 Recall, outperforming YOLO-World-L by a large margin of 7.8 mAP and 43.3 Recall. This significant improvement highlights the model’s robustness in zero-shot scenarios, where test categories are unseen during training.

To further validate the effectiveness of our proposed components, we extend YOLO-World-L with our synonym-enhanced class extension and contrastive learning mechanisms. Although YOLO-World-L* shows moderate gains over its original version in

{Recall}_{B}

and

{Recall}_{HM}

(improving from 91.1 to 91.5 and from 53.8 to 55.6, respectively), its novel class performance remains substantially lower than that of UAV-OVD, with

{mAP}_{N}

= 3.1 and

{Recall}_{N}

= 18.6. In contrast, UAV-OVD* achieves

{Recall}_{N}

= 55.6 and

{Recall}_{HM}

= 69.2, confirming the effectiveness of our design in enhancing linguistic diversity and open-vocabulary detection.

To evaluate the inference efficiency of different open-vocabulary detectors in UAV scenarios, we benchmarked DescReg [49], YOLO-World-L [10], and our proposed UAV-OVD using the standardized speed testing framework provided by MMDetection [52]. All models were tested under identical conditions, with a single RTX 4090 GPU and batch size set to 1, to reflect real-time deployment scenarios. As shown in Figure 5, the results are measured in frames per second (FPS), and are summarized as follows: DescReg achieves 11 FPS, YOLO-World-L reaches 29.4 FPS, while UAV-OVD significantly outperforms both with 53.8 FPS. This indicates that UAV-OVD is nearly 5× faster than DescReg and 1.8× faster than YOLO-World-L, highlighting its advantage in real-time aerial applications where low latency and high throughput are critical.

Figure 6 compares the performance and feature attention maps of UAV-OVD and YOLO-World-L [10]. The first example illustrates the basketball court category, which belongs to the novel set. While UAV-OVD successfully detects the previously unseen object, YOLO-World-L [10] produces a false positive. The other two examples depict typical small-object scenes involving the vehicle category, where instances are densely distributed and relatively small in size. UAV-OVD is able to accurately detect all relevant targets, whereas YOLO-World-L exhibits both missed and false detections. The corresponding heatmaps show that UAV-OVD focuses more precisely on vehicle regions, indicating stronger semantic alignment and attention for both novel and base categories.

In summary, UAV-OVD consistently outperforms state-of-the-art methods on the DIOR dataset, achieving higher accuracy and Recall for both seen and unseen classes under ZSD and GZSD conditions. The results validate the scalability and robustness of our method in diverse aerial imagery scenarios.

4.5. Visualization of Detections

As illustrated in Figure 7, we compare the detection performance of the baseline model RT-DETR [18] and our proposed UAV-OVD across three representative aerial scenes. While RT-DETR [18] performs reasonably well on large and isolated objects, it struggles with complex visual conditions such as densely packed small objects, partial occlusions, or cluttered backgrounds. In contrast, UAV-OVD demonstrates markedly improved precision and localization, particularly in these challenging scenarios.

In scenes with densely distributed small targets, such as parked boats or compact vehicle clusters, RT-DETR [18] tends to produce large, diffuse activation regions—evident from the widespread red overlays—leading to ambiguous attention and missed detections. UAV-OVD, however, generates highly concentrated and localized responses. Its attention maps reveal focused activations tightly aligned with object centers, allowing for accurate separation of closely positioned instances.

In cluttered or occluded scenes, UAV-OVD shows stronger robustness in distinguishing targets from surrounding structures. The enhanced region–text alignment and multi-level fusion enable it to better isolate objects that are partially hidden or visually blended into the background.

These qualitative results highlight UAV-OVD’s superior ability to perceive and detect small, overlapping, or visually ambiguous targets in complex aerial environments, making it particularly suitable for high-density UAV applications.

5. Conclusions

In this paper, we propose UAV-OVD, a novel and efficient open-vocabulary object detector specifically designed for aerial imagery. To address the inherent challenges of small object recognition and category scalability in UAV-based scenarios, we introduce a region–text contrastive loss that replaces conventional categorical classification loss by aligning image regions with textual embeddings in a shared semantic space. Complementing this, a multi-level text-guided fusion decoder (MTFD) is designed to enhance the model’s capacity for fine-grained recognition by incorporating hierarchical class embeddings into each decoding stage, thereby improving detection of densely packed and small-scale targets. Additionally, we use a Wasserstein distance-based localization loss to further refine box regression performance. To extend the model’s semantic coverage, we implement a class synonym expansion strategy, combining GPT-4o-generated [47] candidates with manual validation to improve language generalization under zero-shot settings.

Extensive experiments conducted on two representative aerial benchmarks, xView and DIOR, under both Generalized Zero-Shot Detection (GZSD) and Zero-Shot Detection (ZSD) settings, demonstrate that UAV-OVD consistently outperforms existing state-of-the-art methods. The results highlight not only its strong generalization ability across novel categories but also its superior localization accuracy for small and overlapping objects in dense aerial scenes. These findings validate the effectiveness of UAV-OVD as a practical and robust open-vocabulary solution for real-world UAV applications.

6. Future Work

While UAV-OVD demonstrates strong open-vocabulary detection capabilities in static aerial imagery, several research directions remain open for future exploration. First, to better serve real-world UAV missions, future work could focus on adapting the detector to diverse operational scenarios, such as disaster response, infrastructure inspection, and agricultural monitoring, where object distributions and category vocabularies vary significantly. Second, integrating UAV-OVD with onboard navigation and planning systems could enable closed-loop perception-planning pipelines that support downstream decision making. Third, in dynamic and evolving environments, supporting real-time vocabulary expansion and adaptation—for instance, incorporating user-provided class prompts or contextual cues—would enhance the flexibility of open-vocabulary recognition. Additionally, combining UAV-OVD with multi-modal information sources, such as thermal imagery, LiDAR, or mission reports, could further improve object recognition under complex or low-visibility conditions. These extensions will help bridge the gap between open-vocabulary research and practical deployment in autonomous UAV systems.

Author Contributions

Conceptualization, H.Z.; Methodology, L.T. and G.W.; Software, L.T., G.W. and Z.W.; Validation, L.T. and G.W.; Formal analysis, L.T., G.W. and Z.W.; Investigation, L.T. and G.W.; Resources, G.W.; Data curation, G.W. and Z.W.; Writing—original draft, L.T., G.W. and Z.W.; Writing—review & editing, Z.Q. and Y.L.; Visualization, L.T.; Supervision, H.Z.; Project administration, H.Z.; Funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62401471 and in part by the 2024 Gusu Innovation and Entrepreneurship Leading Talents Program (Young Innovative Leading Talents) under Grant ZXL2024333.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef]
Sadgrove, E.J.; Falzon, G.; Miron, D.; Lamb, D.W. Real-time object detection in agricultural/remote environments using the multiple-expert colour feature extreme learning machine (MEC-ELM). Comput. Ind. 2018, 98, 183–191. [Google Scholar] [CrossRef]
Sun, W.; Dai, L.; Zhang, X.; Chang, P.; He, X. RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring. Appl. Intell. 2022, 52, 8448–8463. [Google Scholar] [CrossRef]
Dong, J.; Ota, K.; Dong, M. UAV-based real-time survivor detection system in post-disaster search and rescue operations. IEEE J. Miniaturization Air Space Syst. 2021, 2, 209–219. [Google Scholar] [CrossRef]
Liu, H.; Yu, Y.; Liu, S.; Wang, W. A military object detection model of UAV reconnaissance image and feature visualization. Appl. Sci. 2022, 12, 12236. [Google Scholar] [CrossRef]
Gu, X.; Lin, T.Y.; Kuo, W.; Cui, Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv 2021, arXiv:2104.13921. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding dino: Marrying dino with grounded pretraining for open-set object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 38–55. [Google Scholar]
Zhao, S.; Zhang, Z.; Schulter, S.; Zhao, L.; Vijay Kumar, B.; Stathopoulos, A.; Chandraker, M.; Metaxas, D.N. Exploiting unlabeled data with vision and language models for object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 159–175. [Google Scholar]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16901–16911. [Google Scholar]
Wu, S.; Zhang, W.; Jin, S.; Liu, W.; Loy, C.C. Aligning bag of regions for open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 15254–15264. [Google Scholar]
Li, J.; Zhang, J.; Li, J.; Li, G.; Liu, S.; Lin, L.; Li, G. Learning background prompts to discover implicit knowledge for open vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16678–16687. [Google Scholar]
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16793–16803. [Google Scholar]
Chen, F.; Zhang, H.; Yang, Z.; Chen, H.; Hu, K.; Savvides, M. Rtgen: Generating region-text pairs for open-vocabulary object detection. arXiv 2024, arXiv:2405.19854. [Google Scholar]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10965–10975. [Google Scholar]
Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.C.; Li, L.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.N.; Gao, J. Glipv2: Unifying localization and vision-language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36067–36080. [Google Scholar]
Yao, L.; Han, J.; Liang, X.; Xu, D.; Zhang, W.; Li, Z.; Xu, H. Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 23497–23506. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Cherti, M.; Beaumont, R.; Wightman, R.; Wortsman, M.; Ilharco, G.; Gordon, C.; Schuhmann, C.; Schmidt, L.; Jitsev, J. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2818–2829. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Shehata, M.; Abo-Al-Ez, R.; Zaghlool, F.; Abou-Kreisha, M.T. Vehicles detection based on background modeling. arXiv 2019, arXiv:1901.04077. [Google Scholar]
Chen, W.; Baojun, Z.; Linbo, T.; Boya, Z. Small vehicles detection based on UAV. J. Eng. 2019, 2019, 7894–7897. [Google Scholar] [CrossRef]
Wang, S.; Jiang, H.; Li, Z.; Yang, J.; Ma, X.; Chen, J.; Tang, X. Phsi-rtdetr: A lightweight infrared small target detection algorithm based on UAV aerial photography. Drones 2024, 8, 240. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Ashraf, M.W.; Sultani, W.; Shah, M. Dogfight: Detecting drones from drones videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7067–7076. [Google Scholar]
Zhao, Y.; Ju, Z.; Sun, T.; Dong, F.; Li, J.; Yang, R.; Fu, Q.; Lian, C.; Shan, P. Tgc-yolov5: An enhanced yolov5 drone detection model based on transformer, gam & ca attention mechanism. Drones 2023, 7, 446. [Google Scholar] [CrossRef]
Bansal, A.; Sikka, K.; Sharma, G.; Chellappa, R.; Divakaran, A. Zero-shot object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 384–400. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Li, Y.; Guo, W.; Yang, X.; Liao, N.; He, D.; Zhou, J.; Yu, W. Toward open vocabulary aerial object detection with clip-activated student-teacher learning. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 431–448. [Google Scholar]
Gupta, A.; Dollar, P.; Girshick, R. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5356–5364. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8514–8523. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. UnitBox: An Advanced Object Detection Network. In Proceedings of the 24th ACM International Conference on Multimedia, New York, NY, USA, 15–19 October 2016; MM’16. pp. 516–520. [Google Scholar] [CrossRef]
Frogner, C.; Zhang, C.; Mobahi, H.; Araya, M.; Poggio, T.A. Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems 28 (NIPS 2015); Curran Associates, Inc.: New York, NY, USA, 2015; Volume 28. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
OpenAI. GPT-4o Technical Report. 2024. Available online: https://openai.com/index/gpt-4o (accessed on 3 July 2025).
Lam, D.; Kuzma, R.; McGee, K.; Dooley, S.; Laielli, M.; Klaric, M.; Bulatov, Y.; McCord, B. xView: Objects in context in overhead imagery. arXiv 2018, arXiv:1802.07856. [Google Scholar]
Zang, Z.; Lin, C.; Tang, C.; Wang, T.; Lv, J. Zero-Shot Aerial Object Detection with Visual Description Regularization. Proc. AAAI Conf. Artif. Intell. 2024, 38, 6926–6934. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; Sun, J. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8430–8439. [Google Scholar]
Hudson, D.A.; Manning, C.D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6700–6709. [Google Scholar]
Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2641–2649. [Google Scholar]
Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar]
Huang, P.; Han, J.; Cheng, D.; Zhang, D. Robust Region Feature Synthesizer for Zero-Shot Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7612–7621. [Google Scholar] [CrossRef]
Yan, C.; Chang, X.; Luo, M.; Liu, H.; Zhang, X.; Zheng, Q. Semantics-Guided Contrastive Network for Zero-Shot Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1530–1544. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the UAV-OVD architecture. The model integrates three key components: (1) a region–text contrastive loss to align visual and semantic features and enable open-vocabulary detection; (2) a multi-level text-guided fusion decoder (MTFD) designed to improve the detection of small and dense objects in complex aerial imagery; and (3) a class extension mechanism that incorporates synonyms during training to enhance linguistic generalization.

Figure 2. Detailed architecture illustrating image–text alignment and Extend Class modules in UAV-OVD.

Figure 3. Detailed architecture illustrating multi-level text-guided fusion decoder modules in UAV-OVD.

Figure 4. Comparison of detection results from different models. Red boxes denote ground truth, while blue boxes indicate predicted results. The right column from ground truth shows the original aerial image, and the left column displays an enlarged region for better visualization. The proposed method achieves the least missed detections when targets are dense (row 1), the most accurate detection boxes for salient target detection (row 2), and the most robust performance for typical aerial small target detection (row 3).

Figure 5. Inference speed (FPS) comparison of DescReg [49], YOLO-World-L [10], and UAV-OVD.

Figure 6. Performance and feature map comparison between UAV-OVD and YOLO-World-L [10]. (1) Detection result for the basketball court category (novel class); (2,3) detection results for the vehicle category (base class).

Figure 7. Qualitative comparison of detection results produced by different algorithms.

Table 1. Ablation study on the xView dataset under Generalized Zero-Shot Detection (GZSD) and Zero-Shot Detection (ZSD) settings. We ablate the contributions of the MTFD, I-T Align, and Extend Class. The checkmark indicates the component is used. Bold indicates the best results.

MTFD	I-T Align	Extend Class	GZSD						ZSD
MTFD	I-T Align	Extend Class	mAP_B	mAP_N	mAP_HM	Recall_B	Recall_N	Recall_HM	mAP	Recall
			24.6	3.4	6.0	58.2	28.7	38.4	7.3	57.4
✓			28.6	3.6	6.2	59.1	23.2	33.2	9.6	62.3
✓	✓		26.6	3.7	6.4	58.0	27.0	36.7	10.3	64.3
✓	✓	✓	29.7	8.0	12.5	61.3	32.0	42.1	9.9	67.3

✓ indicates the module is enabled.

Table 2. Performance comparison on the xView dataset under the Generalized Zero-Shot Detection (GZSD) and Zero-Shot Detection (ZSD) settings. Metrics include mean Average Precision (mAP) and Recall for both base and novel categories. Bold indicates the best results.

Method	Source	GZSD						ZSD
Method	Source	mAP_B	mAP_N	mAP_HM	Recall_B	Recall_N	Recall_HM	mAP	Recall
RRFS [58]	CVPR22	10.2	1.6	2.7	19.1	5.8	8.9	2.2	14.3
ConstrastZSD [59]	TPAMI22	16.8	2.9	5.0	27.6	13.9	18.5	4.1	27.1
DescReg [49]	AAAI24	17.1	5.8	8.7	28.0	12.8	17.6	8.3	43.0
Yolo-World-M [10]	CVPR24	17.3	3.0	5.1	42.9	15.4	22.7	6.8	38.1
Yolo-World-L [10]	CVPR24	18.5	3.3	5.6	41.4	15.2	22.2	7.9	37.1
UAV-OVD (Ours)	–	19.4	4.6	7.4	57.3	32.1	41.2	9.1	63.2
Yolo-World-L [10] *	CVPR24	21.7	3.2	5.6	46.6	16.5	24.4	8.8	41.7
UAV-OVD (Ours) *	–	29.7	8.0	12.5	61.3	32.0	42.1	9.9	67.3

* training with new split method.

Table 3. Performance comparison on the DIOR dataset under the Generalized Zero-Shot Detection (GZSD) and Zero-Shot Detection (ZSD) settings. Metrics include mean Average Precision (mAP) and Recall for both base and novel categories. Bold indicates the best results.

Method	Source	GZSD						ZSD
Method	Source	mAP_B	mAP_N	mAP_HM	Recall_B	Recall_N	Recall_HM	mAP	Recall
RRFS [58]	CVPR22	41.9	2.8	5.2	60.0	19.9	29.9	9.7	19.8
ConstrastZSD [59]	TPAMI22	51.4	3.9	7.2	69.2	25.9	37.7	8.7	22.3
DescReg [49]	AAAI24	68.7	7.9	14.2	82.0	34.3	48.4	15.2	34.6
Yolo-World-M [10]	CVPR24	78.4	12.0	20.8	90.6	37.2	52.7	23.4	38.9
Yolo-World-L [10]	CVPR24	80.2	17.3	28.5	91.1	38.2	53.8	24.8	37.8
UAV-OVD (Ours)	–	77.9	17.2	28.1	93.1	67.3	78.1	32.6	81.1
Yolo-World-L [10] *	CVPR24	78.5	3.1	5.9	91.5	18.6	30.9	6.7	20.6
UAV-OVD (Ours) *		76.6	2.8	5.3	91.8	55.6	69.2	14.3	72.6

* training with new split method.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tao, L.; Wei, G.; Wang, Z.; Qi, Z.; Li, Y.; Zhang, H. UAV-OVD: Open-Vocabulary Object Detection in UAV Imagery via Multi-Level Text-Guided Decoding. Drones 2025, 9, 495. https://doi.org/10.3390/drones9070495

AMA Style

Tao L, Wei G, Wang Z, Qi Z, Li Y, Zhang H. UAV-OVD: Open-Vocabulary Object Detection in UAV Imagery via Multi-Level Text-Guided Decoding. Drones. 2025; 9(7):495. https://doi.org/10.3390/drones9070495

Chicago/Turabian Style

Tao, Lijie, Guoting Wei, Zhuo Wang, Zhaoshuai Qi, Ying Li, and Haokui Zhang. 2025. "UAV-OVD: Open-Vocabulary Object Detection in UAV Imagery via Multi-Level Text-Guided Decoding" Drones 9, no. 7: 495. https://doi.org/10.3390/drones9070495

APA Style

Tao, L., Wei, G., Wang, Z., Qi, Z., Li, Y., & Zhang, H. (2025). UAV-OVD: Open-Vocabulary Object Detection in UAV Imagery via Multi-Level Text-Guided Decoding. Drones, 9(7), 495. https://doi.org/10.3390/drones9070495

Article Menu

UAV-OVD: Open-Vocabulary Object Detection in UAV Imagery via Multi-Level Text-Guided Decoding

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. UAV Imagery Detection

2.3. Open Vocabulary Object Detection

3. Methods

3.1. Image–Text Alignment

3.2. Multi-Level Text-Guided Fusion Decoder

3.3. Extend Class

4. Results

4.1. Experimental Setup

4.2. Evaluation Metrics

4.3. Ablation Experiment

4.4. Comparison with the State of the Art

4.5. Visualization of Detections

5. Conclusions

6. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI