Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives

Zhou, Yang; Li, Junjie; Ou, Congyang; Yan, Dawei; Zhang, Haokui; Xue, Xizhe

doi:10.3390/drones9080557

Open AccessReview

Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives

by

Yang Zhou

,

Junjie Li

,

Congyang Ou

,

Dawei Yan

,

Haokui Zhang

^* and

Xizhe Xue

School of Cybersecurity, Northwestern Polytechnical University, Xi’an 710060, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(8), 557; https://doi.org/10.3390/drones9080557

Submission received: 3 July 2025 / Revised: 30 July 2025 / Accepted: 31 July 2025 / Published: 8 August 2025

Download

Browse Figures

Versions Notes

Abstract

Due to its extensive applications, aerial image object detection has long been a hot topic in computer vision. In recent years, advancements in unmanned aerial vehicle (UAV) technology have further propelled this field to new heights, giving rise to a broader range of application requirements. However, traditional UAV aerial object detection methods primarily focus on detecting predefined categories, which significantly limits their applicability. The advent of cross-modal text–image alignment (e.g., CLIP) has overcome this limitation, enabling open-vocabulary object detection (OVOD), which can identify previously unseen objects through natural language descriptions. This breakthrough significantly enhances the intelligence and autonomy of UAVs in aerial scene understanding. This paper presents a comprehensive survey of OVOD in the context of UAV aerial scenes. We begin by aligning the core principles of OVOD with the unique characteristics of UAV vision, setting the stage for a specialized discussion. Building on this foundation, we construct a systematic taxonomy that categorizes existing OVOD methods for aerial imagery and provides a comprehensive overview of the relevant datasets. This structured review enables us to critically dissect the key challenges and open problems at the intersection of these fields. Finally, based on this analysis, we outline promising future research directions and application prospects. This survey aims to provide a clear road map and a valuable reference for both newcomers and seasoned researchers, fostering innovation in this rapidly evolving domain. We keep track of related works in a public GitHub repository.

Keywords:

unmanned aerial vehicles; open-vocabulary object detection; aerial scene

1. Introduction

In recent years, unmanned aerial vehicle (UAV) technology has achieved dramatic development, with corresponding application scenarios expanding quickly. UAVs, commonly known as drones, have evolved from early entertainment devices into indispensable tools [1,2]. Their application scenarios have extended from precision agriculture [3] and infrastructure inspection to public safety [4,5] and disaster response [6,7], where UAVs are being developed for critical tasks such as delivery and object retrieval in dangerous environments [8]. Particularly in remote or disaster stricken areas with limited or no network connectivity, the need for autonomous, onboard perception is paramount, opening up critical use cases in search-and-rescue operations or low-cost agricultural support.

The significant impact of UAVs stems primarily from their unique aerial perspective, which enables the collection of extensive visual data and provides comprehensive environmental awareness. To fully utilize aerial imagery, UAVs must possess autonomous perception and environmental interpretation capabilities [9,10]. As a fundamental computer vision task, object detection serves as the core technology enabling this intelligent perception. The development of deep learning has led to major breakthroughs in this field. Advanced object detection algorithms based on architectures like the YOLO series [11,12] and Faster R-CNN [13] have achieved remarkable results. When applied to UAV-captured scenes, these models can efficiently and accurately identify and locate target objects, supporting various intelligent applications ranging from automated wildlife monitoring [14] to real-time traffic analysis [15].

However, the efficacy of these traditional methods is fundamentally constrained by their “closed-set” design [16,17]. They are trained on large-scale, meticulously annotated datasets and can only recognize a predefined, fixed set of object categories. This inherent limitation creates a significant bottleneck for UAVs operating in the real world, which is often dynamic, unstructured, and unpredictable. For instance, a UAV tasked with post-earthquake assessment cannot be pretrained to recognize every possible type of debris or sign of human activity. The prohibitive cost of annotating exhaustive datasets for every potential scenario, coupled with the inability to adapt to novel objects, severely curtails the autonomy and practical utility of UAV systems.

Before delving into the specifics of open-vocabulary object detection (OVOD), we first discuss its conceptual predecessor, zero-shot detection (ZSD) [18]. Historically, ZSD was the initial paradigm to tackle the detection of unseen object categories. In a typical setting, a model is trained on a set of base classes and evaluated on its ability to detect a disjoint, predefined set of novel classes. This is often achieved by learning a mapping from visual features to a shared semantic space, which allows the model to infer classifiers for the novel classes based on their semantic descriptions. However, a key limitation of ZSD is that the vocabulary of novel classes is fixed and known during testing. This constrains its applicability in truly dynamic, open-world scenarios, where objects of interest cannot be enumerated beforehand. OVOD emerges as a more powerful and general paradigm designed to overcome this very limitation [19,20,21]. As illustrated in Figure 1, OVOD reframes the task from classifying against a fixed set of categories to a flexible region–text alignment problem. Empowered by large-scale vision–language models (VLMs) like CLIP [22], OVOD systems can detect objects described by arbitrary, open ended textual queries, thus eliminating the need for a predefined novel category list.

The combination of OVOD and UAV technology is especially powerful, offering new possibilities for smarter and more flexible operations. The ability to find objects using simple text commands is particularly useful when we do not know exactly what we are looking for in advance. Here are some practical examples:

Emergency rescue: After disasters, rescue teams can tell UAVs to look for important things like “people needing help”, “temporary shelters”, “red cross signs”, or “damaged buildings”. With OVOD, the UAV can understand these new instructions immediately without extra training.
Wildlife protection: Conservationists can use UAVs to watch large natural areas for problems like hunting or logging. They can give commands like “find hunter camps”, “look for hurt elephants”, or “spot illegal cutting tools”, even if these things are not in the original training data.
Smart cities: City officials can use UAVs with OVOD to check urban areas for problems. The system can understand commands like “find cars parked illegally near fire hydrants”, “look for big road holes”, or “find fallen branches blocking sidewalks”.

In simple terms, OVOD turns regular UAVs into smart helpers that can understand what people need and find things accordingly. This makes them much more useful in real-world situations, where needs can change quickly. While combining OVOD and UAV technology shows great promise, this research area is still new but growing fast. A systematic review is crucial to summarize existing methods, identify key challenges, and guide future research. This paper aims to provide the first comprehensive survey on this topic. The main contributions of our work are as follows:

To the best of our knowledge, this is the first survey to provide a systematic taxonomy of OVOD methods specifically for the UAV aerial imagery domain. We categorize existing studies based on their fundamental learning paradigms, which allows us to critically analyze the inherent trade-offs of each approach in the face of UAV-specific challenges.
We go beyond separate discussions of OVOD and UAV vision by being the first to systematically align and confront the core principles of OVOD with the unique constraints of aerial scenes. This synthesis allows us to identify and clearly articulate a set of open research problems that exist specifically at the intersection of these two fields.
Based on our in depth analysis of the current landscape and open challenges, we chart a clear and structured roadmap for the future. We outline six pivotal research directions, from lightweight models and multi-modal fusion to ethical considerations, and discuss their interdependencies, providing a valuable guide for researchers and practitioners.

The remainder of this survey is organized as follows: Section 2 discusses the foundational background and related work in both traditional and open-vocabulary detection. Section 3 classifies the existing methods into two major types and presents a concise introduction and comparison. Section 4 reviews the datasets and evaluation metrics pertinent to the field. Section 5 provides a deep dive into the prevalent challenges and open issues. Finally, Section 6 concludes the survey by offering a perspective on future research directions.

2. Background

In this section, we introduce the basic concepts related to this survey. First, we look at standard object detection methods used with UAV images and discuss their main limitations. Next, we explain the fundamentals of OVOD, including its core ideas and key supporting technologies. Finally, we explore the special challenges of UAV-captured images that require customized OVOD approaches.

2.1. Traditional Object Detection

Object detection in aerial images captured by UAVs has been a hot topic for over a decade. Early approaches relied on hand-crafted features, but the field was revolutionized by the advent of deep convolution neural networks (DCNNs). Modern UAV object detection is dominated by deep learning-based methods, which can be broadly categorized into two families [23]: two-stage detectors and one-stage detectors.

Two-stage detectors approach the detection task using a region proposal and refined pipeline. This architecture first generates a sparse set of class-agnostic candidate-object region proposals, or regions of interest (RoI), from the input image. In the second stage, it extracts features for each RoI and performs fine-grained classification and bounding box regression to produce the final detections. This methodology, while often computationally intensive, is renowned for its high accuracy and precise object localization. The evolution of this family is marked by several seminal works:

R-CNN [24] was the progenitor of this approach, pioneering the use of a deep convolution network for feature extraction on region proposals generated by an external algorithm like Selective Search. While groundbreaking, its multi-stage training and slow inference limited its practical application.
Fast R-CNN [25] significantly improved speed and efficiency by sharing the convolution feature computation across all proposals on an image. It introduced the RoIPool layer to extract fixed-size feature maps from variable-sized RoIs, enabling an end-to-end training process for the classification stage.
Faster R-CNN [26] represented a major leap by integrating the proposal generation step directly into the network. It introduced the Region Proposal Network (RPN), a fully convolutional network that shares features with the detection network, allowing for nearly cost-free, data-driven region proposals and enabling a unified, end-to-end trainable detection framework. Due to its robustness and high accuracy, Faster R-CNN became a dominant baseline for numerous aerial object detection tasks.
Mask R-CNN [27] extended Faster R-CNN by adding a parallel branch for predicting a pixel-level object mask in addition to the bounding box. It also introduced the RoIAlign layer, which replaced RoIPool to more precisely align the extracted features with the input, leading to significant gains in both detection and instance segmentation accuracy.

In contrast to the two-stage paradigm, one-stage detectors eschew the explicit region proposal step and instead treat object detection as a direct regression and classification problem. These models predict bounding boxes and class probabilities simultaneously in a single pass over the image, typically by dividing the image into a grid and having each cell predict potential objects. This unified architecture grants them a significant advantage in inference speed, making them highly suitable for real-time applications, a critical requirement for onboard processing on UAVs. Key innovations within the one-stage family have progressively addressed their initial accuracy limitations, particularly for the challenging small objects common in aerial scenes.

YOLO [11,28,29] was the first model to frame object detection as a single regression problem, directly predicting bounding box coordinates and class probabilities from full images in one evaluation. Its architecture enabled unprecedented real-time performance, fundamentally shifting the research landscape.
SSD [30] introduced the concept of using multiple feature maps from different layers of a network to detect objects at various scales. By making predictions from both deep (low-resolution) and shallow (high-resolution) feature maps, SSD significantly improved the detection of small objects, a pervasive challenge in UAV imagery.
RetinaNet [31] identified the extreme imbalance between the foreground and background classes during training as a primary obstacle to the precision of a one-stage detector. It introduced Focal Loss [32], a novel loss function that dynamically adjusts the cross-entropy loss to down-weight the contribution of easy, well-classified examples, thereby focusing the training on a sparse set of hard-to-detect objects.

To address the unique challenges of aerial scenes, researchers have introduced numerous modifications to these baseline architectures. The prevalence of small objects, for instance, pushed the development of enhanced multi-scale feature fusion networks beyond the standard FPN. Representative works like [33,34,35,36] proposed novel pathways to better preserve high-resolution spatial details in deep feature maps. Other approaches focused on leveraging contextual information [37] to reconstruct fine-grained details of small targets before detection. Furthermore, the top-down perspective of UAVs results in objects with arbitrary orientations, rendering standard horizontal bounding boxes (HBBs) inefficient. This led to a significant research branch on oriented bounding box (OBB) detection. Frameworks were adapted to predict rotated rectangles, such as the work by [38], which introduced a rotation-aware region proposal mechanism.

Despite these advancements, a fundamental and pervasive limitation underpins nearly all traditional object detectors: the closed-set assumption [39,40]. These models [41,42] are designed to operate within a “closed world”, where the set of object categories is predefined and fixed at training time. The final classification layer of the network has a fixed number of output neurons, each corresponding to a specific class seen during training (e.g., ‘car’, ‘person’, ‘building’). Consequently, these systems are inherently incapable of recognizing objects belonging to novel unseen categories. If a new object class (e.g., “a dropped parcel” or “a rare species of animal”) needs to be detected, the entire pipeline, from data collection and annotation to model retraining and deployment, must be repeated. This process is not only resource-intensive and time-consuming, but also renders the system brittle and nonadaptive in dynamic, real-world scenarios, where unforeseen objects are the norm, not the exception. This closed-set constraint is the primary bottleneck that motivates the exploration of open-vocabulary detection for more flexible and scalable aerial intelligence.

2.2. Fundamentals of Open-Vocabulary Object Detection

OVOD emerges as a revolutionary paradigm designed to dismantle the barriers of the closed-set assumption. It empowers a detector to identify and localize objects described by arbitrary, open-ended textual queries, including categories never seen during the model’s training phase [20,21,43,44]. The core idea of OVOD is to reframe object detection from a task of classification over a fixed set of categories to a task of region–text alignment. Instead of learning to map visual features to a specific class index, an OVOD model learns to map visual features of an object to the semantic representation of its corresponding textual description in a shared embedding space. We illustrate the difference between open-vocabulary object detection and traditional object detection and few-shot detection in Figure 1, mainly from the perspective of training and testing sets. Among them, the few-shot method requires a small number of samples to train when encountering novel classes, while open-vocabulary object detection does not require any at all.

The advanced capabilities of open-vocabulary object detection (OVOD) systems stem from cross-modal alignment technologies, particularly visual language models such as CLIP [22]. CLIP employs an elegant yet highly effective training approach. The model is trained on an extensive dataset containing 400 million image–text pairs collected from the internet. Architecturally, CLIP consists of two core components: a visual encoder that processes image inputs and a text encoder that processes corresponding natural language descriptions. During its training phase, the model is presented with a batch of these image–text pairs. The visual encoder generates an embedding for each image, while the text encoder generates the same for each corresponding text caption. The core learning mechanism is driven by a contrastive objective. Positive pairs: For each correctly matched image and its corresponding text caption in the batch, the model is trained to make their vector embeddings as similar as possible. This teaches the model to associate an image with its authentic description. Negative pairs: In contrast, for every incorrectly matched pairing within the batch, that is, a given image paired with any other text caption from that same batch, the model is trained to make their embeddings as dissimilar as possible. This forces the model to learn fine-grained distinctions and not just general concepts. Through this process, repeated on a colossal scale, CLIP learns a rich shared embedding space where visual concepts and semantic meanings are aligned. An image of a dog and the text “a photo of a dog” will be positioned very close to each other in this space, while the text “a photo of a cat” will be pushed far away. This powerful and generalized alignment is the key that enables the model to understand and detect novel object categories described purely by text.

For OVOD, CLIP provides the “engine” for the alignment step, as shown in Figure 2. OVOD architectures leverage CLIP’s pretrained encoders or fine-tune them to ensure that the visual features from image regions and the semantic features from arbitrary text prompts are comparable within this meaningful shared space [45,46]. This pretrained knowledge of the visual world, linked to natural language, is precisely what allows an OVOD detector to recognize a “solar panel” or a “swimming pool” without ever having been explicitly trained on annotated bounding boxes for those classes.

2.3. Unique Characteristics of UAV Imagery

While OVOD presents a powerful general framework, its direct and naive application to UAV imagery is fraught with challenges. The unique properties of aerial scenes can severely degrade the performance of models pretrained on ground-level, generic web data. Understanding these characteristics is the first step toward developing robust UAV-centric OVOD solutions.

Small objects: This is perhaps the most frequently cited challenge in aerial imaging. Due to high flight altitudes and wide-angle lenses, objects of interest (e.g., people, vehicles) often occupy a minuscule portion of the image, sometimes spanning only a few dozen or even a handful of pixels. For traditional detectors, this leads to the loss of distinguishing features after successive down-sampling operations in deep neural networks. For OVOD, the challenge is even more nuanced: the low-resolution visual features extracted from such small objects may be too coarse and ambiguous to align reliably with a rich textual description in the VLM’s embedding space. The visual signature of a 10x10-pixel “person” is weak and could be easily confused with other small, vertical structures, making a confident visual–linguistic match difficult.
High density and cluttered backgrounds: UAV platforms are often used to monitor crowded scenes like parking lots, public squares, or disaster sites. This results in images containing a high density of objects, often with significant occlusion. Differentiating between individual instances in a dense crowd or a packed car park is a severe test for any detector’s localization capabilities. Furthermore, aerial scenes are rife with complex and cluttered backgrounds—rooftop paraphernalia, varied terrain, foliage, and urban infrastructure. For an OVOD system, this clutter introduces a high number of “distractors”. The visual features of background elements (e.g., a complex pattern of shadows, an air conditioning unit) might accidentally have a high similarity score with one of the text prompts, leading to false positives.
Varying viewpoints: The vast majority of images used to train VLMs like CLIP are taken from ground level, depicting objects from a frontal or near-frontal perspective. However, UAVs primarily capture imagery from nadir (top-down) or oblique viewpoints. This creates a significant domain gap. The visual appearance of a “car” or a “person” from directly above is drastically different from its appearance in a typical photograph. An OVOD model relying on CLIP’s pretrained knowledge may struggle because its internal concept of “car” is strongly tied to side views, not roof views. This viewpoint disparity can weaken the visual–linguistic alignment that is the cornerstone of OVOD’s success.
Varying scales: The operational nature of UAVs, which can dynamically change their flight altitude, introduces extreme variations in object scale—often within the same mission or video sequence. An object that appears large when the UAV is flying low can become a tiny speck as it ascends. This multi-scale challenge requires a detector to be robustly invariant to scale. For OVOD, this means the visual encoder must produce consistent embeddings for the same object category across a wide range of resolutions, a non-trivial requirement that pushes the limits of standard VLM backbones.
Challenging imaging conditions: UAV operations are not confined to perfect, sunny days. The quality of captured imagery can be significantly degraded by a variety of factors. Illumination changes (e.g., harsh sunlight creating deep shadows, or low-light conditions at dawn/dusk) can obscure object details. Adverse weather like rain, fog, or haze can reduce contrast and introduce artifacts. Finally, motion blur, caused by the UAV’s movement or a low shutter speed, can smear features. Each of these factors introduces noise and corrupts the input to the visual encoder, weakening the feature representations and making the subsequent alignment with clean text embeddings less reliable and more prone to error.

In summary, this section has established the limitations of traditional detection methods, introduced the promising paradigm of OVOD, and critically outlined the domain-specific challenges of UAV imagery. The confluence of these factors defines the core research problem: How can we adapt and advance open-vocabulary object detection to perform robustly and accurately despite the severe challenges posed by small objects, high density, viewpoint and scale variations, and difficult imaging conditions in aerial scenes? The following sections will survey the emerging body of work that seeks to answer this question.

3. Methodology: OVOD for UAV Aerial Images

The rapid development of OVOD has led to many different methods. While early methods for ground-view images could be easily grouped, drone images create special challenges that need custom solutions. To help organize these methods, this section divides up OVOD approaches made for aerial images. According to how they learn to recognize new objects, we divide the current methods into two major types. (1) Pseudo-labeling methods: These help when there are not many labeled drone images. They use a teacher–student system to create labels for unlabeled images, which helps the model learn better. (2) CLIP-driven integration methods: This is the most common approach. It adds CLIP’s ability to recognize objects without training (zero shot) into different object detectors. The related methods as shown in Figure 3

3.1. Pseudo-Labeling-Based Methods

This class of methods addresses the critical bottleneck of limited annotated data in the aerial domain. As illustrated in Figure 4a, the core principle is to use a powerful, pretrained model to generate “pseudo-labels” for vast quantities of unlabeled images. These pseudo-labeled data are then used to train or fine-tune a primary object detector, effectively expanding its knowledge base beyond the initial, small set of labeled examples. This paradigm is particularly relevant for UAV applications, where acquiring unlabeled video footage is often trivial but expert annotation is prohibitively expensive.

A pioneering example in this category is CastDet [47], which introduces a CLIP-enhanced teacher–student framework specifically designed for aerial OVOD. The system consists of three core components: a student model (the primary detector), a localization teacher (a stabilized version of the student that generates reliable object proposals), and an external teacher (the fixed domain-specific vision–language model RemoteCLIP [54]). The framework operates through an iterative self-training process: the localization teacher first identifies potential object regions in unlabeled images, which are then labeled by the external teacher. These temporary annotations train the student model to recognize novel classes, and the student’s improved knowledge is subsequently transferred back to the localization teacher, creating a continuous self-improvement cycle termed the “flywheel effect”. To maintain training stability, CastDet employs a dynamic label queue and a proposal selection strategy based on regression stability analysis, ensuring only high-quality candidate regions are used for learning.

Another highly relevant methodology, presented in [48], pushes this concept further towards open-set discovery. Instead of just detecting predefined novel classes, this work proposes a pipeline to discover and assign semantic labels to entirely unknown objects. It first employs a standard closed-set detector to identify both known objects and background regions, which are treated as potential unknown objects. These unknown regions are then fed into a powerful multi-modal large language model (MLLM) with a descriptive prompt, such as “What are the structured objects in this image?”. The MLLM generates rich and natural language descriptions, from which novel class nouns (e.g., “sand volleyball court”, “industrial building”) are extracted using NLP techniques [55,56]. Finally, a domain-adapted vision–language model like RemoteCLIP [54] is used to refine and validate these discovered labels by computing the image–text similarity.

The primary strength of these methods is their exceptional data efficiency. They significantly reduce the reliance on manual annotation and expand the detector’s vocabulary. However, their performance is heavily dependent on the quality of the generated pseudo-labels. This can lead to error propagation, where incorrect pseudo-labels from the teacher model can mislead the student, degrading performance. Furthermore, multi-stage pipelines, as seen in the MLLM-based discovery approach, often suffer from high latency, making them more suitable for offline data annotation and analysis rather than real-time UAV deployment.

3.2. CLIP-Driven Integration Methods

This paradigm represents the mainstream approach for building high-performance OVOD detectors. As shown in Figure 4b, the core idea is to tightly couple the vision and language modalities within a single trainable network. Instead of treating classification as a separate step, these models reformulate object detection as a vision–language grounding task, where the goal is to localize image regions that correspond to given text prompts. This deep fusion allows the model to learn a rich and shared embedding space, leading to superior detection accuracy. Within this category, we observe a divergence in design philosophy, primarily centered on balancing performance, efficiency, and universality.

LAE-DINO [50] embodies the data-driven, performance-first approach. Its authors argue that the primary obstacle for aerial OVOD is the “data domain gap”. Their main contribution is the construction of LAE-1M, the first million-instance, large-scale OVOD dataset for remote sensing, created using a semi-automatic LAE-Label Engine. Built upon this massive dataset, they propose LAE-DINO, an enhanced version of the powerful Grounding DINO detector. To handle the unprecedented vocabulary size of LAE-1M, they introduce dynamic vocabulary construction (DVC), which samples a smaller and relevant vocabulary for each training batch. To better capture the context of aerial scenes, they designed Visual-Guided Text Prompt Learning (VisGT), a module that enriches text features with scene-level visual information. LAE-DINO establishes a new state of the art in raw performance, demonstrating the immense power of training a foundation model on large-scale, in-domain data.

In stark contrast, OVA-Det [49] prioritizes efficiency and real-time performance, critical requirements for onboard UAV processing. Built on the lightweight RT-DETR [57] architecture, OVA-Det introduces several clever, low-overhead modules to achieve image–text collaboration. The core innovations are the Text-Guided Feature Enhancement (TG-FE) and Text-Guided Query Enhancement (TG-QE) modules. These modules use class embeddings as “clues” to guide the encoder and decoder, respectively. They employ cross-attention and a sigmoid gating mechanism to enhance class-relevant features while suppressing background interference, a common issue in cluttered aerial views. This design allows OVA-Det to achieve remarkable zero-shot performance (especially in recall) while operating at an impressive 36 FPS, making it one of the first truly real-time aerial OVOD methods.

Bridging the gap between performance and practicality, OpenRSD [51] aims to be a universal and flexible framework. It is designed to be a “Swiss Army knife” for aerial detection. Its key innovation is a dual-head, multi-task architecture. It features a lightweight alignment head for fast inference and a more complex fusion head for high-precision detection, allowing users to choose the right balance for their application. Critically, OpenRSD is the first framework in this context to explicitly support both horizontal (HBBs) and oriented (OBBs) bounding boxes, a crucial feature for accurately localizing objects in aerial imagery. It also supports both text and image prompts, further enhancing its flexibility. To maximize generalization, it employs a sophisticated multi-stage training pipeline, including pretraining, fine-tuning, and a carefully designed self-training stage that uses the model’s own predictions to create pseudo-labels and refine its knowledge across different datasets.

LLaMA-Unidetector [52] takes a skillful approach by decoupling the complex OVOD task into two simpler, distinct stages: object localization and category recognition. Its framework consists of two independent stages. In the first stage, a class-agnostic detector is trained to perform only one task: to identify all potential objects in an image, distinguishing them from the background, without any knowledge of their specific categories. This detector is optimized purely for localization accuracy and generalization. In the second stage, the foreground object regions proposed by the first stage are passed to TerraOV-LLM, a specialized MLLM built upon LLaMA [58] and fine-tuned on TerraVQA, a large-scale visual question-answering dataset for remote sensing created by the authors. This MLLM then performs the recognition task, inferring the category of each object based on its visual features and a text prompt. The main advantage of this decoupled approach is its ability to tap into the unparalleled semantic understanding and generalization capabilities of MLLM. This allows LLaMA-Unidetector to achieve outstanding zero-shot performance and even recognize objects with a high degree of contextual or abstract reasoning. The modular design also offers great flexibility, as the localization and recognition components can be upgraded independently. However, this two-stage process introduces significant latency, making it unsuitable for real-time applications. It is also susceptible to compounded errors: if the class-agnostic detector fails to propose a region in the first stage, the MLLM will never have the chance to recognize it, regardless of its power.

DescReg [53] is a more theoretical and fundamental work that does not propose a new detector architecture but instead focuses on enhancing the quality of the learned visual–semantic embedding space. It identifies a core challenge in aerial zero-shot detection (ZSD): the weak semantic–visual correlation. Unlike in natural images, an object’s semantic label in aerial imagery may not correlate well with its visual appearance. To solve this, DescReg proposes to regularize the learning process with explicit visual descriptions. For each class, a text description of its typical visual appearance is provided. These descriptions are used to compute a visual similarity matrix between all classes. The core innovation is a novel similarity-aware, adaptive triplet loss. This loss function forces the learned visual feature space to conform to the structure of the predefined visual similarity matrix. It ensures that objects described as visually similar are closer in the embedding space than objects described as visually dissimilar, regardless of their semantic labels.

The primary strength of DescReg is its theoretical elegance and high efficiency. It addresses a fundamental problem with a lightweight, mathematically sound solution. As a regularization method, it is highly versatile and can be integrated into various ZSD/OVOD frameworks to boost their performance. However, its effectiveness is dependent on the quality of the visual descriptions provided. It also focuses primarily on the classification aspect and does not directly address the challenge of improving novel object proposal recall first.

3.3. Summary and Comparison

The previous text provides a comparative summary of the two proposed categories of aerial OVOD methods. It highlights their core principles, representative models, and analyzes their strengths and challenges, with a specific focus on their suitability for UAV applications. The analysis reveals a clear spectrum of solutions, ranging from data-efficient semi-supervised methods ideal for low-resource scenarios, to high-performance end-to-end models for accuracy-critical tasks, to highly efficient models for real-time deployment, and powerful but slow MLLM-based systems for deep semantic analysis. This diverse landscape underscores the richness of the field and points toward a future where hybrid approaches may combine the best of these paradigms.

Table 1 presents a concise performance comparison of various state-of-the-art models on the DIOR and DOTA v1.0 datasets. The evaluation is stratified into two distinct protocols: an mAP-based assessment for few-shot or open-vocabulary settings, and an

{AP}_{50}

-based assessment for general detection performance. It is crucial to emphasize that these two protocols are not directly comparable. The mAP metric, often averaged over a range of IoU thresholds, is significantly more stringent than

{AP}_{50}

, which is calculated at a single, more lenient IoU threshold of 0.5. Therefore, a higher

{AP}_{50}

score does not necessarily imply superior performance over a model evaluated with mAP. Our analysis will thus consider the trends within each protocol separately, avoiding cross-protocol comparisons.

In the mAP-based evaluation, we analyze the models’ ability to detect both base and novel classes, using the harmonic mean (HM) to gauge the balance. A clear trend emerges: while DescReg achieves a high base mAP of 68.7 on both datasets, its performance on novel classes is critically low, leading to poor HM scores of 14.2 and 8.8, respectively. This indicates significant overfitting to the base categories. In contrast, CastDet and OVA-DETR show a much better trade-off. On DIOR, OVA-DETR leads with the highest HM of 39.3, driven by a strong base mAP of 79.6. However, on the more challenging DOTA v1.0 dataset, CastDet demonstrates superior generalization to novel classes (36.0 novel mAP), achieving the best HM of 45.1. DescReg and CastDet were contemporaneous works, while OVA-DETR emerged slightly later. Comparing the two contemporaneous works, DescReg and CastDet, we observe that pseudo-labeling-based algorithms generally achieve superior performance on novel classes. Even when compared to the later-proposed OVA-DETR, CastDet remains highly competitive on novel classes and demonstrates superior performance on the DOTA v1.0 dataset. However, pseudo-labeling-based methods tend to exhibit weaker performance on base classes. We attribute this observation to the inherent characteristics of pseudo-labeling mechanisms. Once pseudo-labels are generated, the model treats novel and base classes similarly during fine-tuning. However, since novel classes receive more focused optimization during this process, the model gradually develops a bias toward novel categories. This bias enhances detection performance for novel classes while simultaneously diverting some of the model’s capacity away from base classes, ultimately resulting in comparatively weaker performance on base categories. It should be noted that while pseudo-labeling-based methods demonstrate measurable advantages in novel-class detection during evaluation, this performance characteristic does not fully meet the expectations for open-world scenarios.

In the

{AP}_{50}

-based evaluation, which measures overall detection accuracy at an IoU threshold of 0.5, we report results for a different set of methods. It is important to note that these scores are not directly comparable to the mAP-based results due to the differing evaluation criteria. LAE-DINO exhibits strong performance on DIOR, with an AP50 of 85.5. Similarly, OPEN-RSD achieves a competitive score of 77.7 on DOTA v1.0. LLaMA-Unidetector, evaluated on both datasets, records scores of 51.38 and 50.22, respectively.

4. Datasets and Evaluation Metrics

A strong and consistent evaluation system is crucial for tracking progress in any research area [59,60]. For OVOD research, this system depends on two key elements: good datasets that provide realistic challenges, and proper metrics that can fully measure a model’s ability to recognize both known and new objects. This section gives a complete review of the main datasets and metrics used in OVOD research, with special attention to how well they work for drone images and where they fall short.

4.1. General OVOD Datasets

The foundations of modern OVOD research were laid using large-scale, general-purpose object detection datasets originally designed for closed-set scenarios [61,62,63,64]. To adapt them for the open-vocabulary task, a standardized protocol of partitioning classes into base and novel sets was established. This split is crucial as it simulates a realistic scenario, where a model is trained on a limited set of annotated categories but is expected to operate in an open world with unseen objects.

Base classes provide the core visual–semantic grounding for the model. The model learns to associate specific visual features with textual class embeddings using the provided bounding box annotations. The primary goal during this phase is to learn a robust and generalizable alignment in a shared embedding space, rather than simply memorizing the base classes; novel classes represent the “open world”. The model must detect these objects at test time without ever having seen an annotated example during training. Success on novel classes is the ultimate measure of a model’s generalization capability, demonstrating its ability to transfer knowledge from the base classes to new, unseen concepts purely through semantic understanding of their class names.

The COCO dataset [61] is arguably the most influential benchmark in object detection and has become the primary testbed for OVOD. It contains 80 object categories from everyday scenes. For OVOD evaluation, a common protocol is to partition these 80 classes into 65 base classes and 15 novel classes. An alternative, more challenging, split designates 48 classes as base and 17 as novel (with 15 categories in the WordNet hierarchy that do not have synonym sets removed). During training, models have access to both images and bounding box annotations for the base classes only. At inference time, the model’s performance is evaluated on its ability to detect all 80 classes, with a particular focus on the 15 or 17 novel classes, for which textual names are the only supervisory signal provided.

LVIS [62] was designed to address the long-tail distribution of objects in the real world, featuring over 1200 categories. This characteristic makes it an excellent benchmark for open-vocabulary learning, as it naturally contains a large set of rare classes that can be designated as ’novel’. The class vocabulary of LVIS is often categorized into frequent, common, and rare classes. In the OVOD setting, the frequent and common classes are typically used as the base set, while the large set of rare classes serves as the novel set. This setup rigorously tests a model’s ability to generalize from a well-represented base to a sparsely represented, diverse set of unseen categories.

While foundational, these general-purpose datasets, captured primarily from a ground-level perspective, do not fully encapsulate the unique challenges of aerial scenes. The significant domain gap, characterized by top-down viewpoints, vast scale variations, complex backgrounds, and arbitrary object orientations, requires the use of specialized aerial datasets.

4.2. UAV-Specific Object Detection Datasets

The rapid advancement of UAV technology has led to the creation of numerous high-resolution aerial and satellite imagery datasets for object detection. In the following, we review some of the most prominent datasets in this domain; their characteristics are summarized in Table 2.

Early and HBB-based datasets: Initial efforts in aerial object detection produced datasets like UCAS-AOD [65], which focuses on two main categories (car and airplane), and NWPU VHR-10 [68], which expands to 10 common object classes. These datasets were instrumental in early research but are limited by their relatively small scale and use of horizontal bounding boxes (HBBs). HBBs are often imprecise for aerial objects like ships and airplanes, which are non-axis-aligned, leading to the inclusion of significant background noise within the bounding box. DIOR [73] is a much larger-scale HBB dataset, with 20 classes and over 190,000 instances, serving as a comprehensive benchmark for HBB-based detection in complex remote sensing scenes.

The rise of OBBs for precise localization: A major leap forward came with the introduction of oriented bounding boxes (OBBs). The HRSC [67] dataset was a pioneer in this area, providing OBB annotations for ships and highlighting the need for rotation-aware detection. This trend was solidified by the DOTA series [70,75]. DOTA v1.0 [70] became a de facto standard, with 15 categories, 2806 large images, and over 188,000 OBB-annotated instances. Its successor, DOTA v2.0 [75], further expanded this with 18 categories and a staggering 1.97 million instances, presenting immense challenges in scale, orientation, and aspect ratio.

Towards million-instance and fine-grained recognition: The last few years have seen the emergence of massive-scale datasets. xView [76] contains over 1 million instances across 60 fine-grained categories, although it uses HBB annotations. FAIR1M [78] pushed the boundaries further by providing over 1 million instances with high-quality OBB annotations, organized into a hierarchical structure of five super-categories and 37 sub-categories. More recently, STAR [85] continued this trend with a fine-grained hierarchy of eight super-categories and 48 sub-categories and OBB annotations. These datasets are crucial for training data-hungry models and evaluating fine-grained recognition capabilities. Other specialized datasets, like GLH-Bridge [79], focus on a single, challenging category with OBBs.

The LAE-1M [50] dataset represents a significant effort to solve the problem of data scarcity in the remote sensing community. It is the first large-scale remote sensing dataset to reach one million labeled objects with a broad category coverage. The construction of LAE-1M is particularly innovative, employing a two-pronged LAE-Label Engine. Fine-grained data (LAE-FOD): For existing, human-labeled datasets, the engine unifies these by performing image slicing, format alignment, and strategic sampling. This creates a high-quality, fine-grained object detection subset. Coarse-grained data (LAE-COD): To leverage the vast amount of unlabeled aerial imagery, the engine uses a semi-automated pipeline. It first employs a segmentation model to generate region proposals from unlabeled images. Then, a powerful large VLM is used to assign categorical labels to these regions in a zero-shot manner. Finally, rule-based filtering cleans the auto-generated labels. By combining both fine-grained and coarse-grained data, LAE-1M provides unprecedented scale and diversity, containing around 1600 unique vocabulary terms. This makes it an ideal pretraining corpus for developing foundational OVOD models for the aerial domain.

The ORSD+ [51] dataset, proposed alongside the OpenRSD framework, is another large-scale training dataset constructed to enhance cross-domain generalization. It comprises over 470,000 images spanning 200 categories. The key innovation of ORSD+ lies in its multi-stage construction and training pipeline. Data aggregation: It begins by integrating numerous existing public datasets, both labeled and unlabeled, which creates an initial and heterogeneous data pool. ORSD+ is explicitly designed to train a universal remote sensing detector that can handle both HBB and OBB tasks and perform robustly across datasets it was not explicitly fine-tuned on, making it highly relevant for open-world scenarios.

The MI-OAD [84] dataset directly tackles the most significant limitation of prior works: the lack of rich, descriptive textual annotations. While datasets like DOTA [75] use simple word-level labels, MI-OAD pioneers the creation of a massive dataset with sentence-level descriptions for objects. It is by far the largest dataset of its kind, containing 163,023 images and 2 million image–caption pairs, which is approximately 40 times larger than any previous remote sensing visual grounding dataset. Its “Open-Source Word-to-Sentence (OS-W2S) Label Engine” is designed to generate rich annotations. Instead of just categories, the engine uses a powerful VLM to generate detailed captions for each object. These captions describe not only the object’s class but also its attributes, its position relative to other objects, and its absolute position within the image. It provides annotations at three distinct levels: the vocabulary level, phrase level, and sentence level. Meanwhile, it breaks the one-to-one correspondence between a caption and a single object. A single descriptive caption can be associated with multiple instances in the image, mimicking real-world language use. MI-OAD is the first benchmark truly designed to evaluate fine-grained, open-vocabulary aerial detection, moving beyond simple category names to complex, descriptive natural language prompts.

Despite these incredible new dataset-building efforts, a standardized benchmark specifically partitioned for open-vocabulary object detection in the UAV aerial domain is still in its infancy. While LAE-1M [50] and MI-OAD [84] propose their own evaluation splits, a community-wide consensus has yet to form. Therefore, the community urgently needs to develop a dedicated UAV-OVOD benchmark that will facilitate progress, achieve fair comparisons, and guide research to address the practical challenges of open-world perception in aerial scenes. In addition, from the dataset summarized in Table 2, we can see a paradigm shift in dataset creation for the aerial domain:

First, we observe a dramatic increase in category diversity. While conventional remote sensing datasets typically contain several dozen classes at most, open-vocabulary datasets like ORSD+ [51] and LAE-1M [50] push this to hundreds or even over a thousand categories, reflecting a significant step towards capturing real-world semantic diversity.
Second, this leap in scale is enabled by a fundamental change in data creation methodology. Rather than relying solely on manual annotation, each of these pioneering datasets is constructed using custom-designed annotation engines. These engines are critically dependent on the powerful zero-shot and text-generation capabilities of modern VLMs, to not only scale up instance numbers but also to increase the semantic richness of the supervision, thus laying the groundwork for a new generation of foundation models for Earth observation.

4.3. Evaluation Metrics

To comprehensively evaluate an OVOD model, metrics must capture its performance on both the familiar base classes and the unseen novel classes [86,87]. Standard object detection metrics are adapted for this dual-objective evaluation. There are several widely used metrics, as described below:

Average precision. The cornerstone of object detection evaluation is average precision (AP). AP provides a single-figure measure that summarizes the quality of a detector by considering both its ability to correctly classify objects (precision) and its ability to find all relevant objects (recall). To understand AP, we must first define its components: precision and recall. For a given object class, precision measures the fraction of correct predictions among all predictions made for that class. Recall measures the fraction of correct predictions among all ground-truth instances of that class. These two metrics can be formulated as follows:

$\begin{matrix} Precision & = \frac{TP}{TP + FP} \end{matrix}$

(1)

$\begin{matrix} Recall & = \frac{TP}{TP + FN} \end{matrix}$

(2)

where true positives (TPs) are correctly detected objects, false positives (FPs) are incorrect detections, and false negatives (FNs) are missed ground-truth objects. A detection is typically considered a true positive if its intersection over union (IoU) with a ground-truth box is above a certain threshold. The IoU threshold is a critical hyperparameter that defines the required strictness of spatial accuracy. A widely used threshold is 0.5 (or 50%). Metrics reported with this threshold are often denoted as $A P_{50}$ or mAP@0.5.
Precision–recall. The precision–recall (PR) curve is an ideal detector which measures precision and recall simultaneously. However, there is often a trade-off: to increase recall (find more objects), a model may lower its confidence threshold, which can lead to more false positives and thus lower precision. The PR curve visualizes this trade-off by plotting precision against recall for various confidence thresholds.
Calculating average precision. AP is conceptually defined as the area under the PR curve. To create a more stable and representative metric, modern evaluation protocols [88,89] employ an interpolation method. The precision at any given recall level is set to the maximum precision achieved at any recall level greater than or equal to it. This creates a monotonically decreasing PR curve, and the AP is the area under this interpolated curve.
Mean average precision. Mean average precision (mAP) is the primary metric for object detection. It is calculated separately for the base and novel class sets. $m A P_{base}$ : This metric is computed over the set of base classes. It quantifies the model’s ability to retain its detection performance on the classes it was explicitly trained on. A high $m A P_{base}$ indicates that the model has not suffered from “catastrophic forgetting” while learning to generalize. $m A P_{novel}$ : This is the most critical metric for OVOD. It is computed exclusively over the set of novel classes. It directly measures the model’s generalization power, its ability to locate and classify objects it has never seen before. A high $m A P_{novel}$ signifies effective knowledge transfer from the seen to the unseen.
Harmonic mean. To provide a single, balanced score that reflects a model’s overall OVOD capability, the harmonic mean (HM) of the base and novel mAP scores is widely used. It is calculated as

$H = \frac{2 \cdot {mAP}_{base} \cdot {mAP}_{novel}}{{mAP}_{base} + {mAP}_{novel}}$

(3)

The HM is more informative than a simple arithmetic mean because it heavily penalizes models that exhibit a large disparity between base- and novel-class performance. For instance, a model that achieves a very high $m A P_{base}$ but a near-zero $m A P_{novel}$ will receive a very low H-score. This metric thus encourages the development of models that achieve a strong balance between retaining knowledge of seen classes and successfully generalizing to new ones, which is the central goal of open-vocabulary object detection.

5. Challenges and Open Issues

The combination of open-vocabulary object detection (OVOD) and drone technology offers exciting possibilities for real-world applications. However, this integration also faces several key challenges. Aerial images have special characteristics, and current OVOD methods still have limitations, which create difficulties that need to be solved to make this technology fully effective [90,91,92]. This section examines these challenges in detail and discusses open research questions. We divide them into two main categories: (1) challenges caused by the nature of drone-captured aerial scenes, and (2) challenges related to current OVOD techniques and how they align text and visual data.

5.1. Limitations of UAV Scenarios

5.1.1. The Domain Gap

A major challenge in applying open-vocabulary object detection (OVOD) to drone imagery stems from the significant differences between standard visual–language model (VLM) training data and actual UAV-captured footage. Advanced VLMs, like CLIP [22] and similar models, form the foundation of most OVOD systems, achieving their impressive zero-shot recognition through pretraining on billions of web images paired with text descriptions. However, these training images primarily feature ground-level perspectives, showing objects in familiar side or frontal views, typically well-framed and centered against common everyday backgrounds that match human visual experience.

In contrast, drone-captured imagery presents fundamentally different characteristics. The aerial perspective introduces substantial visual distortion—when viewed from above, cars appear as simple rectangles while people become small dots, losing the distinctive visual features models rely on for recognition. Objects also exhibit extreme size variations depending on altitude, requiring models to identify the same object whether it fills most of the frame at low altitude or appears just a few pixels wide when flying high. Additionally, the complex, cluttered backgrounds typical of urban or natural aerial scenes make object detection particularly challenging, as targets must be distinguished from dense environmental features [93,94,95].

This domain gap significantly impacts model performance. The semantic connections learned during training, such as associating “passenger bus” with side-view features like length and windows, become unreliable when applied to overhead views where these characteristics are not visible. Consequently, models often fail to recognize objects that are obvious to human observers, even when the visual evidence is clear in the aerial imagery. This fundamental mismatch between ground-level training data and aerial operational conditions remains a critical obstacle for effective UAV-based object detection systems.

Here are some open issues:

Domain adaptation for VLMs: How can we adapt pretrained VLMs to the aerial domain without compromising their open-vocabulary capabilities? Fine-tuning on limited aerial data risks overfitting and catastrophic forgetting of the vast knowledge learned during pretraining. Research into parameter-efficient fine-tuning (PEFT) techniques [96,97], such as adapters [98] and prompt tuning [99], is a promising direction.
Synthetic data generation: Can we leverage simulation engines to generate large-scale, photorealistic aerial datasets with precise, multi-perspective annotations? This could help bridge the domain gap by exposing the VLM to top-down views during a secondary pretraining or fine-tuning stage.
Viewpoint-invariant feature learning: Developing novel network architectures or training strategies that encourage the learning of viewpoint-invariant object representations is a key research goal. This might involve contrastive learning objectives that pull representations of the same object from different viewpoints closer in the embedding space.

5.1.2. Small-Object Detection

The detection of small objects is a long-standing problem in computer vision, and it is particularly acute in UAV imagery due to high flight altitudes [35,100,101]. Quantitatively, a small object is often defined based on standard benchmarks like COCO [89] as an object occupying an area of less than 32×32 pixels. In the context of OVOD, this challenge is amplified. A small object, often defined as occupying a very small number of pixels, presents a dual dilemma:

Information scarcity in visual features: Standard hierarchical vision backbones progressively down-sample the input image to build semantic representations. This process, while effective for large objects, can cause the features of small objects to diminish or vanish entirely in deeper, more semantically rich layers. The resulting feature vector for a small-object proposal is often weak, noisy, and lacks the discriminative information necessary for robust recognition.

Difficulty in visual–semantic alignment: The core mechanism of OVOD is to align region-specific visual features with text-embedding vectors. When the visual feature vector is sparse and non-descriptive due to the object’s small size [102], achieving a meaningful and confident alignment with a rich textual description becomes exceptionally difficult. The model struggles to differentiate a few blurry pixels corresponding to a “life raft” from background noise or other small, irrelevant objects, especially when both have low-quality feature representations.

While traditional small-object detection methods employ specialized techniques like feature pyramid networks (FPNs) [103], high-resolution feature fusion [104], and context-aware modules [37], integrating these seamlessly into the OVOD framework is non-trivial. The challenge lies not just in enhancing small-object features but in ensuring that these enhanced features are compatible with the VLM’s pretrained embedding space.

Here are some open issues:

High-resolution OVOD architectures: Designing OVOD models that can process high-resolution imagery efficiently and maintain fine-grained spatial detail throughout the network is crucial. This may involve exploring novel multi-scale backbones or attention mechanisms tailored for small-object feature preservation.
Context-enhanced alignment: For small objects, surrounding context often provides critical clues. How can we design models that explicitly leverage contextual information (e.g., a “boat” is on “water”) to aid the visual–semantic alignment for the small object itself? This could involve graph-based reasoning or attention mechanisms that correlate object proposals with their environmental context.
Super-resolution as a pre-processing step: Investigating the use of generative super-resolution techniques to “hallucinate” details for small objects before feature extraction could be a viable, albeit computationally expensive, strategy. The challenge is to ensure the generated details are faithful and do not introduce misleading artifacts.

5.1.3. Fine-Grained Recognition

UAVs are often deployed for tasks that require not just detecting object categories but distinguishing between visually similar sub-categories, a task known as fine-grained recognition [78,105]. For example, in traffic monitoring, it is essential to differentiate between a “truck”, a “bus”, a “van”, an “SUV”, and a “sedan”. From a high-altitude, top-down perspective, these distinctions are incredibly subtle. This challenge is magnified in the OVOD setting for two primary reasons:

Loss of distinguishing features: The key visual cues that differentiate fine-grained categories are often subtle and localized (e.g., the length-to-width ratio, the presence of a cargo bed on a truck, the specific shape of the hood). From an aerial view, these features can be obscured, distorted, or simply too small to be resolved. All vehicles may appear as similarly colored rectangles, making visual differentiation based on intrinsic features nearly impossible.

High demands on language specificity: OVOD relies on the text prompt to guide detection. To perform fine-grained recognition, the model must understand the subtle semantic differences between prompts like “truck” and “bus” and link them to the minimal available visual evidence. The VLM may have learned these distinctions from ground-level images, where a “bus” has a long row of windows and a “truck” has a separate cab and trailer. When these features are absent in the aerial view, the model’s ability to ground the specific text prompt correctly is severely compromised.

This challenge pushes the limits of a VLM’s ability to generalize. The model must infer the correct category from indirect cues like relative size, location (e.g., on a highway vs. a city street), or roof features, which may not have been explicitly encoded in its original training.

Here are some open issues:

Hierarchical and attribute-based prompting: Instead of a single class name, can we use more descriptive, attribute-based prompts (e.g., “a long vehicle with a flat roof”, “a small four-wheeled vehicle”)? Developing methods that can parse and reason about such compositional queries is a key research area. Hierarchical prompting (e.g., querying for “vehicle” first, then refining to “truck” or “bus”) could also be a viable strategy.
Injecting domain-specific knowledge: Can we explicitly inject domain knowledge into the model? For example, a knowledge graph could inform the model that “buses are typically longer than SUVs” or that “fishing boats are found near coastlines”. Integrating this symbolic knowledge with the model’s learned representations could significantly improve fine-grained accuracy.
Few-shot fine-grained learning: In many applications, an operator may want to find a new, specific type of object. This requires the model to learn a new fine-grained category from just one or a few visual examples, a challenging few-shot learning problem within the OVOD context.

5.1.4. The Trade-Off Between Efficiency and Performance

A significant practical barrier to the widespread deployment of OVOD on UAVs is the classic trade-off between model performance and computational efficiency. This challenge is especially acute in applications with high social impact, such as disaster response in remote locations or supporting subsistence agriculture, where reliance on powerful cloud servers is impossible due to lack of connectivity. High-performing OVOD models, particularly those based on large backbones like ViT [106], are computationally demanding and have a large memory footprint.

Computational cost: The standard attention mechanism in Transformers [107] has a quadratic complexity with respect to the number of image patches, which makes processing high-resolution imagery computationally expensive. This is a primary bottleneck for large backbones like ViT [106]. However, it is important to note that the research community has actively addressed this limitation. A significant line of work has focused on developing efficient Transformers that approximate the attention mechanism to achieve linear or near linear complexity. Seminal works like Linformer [108], Performer [109], and others [110] have demonstrated various techniques, such as low-rank projections or kernel methods, to drastically reduce this computational burden. The key challenge for UAV OVOD, therefore, is not just acknowledging the problem, but effectively adapting and integrating these efficient Transformer architectures into detection frameworks without sacrificing the rich feature representation needed for open-vocabulary tasks.

Resource-constrained platforms: UAVs are fundamentally resource-constrained edge devices. They operate on limited battery power, and their onboard processing units have a fraction of the computational power and memory of a server-grade GPU.

This creates a dilemma: deploying a large, powerful model is often infeasible due to hardware and power limitations, while deploying a smaller, more efficient model may lead to an unacceptable drop in detection accuracy, especially for the challenging aerial scenarios described above. Real-time performance is often a strict requirement for applications like autonomous navigation or immediate threat detection, further tightening the constraints.

Here are some open issues:

Model compression for OVOD: Research is urgently needed in adapting model compression techniques for OVOD models. These techniques include the following: Quantization: Reducing the precision of model weights to decrease memory usage and accelerate computation on compatible hardware. Pruning: Removing redundant weights or network structures to create a smaller, sparser model. Knowledge distillation: Training a small, efficient “student” model to mimic the output of a large, high-performance “teacher” VLM. The key challenge is how to effectively distill the rich and open-vocabulary knowledge.
Efficient OVOD architectures: There is a need for the design of novel, lightweight network architectures specifically for edge-based OVOD. This might involve hybrid CNN-Transformer models or architectures that are optimized for the specific hardware accelerators found on UAVs.
Cloud–edge collaborative systems: An alternative approach is a hybrid system where the UAV performs lightweight, onboard pre-processing (e.g., detecting potential regions of interest) and transmits only relevant data to a more powerful ground station or cloud server for full OVOD analysis. The primary challenge here is managing communication latency and bandwidth.

5.2. Limitations of Text-Prompting Methods

5.2.1. Ambiguity and Robustness of Text Prompts

The open-vocabulary capability of OVOD is both its greatest strength and a potential source of fragility. The performance of the system is highly dependent on the quality and specificity of the user-provided text prompts. This introduces challenges in ambiguity and robustness.

Semantic ambiguity: The choice of words can have a significant impact on detection results [111]. For instance, the prompts “car”, “automobile”, and “vehicle” may seem synonymous to a human, but they can produce different results. “Vehicle” is broad and might correctly identify cars but also trigger false positives on buses and trucks. “Car” might be too specific and fail to detect SUVs or minivans if the VLM’s internal concept of “car” is biased towards sedans. This prompt engineering is currently more of an art than a science.

Varying levels of abstraction: Users may wish to query for abstract concepts rather than concrete objects. For example, in a disaster response scenario, a relevant query might be “signs of damage”, “a dangerous situation”, or “a gathering crowd”. These prompts do not correspond to a well-defined object category. They require a higher level of scene understanding and reasoning that current OVOD models, which are primarily trained for object-level recognition, are not equipped to handle. Grounding such abstract concepts in visual evidence is a frontier research problem.

Lack of robustness: Models can be brittle. A slight rephrasing of a prompt or the inclusion of descriptive adjectives can sometimes lead to unpredictable changes in performance [112]. Furthermore, most models lack the ability to understand negation or complex compositional queries involving spatial relationships.

Here are some open issues:

Automated prompt engineering and refinement: Can we develop methods that automatically generate or refine text prompts to be optimal for a given task or dataset? This could involve learning a mapping from a user’s high-level intent to a set of effective, low-level prompts.
Learning from multiple prompts: Instead of relying on a single prompt, models could be designed to leverage a set of synonymous or related prompts to produce more robust and reliable detections.
Abstract and compositional reasoning: A major leap forward would be the development of OVOD systems that can handle abstract queries by decomposing them into recognizable visual components. For example, a “dangerous situation” on a highway might be decomposed into “overturned car”, “traffic jam”, and “emergency vehicles.

5.2.2. The Lack of Benchmarks

A final, overarching challenge that impedes progress in the entire field of UAV-OVOD is the critical lack of standardized benchmarks. The development and evaluation of new methods are currently fragmented and difficult to compare. Most existing UAV datasets [70,73,75,89] were created for traditional, closed-set object detection. While they provide high-quality aerial imagery and bounding box annotations, their class vocabularies are fixed and relatively small. Researchers wishing to evaluate OVOD models on this data must manually define their own ”base“ and ”novel” class splits, leading to inconsistent evaluation protocols. Furthermore, these datasets lack the rich, descriptive, and hierarchical text labels needed to fully test the capabilities of VLMs.

A professional benchmark needs to meet the following requirements. Enable fair comparison: Provide a common ground with standardized training, validation, and test sets, along with predefined base and novel vocabularies, to allow for the direct and fair comparison of different methods. Drive progress on key challenges: The benchmark should be explicitly designed to include challenging scenarios that target the problems outlined in this section: a wide distribution of object scales (especially small objects), fine-grained categories with subtle differences, and diverse viewpoints and backgrounds. Standardize evaluation metrics: Define clear and comprehensive evaluation metrics. This should include not only mAP on base and novel classes but also metrics for evaluating performance on hierarchical and descriptive queries, robustness to prompt variations, and computational efficiency.

In conclusion, while the fusion of OVOD and UAV technology holds immense promise, the path to robust, real-world deployment is fraught with challenges. Addressing the domain gap, solving the small-object and fine-grained recognition puzzles, balancing performance with efficiency, improving prompt robustness, and establishing standardized benchmarks are the key open issues that will define the research agenda in this exciting field for years to come.

5.3. Ethical, Privacy, and Bias Challenges

Beyond the technical hurdles, the deployment of UAV OVOD systems in sensitive applications like public surveillance and smart city management introduces a new class of profound challenges that are critical open issues for the field. The power that makes this technology so helpful also means it can be misused or accidentally cause harm to society. Addressing these issues is central to the responsible development and public acceptance of the technology. Key challenges include the following:

Privacy erosion and intrusive surveillance: Traditional detectors identify categories. OVOD can interpret and log specific, nuanced activities. This qualitative leap in surveillance capability poses an unprecedented threat to personal privacy and practical obscurity, creating a pressing need for research into privacy-preserving OVOD architectures and ethical-by-design principles.
Algorithmic bias and fairness: OVOD models inherit societal biases from their web-scale training data. In aerial imagery, this can lead to discriminatory outcomes. For example, a system might exhibit lower accuracy for certain demographic groups or misinterpret cultural objects and activities, posing significant fairness risks in applications like law enforcement or disaster response. Developing methods to audit and mitigate these biases in the aerial domain is a critical, unresolved problem.
Potential for misuse and accountability: The ease of describing any object for detection lowers the barrier for malicious applications, from unauthorized tracking to oppressive monitoring. This creates an urgent need for robust safeguards, access control, and transparent, explainable models to ensure accountability when systems fail or are misused.

6. Future Perspectives and Directions

The integration of OVOD into UAV aerial scenes, while promising, is still in its nascent stages. The challenges detailed in the previous section not only highlight the current limitations but also illuminate a clear path forward for future research. To transition this technology from a laboratory concept to a robust, deployable real-world system, the community must focus on several key research thrusts. This section outlines our vision for the future of UAV-OVOD, presenting six pivotal directions that we believe will shape the landscape of this powerful field.

6.1. Domain Adaptation for UAV-OVOD

The performance of current OVOD models is fundamentally constrained by the domain gap between ground-level web data and top-down aerial imagery. Bridging this gap is arguably the most critical step towards unlocking reliable performance. Future research must move beyond the naive application of off-the-shelf VLMs and focus on creating aerial-aware models that retain their open-vocabulary prowess.

A primary direction is the exploration of PEFT techniques. Methods such as adapters [98], which insert small, trainable modules between the frozen layers of a pretrained model, or LoRA [96], which injects trainable low-rank matrices into Transformer layers, offer a compelling solution. These approaches allow the model to learn aerial-specific features using a small amount of UAV data while keeping the vast majority of the VLM’s parameters frozen. This strategy mitigates the risk of “catastrophic forgetting”, thereby preserving the rich semantic knowledge learned from web-scale data.

Another promising avenue is secondary pretraining on a curated, mid-scale corpus of aerial imagery. This involves taking a general-purpose VLM and continuing its pretraining on a dataset composed of existing aerial imagery and, crucially, large-scale synthetic aerial data. Advanced simulation platforms can generate photorealistic aerial scenes with perfect, automatic annotations, providing a cost-effective way to expose the model to a massive volume of top-down visual concepts. The research challenge lies in optimizing this secondary pretraining phase to maximize domain adaptation without incurring prohibitive computational costs.

6.2. Lightweight and Efficient OVOD Models

The practical utility of UAV-OVOD is contingent upon its ability to run in real time on resource-constrained onboard processors. As highlighted by use cases in emergency rescue and remote agriculture, where network access is unreliable, the development of lightweight and efficient OVOD architectures is not just an optimization but a core enabling factor. The “performance vs. efficiency” trade-off must be addressed through dedicated research into lightweight and efficient OVOD architectures [113,114].

Model compression techniques will be paramount. These include the following:

Quantization: Systematically reducing the numerical precision of model weights and activations [115]. Post-training quantization and quantization-aware training tailored for OVOD models need to be investigated to minimize the accuracy loss.
Knowledge distillation: This is a particularly powerful paradigm for OVOD [116]. A large, high-performance “teacher” model can be used to train a small, efficient “student” model. The key research question is what knowledge to distill. Beyond simply matching the final detection outputs, the student could be trained to mimic the teacher’s intermediate region–text feature alignments, thereby learning the rich cross-modal relationships that enable open-vocabulary recognition.
Network pruning and architecture search: Exploring structured pruning to remove entire filters or attention heads [117], and employing neural architecture search (NAS) to discover novel [118], hardware-aware network designs that are inherently optimized for the computational patterns of edge GPUs.

The ultimate goal is to create a family of OVOD models that offer a flexible trade-off, allowing operators to choose a model that meets the specific latency and accuracy requirements of their mission.

6.3. Multi-Modal Data Fusion

UAVs are often equipped with a suite of sensors beyond standard RGB cameras, such as thermal infrared (IR), multispectral, or even LiDAR sensors. The future of UAV-OVOD lies in moving beyond the RGB–text paradigm to a more holistic multi-modal fusion framework. This will enable robust perception in challenging conditions where RGB data is ambiguous or unavailable, such as at night, in fog, or through smoke.

Imagine a search-and-rescue mission at night. A query for “person” using an RGB camera would likely fail. However, by fusing data from a thermal camera, the system could be prompted to find a “heat source” or “a warm object with a human-like shape”. This requires developing novel fusion architectures. Instead of simple early or late fusion, sophisticated cross-modal attention mechanisms could allow features from one modality to dynamically inform and enhance features from another. For instance, thermal features could guide the attention of the RGB-processing stream towards salient regions, and vice versa. The text embedding would then be aligned with this fused, multi-modal feature representation, creating a powerful system for all-weather detection. Research in this area will need to address challenges in cross-modal alignment, data synchronization, and training with heterogeneous data sources.

6.4. Interactive and Conversational Detection

The current interaction model for OVOD—a single, static text prompt—is limiting. The future will see a shift towards interactive and conversational systems that allow for a natural, multi-turn dialogue between the human operator and the UAV. This paradigm, inspired by advances in embodied AI and vision–language navigation, would enable complex, context-aware instructions. Consider the interaction shown below.

This level of interaction requires more than just detection. It necessitates models that can handle the following. Dialogue history and state tracking: Maintaining the context of the conversation; Co-reference resolution: As shown in Figure 5, understanding that “it” in the second prompt refers to the “bridge” from the first. Task grounding: Translating natural language commands (“track that truck”) into specific actions for the perception and control modules. This research direction points towards a powerful synergy between LLMs for understanding conversational intent and OVOD models for grounding that intent in the visual world.

6.5. Building Large-Scale UAV-OVOD Benchmarks

Progress in any data-driven field is catalyzed by high-quality public benchmarks. The lack of a standard benchmark is a major bottleneck for UAV-OVOD research. A concerted effort from both academia and industry is imperative to build a large-scale, richly annotated UAV-OVOD benchmark.

This future benchmark should possess several key characteristics:

Scale and diversity: It must contain tens of thousands of images from diverse geographical locations, altitudes, times of day, and weather conditions.
Expansive and hierarchical vocabulary: The vocabulary should encompass hundreds or even thousands of object classes, from common categories to rare instances and fine-grained sub-categories.
Rich annotations: Crucially, annotations must go beyond simple bounding boxes to a tight bounding box and/or a segmentation mask; attribute labels; and multiple free-form textual descriptions, capturing the object’s appearance and context.

Such a benchmark would not only enable fair and reproducible comparisons but also drive research towards solving the core challenges of fine-grained recognition, context-based reasoning, and descriptive language grounding.

6.6. Integration with Downstream Tasks

Ultimately, OVOD is an enabling perception technology, not an end in itself. Its true value will be realized through its seamless integration into a broader ecosystem of intelligent UAV tasks, evolving from object detection to holistic situational awareness. Future research will focus on creating unified models or pipelines that leverage OVOD as a foundational component for more complex capabilities:

Open-vocabulary tracking (OVT): Extending the “detect-by-description” capability to “track-by-description”. An operator could initiate tracking by simply describing the target, and the system would maintain a persistent track of that specific object across time and viewpoint changes.
Open-vocabulary segmentation (OVS): A natural and crucial evolution from object detection is to move beyond bounding boxes and provide pixel-level masks for any described object or region. This task is helpful for applications requiring precise area measurement, fine-grained damage assessment, or land cover analysis. Far from being a mere future prospect, OVS for aerial imagery is an active and emerging research front, with several pioneering works [119,120,121] already laying the groundwork.
UAV-based visual question answering (VQA) and scene captioning: Enabling an operator to ask complex questions about the aerial scene or receive dense, language-based summaries of the dynamic environment.

By integrating these capabilities, we move towards a future where a UAV can autonomously perceive, understand, and describe its surroundings in human-like terms, transforming it from a simple remote sensor into a truly intelligent partner for a wide range of critical applications.

Author Contributions

Conceptualization, Y.Z., H.Z. and X.X.; methodology, Y.Z.; software, Y.Z.; validation, Y.Z., J.L., C.O. and D.Y.; formal analysis, Y.Z. and J.L.; investigation, Y.Z., J.L., C.O. and D.Y.; resources, H.Z. and X.X.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z., H.Z. and X.X.; visualization, Y.Z.; supervision, H.Z. and X.X.; project administration, H.Z. and X.X.; funding acquisition, H.Z. and X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of Chinaunder Grant 62401471 and in part by the 2024 Gusu Innovation and Entrepreneurship Leading TalentsProgram (Young Innovative Leading Talents) under Grant ZXL2024333.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Javed, S.; Hassan, A.; Ahmad, R.; Ahmed, W.; Ahmed, R.; Saadat, A.; Guizani, M. State-of-the-art and future research challenges in uav swarms. IEEE Internet Things J. 2024, 11, 19023–19045. [Google Scholar] [CrossRef]
Mao, K.; Zhu, Q.; Wang, C.X.; Ye, X.; Gomez-Ponce, J.; Cai, X.; Miao, Y.; Cui, Z.; Wu, Q.; Fan, W. A survey on channel sounding technologies and measurements for UAV-assisted communications. IEEE Trans. Instrum. Meas. 2024, 73, 8004624. [Google Scholar] [CrossRef]
Sadgrove, E.J.; Falzon, G.; Miron, D.; Lamb, D.W. Real-time object detection in agricultural/remote environments using the multiple-expert colour feature extreme learning machine (MEC-ELM). Comput. Ind. 2018, 98, 183–191. [Google Scholar] [CrossRef]
Chaturvedi, V.; de Vries, W.T. Machine learning algorithms for urban land use planning: A review. Urban Sci. 2021, 5, 68. [Google Scholar] [CrossRef]
Fang, Z.; Savkin, A.V. Strategies for optimized uav surveillance in various tasks and scenarios: A review. Drones 2024, 8, 193. [Google Scholar] [CrossRef]
Albahri, A.; Khaleel, Y.L.; Habeeb, M.A.; Ismael, R.D.; Hameed, Q.A.; Deveci, M.; Homod, R.Z.; Albahri, O.; Alamoodi, A.; Alzubaidi, L. A systematic review of trustworthy artificial intelligence applications in natural disasters. Comput. Electr. Eng. 2024, 118, 109409. [Google Scholar] [CrossRef]
Reilly, V.; Idrees, H.; Shah, M. Detection and tracking of large number of targets in wide area surveillance. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part III 11. Springer: Berlin/Heidelberg, Germany, 2010; pp. 186–199. [Google Scholar]
Sy Nguyen, V.; Jung, J.; Jung, S.; Joe, S.; Kim, B. Deployable Hook Retrieval System for UAV Rescue and Delivery. IEEE Access 2021, 9, 74632–74645. [Google Scholar] [CrossRef]
Wang, D.; Li, W.; Liu, X.; Li, N.; Zhang, C. UAV environmental perception and autonomous obstacle avoidance: A deep learning and depth camera combined solution. Comput. Electron. Agric. 2020, 175, 105523. [Google Scholar] [CrossRef]
Nelson, J.R.; Grubesic, T.H.; Wallace, D.; Chamberlain, A.W. The view from above: A survey of the public’s perception of unmanned aerial vehicles and privacy. J. Urban Technol. 2019, 26, 83–105. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Yi, Z.F.; Frederick, H.; Mendoza, R.L.; Avery, R.; Goodman, L. AI Mapping Risks to Wildlife in Tanzania: Rapid scanning aerial images to flag the changing frontier of human-wildlife proximity. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 5299–5302. [Google Scholar]
Zhao, T.; Nevatia, R. Car detection in low resolution aerial images. Image Vis. Comput. 2003, 21, 693–703. [Google Scholar] [CrossRef]
Han, K.; Huang, X.; Li, Y.; Vaze, S.; Li, J.; Jia, X. What’s in a Name? Beyond Class Indices for Image Recognition. arXiv 2024, arXiv:2304.02364. [Google Scholar] [CrossRef]
Vaze, S.; Han, K.; Vedaldi, A.; Zisserman, A. Open-set recognition: A good closed-set classifier is all you need? In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Bansal, A.; Sikka, K.; Sharma, G.; Chellappa, R.; Divakaran, A. Zero-Shot Object Detection. arXiv 2018, arXiv:1804.04340. [Google Scholar] [CrossRef]
Zhao, H.; Puig, X.; Zhou, B.; Fidler, S.; Torralba, A. Open Vocabulary Scene Parsing. arXiv 2017, arXiv:1703.08769. [Google Scholar] [CrossRef]
Zareian, A.; Rosa, K.D.; Hu, D.H.; Chang, S.F. Open-Vocabulary Object Detection Using Captions. arXiv 2021, arXiv:2011.10678. [Google Scholar] [CrossRef]
Gu, X.; Lin, T.Y.; Kuo, W.; Cui, Y. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. arXiv 2022, arXiv:2104.13921. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Soviany, P.; Ionescu, R.T. Optimizing the Trade-Off between Single-Stage and Two-Stage Deep Object Detectors using Image Difficulty Prediction. In Proceedings of the 2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), Timisoara, Romania, 20–23 September 2018; pp. 209–214. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition; 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016, arXiv:1506.02640. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2018, arXiv:1708.02002. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2017. [Google Scholar]
Gong, Y.; Yu, X.; Ding, Y.; Peng, X.; Zhao, J.; Han, Z. Effective Fusion Factor in FPN for Tiny Object Detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 1159–1167. [Google Scholar] [CrossRef]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320. [Google Scholar]
Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny object detection in aerial images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3791–3798. [Google Scholar]
Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density Map Guided Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive Sparse Convolutional Networks With Global Context Enhancement for Faster Object Detection on Drone Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13435–13444. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Geng, C.; Huang, S.j.; Chen, S. Recent advances in open set recognition: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3614–3631. [Google Scholar] [CrossRef]
Yang, H.M.; Zhang, X.Y.; Yin, F.; Yang, Q.; Liu, C.L. Convolutional prototype network for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2358–2370. [Google Scholar] [CrossRef]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple open-vocabulary object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 728–755. [Google Scholar]
Du, Y.; Wei, F.; Zhang, Z.; Shi, M.; Gao, Y.; Li, G. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14084–14093. [Google Scholar]
Rasheed, H.; Maaz, M.; Khattak, M.U.; Khan, S.; Khan, F.S. Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection. arXiv 2022, arXiv:2207.03482. [Google Scholar] [CrossRef]
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. RegionCLIP: Region-based Language-Image Pretraining. arXiv 2021, arXiv:2112.09106. [Google Scholar] [CrossRef]
Li, Y.; Guo, W.; Yang, X.; Liao, N.; He, D.; Zhou, J.; Yu, W. Toward open vocabulary aerial object detection with clip-activated student-teacher learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2024; pp. 431–448. [Google Scholar]
Saini, N.; Dubey, A.; Das, D.; Chattopadhyay, C. Advancing open-set object detection in remote sensing using multimodal large language model. In Proceedings of the Winter Conference on Applications of Computer Vision, Tucson, Arizona, 28 February–4 March 2025; pp. 451–458. [Google Scholar]
Wei, G.; Yuan, X.; Liu, Y.; Shang, Z.; Yao, K.; Li, C.; Yan, Q.; Zhao, C.; Zhang, H.; Xiao, R. OVA-DETR: Open vocabulary aerial object detection using image-text alignment and fusion. arXiv 2024, arXiv:2408.12246. [Google Scholar]
Pan, J.; Liu, Y.; Fu, Y.; Ma, M.; Li, J.; Paudel, D.P.; Van Gool, L.; Huang, X. Locate anything on earth: Advancing open-vocabulary object detection for remote sensing community. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6281–6289. [Google Scholar]
Huang, Z.; Feng, Y.; Yang, S.; Liu, Z.; Liu, Q.; Wang, Y. Openrsd: Towards open-prompts for object detection in remote sensing images. arXiv 2025, arXiv:2503.06146. [Google Scholar]
Xie, J.; Wang, G.; Zhang, T.; Sun, Y.; Chen, H.; Zhuang, Y.; Li, J. LLaMA-Unidetector: An LLaMA-Based Universal Framework for Open-Vocabulary Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4409318. [Google Scholar] [CrossRef]
Zang, Z.; Lin, C.; Tang, C.; Wang, T.; Lv, J. Zero-shot aerial object detection with visual description regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 6926–6934. [Google Scholar]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. arXiv 2024, arXiv:2306.11029. [Google Scholar] [CrossRef]
Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2024, arXiv:2304.08069. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Nilsen, P. Making sense of implementation theories, models, and frameworks. In Implementation Science 3.0; Springer: Berlin/Heidelberg, Germany, 2020; pp. 53–79. [Google Scholar]
Venable, J.; Pries-Heje, J.; Baskerville, R. FEDS: A framework for evaluation in design science research. Eur. J. Inf. Syst. 2016, 25, 77–89. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar] [CrossRef]
Gupta, A.; Dollár, P.; Girshick, R. LVIS: A Dataset for Large Vocabulary Instance Segmentation. arXiv 2019, arXiv:1908.03195. [Google Scholar] [CrossRef]
Wang, J.; Zhang, P.; Chu, T.; Cao, Y.; Zhou, Y.; Wu, T.; Wang, B.; He, C.; Lin, D. V3Det: Vast Vocabulary Visual Detection Dataset. arXiv 2023, arXiv:2304.03752. [Google Scholar] [CrossRef]
Yao, Y.; Liu, P.; Zhao, T.; Zhang, Q.; Liao, J.; Fang, C.; Lee, K.; Wang, Q. How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection. arXiv 2023, arXiv:2308.13177. [Google Scholar] [CrossRef]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 1–27 September 2015; pp. 3735–3739. [Google Scholar] [CrossRef]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017; SciTePress: Setúbal, Portugal, 2017; Volume 2, pp. 324–331. [Google Scholar]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Zou, Z.; Shi, Z. Random Access Memories: A New Paradigm for Target Detection in High Resolution Aerial Remote Sensing Images. IEEE Trans. Image Process. 2018, 27, 1100–1111. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and Robust Convolutional Neural Network for Very High-Resolution Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
Haroon, M.; Shahzad, M.; Fraz, M.M. Multisized Object Detection Using Spaceborne Optical Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3032–3046. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free oriented proposal generator for object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625411. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef]
Lam, D.; Kuzma, R.; McGee, K.; Dooley, S.; Laielli, M.; Klaric, M.; Bulatov, Y.; McCord, B. xview: Objects in context in overhead imagery. arXiv 2018, arXiv:1802.07856. [Google Scholar] [CrossRef]
Bansal, A.; Sikka, K.; Sharma, G.; Chellappa, R.; Divakaran, A. Zero-shot object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 384–400. [Google Scholar]
Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
Li, Y.; Luo, J.; Zhang, Y.; Tan, Y.; Yu, J.G.; Bai, S. Learning to holistically detect bridges from large-size vhr remote sensing imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11507–11523. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards Large-Scale Small Object Detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Sun, Y.; Feng, S.; Li, X.; Ye, Y.; Kang, J.; Huang, X. Visual Grounding in Remote Sensing Images. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10 October 2022; MM ’22. pp. 404–412. [Google Scholar] [CrossRef]
Li, K.; Wang, D.; Xu, H.; Zhong, H.; Wang, C. Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5631413. [Google Scholar] [CrossRef]
Zhan, Y.; Xiong, Z.; Yuan, Y. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604513. [Google Scholar] [CrossRef]
Wei, G.; Liu, Y.; Yuan, X.; Xue, X.; Guo, L.; Yang, Y.; Zhao, C.; Bai, Z.; Zhang, H.; Xiao, R. From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection. arXiv 2025, arXiv:2505.03334. [Google Scholar]
Li, Y.; Wang, L.; Wang, T.; Yang, X.; Luo, J.; Wang, Q.; Deng, Y.; Wang, W.; Sun, X.; Li, H.; et al. Star: A first-ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery. IEEE Trans. Pattern Anal. Mach. Intell 2025, 47, 1832–1849. [Google Scholar] [CrossRef] [PubMed]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable Person Re-Identification: A Benchmark. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–15 December 2015. [Google Scholar]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Everingham, M.; Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Leng, J.; Ye, Y.; Mo, M.; Gao, C.; Gan, J.; Xiao, B.; Gao, X. Recent Advances for Aerial Object Detection: A Survey. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. arXiv 2021, arXiv:2001.06303. [Google Scholar] [CrossRef] [PubMed]
Cazzato, D.; Cimarelli, C.; Sanchez-Lopez, J.L.; Voos, H.; Leo, M. A Survey of Computer Vision Methods for 2D Object Detection from Unmanned Aerial Vehicles. J. Imaging 2020, 6, 78. [Google Scholar] [CrossRef] [PubMed]
Mueller, M.; Smith, N.; Ghanem, B. A Benchmark and Simulator for UAV Tracking. In Proceedings of the Computer Vision—ECCV, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 445–461. [Google Scholar]
Kalra, I.; Singh, M.; Nagpal, S.; Singh, R.; Vatsa, M.; Sujit, P.B. DroneSURF: Benchmark Dataset for Drone-based Face Recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; IEEE Press: Piscataway, NJ, USA, 2019; pp. 1–7. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. arXiv 2018, arXiv:1804.00518. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4582–4597. [Google Scholar] [CrossRef]
Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.P.; Bing, L.; Xu, X.; Poria, S.; Lee, R.K.W. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. arXiv 2023, arXiv:2304.01933. [Google Scholar] [CrossRef]
Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar] [CrossRef]
Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhang, W.; Zhang, Y.; Zhang, P.; Bao, G. Feature-Enhanced CenterNet for Small Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 5488. [Google Scholar] [CrossRef]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. arXiv 2022, arXiv:2110.13389. [Google Scholar] [CrossRef]
Jing, R.; Zhang, W.; Li, Y.; Li, W.; Liu, Y. Feature aggregation network for small object detection. Expert Syst. Appl. 2024, 255, 124686. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. arXiv 2017, arXiv:1612.03144. [Google Scholar] [CrossRef]
Ma, Z.; Zhou, L.; Wu, D.; Zhang, X. A small object detection method with context information for high altitude images. Pattern Recognit. Lett. 2025, 188, 22–28. [Google Scholar] [CrossRef]
Zhang, R.; Xie, C.; Deng, L. A fine-grained object detection model for aerial images based on yolov5 deep neural network. Chin. J. Electron. 2023, 32, 51–63. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. arXiv 2022, arXiv:2009.14794. [Google Scholar] [CrossRef]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The Efficient Transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
Manhardt, F.; Arroyo, D.M.; Rupprecht, C.; Busam, B.; Birdal, T.; Navab, N.; Tombari, F. Explaining the ambiguity of object detection and 6d pose from visual data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6841–6850. [Google Scholar]
He, H.; Ding, J.; Xu, B.; Xia, G.S. On the robustness of object detection models on aerial images. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5600512. [Google Scholar] [CrossRef]
Yue, M.; Zhang, L.; Huang, J.; Zhang, H. Lightweight and efficient tiny-object detection based on improved YOLOv8n for UAV aerial images. Drones 2024, 8, 276. [Google Scholar] [CrossRef]
Hu, M.; Li, Z.; Yu, J.; Wan, X.; Tan, H.; Lin, Z. Efficient-lightweight YOLO: Improving small object detection in YOLO for aerial images. Sensors 2023, 23, 6423. [Google Scholar] [CrossRef]
Plastiras, G.; Siddiqui, S.; Kyrkou, C.; Theocharides, T. Efficient embedded deep neural-network-based object detection via joint quantization and tiling. In Proceedings of the 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Genova, Italy, 31 August–2 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 6–10. [Google Scholar]
Yang, Y.; Sun, X.; Diao, W.; Li, H.; Wu, Y.; Li, X.; Fu, K. Adaptive knowledge distillation for lightweight remote sensing object detectors optimizing. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5623715. [Google Scholar] [CrossRef]
Zhao, P.; Yuan, G.; Cai, Y.; Niu, W.; Liu, Q.; Wen, W.; Ren, B.; Wang, Y.; Lin, X. Neural pruning search for real-time object detection of autonomous vehicles. In Proceedings of the 2021 58th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 5–9 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 835–840. [Google Scholar]
Wang, Y.; Yang, Y.; Zhao, X. Object detection using clustering algorithm adaptive searching regions in aerial images. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 651–664. [Google Scholar]
Cao, Q.; Chen, Y.; Ma, C.; Yang, X. Open-Vocabulary Remote Sensing Image Semantic Segmentation. arXiv 2024, arXiv:2409.07683. [Google Scholar] [CrossRef]
Li, K.; Liu, R.; Cao, X.; Bai, X.; Zhou, F.; Meng, D.; Wang, Z. SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images. arXiv 2024, arXiv:2410.01768. [Google Scholar] [CrossRef]
Ye, C.; Zhuge, Y.; Zhang, P. Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation. arXiv 2024, arXiv:2412.19492. [Google Scholar] [CrossRef]

Figure 1. Comparison between traditional closed-set detection, few-shot detection, and open-vocabulary detection. (a) UAV image object detection. (b) Traditional closed-set detection: All test categories have appeared in the training set. (c) Few-shot detection: Some of the test categories have only a few training samples. (d) Open-vocabulary detection: Some categories have never been seen in the training set, only their names are provided.

Figure 2. Illustration of how CLIP is used in open-vocabulary object detection. Region features extracted from the image are matched against category embeddings generated from textual prompts, enabling category expansion and unsupervised detection.

Figure 3. Several typical open-vocabulary object detection methods and their representative works in UAV imagery.The related reference of End-to-End Vision Language Fusion Methods is [47,48]. The related reference of Semi-Supervised Pseudo-Labeling Methods is [49,50,51]. The related reference of Decoupled Recognition with MLLMs is [52]. The related reference of Representation Regularization Methods is [53].

Figure 4. Comparison of open-vocabulary object detection strategies. (a) Pseudo-labeling-based methods. (b) CLIP-driven integration methods.

Figure 5. Interaction between human operators and drones.

Table 1. Comparison on DIOR and DOTA v1.0 datasets.

Model	DIOR			DOTA v1.0
	mAP-based Evaluation
	Base mAP	Novel mAP	HM	Base mAP	Novel mAP	HM
DescReg	68.7	7.9	14.2	68.7	4.7	8.8
CastDet	51.3	24.3	33.0	60.6	36.0	45.1
OVA-DETR	79.6	26.1	39.3	75.5	23.7	36.1
	${AP}_{50}$ -based Evaluation
LAE-DINO	85.5			–
OPEN-RSD	–			77.7
LLaMA-Unidetector	51.38			50.22

– represents that the result was not reported in the original paper.

Table 2. Statistics of relevant datasets. The upper section covers typical ground-view detection datasets. The middle section focuses on remote sensing aerial detection datasets, while the lower section is dedicated to datasets specifically designed for open-vocabulary aerial object detection.

Domain	Dataset	Class	Image	Instance	Annotation Method
General scene	COCO [61]	80	328 K	2.5 M	-
General scene	LVIS [62]	1200	164 K	2.2 M	-
Remote Sensing	UCAS-AOD [65]	2	2420	14596	HBB
	RSOD [66]	4	3644	22,221	OBB
	HRSC [67]	19	1061	2976	OBB
	NWPU-VHR [68]	10	800	3651	HBB
	LEVIR [69]	3	21,952	11,028	HBB
	DOTA v1.0 [70]	15	2806	188,282	OBB
	HRRSD [71]	13	21,761	55,740	HBB
	SIMD [72]	15	5000	45,096	HBB
	DIOR [73]	20	23,463	190,288	HBB
	DIOR-R [74]	20	23,463	192,512	OBB
	DOTA v2.0 [75]	18	11,268	1,973,658	OBB
	xView [76]	60	1127	>1 M	HBB
	Visdrone [77]	10	29,040	740,419	HBB
	FAIR1M [78]	37	15,266	>1 M	OBB
	GLH-Bridge [79]	1	6000	59,737	ALL
	SODA [80]	9	31,798	1,008,346	HBB
	RSVG [81]	-	4239	7933	HBB
	OPT-RSVG [82]	14	25,452	48,952	HBB
	DIOR-RSVG [83]	20	17,402	38,320	HBB
Open vocabulary	MI-OAD [84]	100	163,023	2 M	HBB
	ORSD+ [51]	200	474,058	-	ALL
	LAE-1M [50]	1600	-	1 M	HBB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Y.; Li, J.; Ou, C.; Yan, D.; Zhang, H.; Xue, X. Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives. Drones 2025, 9, 557. https://doi.org/10.3390/drones9080557

AMA Style

Zhou Y, Li J, Ou C, Yan D, Zhang H, Xue X. Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives. Drones. 2025; 9(8):557. https://doi.org/10.3390/drones9080557

Chicago/Turabian Style

Zhou, Yang, Junjie Li, Congyang Ou, Dawei Yan, Haokui Zhang, and Xizhe Xue. 2025. "Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives" Drones 9, no. 8: 557. https://doi.org/10.3390/drones9080557

APA Style

Zhou, Y., Li, J., Ou, C., Yan, D., Zhang, H., & Xue, X. (2025). Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives. Drones, 9(8), 557. https://doi.org/10.3390/drones9080557

Article Menu

Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives

Abstract

1. Introduction

2. Background

2.1. Traditional Object Detection

2.2. Fundamentals of Open-Vocabulary Object Detection

2.3. Unique Characteristics of UAV Imagery

3. Methodology: OVOD for UAV Aerial Images

3.1. Pseudo-Labeling-Based Methods

3.2. CLIP-Driven Integration Methods

3.3. Summary and Comparison

4. Datasets and Evaluation Metrics

4.1. General OVOD Datasets

4.2. UAV-Specific Object Detection Datasets

4.3. Evaluation Metrics

5. Challenges and Open Issues

5.1. Limitations of UAV Scenarios

5.1.1. The Domain Gap

5.1.2. Small-Object Detection

5.1.3. Fine-Grained Recognition

5.1.4. The Trade-Off Between Efficiency and Performance

5.2. Limitations of Text-Prompting Methods

5.2.1. Ambiguity and Robustness of Text Prompts

5.2.2. The Lack of Benchmarks

5.3. Ethical, Privacy, and Bias Challenges

6. Future Perspectives and Directions

6.1. Domain Adaptation for UAV-OVOD

6.2. Lightweight and Efficient OVOD Models

6.3. Multi-Modal Data Fusion

6.4. Interactive and Conversational Detection

6.5. Building Large-Scale UAV-OVOD Benchmarks

6.6. Integration with Downstream Tasks

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI