WCDB-YOLO: Wavelet-Enhanced Contextual Dual-Backbone Network for Small Object Detection in UAV Aerial Imagery

Luan, Di; Dong, Yuna; Zhou, Jian; Li, Ang; Xie, Ling; Liu, Hongying; Zhu, Jun

doi:10.3390/drones10030155

Open AccessArticle

WCDB-YOLO: Wavelet-Enhanced Contextual Dual-Backbone Network for Small Object Detection in UAV Aerial Imagery

by

Di Luan

¹,

Yuna Dong

²,

Jian Zhou

³,

Ang Li

¹,

Ling Xie

¹,

Hongying Liu

⁴ and

Jun Zhu

^5,*

¹

School of Intelligent Control, Nanjing University of Science and Technology Zijin College, Nanjing 210023, China

²

School of Intelligent Control, Yantai Engineering & Technology College, Yantai 264000, China

³

Shenzhen Yuanlian Technology Co., Ltd., Shenzhen 518000, China

⁴

School of Computer and Artificial Intelligence, Nanjing University of Science and Technology Zijin College, Nanjing 210023, China

⁵

School of Computer and Software, Nanjing University of Industry Technology, Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(3), 155; https://doi.org/10.3390/drones10030155

Submission received: 17 January 2026 / Revised: 13 February 2026 / Accepted: 20 February 2026 / Published: 24 February 2026

(This article belongs to the Topic Advances in Integrative AI, Machine Learning, and Big Data for Transformative Applications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel small object detection model named WCDB-YOLO is proposed, with its core innovation lying in the introduction of a “target-context decoupled perception” architectural paradigm. The model constructs two functionally complementary backbone branches: one branch extracts object features, while the other innovatively incorporates a wavelet convolution module to efficiently model global contextual semantics in complex scenes.
The model incorporates a Dilated-Wise Residual (DWR) module, which employs parallel convolutional branches with varying dilation rates to achieve dynamic adaptive fusion of multi-scale small object features. Furthermore, by optimizing the feature pyramid structure and integrating high-resolution P2/4-level features into the detection head, the localization accuracy of tiny objects is significantly improved. On the VisDrone dataset, the mAP50 metric improves by 8.4% compared to the baseline, surpassing current SOTA methods.

What are the implications of the main findings?

The proposed dual-backbone structure and multi-scale context modeling strategy provide a new paradigm for addressing core challenges such as small objects, high density, and strong interference in drone aerial images, demonstrating good generality and scalability.
The model significantly enhances the network’s ability to identify and localize tiny, blurred, and densely distributed objects in aerial images, while only incurring a reasonable linear increase in overall complexity. This provides a technical foundation for the practical application of UAV vision systems in areas such as traffic monitoring, disaster rescue, and smart city applications.

Abstract

Object detection in UAV aerial imagery plays a pivotal role across a wide spectrum of applications. However, existing detection models continue to face significant challenges stemming from small object scales, dense spatial distributions, and highly complex backgrounds. To address these challenges, this paper proposes a novel dual-backbone network model named WCDB-YOLO. The core innovation of this work lies in introducing a “target-context decoupled perception” paradigm, which utilizes two structurally complementary backbone networks to separately process local object features and global background information: one backbone focuses on extracting fine-grained local features of objects, while the other innovatively incorporates a wavelet convolution module to efficiently model the global contextual semantics of complex scenes with minimal computational cost by constructing a large receptive field. To further enhance the scale adaptability for small objects, a Dilation-wise Residual (DWR) module is designed, which employs parallel convolutional branches with different dilation rates to achieve dynamic adaptation to multi-scale small object features. Additionally, the model optimizes the feature pyramid structure by integrating high-resolution P2/4 features into the detection head, significantly improving the localization accuracy of tiny objects. Experimental results on the VisDrone dataset show that the proposed method achieves an 8.4% improvement in mAP50 over the baseline YOLOv11s model and outperforms current state-of-the-art (SOTA) approaches. This work presents a highly accurate and robust solution for small object detection from UAV platforms in complex environments.

Keywords:

UAV; small object detection; YOLOv11 algorithm; wavelet transform; context enhancement; multi-scale feature extraction

1. Introduction

Recent progress in unmanned aerial vehicle (UAV) technology and artificial intelligence has propelled object detection in UAV imagery to the forefront as a key enabling technology [1,2,3]. Its applications are expanding rapidly across diverse domains, from smart agriculture and military reconnaissance to urban surveillance. However, existing object detection models typically fail to achieve satisfactory accuracy when applied directly to UAV-captured images. Aerial photography usually encompasses extensive geographical expanses, leading to most objects being depicted at very small scales within these images, amidst intricate backdrops. Detecting small objects in drone imagery presents three core challenges: minimal pixel coverage of the objects (frequently under 0.1%), highly varied and complex backgrounds (like forest canopies or urban landscapes), and dense, multi-scale object distribution. This complexity of aerial images is clearly demonstrated in Figure 1. Consequently, aerial-image-based detection models are now subject to more stringent requirements regarding accuracy, robustness, and adaptability [4].

Object detection in aerial imagery has attracted considerable attention and research interest. Yan et al. [5] proposed ST-YOLO, a model based on YOLOv5s that was specifically improved for small object detection. However, its performance on the VisDrone dataset remains relatively low, with an mAP50 of 33.2%. This suggests that, although the introduced modules provide some benefit to detection, they may still be insufficient in capturing the multi-scale features and contextual information essential for robust aerial object detection. Zhu et al. [6] proposed TPH-YOLOv5, which improves detection performance by incorporating transformer prediction heads and attention mechanisms. The model achieves an mAP50 of 36.2% on the VisDrone dataset. Although this work notably enhances model accuracy and delivers strong performance, the incorporation of numerous Transformer encoder blocks in the feature extraction stage significantly inflates the model’s size and markedly slows down inference. Yue et al. [7] proposed a lightweight small object detection model named LE-YOLO, which integrates depthwise separable convolution with channel shuffling modules to enhance multi-level extraction of local details and channel-wise features. They also designed the LGS bottleneck and LGSCSP fusion modules to reduce computational complexity. However, this approach exhibits limited capacity in modeling global contextual information, making it difficult to effectively capture the semantic relationships between tiny objects and their extensive backgrounds in aerial images. Moreover, excessive model lightweighting compromises representational power, thereby constraining further improvements in detection accuracy.

Existing object detection models for UAV aerial imagery continue to face significant challenges. Conventional detectors are typically not tailored to the unique characteristics of small objects, leading to consistently suboptimal performance in such scenarios. Current efforts to address these limitations largely follow two paradigms: (1) pursuing higher detection accuracy at the cost of substantial computational complexity—often resulting in impractically slow inference even on high-end servers; and (2) prioritizing model lightweighting, which frequently sacrifices model capacity and representational power, thereby causing performance bottlenecks in challenging conditions such as complex backgrounds and densely clustered small objects. Overall, existing approaches lack a design strategy that effectively maximizes small object detection performance within a controllable and reasonable computational budget. This calls for moving beyond simplistic solutions such as naive module stacking or aggressive channel pruning. Instead, the focus must shift toward more efficient mechanisms for feature representation and utilization—specifically, through the adoption of superior network architectures and more intelligent contextual information fusion—to achieve substantially improved detection accuracy without exceeding acceptable computational costs. Thanks to advances in communication technology, images captured by UAVs can be transmitted in real time to a server for processing, with detection results sent back promptly. This makes models that achieve high accuracy while maintaining moderate computational complexity highly practical and valuable.

Inspired by this, this paper proposes a dual-backbone detection model based on wavelet-enhanced contextual information, referred to as WCDB-YOLO. It adopts the current state-of-the-art (SOTA) model, YOLOv11s, as the baseline— a model that has already demonstrated an excellent balance between detection accuracy and speed. The proposed model effectively improves small object detection performance through structural decoupling and targeted enhancement, achieving competitive overall performance that surpasses several current SOTA models, with only a moderate increase in computational cost. The main contributions of this paper are as follows:

(1) We propose a “target-context decoupled perception” paradigm, which leverages two structurally complementary backbone networks to separately process local object features and global background information. One backbone focuses on extracting fine-grained local object features, while the other innovatively incorporates a wavelet convolution module to efficiently model the global contextual semantics of complex scenes with minimal computational cost by constructing a large receptive field.

(2) We incorporate the Dilation-wise Residual (DWR) module into both the object-extraction backbone and the neck fusion network. By deploying convolutional branches with different dilation rates in parallel, the network can simultaneously capture local fine-grained features and global contextual information. This enables the model to establish multi-level representations—from pixel-level details to region-level semantics—within a single layer, providing crucial scale adaptability for small object detection.

(3) Building upon the original detection head, we incorporate a high-resolution feature map from the shallow layer P2/4 to enrich fine-grained details of small objects. This design significantly enhances the model’s ability to perceive and localize tiny objects in the image.

(4) Through structural decoupling and targeted enhancement, the model effectively improves the detection performance for small objects. Experiments on the VisDrone dataset show that our model achieves an 8.4% improvement in mAP50 over the baseline and outperforms current SOTA small object detection models. Additionally, generalization experiments on the VEDAI dataset further demonstrate the effectiveness of the proposed enhancements.

Following this introduction, the paper is organized as follows. Section 2 reviews both foundational and recent advances in related fields to establish the necessary context. Section 3 then details the proposed WCDB-YOLO algorithm, including its network architecture and key improvements. Section 4 provides a thorough evaluation of WCDB-YOLO’s performance through comprehensive experiments and comparative analysis. Finally, Section 5 wraps up the paper by highlighting the key contributions and outlining possible avenues for future work.

2. Related Work

2.1. Object Detection Algorithms

Current mainstream object detection approaches primarily rely on diverse deep learning architectures and can broadly be categorized into two groups. The first group consists of two-stage detection methods, which begin by generating region proposals before performing the actual object detection. Notable examples of such algorithms include R-CNN [8] and its improved variants [9,10]. R-CNN is a landmark work in the field of object detection. Prior to its emergence, object detection primarily relied on handcrafted features such as HOG and SIFT, which offered limited performance. By being the first to employ convolutional neural networks (CNN) for feature extraction, R-CNN significantly enhanced feature representation and led to a substantial leap in detection accuracy. However, R-CNN and similar two-stage detectors involve complex computational pipelines and suffer from slow inference speeds, making them ill-suited for real-time applications such as UAV aerial image detection.

The second group consists of single-stage object detection methods, including SSD [11] and models in the YOLO series [12,13,14,15,16,17,18,19,20,21]. SSD is one of the pioneering end-to-end single-stage object detection models, performing classification and localization in a single forward pass and thereby achieving significantly improved inference speed. By utilizing convolutional feature maps from multiple network layers for prediction, SSD effectively enhances its capability to detect objects across a range of scales. Nevertheless, its performance remains limited in complex scenarios—particularly in aerial imagery characterized by extremely small and densely clustered objects. Since its introduction, the YOLO series has fundamentally transformed the field of object detection with its efficient, single-stage, and unified architecture. The core contribution of YOLO lies in reframing object detection as an end-to-end regression problem, capable of jointly estimating object class labels and localization bounding boxes within one forward pass through the network. This design dramatically reduces inference time and enables real-time, high-accuracy detection for the first time. From YOLOv1 to YOLOv11 [22], the series has evolved by continuously integrating SOTA advances, such as improved anchor mechanisms, multi-scale feature fusion strategies, dynamic label assignment, attention modules, and highly efficient backbone networks. These enhancements have consistently increased detection accuracy while preserving inference speed. As a result, YOLO has established itself as a leading and extensively adopted framework in both academic research and industrial deployment.

2.2. Object Detection Algorithms for UAV Imagery

Aerial images acquired by drones possess several distinctive characteristics—such as extremely small object scales, highly uneven spatial distribution, complex and cluttered backgrounds, and pronounced visual interference—all of which present substantial challenges to general-purpose object detection models. Researchers have invested substantial efforts in advancing object detection methodologies specifically designed for aerial imagery, leading to significant achievements in this domain. Ye et al. [23] introduced an innovative quantized feature extraction module by aggregating features within a single layer to improve the detection accuracy of small objects. These methods, which primarily operate on features from a single layer, struggle to effectively capture cross-scale contextual information and multi-level semantic representations of objects, thereby limiting the potential for accuracy improvement of object detection models in complex scenes. Yang et al. [24] proposed the VAMYOLOX model, which introduces the Triplet Attention Module (TAM) to reconstruct the neck structure of the detection network. This module can effectively focus on key regions in wide-range and densely populated aerial scenes, thereby significantly enhancing the feature representation capability extracted by the backbone network. However, the module’s localization of attention regions still relies primarily on local feature interactions and fails to fully integrate cross-level semantic information. Under conditions of extreme scale variation or severe occlusion, its attention mechanism may struggle to accurately capture long-range dependencies, thereby limiting the detection robustness for small targets and objects with morphological variations.

Currently, object detection models based on the Transformer architecture have also achieved significant advancements, such as DETR. Zhu et al. [25] proposed GM-GETR, a model specifically designed to address the challenges of weak object features, dense spatial distribution, and motion blur commonly encountered in infrared images of drone swarms. However, the model has a total parameter count of 53.8 million and requires 208.5 GFLOPs, which is significantly higher than lightweight models such as those in the YOLO series, making it difficult to satisfy the real-time detection demands of embedded UAV platforms that require high frame rates and low power consumption.

However, among these algorithms, those achieving high detection accuracy often entail substantial computational complexity, rendering them impractical for real-time drone imagery applications. Conversely, models with moderate computational demands typically compromise on detection accuracy. Motivated by the aforementioned studies, this paper proposes WCDB-YOLO, a novel small object detection model tailored for drone imagery that effectively strikes a balance between accuracy and efficiency. The proposed method outperforms multiple current SOTA models in terms of overall performance.

3. Materials and Methods

We adopt YOLOv11—the current SOTA model in the YOLO family—as our baseline. YOLOv11 preserves the canonical YOLO architecture, consisting of three core components: a backbone network, a neck module, and a detection head. Its key advancements lie in the introduction of two novel modules: C3K2 and C2PSA. The C3K2 module constitutes a significant refinement of the C2f block in YOLOv8. It is specifically engineered to enhance feature representation capacity, improve multi-scale perception, and optimize computational efficiency—without compromising detection accuracy. The C2PSA module is an advanced feature enhancement component that synergistically combines the Cross-Stage Partial (CSP) network structure with a Pyramid Spatial Attention (PSA) mechanism. This design substantially strengthens the model’s spatial awareness and contextual reasoning ability, particularly for challenging cases such as small-scale and occluded objects, while maintaining low computational overhead. Owing to these architectural innovations, YOLOv11 demonstrates strong suitability for object detection in UAV-captured imagery, where both accuracy and efficiency are critical. YOLOv11 is available in five scaled variants—namely, n, s, m, l, and x. To strike an optimal balance between detection precision and inference speed, we select the YOLOv11s variant as our baseline model.

3.1. WCDB-YOLO Network

In this work, we propose WCDB-YOLO, a novel small object detection model tailored for drone imagery, featuring three key architectural enhancements. First, we propose a “target-context decoupled perception” paradigm and design a wavelet-enhanced contextual dual-backbone network: one branch focuses on extracting fine-grained object-level features, while the other incorporates wavelet convolution to explicitly expand the receptive field and capture rich background contextual information. The fusion of these complementary feature streams enables the model to jointly leverage local details and global semantics, significantly improving detection accuracy for small objects in complex aerial scenes. Second, we integrate the DWR module into both the object-extraction backbone and the neck fusion network. By parallelizing convolutional branches with diverse dilation rates, the DWR module allows the network to simultaneously encode local textures and long-range contextual dependencies, thereby establishing multi-level representations—from pixel-level details to region-level semantics—within a single layer and endowing the model with strong scale adaptability. Third, we enhance the detection head by incorporating a high-resolution feature map from the shallow P2/4 layer, which preserves spatial fidelity and enriches fine-grained cues critical for tiny objects. Through optimized network architecture design and more intelligent context fusion, the model enhances its capacity to detect, localize, and distinguish small objects under challenging UAV imaging conditions, while maintaining computational efficiency suitable for practical applications. The overall architecture of the WCDB-YOLO model is illustrated in Figure 2.

3.2. WCDB Structure

Inspired by CBNet, we designed a dual-backbone architecture [26]. Unlike CBNet, where the two backbones share homogeneous functionality and jointly enhance target feature extraction, the dual-backbone network proposed in this paper achieves a functional differentiation at the architectural level: one backbone is dedicated to modeling background context, while the other focuses on extracting fine-grained representations of the targets themselves. This “target-background decoupling” design enables the model to more accurately separate foreground and background information in complex scenes, thereby significantly improving small object detection performance.

The dual-backbone architecture is illustrated in Figure 3. The dual-backbone fusion strategy adopts the Dense Higher-Level Composition (DHLC) approach, which has been thoroughly validated in CBNet and identified as the optimal connection scheme through systematic experimental comparisons. DHLC effectively facilitates feature reuse and cross-branch information interaction, providing a solid foundation for efficient collaboration within the dual-backbone architecture.

In the background-extraction backbone, we have designed wavelet transform convolutional (WTConv) layers that are capable of expanding the receptive field. These layers replace traditional convolution kernels with wavelet convolution kernels, capturing background information across different frequency domains in the image through a multi-scale decomposition mechanism. Compared to conventional convolutional layers, wavelet convolutional layers can significantly extend the receptive field without increasing computational complexity, while maintaining spatial resolution. This feature allows them to effectively extract widely distributed background patterns, such as large-scale texture features, continuous shadow distributions, and global illumination gradients—providing rich background semantic information. The structure of the wavelet convolution is illustrated in Figure 4.

We employ the Haar wavelet basis—a computationally lightweight yet highly effective choice—for constructing the wavelet convolution kernels [27]. For a given image, performing a one-level Haar wavelet transform along a single spatial dimension (either width or height) can be achieved by applying depth-wise convolution with the two kernels

[\begin{matrix} 1 & 1 \end{matrix}] / \sqrt{2}

and

[\begin{matrix} 1 & - 1 \end{matrix}] / \sqrt{2}

, followed by a standard downsampling operation with a factor of 2. To carry out the 2D Haar wavelet transform, this procedure is applied sequentially along both spatial dimensions. This operation can be equivalently implemented using a depth-wise convolution with a stride of 2, achieved by applying the following four filters:

f_{L L} = \frac{1}{2} [\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}], f_{L H} = \frac{1}{2} [\begin{matrix} 1 & - 1 \\ 1 & - 1 \end{matrix}], f_{H L} = \frac{1}{2} [\begin{matrix} 1 & 1 \\ - 1 & - 1 \end{matrix}], f_{H H} = \frac{1}{2} [\begin{matrix} 1 & - 1 \\ - 1 & 1 \end{matrix}] .

(1)

Among them,

f_{L L}

acts as a low-pass filter, while

f_{L H}

,

f_{H L}

, and

f_{H H}

constitute a group of high-pass filters.

For each input channel, WTConv performs the following operation:

[X_{L L}, X_{L H}, X_{H L}, X_{H H}] = C o n v ([f_{L L}, f_{L H}, f_{H L}, f_{H H}], X) .

(2)

It can be seen that the output is divided into four channels, where

X_{L L}

denotes the low-frequency subband, while

X_{L H}

,

X_{H L}

, and

X_{H H}

represent the high-frequency subbands along the horizontal, vertical, and diagonal orientations, respectively.

Subsequently, a learnable scaling operation is applied to these four components to dynamically adjust their importance weights.

Finally, the inverse wavelet transform (IWT) is performed, as shown in Equation (3).

X = C o n v - t r a n s p o s e d ([f_{L L}, f_{L H}, f_{H L}, f_{H H}], [X_{L L}, X_{L H}, X_{H L}, X_{H H}]) .

(3)

In the background branch, the wavelet convolution module is incorporated to guide the network toward extracting global contextual information. This branch operates in parallel with the conventional main detail branch. Through subsequent feature fusion, the model acquires the dual capability of “perceiving fine details” and “capturing the overall context,” thereby significantly enhancing scene understanding accuracy. As can be seen from the heatmaps in Figure 5, the object extraction backbone primarily focuses on the object region, while the background extraction backbone mainly concentrates on the background area.

3.3. DWR Module

In the object extraction backbone and neck networks, we introduce the DWR module, which deeply integrates the advantages of dilated convolution and residual connections [28]. This effectively addresses the challenges of multi-scale perception and feature degradation faced by traditional convolutional neural networks in drone scenarios.

The structure of the DWR module is illustrated in Figure 6. The outer layer employs a residual connection, which utilizes identity mapping to ensure stable gradient propagation in deep networks and prevent gradual attenuation of feature information during multi-layer transmission. The inner layer integrates multi-branch dilated convolutions, with each branch configured with different dilation rates (1, 3, 5) to form parallel multi-scale feature extraction pathways. The feature maps output by each branch are fused through concatenation, followed by channel adjustment and information integration via 1 × 1 convolution. Finally, the result is added to the outer layer input to complete the residual connection.

Dilated convolution expands the receptive field by introducing “holes” into the standard convolution kernel, without increasing parameters or requiring downsampling. For example, a 3 × 3 convolution kernel with a dilation rate of 2 has an effective receptive field equivalent to that of a standard 5 × 5 kernel. The DWR module deploys parallel convolutional branches with different dilation rates, enabling the network to simultaneously capture local fine-grained features (small dilation rates) and global contextual information (large dilation rates). This design allows the model to establish multi-level representations from pixel-level details to region-level semantics within a single layer, providing essential scale adaptability for small object detection.

3.4. Small-Object Detection Head

YOLO-family models employ a three-level detection head—P3, P4, and P5—corresponding to feature maps with downsample ratios of 8×, 16×, and 32×, respectively. While this design has proven highly effective for general object detection tasks, it exhibits inherent limitations in the context of small object detection from drone aerial imagery [29].

The P3–P5 detection heads are primarily optimized for medium- and large-scale objects. Among them, the P5 level possesses the largest receptive field and richest semantic information but suffers from the lowest spatial resolution; as a result, fine-grained spatial details of small objects are nearly lost after repeated downsampling. Although the P3 level offers relatively higher resolution (8× downsampled), each pixel in its feature map still corresponds to a relatively large region in the original image. For objects that occupy only a few pixels, this representation remains overly coarse and is easily overlooked during training.

Fundamentally, this architecture lacks a dedicated detection head specifically designed to process high-resolution, fine-grained features—making it difficult for the network to accurately localize and recognize tiny objects. To address this limitation, we introduce an additional P2 detection head at the top of the FPN structure in YOLOv11, as illustrated in Figure 2. This new head is connected to the shallowest layer of the backbone network, which retains the highest spatial resolution (only 4× downsampled). The resulting P2 feature map is twice the size of the P3 map and preserves significantly richer low-level details—such as edges, corners, and textures—that are critical for distinguishing minute objects from background clutter. By leveraging these high-resolution features, the network can perceive the complete structural cues of tiny objects rather than fragmented, ambiguous pixel clusters, thereby substantially improving localization accuracy.

Notably, the newly introduced P2 detection head synergizes effectively with the aforementioned DWR module. The DWR module enhances feature discriminability through multi-scale context awareness, while the P2 head provides a dedicated, high-resolution detection pathway for these enriched features. Their integration further amplifies the performance gains in small object detection.

4. Experiments and Results

4.1. Implementation Details

The experimental setup of this study, including both software and hardware specifications, is detailed in Table 1, while the training hyperparameters—such as the learning rate—are provided in Table 2. Each experiment was independently repeated three times, and the reported results are the average of the three runs. The maximum standard deviation across all experimental results is 0.25%, which strongly demonstrates the favorable stability of the proposed model.

It should be noted that the batch size was set to 2, as this represents the maximum stable training capacity allowed by our current hardware limitations. Since aerial images typically contain numerous small object instances, even a small batch size provides rich sample diversity and dense gradient information per image, thereby promoting stable model convergence. Experimental results show that this setup is sufficient for effective learning.

4.2. Datasets

Our experiments were conducted on three publicly available drone imagery datasets: VisDrone [30], VEDAI [31], and UAVDT [32]. The majority of quantitative evaluations were performed on the VisDrone dataset. To assess the model’s generalization capability across different aerial scenarios, we further evaluated it on the VEDAI dataset. Additionally, we selected representative images from the UAVDT dataset for qualitative visualization experiments, providing intuitive insights into the model’s detection performance under diverse real-world conditions.

The VisDrone dataset was collected by the AISKYEYE team at Tianjin University. It comprises 6471 training images and 548 validation images, capturing diverse scenes and viewpoints from aerial perspectives. The dataset includes annotations for 10 object categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor.

The VEDAI is a benchmark dataset specifically designed for vehicle detection in aerial imagery and is widely used to evaluate the performance of automatic object recognition algorithms in unconstrained environments. The vehicles in this dataset are not only small in scale but also exhibit significant variability, including diverse orientations, complex lighting and shadow variations, specular reflections, and partial or severe occlusions. The VEDAI dataset contains both RGB and infrared image modalities; for consistency with the other datasets, we used RGB images for training and validation, and selected several infrared images for inference. The dataset includes annotations for nine object categories: boat, car, camping car, plane, pickup, tractor, truck, van, and other.

The UAVDT dataset, captured by drones over urban environments, encompasses diverse weather conditions and varying flight altitudes, presenting a challenging benchmark for object detection in computer vision. It contains a total of 25,137 training images and 15,598 test images, annotated across three object classes: car, truck, and bus.

4.3. Evaluation Metrics

We adopt four key metrics for performance assessment: precision (P), recall (R), mAP calculated at a fixed IoU threshold of 0.5 (mAP50), and the mean AP integrated over multiple IoU thresholds in the interval [0.5, 0.95] (mAP50-95) [33]. Their formal expressions are provided below.

P = \frac{T P}{T P + F P},

(4)

R = \frac{T P}{T P + F N},

(5)

A P = \int_{0}^{1} P (r) d r,

(6)

m A P = \frac{1}{C} \sum_{i = 1}^{C} A P_{i} .

(7)

The symbols are defined as follows: TP, FP, FN and correspond to true positives, false positives, and false negatives; P(r) gives the precision when recall equals r; AP quantifies the average precision for one category; and C denotes the overall number of categories in the dataset.

4.4. Enhanced Performance Verification Experiments

On the VisDrone dataset, the baseline model YOLOv11s and the enhanced WCDB-YOLO model were trained independently. The performance metrics of both models after training are summarized in Table 3 and Figure 7. As indicated by the results, WCDB-YOLO surpasses the baseline across all evaluation metrics. Specifically, it shows gains of 7.8% in precision (P), 7.5% in recall (R), 8.4% in mAP50, and 5.8% in mAP50-95, confirming substantial overall enhancement.

As shown in Figure 8, our improved model consistently outperforms the baseline across all ten object categories on the PR curves. The most significant improvements are achieved for the “pedestrian” and “person” classes, both of which represent small objects, with AP increases of 13.6 and 12.7 percentage points, respectively. Figure 9 compares the confusion matrices of the proposed model and the baseline model. The values along the diagonal indicate improved recognition accuracy for all ten object categories.

4.5. Ablation Experiments

WCDB-YOLO introduces three key enhancements upon the YOLOv11s baseline: (1) a dual-backbone architecture, termed WCDB, to strengthen contextual feature extraction; (2) the integration of DWR convolutional module to enlarge the receptive field while preserving fine-grained semantic information; and (3) the addition of a P2 detection layer to improve sensitivity to small-scale objects. To systematically evaluate the contribution of each component to the overall performance, we designed and conducted ablation studies, with the results presented in Table 4 and Figure 10.

The ablation study results clearly demonstrate that each introduced module positively contributes to the overall model performance. When applied individually, WCDB yields the most significant improvement in mAP50, achieving a gain of 4.0%, while the DWR module delivers the largest boost in mAP50-95, with an increase of 3.4%. Among all pairwise combinations, the integration of WCDB and the P2 detection layer consistently achieves the highest gains across all four metrics—P, R, mAP50, and mAP50-95. Moreover, when all three modules are jointly employed, these metrics reach their peak values, substantially outperforming both the baseline and all other partial configurations. These findings confirm that WCDB, DWR, and the P2 detection layer are individually effective and work together synergistically. This successful collaboration underscores the strength of our multi-module design in boosting detection performance.

4.6. Comparative Experiments

To comprehensively and objectively evaluate the performance of the proposed method, we conduct comparative experiments against two representative categories of SOTA models: (1) Strong general-purpose detectors: These are classical object detection models that have demonstrated outstanding performance on generic detection benchmarks and are widely regarded as robust industrial baselines. The comparison results are presented in Table 5 and Figure 11. (2) Specialized models for small object detection: These methods are typically built upon classical SOTA architectures but incorporate specific enhancements tailored to improve small object detection performance, representing the current frontier in this domain. Their comparative results are shown in Table 6 and Figure 12.

Among classical object detection models, RT-DETR-L achieves the highest mAP50 of 45.0%, which is slightly lower than that of our proposed WCDB-YOLO. However, its computational cost—measured in GFLOPs—is nearly twice that of our model, suggesting that its performance gain is partly attributable to significantly higher computational overhead. To further verify that our improvements do not merely stem from increased parameter count or computational complexity, we compare our method with YOLOv11m. Despite having both more parameters and higher computational demands than our model, YOLOv11m attains only 40.3% mAP50, substantially underperforming our approach.

Among specialized small-object detection models, Drone-YOLO-N has the smallest parameter footprint but achieves only 38.1% mAP50, indicating limited detection accuracy. In contrast, EdgeYOLO-S delivers the highest accuracy in this category; however, its parameter count reaches 40.5 M—more than double that of our model.

In summary, WCDB-YOLO achieves competitive, if not superior, detection accuracy while maintaining a relatively low model complexity, clearly demonstrating the effectiveness and efficiency of our proposed architectural enhancements in striking an optimal balance between performance and computational cost.

4.7. Generalization Experiments

To validate the effectiveness of WCDB-YOLO, we conducted comparative experiments against the baseline model on the VEDAI dataset. The experimental results are shown in Table 7 and Figure 13. Experimental results demonstrate that the proposed model consistently outperforms the baseline across all evaluation metrics: it achieves a 2.2% improvement in mAP50 and a more substantial gain of 4.2% in the stricter mAP50-95 metric. Moreover, as shown in Figure 13—which depicts the evolution of these metrics over training epochs—the proposed model not only converges more rapidly but also attains superior final performance, with consistently larger improvements observed across all indicators throughout the training process. These results strongly corroborate the effectiveness of the proposed method in enhancing detection accuracy, robustness, and overall generalization capability.

4.8. Visualization

To intuitively demonstrate the detection performance of the improved model, we selected several aerial images from the VisDrone dataset for testing and compared the results with those of the baseline model. The visual comparison is presented in Figure 14.

In Figure 14a, YOLOv11s fails to detect a person in a seated posture; similarly, in Figure 14b, it misses a white vehicle. In contrast, the proposed WCDB-YOLO model successfully and accurately detects both objects in these two scenarios, demonstrating stronger detection robustness and generalization capability—particularly excelling in handling challenging cases such as people with varying postures or vehicles in low-contrast environments.

To comprehensively evaluate the generalization capability of the proposed model, we additionally selected several representative image samples from the UAVDT dataset and conducted visual detection experiments. As shown in Figure 15, YOLOv11s fails to detect multiple small-scale vehicles in Figure 15a,c, whereas the proposed WCDB-YOLO model successfully achieves accurate recognition and localization of these small objects. These results convincingly demonstrate WCDB-YOLO’s superior performance in detecting small objects under complex aerial-view scenarios, further confirming its effectiveness and robustness in enhancing small-object detection capabilities.

Although the VisDrone dataset used for training does not contain complex scenarios such as adverse weather or low-light conditions, our model nonetheless demonstrates strong generalization ability and robust detection performance when deployed in such challenging environments. Specifically, evaluations on aerial images captured under foggy conditions and at night in complex urban road settings reveal that the proposed model achieves a significant improvement in detection effectiveness over the baseline, as illustrated in Figure 16.

Aerial drone imagery is widely employed for object detection in infrared scenarios. The proposed WCDB-YOLO also exhibits excellent performance on IR data. As illustrated in Figure 17, when applied to two infrared aerial images, YOLOv11s produces suboptimal detection results, while WCDB-YOLO successfully identifies multiple cars.

4.9. Efficiency Analysis

Object detection in drone-captured aerial imagery imposes strict real-time requirements. However, the onboard hardware of current drones still struggles to support the real-time execution of high-accuracy detection models. Thanks to advances in communication technologies, drones can now establish real-time connections with remote servers—transmitting captured images for processing and receiving detection results in return—thereby enabling the effective deployment of high-precision models. Consequently, employing a medium-scale detection model that balances accuracy and computational efficiency has emerged as a practical solution to meet real-time demands. WCDB-YOLO is specifically designed with this objective in mind. On a server equipped with an NVIDIA RTX 3090 GPU, it achieves an average per-image processing time of 1.8 ms for preprocessing, 28.4 ms for inference, and 2.8 ms for postprocessing, resulting in an overall throughput of 30.3 FPS. When deployed on higher-end server hardware, its processing speed is expected to increase further. These results demonstrate that WCDB-YOLO effectively satisfies the real-time requirements of object detection in drone-based aerial imagery.

5. Conclusions

To address the challenges inherent in small object detection within drone-captured aerial imagery—such as extremely limited object scale, dense spatial distribution, and highly complex backgrounds—this paper proposes a novel object detection model termed WCDB-YOLO. Based on YOLOv11s as the benchmark, the model introduces a “target-context decoupled perception” paradigm by constructing a dual-backbone structure. One backbone focuses on extracting fine local features, while the other expands the receptive field through wavelet convolution to efficiently capture global contextual information. Through a multi-level feature fusion mechanism, the model achieves the synergistic utilization of local details and global semantics. Additionally, the DWR module is introduced, which utilizes parallel convolutional branches with different dilation rates to simultaneously capture fine local features and global contextual information. By first strengthening the representation of the object itself and then injecting scale-adaptive contextual information, the discriminative ability for small objects is effectively enhanced. Furthermore, high-resolution P2/4 feature maps are fused in the detection head to further improve the localization and recognition accuracy for tiny objects. Experimental results demonstrate that WCDB-YOLO outperforms current mainstream methods, validating the effectiveness and robustness of the proposed approach in complex scenarios.

Although the proposed method achieves strong detection performance, its dual-backbone design leads to a higher parameter count and computational burden, pointing to a clear direction for future improvement. Future work will focus on the following directions: (1) developing a lightweight dual-backbone architecture or integrating neural network pruning and quantization techniques to reduce computational overhead and enhance real-time inference capabilities; and (2) improving the model’s cross-domain generalization and robustness under diverse and adverse weather conditions, thereby enabling reliable object detection in increasingly complex and varied aerial scenarios. Overall, WCDB-YOLO presents a novel and effective approach to small object detection in challenging drone-captured environments, offering a solid foundation for future research on efficient, robust, and deployable visual perception systems.

Author Contributions

Methodology, D.L. and Y.D.; Writing—original draft preparation, D.L. and A.L.; Writing—review & editing, H.L. and L.X.; Data curation, J.Z. (Jian Zhou); Project administration, D.L.; Formal analysis, J.Z. (Jun Zhu); Supervision, J.Z. (Jun Zhu) and L.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 42507422); Natural Science Foundation of Jiangsu Province (No. BK20241070); the Start-up Fund for New Talented Researchers of Nanjing University of Industry Technology (No. YK23-05-01); 2025 QingLan Project in Colleges and Universities of Jiangsu Province, and 2024 QingLan Project in Colleges and Universities of Jiangsu Province.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors appreciate the constructive feedback received during the review process.

Conflicts of Interest

The authors declare no conflicts of interest. Author Mr. Jian Zhou was employed by the company Shenzhen Yuanlian Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Cai, H.; Xie, Y.; Xu, J.; Xiong, Z. A Lightweight and Accurate UAV Detection Method Based on YOLOv4. Sensors 2022, 22, 6874. [Google Scholar] [CrossRef]
Xue, Z.; Xu, R.; Bai, D.; Lin, H. YOLO-Tea: A Tea Disease Detection Model Improved by YOLOv5. Forests 2023, 14, 415. [Google Scholar] [CrossRef]
Guo, Y.; Chen, S.; Zhan, R.; Wang, W.; Zhang, J. LMSD-YOLO: A Lightweight YOLO Algorithm for Multi-Scale SAR Ship Detection. Remote Sens. 2022, 14, 4801. [Google Scholar] [CrossRef]
Li, H.; Ma, J.; Zhang, J. ELNet: An Efficient and Lightweight Network for Small Object Detection in UAV Imagery. Remote Sens. 2025, 17, 2096. [Google Scholar] [CrossRef]
Yan, H.; Kong, X.; Wang, J.; Tomiyama, H. ST-YOLO: An Enhanced Detector of Small Objects in Unmanned Aerial Vehicle Imagery. Drones 2025, 9, 338. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021. [Google Scholar]
Yue, M.; Zhang, L.; Huang, J.; Zhang, H. Lightweight and Efficient Tiny-Object Detection Based on Improved YOLOv8n for UAV Aerial Images. Drones 2024, 8, 276. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Ultralytics. YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 2 October 2024).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 October 2024).
Wang, C.Y.; Yeh, I.; Liao, H.Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458,2024. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-Time Object Detection Network in UAV-Vision Based on CNN and Transformer. IEEE Trans. Instrum. Meas. 2023, 72, 2505713. [Google Scholar] [CrossRef]
Yang, Y.; Gao, X.; Wang, Y.; Song, S. VAMYOLOX: An Accurate and Efficient Object Detection Algorithm Based on Visual Attention Mechanism for UAV Optical Sensors. IEEE Sens. J. 2023, 23, 11139–11155. [Google Scholar] [CrossRef]
Zhu, C.; Xi, J.; Yang, X. GM-DETR: Infrared Detection of Small UAV Swarm Targets Based on Detection Transformer. Remote Sens. 2025, 17, 3379. [Google Scholar] [CrossRef]
Liang, T.; Chu, X.; Liu, Y.; Wang, Y.; Tang, Z.; Chu, W.; Chen, J.; Ling, H. CBNet: A Composite Backbone Network Architecture for Object Detection. arXiv 2022, arXiv:2107.00420. [Google Scholar] [CrossRef] [PubMed]
Finder, S.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. arXiv 2024, arXiv:2407.05848. [Google Scholar] [CrossRef]
Wei, H.; Liu, X.; Xu, S.; Dai, Z.; Dai, Y.; Xu, X. DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual Information for Real-time Semantic Segmentation. arXiv 2023, arXiv:2212.01173. [Google Scholar]
Qiu, H.; Zhong, X.; Huang, L.; Yang, H. An improved YO1Ov5n detection algorithm for aerial photography of small targets. Electron. Opt. Control 2023, 30, 95–101. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Photography Scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Salt Lake City, UT, USA, 2018; pp. 6154–6162. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density Map Guided Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Chen, L.; Liu, C.; Li, W.; Xu, Q.; Deng, H. DTSSNet: Dynamic Training Sample Selection Network for UAV Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5902516. [Google Scholar] [CrossRef]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Wang, Z.; Su, Y.; Kang, F.; Wang, L.; Lin, Y.; Wu, Q.; Li, H.; Cai, Z. PC-YOLO11s: A Lightweight and Effective Feature Extraction Method for Small Target Image Detection. Sensors 2025, 25, 348. [Google Scholar] [CrossRef]
Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. EdgeYOLO: An edge-real-time object detector. arXiv 2023, arXiv:2302.07483. [Google Scholar]
Zhang, Z. An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A modified YOLOv8 detection network for UAV aerial image recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
Tahir, N.U.A.; Long, Z.; Zhang, Z.; Asim, M.; ELAffendi, M. PVswin-YOLOv8s: UAV-based pedestrian and vehicle detection for traffic management in smart cities using improved YOLOv8. Drones 2024, 8, 84. [Google Scholar]

Figure 1. Complexities of aerial imagery for object detection.

Figure 2. Architecture of WCDB-YOLO.

Figure 3. Background extraction backbone and object extraction backbone.

Figure 4. Illustration of WTConv steps.

Figure 5. Heatmaps of the object extraction backbone and background extraction backbone.

Figure 6. Architecture of DWR module.

Figure 7. Performance comparison of YOLOv11s and WCDB-YOLO in terms of (a) P, (b) R, (c) mAP50, and (d) mAP50-95.

Figure 8. PR curves on the VisDrone dataset: (a) YOLOv11s and (b) WCDB-YOLO.

Figure 9. Confusion matrix results for object detection on VisDrone: (a) YOLOv11s and (b) WCDB-YOLO.

Figure 10. The mAP50 training curves from ablation experiments.

Figure 11. Performance (mAP50) versus Computational Complexity (GFLOPs) of SOTA Detectors.

Figure 12. Small object detector performance (mAP50) versus number of parameters.

Figure 13. Performance comparison of YOLOv11s and WCDB-YOLO in terms of (a) mAP50, and (b) mAP50-95.

Figure 14. Visualization of detection results on the VisDrone dataset: (a,c) present the outputs of YOLOv11s, whereas (b,d) display those of WCDB-YOLO.

Figure 15. Visualization of detection results on the UAVDT dataset: (a,c) present the outputs of YOLOv11s, whereas (b,d) display those of WCDB-YOLO.

Figure 16. Visualization of detection results on foggy and low-light images: (a,c) present the outputs of YOLOv11s, whereas (b,d) display those of WCDB-YOLO.

Figure 17. Visualization of detection results on IR aerial images: (a,c) present the outputs of YOLOv11s, whereas (b,d) display those of WCDB-YOLO.

Table 1. Experimental environment configuration.

Name	Type
GPU	NVIDIA RTX A2000
CPU	Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40 GHz
Python	Python 3.9
Running memory	30 GB
System environment	Ubuntu 20.04
CUDA	CUDA 11.3
CUDNN	CUDNN 8
PyTorch	Pytorch1.12.1

Table 2. Configuration of experimental parameters.

Name	Value
Epochs	150
Batches	2
Learning rate	0.01
Optimizer	SGD

Table 3. Performance metrics before and after model enhancement.

Model	P/%	R/%	mAP50/%	mAP50–95/%
YOLOv11s	47.5	36.2	37.1	22.1
WCDB-YOLO	55.3	43.7	45.5	27.9

Table 4. Ablation experiment indicators.

WCDB	DWR	P2	P/%	R/%	mAP50/%	mAP50-95/%
			47.5	36.2	37.1	22.1
√			53.3	39.1	41.1	25.2
	√		49.1	37.2	37.7	25.5
		√	49.6	37.9	38.7	23.5
√	√		52.2	40.8	41.5	25.3
√		√	54.2	43.2	44.7	27.5
	√	√	51.7	41.4	42.6	25.8
√	√	√	55.3	43.7	45.5	27.9

Table 5. Comparison with Classical Models.

Model	mAP50/%	GFLOPs	Parameters/M
SSD [11]	10.6	31.4	34.0
Faster R-CNN [10]	37.2	118.8	41.4
Cascade R-CNN [34]	39.1	189.1	69.0
RetinaNet [35]	19.1	36.4	35.7
Swin Transformer [36]	35.6	44.5	34.2
RT-DETR-L [37]	45.0	103.5	32.0
DMNet [38]	43.6	101.7	39.4
DTSSNet [39]	39.9	50.4	10.1
TOOD [40]	33.9	199.0	32.4
ATSS [41]	33.8	110.0	38.9
YOLOv5s	33.2	15.8	7.0
YOLOv5m	36.3	48	20.9
YOLOv8s	39.3	28.5	11.1
YOLOv8m	44.0	78.7	25.9
YOLOv10s	41.1	21.4	7.2
YOLOv10m	44.4	58.9	15.3
YOLOv11s	37.1	21.7	9.5
YOLOv11m	40.3	68.5	20.1
WCDB-YOLO	45.5	60.1	19.8

Table 6. Comparative analysis with small object detection models.

Model	mAP50/%	GFLOPs	Parameters/M
PC-YOLO11s [42]	43.8	-	7.1
EdgeYOLO-S [43]	44.8	109.1	40.5
Drone-YOLO-N [44]	38.1	-	3.1
Modified-YOLOv8 [45]	42.2	9.7	19.2
PVswin-YOLOv8s [46]	43.3	-	41.8
TPH-YOLO [6]	36.2	145.7	60.4
ST-YOLO [5]	33.2	20.1	9.0
WCDB-YOLO	45.5	60.1	19.8

Table 7. Evaluation metrics used on the VEDAI dataset.

Model	P/%	R/%	mAP50/%	mAP50–95/%
YOLOv11s	50.3	55.5	52.1	28.6
WCDB-YOLO	50.8	56.9	54.3	32.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luan, D.; Dong, Y.; Zhou, J.; Li, A.; Xie, L.; Liu, H.; Zhu, J. WCDB-YOLO: Wavelet-Enhanced Contextual Dual-Backbone Network for Small Object Detection in UAV Aerial Imagery. Drones 2026, 10, 155. https://doi.org/10.3390/drones10030155

AMA Style

Luan D, Dong Y, Zhou J, Li A, Xie L, Liu H, Zhu J. WCDB-YOLO: Wavelet-Enhanced Contextual Dual-Backbone Network for Small Object Detection in UAV Aerial Imagery. Drones. 2026; 10(3):155. https://doi.org/10.3390/drones10030155

Chicago/Turabian Style

Luan, Di, Yuna Dong, Jian Zhou, Ang Li, Ling Xie, Hongying Liu, and Jun Zhu. 2026. "WCDB-YOLO: Wavelet-Enhanced Contextual Dual-Backbone Network for Small Object Detection in UAV Aerial Imagery" Drones 10, no. 3: 155. https://doi.org/10.3390/drones10030155

APA Style

Luan, D., Dong, Y., Zhou, J., Li, A., Xie, L., Liu, H., & Zhu, J. (2026). WCDB-YOLO: Wavelet-Enhanced Contextual Dual-Backbone Network for Small Object Detection in UAV Aerial Imagery. Drones, 10(3), 155. https://doi.org/10.3390/drones10030155

Article Menu

WCDB-YOLO: Wavelet-Enhanced Contextual Dual-Backbone Network for Small Object Detection in UAV Aerial Imagery

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Object Detection Algorithms

2.2. Object Detection Algorithms for UAV Imagery

3. Materials and Methods

3.1. WCDB-YOLO Network

3.2. WCDB Structure

3.3. DWR Module

3.4. Small-Object Detection Head

4. Experiments and Results

4.1. Implementation Details

4.2. Datasets

4.3. Evaluation Metrics

4.4. Enhanced Performance Verification Experiments

4.5. Ablation Experiments

4.6. Comparative Experiments

4.7. Generalization Experiments

4.8. Visualization

4.9. Efficiency Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI