1. Introduction
Recent progress in unmanned aerial vehicle (UAV) technology and artificial intelligence has propelled object detection in UAV imagery to the forefront as a key enabling technology [
1,
2,
3]. Its applications are expanding rapidly across diverse domains, from smart agriculture and military reconnaissance to urban surveillance. However, existing object detection models typically fail to achieve satisfactory accuracy when applied directly to UAV-captured images. Aerial photography usually encompasses extensive geographical expanses, leading to most objects being depicted at very small scales within these images, amidst intricate backdrops. Detecting small objects in drone imagery presents three core challenges: minimal pixel coverage of the objects (frequently under 0.1%), highly varied and complex backgrounds (like forest canopies or urban landscapes), and dense, multi-scale object distribution. This complexity of aerial images is clearly demonstrated in
Figure 1. Consequently, aerial-image-based detection models are now subject to more stringent requirements regarding accuracy, robustness, and adaptability [
4].
Object detection in aerial imagery has attracted considerable attention and research interest. Yan et al. [
5] proposed ST-YOLO, a model based on YOLOv5s that was specifically improved for small object detection. However, its performance on the VisDrone dataset remains relatively low, with an mAP50 of 33.2%. This suggests that, although the introduced modules provide some benefit to detection, they may still be insufficient in capturing the multi-scale features and contextual information essential for robust aerial object detection. Zhu et al. [
6] proposed TPH-YOLOv5, which improves detection performance by incorporating transformer prediction heads and attention mechanisms. The model achieves an mAP50 of 36.2% on the VisDrone dataset. Although this work notably enhances model accuracy and delivers strong performance, the incorporation of numerous Transformer encoder blocks in the feature extraction stage significantly inflates the model’s size and markedly slows down inference. Yue et al. [
7] proposed a lightweight small object detection model named LE-YOLO, which integrates depthwise separable convolution with channel shuffling modules to enhance multi-level extraction of local details and channel-wise features. They also designed the LGS bottleneck and LGSCSP fusion modules to reduce computational complexity. However, this approach exhibits limited capacity in modeling global contextual information, making it difficult to effectively capture the semantic relationships between tiny objects and their extensive backgrounds in aerial images. Moreover, excessive model lightweighting compromises representational power, thereby constraining further improvements in detection accuracy.
Existing object detection models for UAV aerial imagery continue to face significant challenges. Conventional detectors are typically not tailored to the unique characteristics of small objects, leading to consistently suboptimal performance in such scenarios. Current efforts to address these limitations largely follow two paradigms: (1) pursuing higher detection accuracy at the cost of substantial computational complexity—often resulting in impractically slow inference even on high-end servers; and (2) prioritizing model lightweighting, which frequently sacrifices model capacity and representational power, thereby causing performance bottlenecks in challenging conditions such as complex backgrounds and densely clustered small objects. Overall, existing approaches lack a design strategy that effectively maximizes small object detection performance within a controllable and reasonable computational budget. This calls for moving beyond simplistic solutions such as naive module stacking or aggressive channel pruning. Instead, the focus must shift toward more efficient mechanisms for feature representation and utilization—specifically, through the adoption of superior network architectures and more intelligent contextual information fusion—to achieve substantially improved detection accuracy without exceeding acceptable computational costs. Thanks to advances in communication technology, images captured by UAVs can be transmitted in real time to a server for processing, with detection results sent back promptly. This makes models that achieve high accuracy while maintaining moderate computational complexity highly practical and valuable.
Inspired by this, this paper proposes a dual-backbone detection model based on wavelet-enhanced contextual information, referred to as WCDB-YOLO. It adopts the current state-of-the-art (SOTA) model, YOLOv11s, as the baseline— a model that has already demonstrated an excellent balance between detection accuracy and speed. The proposed model effectively improves small object detection performance through structural decoupling and targeted enhancement, achieving competitive overall performance that surpasses several current SOTA models, with only a moderate increase in computational cost. The main contributions of this paper are as follows:
(1) We propose a “target-context decoupled perception” paradigm, which leverages two structurally complementary backbone networks to separately process local object features and global background information. One backbone focuses on extracting fine-grained local object features, while the other innovatively incorporates a wavelet convolution module to efficiently model the global contextual semantics of complex scenes with minimal computational cost by constructing a large receptive field.
(2) We incorporate the Dilation-wise Residual (DWR) module into both the object-extraction backbone and the neck fusion network. By deploying convolutional branches with different dilation rates in parallel, the network can simultaneously capture local fine-grained features and global contextual information. This enables the model to establish multi-level representations—from pixel-level details to region-level semantics—within a single layer, providing crucial scale adaptability for small object detection.
(3) Building upon the original detection head, we incorporate a high-resolution feature map from the shallow layer P2/4 to enrich fine-grained details of small objects. This design significantly enhances the model’s ability to perceive and localize tiny objects in the image.
(4) Through structural decoupling and targeted enhancement, the model effectively improves the detection performance for small objects. Experiments on the VisDrone dataset show that our model achieves an 8.4% improvement in mAP50 over the baseline and outperforms current SOTA small object detection models. Additionally, generalization experiments on the VEDAI dataset further demonstrate the effectiveness of the proposed enhancements.
Following this introduction, the paper is organized as follows.
Section 2 reviews both foundational and recent advances in related fields to establish the necessary context.
Section 3 then details the proposed WCDB-YOLO algorithm, including its network architecture and key improvements.
Section 4 provides a thorough evaluation of WCDB-YOLO’s performance through comprehensive experiments and comparative analysis. Finally,
Section 5 wraps up the paper by highlighting the key contributions and outlining possible avenues for future work.
3. Materials and Methods
We adopt YOLOv11—the current SOTA model in the YOLO family—as our baseline. YOLOv11 preserves the canonical YOLO architecture, consisting of three core components: a backbone network, a neck module, and a detection head. Its key advancements lie in the introduction of two novel modules: C3K2 and C2PSA. The C3K2 module constitutes a significant refinement of the C2f block in YOLOv8. It is specifically engineered to enhance feature representation capacity, improve multi-scale perception, and optimize computational efficiency—without compromising detection accuracy. The C2PSA module is an advanced feature enhancement component that synergistically combines the Cross-Stage Partial (CSP) network structure with a Pyramid Spatial Attention (PSA) mechanism. This design substantially strengthens the model’s spatial awareness and contextual reasoning ability, particularly for challenging cases such as small-scale and occluded objects, while maintaining low computational overhead. Owing to these architectural innovations, YOLOv11 demonstrates strong suitability for object detection in UAV-captured imagery, where both accuracy and efficiency are critical. YOLOv11 is available in five scaled variants—namely, n, s, m, l, and x. To strike an optimal balance between detection precision and inference speed, we select the YOLOv11s variant as our baseline model.
3.1. WCDB-YOLO Network
In this work, we propose WCDB-YOLO, a novel small object detection model tailored for drone imagery, featuring three key architectural enhancements. First, we propose a “target-context decoupled perception” paradigm and design a wavelet-enhanced contextual dual-backbone network: one branch focuses on extracting fine-grained object-level features, while the other incorporates wavelet convolution to explicitly expand the receptive field and capture rich background contextual information. The fusion of these complementary feature streams enables the model to jointly leverage local details and global semantics, significantly improving detection accuracy for small objects in complex aerial scenes. Second, we integrate the DWR module into both the object-extraction backbone and the neck fusion network. By parallelizing convolutional branches with diverse dilation rates, the DWR module allows the network to simultaneously encode local textures and long-range contextual dependencies, thereby establishing multi-level representations—from pixel-level details to region-level semantics—within a single layer and endowing the model with strong scale adaptability. Third, we enhance the detection head by incorporating a high-resolution feature map from the shallow P2/4 layer, which preserves spatial fidelity and enriches fine-grained cues critical for tiny objects. Through optimized network architecture design and more intelligent context fusion, the model enhances its capacity to detect, localize, and distinguish small objects under challenging UAV imaging conditions, while maintaining computational efficiency suitable for practical applications. The overall architecture of the WCDB-YOLO model is illustrated in
Figure 2.
3.2. WCDB Structure
Inspired by CBNet, we designed a dual-backbone architecture [
26]. Unlike CBNet, where the two backbones share homogeneous functionality and jointly enhance target feature extraction, the dual-backbone network proposed in this paper achieves a functional differentiation at the architectural level: one backbone is dedicated to modeling background context, while the other focuses on extracting fine-grained representations of the targets themselves. This “target-background decoupling” design enables the model to more accurately separate foreground and background information in complex scenes, thereby significantly improving small object detection performance.
The dual-backbone architecture is illustrated in
Figure 3. The dual-backbone fusion strategy adopts the Dense Higher-Level Composition (DHLC) approach, which has been thoroughly validated in CBNet and identified as the optimal connection scheme through systematic experimental comparisons. DHLC effectively facilitates feature reuse and cross-branch information interaction, providing a solid foundation for efficient collaboration within the dual-backbone architecture.
In the background-extraction backbone, we have designed wavelet transform convolutional (WTConv) layers that are capable of expanding the receptive field. These layers replace traditional convolution kernels with wavelet convolution kernels, capturing background information across different frequency domains in the image through a multi-scale decomposition mechanism. Compared to conventional convolutional layers, wavelet convolutional layers can significantly extend the receptive field without increasing computational complexity, while maintaining spatial resolution. This feature allows them to effectively extract widely distributed background patterns, such as large-scale texture features, continuous shadow distributions, and global illumination gradients—providing rich background semantic information. The structure of the wavelet convolution is illustrated in
Figure 4.
We employ the Haar wavelet basis—a computationally lightweight yet highly effective choice—for constructing the wavelet convolution kernels [
27]. For a given image, performing a one-level Haar wavelet transform along a single spatial dimension (either width or height) can be achieved by applying depth-wise convolution with the two kernels
and
, followed by a standard downsampling operation with a factor of 2. To carry out the 2D Haar wavelet transform, this procedure is applied sequentially along both spatial dimensions. This operation can be equivalently implemented using a depth-wise convolution with a stride of 2, achieved by applying the following four filters:
Among them, acts as a low-pass filter, while , , and constitute a group of high-pass filters.
For each input channel, WTConv performs the following operation:
It can be seen that the output is divided into four channels, where denotes the low-frequency subband, while , , and represent the high-frequency subbands along the horizontal, vertical, and diagonal orientations, respectively.
Subsequently, a learnable scaling operation is applied to these four components to dynamically adjust their importance weights.
Finally, the inverse wavelet transform (IWT) is performed, as shown in Equation (3).
In the background branch, the wavelet convolution module is incorporated to guide the network toward extracting global contextual information. This branch operates in parallel with the conventional main detail branch. Through subsequent feature fusion, the model acquires the dual capability of “perceiving fine details” and “capturing the overall context,” thereby significantly enhancing scene understanding accuracy. As can be seen from the heatmaps in
Figure 5, the object extraction backbone primarily focuses on the object region, while the background extraction backbone mainly concentrates on the background area.
3.3. DWR Module
In the object extraction backbone and neck networks, we introduce the DWR module, which deeply integrates the advantages of dilated convolution and residual connections [
28]. This effectively addresses the challenges of multi-scale perception and feature degradation faced by traditional convolutional neural networks in drone scenarios.
The structure of the DWR module is illustrated in
Figure 6. The outer layer employs a residual connection, which utilizes identity mapping to ensure stable gradient propagation in deep networks and prevent gradual attenuation of feature information during multi-layer transmission. The inner layer integrates multi-branch dilated convolutions, with each branch configured with different dilation rates (1, 3, 5) to form parallel multi-scale feature extraction pathways. The feature maps output by each branch are fused through concatenation, followed by channel adjustment and information integration via 1 × 1 convolution. Finally, the result is added to the outer layer input to complete the residual connection.
Dilated convolution expands the receptive field by introducing “holes” into the standard convolution kernel, without increasing parameters or requiring downsampling. For example, a 3 × 3 convolution kernel with a dilation rate of 2 has an effective receptive field equivalent to that of a standard 5 × 5 kernel. The DWR module deploys parallel convolutional branches with different dilation rates, enabling the network to simultaneously capture local fine-grained features (small dilation rates) and global contextual information (large dilation rates). This design allows the model to establish multi-level representations from pixel-level details to region-level semantics within a single layer, providing essential scale adaptability for small object detection.
3.4. Small-Object Detection Head
YOLO-family models employ a three-level detection head—P3, P4, and P5—corresponding to feature maps with downsample ratios of 8×, 16×, and 32×, respectively. While this design has proven highly effective for general object detection tasks, it exhibits inherent limitations in the context of small object detection from drone aerial imagery [
29].
The P3–P5 detection heads are primarily optimized for medium- and large-scale objects. Among them, the P5 level possesses the largest receptive field and richest semantic information but suffers from the lowest spatial resolution; as a result, fine-grained spatial details of small objects are nearly lost after repeated downsampling. Although the P3 level offers relatively higher resolution (8× downsampled), each pixel in its feature map still corresponds to a relatively large region in the original image. For objects that occupy only a few pixels, this representation remains overly coarse and is easily overlooked during training.
Fundamentally, this architecture lacks a dedicated detection head specifically designed to process high-resolution, fine-grained features—making it difficult for the network to accurately localize and recognize tiny objects. To address this limitation, we introduce an additional P2 detection head at the top of the FPN structure in YOLOv11, as illustrated in
Figure 2. This new head is connected to the shallowest layer of the backbone network, which retains the highest spatial resolution (only 4× downsampled). The resulting P2 feature map is twice the size of the P3 map and preserves significantly richer low-level details—such as edges, corners, and textures—that are critical for distinguishing minute objects from background clutter. By leveraging these high-resolution features, the network can perceive the complete structural cues of tiny objects rather than fragmented, ambiguous pixel clusters, thereby substantially improving localization accuracy.
Notably, the newly introduced P2 detection head synergizes effectively with the aforementioned DWR module. The DWR module enhances feature discriminability through multi-scale context awareness, while the P2 head provides a dedicated, high-resolution detection pathway for these enriched features. Their integration further amplifies the performance gains in small object detection.
4. Experiments and Results
4.1. Implementation Details
The experimental setup of this study, including both software and hardware specifications, is detailed in
Table 1, while the training hyperparameters—such as the learning rate—are provided in
Table 2. Each experiment was independently repeated three times, and the reported results are the average of the three runs. The maximum standard deviation across all experimental results is 0.25%, which strongly demonstrates the favorable stability of the proposed model.
It should be noted that the batch size was set to 2, as this represents the maximum stable training capacity allowed by our current hardware limitations. Since aerial images typically contain numerous small object instances, even a small batch size provides rich sample diversity and dense gradient information per image, thereby promoting stable model convergence. Experimental results show that this setup is sufficient for effective learning.
4.2. Datasets
Our experiments were conducted on three publicly available drone imagery datasets: VisDrone [
30], VEDAI [
31], and UAVDT [
32]. The majority of quantitative evaluations were performed on the VisDrone dataset. To assess the model’s generalization capability across different aerial scenarios, we further evaluated it on the VEDAI dataset. Additionally, we selected representative images from the UAVDT dataset for qualitative visualization experiments, providing intuitive insights into the model’s detection performance under diverse real-world conditions.
The VisDrone dataset was collected by the AISKYEYE team at Tianjin University. It comprises 6471 training images and 548 validation images, capturing diverse scenes and viewpoints from aerial perspectives. The dataset includes annotations for 10 object categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor.
The VEDAI is a benchmark dataset specifically designed for vehicle detection in aerial imagery and is widely used to evaluate the performance of automatic object recognition algorithms in unconstrained environments. The vehicles in this dataset are not only small in scale but also exhibit significant variability, including diverse orientations, complex lighting and shadow variations, specular reflections, and partial or severe occlusions. The VEDAI dataset contains both RGB and infrared image modalities; for consistency with the other datasets, we used RGB images for training and validation, and selected several infrared images for inference. The dataset includes annotations for nine object categories: boat, car, camping car, plane, pickup, tractor, truck, van, and other.
The UAVDT dataset, captured by drones over urban environments, encompasses diverse weather conditions and varying flight altitudes, presenting a challenging benchmark for object detection in computer vision. It contains a total of 25,137 training images and 15,598 test images, annotated across three object classes: car, truck, and bus.
4.3. Evaluation Metrics
We adopt four key metrics for performance assessment: precision (P), recall (R), mAP calculated at a fixed IoU threshold of 0.5 (mAP50), and the mean AP integrated over multiple IoU thresholds in the interval [0.5, 0.95] (mAP50-95) [
33]. Their formal expressions are provided below.
The symbols are defined as follows: TP, FP, FN and correspond to true positives, false positives, and false negatives; P(r) gives the precision when recall equals r; AP quantifies the average precision for one category; and C denotes the overall number of categories in the dataset.
4.4. Enhanced Performance Verification Experiments
On the VisDrone dataset, the baseline model YOLOv11s and the enhanced WCDB-YOLO model were trained independently. The performance metrics of both models after training are summarized in
Table 3 and
Figure 7. As indicated by the results, WCDB-YOLO surpasses the baseline across all evaluation metrics. Specifically, it shows gains of 7.8% in precision (P), 7.5% in recall (R), 8.4% in mAP50, and 5.8% in mAP50-95, confirming substantial overall enhancement.
As shown in
Figure 8, our improved model consistently outperforms the baseline across all ten object categories on the PR curves. The most significant improvements are achieved for the “pedestrian” and “person” classes, both of which represent small objects, with AP increases of 13.6 and 12.7 percentage points, respectively.
Figure 9 compares the confusion matrices of the proposed model and the baseline model. The values along the diagonal indicate improved recognition accuracy for all ten object categories.
4.5. Ablation Experiments
WCDB-YOLO introduces three key enhancements upon the YOLOv11s baseline: (1) a dual-backbone architecture, termed WCDB, to strengthen contextual feature extraction; (2) the integration of DWR convolutional module to enlarge the receptive field while preserving fine-grained semantic information; and (3) the addition of a P2 detection layer to improve sensitivity to small-scale objects. To systematically evaluate the contribution of each component to the overall performance, we designed and conducted ablation studies, with the results presented in
Table 4 and
Figure 10.
The ablation study results clearly demonstrate that each introduced module positively contributes to the overall model performance. When applied individually, WCDB yields the most significant improvement in mAP50, achieving a gain of 4.0%, while the DWR module delivers the largest boost in mAP50-95, with an increase of 3.4%. Among all pairwise combinations, the integration of WCDB and the P2 detection layer consistently achieves the highest gains across all four metrics—P, R, mAP50, and mAP50-95. Moreover, when all three modules are jointly employed, these metrics reach their peak values, substantially outperforming both the baseline and all other partial configurations. These findings confirm that WCDB, DWR, and the P2 detection layer are individually effective and work together synergistically. This successful collaboration underscores the strength of our multi-module design in boosting detection performance.
4.6. Comparative Experiments
To comprehensively and objectively evaluate the performance of the proposed method, we conduct comparative experiments against two representative categories of SOTA models: (1) Strong general-purpose detectors: These are classical object detection models that have demonstrated outstanding performance on generic detection benchmarks and are widely regarded as robust industrial baselines. The comparison results are presented in
Table 5 and
Figure 11. (2) Specialized models for small object detection: These methods are typically built upon classical SOTA architectures but incorporate specific enhancements tailored to improve small object detection performance, representing the current frontier in this domain. Their comparative results are shown in
Table 6 and
Figure 12.
Among classical object detection models, RT-DETR-L achieves the highest mAP50 of 45.0%, which is slightly lower than that of our proposed WCDB-YOLO. However, its computational cost—measured in GFLOPs—is nearly twice that of our model, suggesting that its performance gain is partly attributable to significantly higher computational overhead. To further verify that our improvements do not merely stem from increased parameter count or computational complexity, we compare our method with YOLOv11m. Despite having both more parameters and higher computational demands than our model, YOLOv11m attains only 40.3% mAP50, substantially underperforming our approach.
Among specialized small-object detection models, Drone-YOLO-N has the smallest parameter footprint but achieves only 38.1% mAP50, indicating limited detection accuracy. In contrast, EdgeYOLO-S delivers the highest accuracy in this category; however, its parameter count reaches 40.5 M—more than double that of our model.
In summary, WCDB-YOLO achieves competitive, if not superior, detection accuracy while maintaining a relatively low model complexity, clearly demonstrating the effectiveness and efficiency of our proposed architectural enhancements in striking an optimal balance between performance and computational cost.
4.7. Generalization Experiments
To validate the effectiveness of WCDB-YOLO, we conducted comparative experiments against the baseline model on the VEDAI dataset. The experimental results are shown in
Table 7 and
Figure 13. Experimental results demonstrate that the proposed model consistently outperforms the baseline across all evaluation metrics: it achieves a 2.2% improvement in mAP50 and a more substantial gain of 4.2% in the stricter mAP50-95 metric. Moreover, as shown in
Figure 13—which depicts the evolution of these metrics over training epochs—the proposed model not only converges more rapidly but also attains superior final performance, with consistently larger improvements observed across all indicators throughout the training process. These results strongly corroborate the effectiveness of the proposed method in enhancing detection accuracy, robustness, and overall generalization capability.
4.8. Visualization
To intuitively demonstrate the detection performance of the improved model, we selected several aerial images from the VisDrone dataset for testing and compared the results with those of the baseline model. The visual comparison is presented in
Figure 14.
In
Figure 14a, YOLOv11s fails to detect a person in a seated posture; similarly, in
Figure 14b, it misses a white vehicle. In contrast, the proposed WCDB-YOLO model successfully and accurately detects both objects in these two scenarios, demonstrating stronger detection robustness and generalization capability—particularly excelling in handling challenging cases such as people with varying postures or vehicles in low-contrast environments.
To comprehensively evaluate the generalization capability of the proposed model, we additionally selected several representative image samples from the UAVDT dataset and conducted visual detection experiments. As shown in
Figure 15, YOLOv11s fails to detect multiple small-scale vehicles in
Figure 15a,c, whereas the proposed WCDB-YOLO model successfully achieves accurate recognition and localization of these small objects. These results convincingly demonstrate WCDB-YOLO’s superior performance in detecting small objects under complex aerial-view scenarios, further confirming its effectiveness and robustness in enhancing small-object detection capabilities.
Although the VisDrone dataset used for training does not contain complex scenarios such as adverse weather or low-light conditions, our model nonetheless demonstrates strong generalization ability and robust detection performance when deployed in such challenging environments. Specifically, evaluations on aerial images captured under foggy conditions and at night in complex urban road settings reveal that the proposed model achieves a significant improvement in detection effectiveness over the baseline, as illustrated in
Figure 16.
Aerial drone imagery is widely employed for object detection in infrared scenarios. The proposed WCDB-YOLO also exhibits excellent performance on IR data. As illustrated in
Figure 17, when applied to two infrared aerial images, YOLOv11s produces suboptimal detection results, while WCDB-YOLO successfully identifies multiple cars.
4.9. Efficiency Analysis
Object detection in drone-captured aerial imagery imposes strict real-time requirements. However, the onboard hardware of current drones still struggles to support the real-time execution of high-accuracy detection models. Thanks to advances in communication technologies, drones can now establish real-time connections with remote servers—transmitting captured images for processing and receiving detection results in return—thereby enabling the effective deployment of high-precision models. Consequently, employing a medium-scale detection model that balances accuracy and computational efficiency has emerged as a practical solution to meet real-time demands. WCDB-YOLO is specifically designed with this objective in mind. On a server equipped with an NVIDIA RTX 3090 GPU, it achieves an average per-image processing time of 1.8 ms for preprocessing, 28.4 ms for inference, and 2.8 ms for postprocessing, resulting in an overall throughput of 30.3 FPS. When deployed on higher-end server hardware, its processing speed is expected to increase further. These results demonstrate that WCDB-YOLO effectively satisfies the real-time requirements of object detection in drone-based aerial imagery.
5. Conclusions
To address the challenges inherent in small object detection within drone-captured aerial imagery—such as extremely limited object scale, dense spatial distribution, and highly complex backgrounds—this paper proposes a novel object detection model termed WCDB-YOLO. Based on YOLOv11s as the benchmark, the model introduces a “target-context decoupled perception” paradigm by constructing a dual-backbone structure. One backbone focuses on extracting fine local features, while the other expands the receptive field through wavelet convolution to efficiently capture global contextual information. Through a multi-level feature fusion mechanism, the model achieves the synergistic utilization of local details and global semantics. Additionally, the DWR module is introduced, which utilizes parallel convolutional branches with different dilation rates to simultaneously capture fine local features and global contextual information. By first strengthening the representation of the object itself and then injecting scale-adaptive contextual information, the discriminative ability for small objects is effectively enhanced. Furthermore, high-resolution P2/4 feature maps are fused in the detection head to further improve the localization and recognition accuracy for tiny objects. Experimental results demonstrate that WCDB-YOLO outperforms current mainstream methods, validating the effectiveness and robustness of the proposed approach in complex scenarios.
Although the proposed method achieves strong detection performance, its dual-backbone design leads to a higher parameter count and computational burden, pointing to a clear direction for future improvement. Future work will focus on the following directions: (1) developing a lightweight dual-backbone architecture or integrating neural network pruning and quantization techniques to reduce computational overhead and enhance real-time inference capabilities; and (2) improving the model’s cross-domain generalization and robustness under diverse and adverse weather conditions, thereby enabling reliable object detection in increasingly complex and varied aerial scenarios. Overall, WCDB-YOLO presents a novel and effective approach to small object detection in challenging drone-captured environments, offering a solid foundation for future research on efficient, robust, and deployable visual perception systems.