Next Article in Journal
Semantic Segmentation of Pegmatite Dikes in High-Resolution Remote Sensing Imagery Using GAD-UNet++ in the Yilanlike Area, South Tianshan
Previous Article in Journal
An Optimized Heterogeneous Ensemble Learning Algorithm for InSAR Landslide Susceptibility Mapping Based on the Adaptive Sampling Strategy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

BDRNet: Background-Aware Dynamic-Scale Routing Network for UAV Remote Sensing Object Detection

College of Field Engineering, Army Engineering University of PLA, Nanjing 210007, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(12), 1987; https://doi.org/10.3390/rs18121987 (registering DOI)
Submission received: 18 April 2026 / Revised: 3 June 2026 / Accepted: 11 June 2026 / Published: 15 June 2026

Highlights

What are the main findings?
  • UAV aerial imagery suffers from three persistent bottlenecks—complex background clutter, drastic scale variation, and gradient instability for extremely small objects—that existing lightweight detectors fail to address simultaneously.
  • Standard attention mechanisms enhance foreground indiscriminately, static FPN fusion cannot adapt to content, and IoU-based losses are overly sensitive to small object offsets, collectively limiting detection accuracy in UAV scenarios.
What are the implications of the main findings?
  • The consistent performance gains across two diverse UAV benchmarks validate that jointly optimizing feature suppression, multi-scale fusion, and regression objectives is a promising direction for advancing small object detection in complex remote sensing scenarios.
  • The proposed approach has broad implications for remote sensing applications requiring real-time aerial perception, including urban traffic monitoring, infrastructure inspection, disaster assessment, and precision agriculture.

Abstract

Object detection in UAV remote sensing imagery remains challenging due to severe scale variation, dense object distributions, complex background clutter, and localization ambiguity caused by extremely small objects. To address these issues, this paper proposes BDRNet, a lightweight background-aware dynamic-scale routing network for UAV remote sensing object detection. First, a background-aware feature enhancement (BAFE) module is introduced into the backbone to enhance feature representation through horizontal and vertical contextual modeling, improving target-related responses in complex aerial scenes. Second, a dynamic-scale routing pyramid (DSRP) is designed to retain the high-resolution P 2 branch and adaptively integrate multi-scale features through spatially dynamic routing, alleviating the loss of fine-grained information and improving the representation of small and scale-varied objects. Third, a scale- and geometry-aware normalized Wasserstein distance (SGNW) loss is proposed by modeling bounding boxes as two-dimensional Gaussian distributions. By incorporating aspect-ratio-guided geometric weighting and scale-aware dynamic fusion, SGNW improves regression stability for small objects while preserving geometric constraints for medium and large targets. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate that BDRNet consistently improves detection accuracy over the YOLOv10s detector while maintaining a comparable model size and computational cost. Compared with several mainstream lightweight detectors, BDRNet achieves a favorable accuracy–efficiency trade-off, demonstrating its effectiveness for UAV remote sensing object detection in complex aerial scenarios.

1. Introduction

Unmanned Aerial Vehicle (UAV) remote sensing has been widely applied in environmental monitoring, traffic management, disaster assessment, and precision agriculture owing to its high mobility, low deployment cost, and flexible imaging capability [1,2,3,4,5]. As a fundamental task in UAV visual perception, object detection aims to accurately localize and recognize targets of interest, such as vehicles, pedestrians, and traffic-related objects, from aerial images. Although UAV imagery provides a broad observation range and rich spatial context, its high-altitude imaging manner also introduces severe scale variation, background clutter, and localization ambiguity, making accurate and efficient detection still challenging.
Recent advances in deep learning have substantially promoted the development of object detection methods. Two-stage detectors, such as Faster R-CNN [6] and Mask R-CNN [7], usually achieve competitive accuracy by first generating region proposals and then refining object classification and localization. However, their proposal-based pipelines introduce considerable computational overhead, limiting their applicability to real-time UAV scenarios. In contrast, single-stage detectors, including SSD [8], RetinaNet [9], and YOLO [10], directly perform dense prediction on feature maps and provide a more favorable accuracy–efficiency trade-off. Benefiting from their compact architectures and high inference speed, YOLO-style detectors have become widely adopted for resource-constrained UAV platforms.
Despite these advances, directly applying existing detectors to UAV aerial imagery remains difficult, as illustrated in Figure 1. First, UAV images are commonly captured from high altitudes with a wide field of view, causing objects to occupy only a small number of pixels and exhibit drastic scale changes. After repeated downsampling in the backbone, fine-grained cues such as object boundaries, textures, and local structures are easily weakened, which reduces the discriminability of small object features. Second, aerial scenes often contain complex background interference, including road textures, building edges, shadows, and vegetation. These background patterns may produce responses similar to real targets, leading to false activations and weakening foreground-focused representation. Third, objects in UAV scenarios are frequently densely distributed and appear across multiple scales. Conventional feature pyramid structures usually rely on fixed fusion rules, which are insufficient to adaptively select useful cross-scale information and may introduce redundant background noise during feature aggregation. Moreover, for extremely small objects, slight coordinate deviations can cause large fluctuations in IoU-based regression signals, making bounding box optimization unstable under severe scale variation and diverse aspect ratios.
The above observations indicate that UAV object detection requires not only an efficient detector, but also targeted designs for background suppression, adaptive multi-scale feature interaction, and stable small object localization. To this end, this paper proposes BDRNet, a background-aware dynamic-scale routing network for UAV remote sensing object detection. Built upon YOLOv10 [11], BDRNet introduces coordinated improvements from three complementary aspects: background-aware feature extraction, dynamic-scale-adaptive feature fusion, and scale-geometry-aware bounding box regression. Specifically, BAFE is introduced into the backbone to suppress background-induced responses and enhance weak target features in complex aerial scenes. DSRP is designed in the neck to retain high-resolution details and adaptively route multi-scale features for small and scale-varied objects. Finally, SGNW loss reformulates bounding box regression from a scale- and geometry-aware perspective, improving localization robustness for extremely small objects and targets with diverse aspect ratios.
The main contributions of this paper are summarized as follows:
  • We propose the BAFE module to address complex background interference in UAV aerial imagery. By modeling directional contextual information and generating a background suppression weight map, BAFE strengthens weak target responses and improves backbone feature representation.
  • We design the DSRP module to improve scale-adaptive feature aggregation. By retaining the high-resolution P 2 branch and replacing static feature fusion with dynamic routing, DSRP enhances fine-detail preservation and cross-scale information interaction for small and multi-scale objects.
  • We introduce SGNW loss to improve localization robustness. By incorporating an aspect-ratio-guided geometric constraint and a scale-aware dynamic fusion strategy, SGNW alleviates regression instability for extremely small objects while maintaining reliable localization for multi-scale targets.

2. Related Work

2.1. UAV Remote Sensing Object Detection

Object detection in UAV remote sensing faces multiple challenges, such as small object sizes, drastic scale variations, dense distributions, and complex background interference. Consequently, general-purpose detectors often struggle to achieve robust performance in these scenarios. Recently, general-purpose detectors like Faster R-CNN [6], SSD [8], RetinaNet [9], and YOLO [10] have been widely adapted to UAV remote sensing. While two-stage methods typically achieve higher accuracy, they suffer from limited inference speed and deployment efficiency. Conversely, single-stage methods offer distinct advantages in processing speed and real-world deployment, attracting widespread attention.
The introduction of public datasets such as DOTA [12], UAVDT [13], and VisDrone2019 [14] has established a comprehensive evaluation framework for UAV object detection. To address challenges posed by densely distributed small objects and large-format aerial images, researchers have proposed tailored improvements. Yang et al. [15] proposed ClusDet, which enhances small object detection in large-format images by clustering region proposals for local high-resolution detection. Liu et al. [16] developed a multi-branch parallel feature pyramid network to bolster multi-scale representations of small objects. Chen et al. [17] introduced a high-resolution feature pyramid network to mitigate the loss of fine-grained details in UAV scenarios. Zhu et al. [18] proposed CGK-FSOD, which exploits stable diffusion and CLIP to generate high-quality training samples for few-shot object detection from optical remote sensing imagery. Zhang et al. [19] proposed UniconDet, which formulates HBB, OBB, and instance segmentation as a unified arbitrary-shaped contour detection problem and introduces Fourier contour parametric modeling to represent diverse object geometries in the frequency domain.
Despite remarkable progress in detection accuracy, existing methods still struggle with robust object representation against complex backgrounds and efficient cross-scale feature interaction [15].

2.2. Attention Mechanism

Attention mechanisms emulate human visual cognition by adaptively reweighting spatial and channel dimensions during feature extraction, enhancing target responses while suppressing background interference. Hu et al. [20] proposed SENet to improve feature representation by explicitly modeling inter-channel dependencies. Woo et al. [21] introduced CBAM, combining channel and spatial attention to reweight crucial features. Wang et al. [22] presented the Non-local Network, enhancing context-awareness by modeling global pixel-wise relationships. Recently, Transformer-based self-attention mechanisms have further strengthened global dependency modeling, providing new paradigms for complex scene object detection [23].
For object detection in UAV remote sensing images, researchers often embed attention mechanisms into the backbone, neck, or detection head to address challenges like small object sizes and complex, confusing backgrounds. Xu et al. [24] proposed FEA-Swin, integrating foreground enhancement attention into the Swin Transformer to improve dense object detection. Gong et al. [25] designed a framework integrating the Swin Transformer and attention mechanisms to strengthen small object feature extraction. Li et al. [5] presented LAST-Net, combining local adaptive spatial transformations with attention to enhance multi-object detection in thermal infrared aerial images. Additionally, several lightweight methods [26,27,28] have incorporated CBAM, SE, or multi-scale attention modules to balance feature enhancement with real-time processing demands.
Overall, while attention mechanisms alleviate background interference, they still lack fine-grained discriminative capacity for weak and small objects and struggle with synergistic cross-layer feature enhancement.

2.3. Feature Pyramid Networks

Multi-scale feature fusion is vital for addressing drastic object-scale variations. Lin et al. [29] proposed the feature pyramid network (FPN) to fuse multi-level features via a top-down pathway and lateral connections, boosting multi-scale representation. Building upon FPN, Liu et al. [30] introduced PANet to shorten feature propagation paths by enhancing bottom-up aggregation. Ghiasi et al. [31] developed NAS-FPN, using neural architecture search to automatically optimize the pyramid structure. Tan et al. [32] proposed EfficientDet with a bi-directional feature pyramid network (BiFPN), further advancing fusion efficiency.
Tailoring to object detection in UAV imagery, researchers have extensively modified FPN-like architectures. Liu et al. [16] proposed a multi-branch parallel network to enhance the semantic expression of small objects via parallel pathways. Chen et al. [17] presented the high-resolution feature pyramid network (HRFPN) to preserve high-resolution representations and reinforce cross-layer interaction. Sun et al. [33] introduced a decoupled feature pyramid learning method to separate the feature learning of objects at different scales. Moreover, several lightweight methods [34,35,36] have enhanced cross-scale feature propagation by modifying PANet or BiFPN, or by designing custom pyramids.
More recently, Bai et al. [37] proposed RFHA-YOLO, which combines dynamic receptive-field modeling with adaptive hybrid attention to enhance small object feature representation in remote sensing images. Despite these advances, cross-scale semantic interaction efficiency and small object feature enhancement in complex backgrounds still require improvement.

2.4. Loss Function Optimization

Loss functions directly influence the training stability and final performance of detection models. In UAV scenarios, where small objects are prevalent and localization requirements are strict, designing optimal objectives is exceptionally critical. Lin et al. [9] proposed focal loss to mitigate foreground–background class imbalance by reducing the weight of easy-to-classify examples. For bounding box regression, Rezatofighi et al. [38] introduced GIoU loss to address the vanishing gradient problem in standard IoU for non-overlapping boxes. Subsequently, Zheng et al. proposed DIoU loss [39] and CIoU loss [40], incorporating center distance and aspect ratio consistency to improve regression precision. Furthermore, VarifocalNet [41] introduced varifocal loss to align classification scores with localization quality, while TOOD [42] alleviated task misalignment through Task-Aligned Learning.
Adapting to the characteristics of UAV remote sensing object detection, studies have implemented targeted loss function modifications. For instance, Ma et al. [43] adopted CIoU to improve small object localization precision in UAV images. Sun et al. [44] compared multiple regression losses (e.g., DIoU, CIoU, EIoU) and boosted detection efficacy through customized refinements. Recent studies [45,46] have also utilized weighted IoU, dynamic loss assignment, or quality-aware strategies to mitigate sample imbalance and improve small object detection.
Overall, while loss function optimizations have improved regression quality, achieving an optimal balance among hard example mining, localization precision, and training stability requires further investigation.

3. Materials and Methods

3.1. Overview

BDRNet is constructed based on YOLOv10, and its overall architecture is shown in Figure 2. In the backbone, BAFE is inserted to refine intermediate features by modeling directional contextual responses and suppressing background-dominated activations. DSRP replaces the original feature aggregation path by incorporating the high-resolution P 2 branch and learning spatially adaptive routing weights for scale-aware feature fusion. The detection head remains consistent with YOLOv10, while the bounding box regression objective is reformulated by SGNW, which augments NWD [47] with an aspect-ratio-guided anisotropic metric and combines it with CIoU through a scale-aware weighting strategy. Through this design, BDRNet preserves the efficiency of the baseline detector while improving feature discrimination, multi-scale representation, and localization robustness.

3.2. BAFE

In UAV remote sensing images, complex background regions frequently contain abundant texture and structural details that visually resemble the targets, such as road textures, building edges, shadows, and dense vegetation. These high-response backgrounds significantly interfere with the network’s ability to discriminate foreground targets. The original C2f module focuses on local feature extraction and cross-layer information aggregation but lacks an explicit suppression mechanism for complex background interference. To address this, the BAFE module is proposed to attenuate background responses and fortify the effective representation of target regions. Its structure is illustrated in Figure 2. BAFE shares with Strip R-CNN [48] the use of horizontal and vertical directional cues in aerial images, but their objectives are different. Strip R-CNN employs large strip convolutions to learn long-range representations for elongated objects, whereas BAFE uses directional contextual aggregation to estimate background-dominated responses and generate a background-aware modulation map for clutter suppression.
As the core component of BAFE, the Background-Guided Suppression Module (BGSM) explicitly models background contextual information and highlights target features via background response suppression. Its structure is shown in Figure 3. Specifically, let the input feature to the BGSM be denoted as F R C × H × W , where F represents the output feature of the second convolutional layer in the bottleneck. To simultaneously model background distribution characteristics in the horizontal and vertical directions, we extract average-pooled and max-pooled contextual information along both axes. First, the horizontal contextual description is defined as
F h a v g = AvgPool h ( F ) F h m a x = MaxPool h ( F ) .
Similarly, the vertical contextual description can be formulated as
F v a v g = AvgPool v ( F ) F v m a x = MaxPool v ( F ) ,
where F h a v g ,   F h m a x R C × 1 × W ; F v a v g ,   F v m a x R C × H × 1 . AvgPool h ( · ) and MaxPool h ( · ) denote average and max pooling operations along the horizontal direction, while AvgPool v ( · ) and MaxPool v ( · ) represent the corresponding operations along the vertical direction.
Subsequently, to improve the stability of the directional contextual representations, the average and maximum statistics along the same direction are fused to obtain direction-aware features:
F h = F h a v g + F h m a x F v = F v a v g + F v m a x .
Next, lightweight directional convolutions are employed to map the horizontal and vertical contexts, yielding background responses in both directions:
A h = ψ h ( F h ) A v = ψ v ( F v ) ,
where ψ h ( · ) and ψ v ( · ) denote the 1 × k and k × 1 convolutional mapping functions for the horizontal and vertical directions, respectively. k is set to 7, which provides a sufficient receptive field to capture the continuous structural patterns of background bands while keeping the parameter overhead negligible.
Furthermore, the responses from both directions are broadcast to the original spatial dimensions, fused, and normalized using a Sigmoid function to generate the background suppression weight map
M b = σ B h ( A h ) + B v ( A v ) ,
where σ ( · ) is the Sigmoid activation function, B h ( · ) and B v ( · ) represent the broadcast operations, and M b [ 0 , 1 ] C × H × W is the background suppression map. Intuitively, background regions such as roads and building rows produce uniformly high responses along their dominant axis, whereas foreground targets exhibit localized, non-uniform activations. Consequently, M b naturally assigns high suppression weights to background-dominated positions and low weights to target-occupied regions, without requiring any explicit foreground annotation.
Because complex background regions typically exhibit higher redundant responses, this suppression map is utilized to recalibrate the input features, thereby attenuating background interference and preserving target details. The output of the BGSM is defined as
F ˜ = F ( 1 M b ) ,
where ⊙ denotes element-wise multiplication. The term ( 1 M b ) acts as a foreground gate: positions with strong background responses are down-weighted, while target regions with weak background responses retain their activations. As indicated by Equation (6), a strong background response at a specific position increases its suppression weight, thereby diminishing feature activation at that location. Conversely, effective features in target regions are preserved due to lower responses in the suppression map. Finally, the BGSM output feature is added to the bottleneck’s shortcut branch via a residual connection to produce the final output.
Unlike traditional joint channel–spatial attention modules, BGSM does not simultaneously perform generalized enhancement across channel and spatial dimensions. Instead, it focuses exclusively on complex background suppression by constructing a background suppression map through horizontal and vertical contextual statistics, effectively constraining the responses of complex background regions in UAV remote sensing images.

3.3. DSRP

Feature pyramid structures are commonly used to exchange information across different resolution levels. However, conventional pyramid fusion relies on predefined aggregation paths, where shallow fine-grained details may be weakened after repeated downsampling and cross-scale propagation. To improve the representation of small and densely distributed UAV targets, DSRP is designed to introduce a high-resolution branch and content-adaptive feature routing into the neck, as shown in Figure 2.
Specifically, the multi-scale backbone features are denoted as P 2 , P 3 , P 4 , and P 5 , where P 2 preserves the highest spatial resolution and P 5 contains the strongest semantic information. Different from the original neck structure, DSRP retains the P 2 branch to preserve edge, texture, and local structural details that are essential for small object localization. In addition, a DRM is introduced to replace static feature fusion. Unlike fusion strategies with globally shared scalar weights, the DRM generates spatially adaptive routing maps, allowing each fusion node to select scale-specific information according to local image content.
For the l-th fusion node, its candidate input feature set is defined as
X l = { X 1 l , X 2 l , , X N l l } ,
where N l denotes the number of candidate input branches. These candidates can be obtained from adjacent-scale features after upsampling or downsampling, as well as same-level shortcut branches. Since different branches may have inconsistent spatial resolutions and channel dimensions, each candidate feature is first aligned into a unified feature space
X ˜ i l = A ( X i l ) , i = 1 , 2 , , N l ,
where A ( · ) denotes the alignment operator, including resolution adjustment and 1 × 1 convolutional channel mapping.
As illustrated in Figure 4, the DRM estimates the contribution of each candidate branch at each spatial location. For the aligned feature X ˜ i l , channel-wise average pooling and max pooling are first applied to obtain a compact spatial descriptor:
D i l = Avg c ( X ˜ i l ) ; Max c ( X ˜ i l ) ,
where Avg c ( · ) and Max c ( · ) denote average and max pooling along the channel dimension, respectively, and [ · ; · ] denotes channel concatenation. The routing score map is then generated by a lightweight mapping function
R i l = ϕ l ( D i l ) ,
where ϕ l ( · ) consists of a 3 × 3 convolution followed by a non-linear activation. The routing scores of all candidate branches are normalized along the branch dimension
α i l = exp ( R i l ) j = 1 N l exp ( R j l ) ,
where α i l is a two-dimensional routing weight map rather than a globally shared scalar. The output of the l-th fusion node is obtained by
Y l = i = 1 N l α i l X ˜ i l ,
where ⊙ denotes element-wise multiplication with channel-wise broadcasting. In this manner, DSRP enables spatially adaptive cross-scale aggregation: high-resolution features can be emphasized in regions containing small objects, while deeper semantic features can contribute more to regions requiring stronger contextual representation.

3.4. SGNW Loss

In UAV aerial object detection, small targets are highly sensitive to slight coordinate deviations, making IoU-based regression losses unstable when the overlap region changes sharply. To obtain a smoother regression signal, NWD [47] models each bounding box as a two-dimensional Gaussian distribution. Specifically, for a box B = ( x , y , w , h ) , its Gaussian representation is defined as
N B = N ( μ , Σ ) , μ = x y , Σ = w 2 / 4 0 0 h 2 / 4 .
For the predicted box B p and the ground-truth box B g , the squared 2-Wasserstein distance between their Gaussian distributions has the closed form
W 2 2 = μ p μ g 2 2 + Tr Σ p + Σ g 2 ( Σ g 1 / 2 Σ p Σ g 1 / 2 ) 1 / 2 .
Since the covariance matrices are diagonal, this distance can be simplified as
W NWD 2 = ( x p x g ) 2 + ( y p y g ) 2 + ( w p w g ) 2 + ( h p h g ) 2 4 .
Although NWD provides smoother gradients for small objects, it treats horizontal and vertical deviations uniformly and does not explicitly consider the geometric sensitivity caused by different aspect ratios.
To address this limitation, SGNW introduces an aspect-ratio-guided anisotropic metric into the NWD formulation. Let
Δ μ = x p x g y p y g , Δ s = w p w g h p h g .
The aspect ratio descriptor and modulation factor of the ground-truth box are defined as
ρ g = ln w g + ε h g + ε , η g = 1 + α | ρ g | ,
where ε is used for numerical stability and α controls the strength of geometric modulation. Based on ρ g , the anisotropic metric matrix is defined as
A g = η g 0 0 η g 1 , ρ g 0 , η g 1 0 0 η g , ρ g < 0 .
Here, A g is a geometry-aware metric matrix rather than a covariance matrix. It redistributes the regression penalty between horizontal and vertical directions according to the target aspect ratio.
The proposed geometry-aware Wasserstein distance is then formulated as
W SG 2 = Δ μ T A g Δ μ + 1 4 Δ s T A g Δ s .
When A g degenerates to the identity matrix, W SG 2 becomes equivalent to the standard NWD distance. Therefore, SGNW can be regarded as a shape-aware extension of NWD.
Following NWD, the distance is normalized into a similarity score
S SG = exp W SG 2 C ,
and the SGNW loss is defined as
L SGNW = 1 S SG .
To balance the smooth regression property of SGNW and the explicit overlap constraint of CIoU, a scale-aware fusion factor is introduced
ω s = exp w g h g τ ,
where w g h g denotes the characteristic scale of the ground-truth box and τ controls the transition rate. The final regression loss is given by
L reg = ω s L SGNW + ( 1 ω s ) ( 1 CIoU ) .
Compared with standard NWD, SGNW introduces two modifications: an aspect-ratio-guided anisotropic metric for shape-aware localization and a scale-aware fusion strategy for adaptive balance between distributional regression and overlap-based regression.

4. Results

4.1. Datasets

In this study, experiments were conducted on two public benchmark datasets, VisDrone2019 [14] and UAVDT [13]. Both datasets were collected from real-world UAV imaging scenarios and are characterized by complex backgrounds, substantial scale variations, and a high proportion of small objects, thereby effectively reflecting the practical challenges of aerial object detection.
VisDrone2019 is a large-scale UAV object detection dataset released by the AISKYEYE team at Tianjin University. It contains 6471 training images, 548 validation images, and 1610 test images. The dataset was collected from 14 cities and several rural areas in China under diverse weather and illumination conditions, with annotations for 10 object categories. The category distribution and object-scale statistics of the dataset are shown in Figure 5. Overall, the dataset exhibits a certain degree of class imbalance and a pronounced bias toward small objects. Representative samples are shown in Figure 6.
UAVDT is another widely used benchmark dataset for UAV-based aerial object detection. It was constructed from approximately 10 h of UAV video and contains about 80,000 image frames with a resolution of 1080 × 540 . This dataset covers a variety of urban scenes, including roads, squares, and parking lots, and includes diverse weather conditions, flight altitudes, viewing angles, and occlusion levels. Unlike VisDrone2019, UAVDT is primarily designed for vehicle detection and includes annotations for three categories: car, truck, and bus. Its category and scale distributions are illustrated in Figure 7, which likewise indicate a prominent small object characteristic. Representative samples are shown in Figure 8.

4.2. Evaluation Metrics

To comprehensively evaluate the detection performance and computational complexity of the model, precision, recall, average precision (AP), mean average precision (mAP), F1-score, the number of parameters (Params), and floating-point operations (FLOPs) were adopted as evaluation metrics.
Specifically, precision and recall are used to measure the accuracy and completeness of the detection results, respectively, and are defined in Equations (24) and (25). When the Intersection over Union (IoU) between a predicted bounding box and a ground-truth box exceeds a predefined threshold and the prediction can be uniquely matched to a ground-truth object, it is counted as a true positive (TP).
Unmatched predicted boxes are counted as false positives (FP), whereas undetected ground-truth objects are regarded as false negatives (FN).
Precision = TP TP + FP .
Recall = TP TP + FN .
AP represents the detection performance for a single category and corresponds to the area under the precision–recall curve. mAP denotes the mean of AP over all categories and is defined in Equations (26) and (27), where n is the total number of categories and A P i is the average precision of the i-th category. In this study, both mAP@0.5 and mAP@0.5:0.95 are reported. The former is used to evaluate the overall detection capability of the model, whereas the latter provides a more stringent assessment of localization accuracy and overall performance.
A P = 0 1 P ( R ) d R .
mAP = 1 n i = 1 n A P i .
In addition, F1-score, defined in Equation (28) as the harmonic mean of precision and recall, is used to comprehensively reflect the balance between the two. Params and FLOPs are used to measure model size and computational cost, respectively. FPS is further reported to evaluate the inference efficiency of different detectors.
F 1 = 2 × Precision × Recall Precision + Recall .

4.3. Experimental Setup

To ensure the reproducibility of the experimental results and the fairness of the comparisons, all models were trained and evaluated under a unified software and hardware environment. The configuration of the experimental platform is listed in Table 1. Specifically, the experiments were conducted on a platform equipped with an Intel Core i9-14900K processor and an NVIDIA GeForce RTX 5070 GPU with 12 GB of memory. The operating system was Ubuntu 20.04 LTS, the deep learning framework was PyTorch 2.0.6, the GPU acceleration environment was based on CUDA 13.0, the programming language was Python 3.11, and the integrated development environment was PyCharm 2025.2.2.
The main training hyperparameters are presented in Table 2. Unless otherwise specified, all experiments were conducted using the same training settings. SGD was adopted as the optimizer, with an initial learning rate of 0.001 dynamically adjusted by a cosine annealing schedule. The model was trained for 200 epochs with a batch size of 32 and eight data-loading workers. The momentum and weight decay were set to 0.937 and 0.0005, respectively. The input image size was uniformly set to 640 × 640 for both training and inference. It is worth noting that UAVDT frames have a native resolution of 1080 × 540 . To convert them to the unified input size, we used aspect-ratio-preserving letterbox resizing rather than direct stretching. Specifically, each frame was first scaled to fit within 640 × 640 and then padded to the target size, with the same scale factor and padding offsets applied to the bounding box annotations. This avoids object shape distortion, although the resolution reduction may still weaken fine-grained cues for very small objects.
In terms of the training strategy, Mosaic data augmentation was employed to enrich the scene diversity and scale distribution of the samples, thereby improving the model’s adaptability to complex backgrounds and small objects. To mitigate the adverse effect of strong data augmentation on bounding box regression in the later stage of training, Mosaic augmentation was disabled during the last 10 epochs. In addition, an early stopping strategy was adopted according to the validation-set performance during training, so as to reduce the additional computational cost caused by ineffective iterations.

4.4. Ablation Study

To verify the contribution of each proposed module to detection performance, a systematic ablation study was conducted on the VisDrone2019 and UAVDT datasets using YOLOv10s as the baseline. Specifically, a series of experiments was constructed using an incremental module stacking strategy, including configurations with only BAFE (A), only DSRP (B), only SGNW (C), the combination of BAFE and DSRP (D), and the fully integrated BDRNet. The experimental results are summarized in Table 3, Table 4 and Table 5.

4.4.1. Ablation on the VisDrone2019 Dataset

At a marginal cost of 0.14 M parameters and 0.7 G FLOPs, Model A improves mAP@0.5 and mAP@0.5:0.95 from the baseline’s 31.4% and 17.5% to 32.3% and 18.2%, respectively, alongside a 1.1 pp increase in precision. These results indicate that adding BAFE improves the overall detection performance of the baseline model on the VisDrone2019 dataset. Therefore, the ablation result supports the effectiveness of BAFE in enhancing the model’s detection capability.
With slightly fewer parameters and computational overhead than the baseline, Model B elevates mAP@0.5 to 31.7% and mAP@0.5:0.95 to 17.9%, achieving a dual improvement in accuracy and efficiency. This is attributed to dynamic weighted fusion, which reduces noise accumulation from redundant fusion, and the retention of the high-resolution P 2 branch, which prevents the irreversible loss of minute object edge details during deep downsampling. Higher-altitude UAV imaging usually produces smaller object regions in the image, whereas lower-altitude imaging generally corresponds to relatively larger targets. Under this setting, the improvement brought by Model B indicates that preserving the high-resolution P 2 branch and using spatially adaptive routing are beneficial for scale-varied UAV objects.
Without adding any parameters or computational load, Model C boosts mAP@0.5 to 32.3% and mAP@0.5:0.95 to 18.3%, while precision increases by 1.2 pp—the most prominent improvement among single-module configurations. This demonstrates that introducing scale and geometric constraints significantly improves bounding box regression quality in scenarios with drastic aspect ratio variations.
Model D achieves an mAP@0.5 of 32.8%, a further 0.5 pp improvement over the highest value achieved by individual modules, reflecting a clear positive synergistic effect. BAFE’s front-end suppression of background noise reduces interference for DSRP’s cross-layer routing, enabling dynamic weighted fusion to allocate semantic weights across scales more accurately, thereby generating cascaded gains superior to a simple superposition.
Furthermore, to clarify the specific advantages of SGNW in object localization, additional comparative experiments on loss functions were conducted using the VisDrone2019 dataset. As shown in Table 4, the compared losses include CIoU, EIoU, NWD, and the proposed SGNW. The table additionally reports the small object detection accuracy, mAP S , to evaluate the localization effect of each loss function on extremely small targets.
The baseline CIoU yields an mAP@0.5 of 31.4% and an mAP S of only 16.2%. Replacing the loss with EIoU only marginally increases mAP@0.5 to 31.5%, indicating that decoupling the aspect ratio alone is insufficient to alleviate the gradient instability of IoU-based metrics on extremely small targets. Substituting it with NWD elevates mAP@0.5 to 31.8%, verifying that the Wasserstein distance-based distribution similarity metric provides smoother regression gradients in regions with small overlaps. The proposed SGNW further introduces an aspect-ratio-guided anisotropic covariance matrix and a scale-aware dynamic fusion strategy upon the NWD baseline, boosting mAP@0.5 to 32.3% and mAP S to 17.2%—an increase of 0.5 pp and 0.4 pp over NWD, respectively—achieving optimal results in small object localization accuracy. These results suggest that modeling target scale and geometric morphological discrepancies is crucial for further elevating small object regression accuracy.
BDRNet introduces SGNW on top of Model D. Its mAP@0.5 and mAP@0.5:0.95 reach 33.3% and 18.8%, respectively, outperforming the baseline by 1.9 pp and 1.3 pp, and achieving the best levels in both precision and recall. No performance saturation is observed among the three modules, indicating that BAFE, DSRP, and SGNW independently address the distinct bottlenecks of feature enhancement, multi-scale fusion, and regression optimization, exhibiting excellent functional complementarity.

4.4.2. Ablation on the UAVDT Dataset

Table 5 reports the ablation results of each module on the UAVDT dataset. Unlike the VisDrone2019 dataset, the UAVDT dataset primarily comprises urban road traffic scenes, placing greater emphasis on testing the model’s localization robustness in densely distributed object environments.
Model A improves mAP@0.5 from the baseline’s 63.8% to 64.7%, mAP@0.5:0.95 from 38.5% to 39.5%, and precision by 1.7 pp to 90.9%, representing the most notable gain among all single-module configurations. These results show that adding BAFE improves the overall detection performance on the UAVDT dataset.
Models B and C similarly yield mAP@0.5 improvements of approximately 0.4–0.8 pp, further supporting the interpretability of the respective mechanisms of the three modules.
Model D achieves an mAP@0.5 of 65.2% and an mAP@0.5:0.95 of 40.2%, both surpassing the optimal values of any single module, reaffirming the cross-dataset universality of their positive synergistic effect.
BDRNet ultimately attains an mAP@0.5 and mAP@0.5:0.95 of 65.9% and 40.8%, respectively, improving upon the baseline by 2.1 pp and 2.3 pp and achieving the highest precision. Notably, the recall of BDRNet is slightly lower than that of Model D. This phenomenon is not a performance degradation but a direct manifestation of SGNW’s geometric constraints: strict geometric alignment constraints filter out a large number of false-positive samples with insufficient bounding box overlap or imprecise localization, thereby concentrating prediction confidence on geometrically congruent positive samples. In other words, BDRNet trades a marginal loss in loosely bounded detections for significantly higher localization precision, a trade-off with clear practical value.

4.4.3. Sensitivity Analysis

To evaluate the robustness of BAFE with respect to the directional convolution kernel size k, we conduct a sensitivity analysis on the VisDrone2019 validation set. The full BDRNet model is trained with k { 3 , 5 , 7 , 9 } , while all other hyperparameters remain fixed. The results are illustrated in Figure 9.
Performance peaks at k = 7 and degrades gracefully on both sides, confirming that a moderate receptive field is sufficient to capture the continuous structural patterns of background bands. The performance variation across all tested values is within 0.7 pp in mAP@0.5, demonstrating that BDRNet is not sensitive to this hyperparameter and that the results are reproducible across a reasonable range of k.

4.4.4. Qualitative Analysis

Figure 10 and Figure 11 present representative qualitative detection results of different module configurations, providing visual observations that complement the quantitative results in Table 3 and Table 5.
The detection results from VisDrone2019 in Figure 10 show that the model with BAFE achieves more stable detection results than the baseline in several complex aerial scenes. This observation is consistent with the overall performance improvement brought by BAFE in Table 3, but it is not used here to claim the suppression of specific background categories. After further incorporating DSRP, the model shows improved detection of occluded targets and objects with large scale variations, which is consistent with the role of DSRP in preserving high-resolution spatial information and adaptively integrating multi-scale features.
The detection results from UAVDT in Figure 11 show a similar trend. The baseline model tends to produce localization offsets and low-confidence predictions when detecting distant and small vehicles under low-contrast nighttime conditions. With the integration of BAFE, the displayed examples show more stable detection results, which is consistent with its overall improvement in detection metrics. After introducing DSRP, the model further improves the detection of distant and scale-varied vehicles, producing more complete predictions and more compact bounding boxes around target regions. These qualitative observations support the overall trends in Table 5, while they should not be interpreted as direct evidence for a specific background suppression mechanism.

4.5. Comparative Experiments

To fully validate the effectiveness and robustness of BDRNet in complex UAV aerial scenarios, this section conducts a detailed experimental analysis across three dimensions: comparison with state-of-the-art (SOTA) detectors, in-depth analysis against the baseline, and cross-dataset generalization evaluation. Unless otherwise specified, all models involved in the comparison strictly adhere to the same training configurations.

4.5.1. Comparison with State-of-the-Art Detectors

Twelve representative detectors are selected for comparison, including YOLOv10s [11], RT-DETR [49], YOLOv5s [50], YOLOv6s [51], YOLOX-S [52], YOLOv7-tiny [53], YOLOv8s [54], YOLOv9s [55], YOLOv11s [56], YOLO-MS-S [57], Gold-YOLO-S [58], and SFFNet [59].
Table 6 reports the comparative results on the VisDrone2019 dataset. BDRNet achieves the best precision, recall, mAP@0.5, and mAP@0.5:0.95 among the lightweight detectors in this comparison, while maintaining a parameter count and computational cost close to the YOLOv10s baseline. Compared with YOLOv8s and YOLOv9s, BDRNet improves mAP@0.5 by 2.0 pp and 2.1 pp, respectively. Compared with Gold-YOLO-S and SFFNet, BDRNet also achieves higher mAP@0.5 with lower or comparable computational cost, indicating its effectiveness for UAV object detection under a lightweight setting. Although RT-DETR obtains a higher mAP@0.5 than BDRNet, it requires about 4.1× more parameters and 6.3× more FLOPs, and its inference speed is lower than that of BDRNet, i.e., 74 FPS versus 148 FPS. These results show that BDRNet provides a favorable balance between detection accuracy and computational efficiency for resource-constrained UAV platforms.
To further investigate performance variations across different categories, Figure 12 presents a radar chart of per-class AP@0.5 for each model on the VisDrone2019 dataset. BDRNet achieves substantial improvements on dense categories characterized by extremely small objects (e.g., ‘motor’, ‘pedestrian’, ‘people’, and ‘bicycle’). This advantage is directly attributable to DSRP’s dynamic retention of high-resolution fine-grained features and SGNW’s ability to provide smooth gradients at minute scales.

4.5.2. In-Depth Analysis Against the Baseline

To further analyze BDRNet’s improvements over the YOLOv10s baseline, their PR curves, F1 curves, confusion matrices, and Grad-CAM visualizations on the VisDrone2019 dataset are compared.
Figure 13 presents the P-R curve comparison between the baseline model and BDRNet on the VisDrone2019 dataset. Different from the overall quantitative comparison in Table 6, this visualization aims to further investigate how the proposed modules affect the precision–recall trade-off across different categories, especially for hard samples and challenging object classes. As can be observed, the overall P-R curve of BDRNet shifts toward the upper-right region compared with the baseline, indicating a better balance between precision and recall. In particular, for small object categories such as ‘people’, ‘pedestrian’, and ‘motor’, BDRNet achieves higher detection accuracy, demonstrating stronger adaptability to small object detection. Moreover, for categories such as ‘car’, ‘bicycle’, ‘tricycle’, and ‘van’, the P-R curves of BDRNet consistently outperform those of the baseline, suggesting more stable detection performance in complex backgrounds and multi-scale scenarios.
The F1 curves of the two models at different confidence thresholds are shown in Figure 14. BDRNet’s overall F1 curve is visibly higher than the baseline’s, with improvements particularly pronounced in low-to-medium confidence intervals. This further illustrates that the proposed method effectively elevates comprehensive detection performance. For hard-to-detect categories (e.g., ‘people’ and ‘motor’), BDRNet achieves higher peaks, suggesting enhanced discriminative capability for difficult samples.
The confusion matrices (Figure 15) show that BDRNet exhibits stronger diagonal responses across most categories, indicating superior classification and discriminative capabilities compared to the baseline. Recognition for categories like ‘pedestrian’, ‘tricycle’, and ‘motor’ shows noticeable improvement, whereas ‘awning-tricycle’ still suffers from high miss rates and class confusion. This highlights that extremely small, visually similar targets with scarce training samples remain a formidable challenge in UAV detection.
To visualize shifts in the model’s regions of interest, Grad-CAM is employed to analyze the detection processes of the baseline and BDRNet (Figure 16). Compared to YOLOv10s, BDRNet’s response regions are more concentrated on target bodies, with weaker background interference. In scenarios with complex backgrounds, distant targets, and dense objects, BDRNet focuses more accurately on critical regions, aligning with the quantitative results.

4.5.3. Generalization on the UAVDT Dataset

To examine the model’s generalization robustness under different data distributions and extreme environmental conditions, qualitative visual comparisons were conducted on the UAVDT dataset.
Figure 17 displays the visual detection results of the baseline model and BDRNet across various complex scenes in the UAVDT dataset. Under low nighttime illumination, strong glare, rain/fog occlusion, and extremely dense distant traffic, the baseline model exhibits severe missed detections and localization drift. In stark contrast, BDRNet precisely captures minute vehicles hidden in the dark and within dense clusters, generating tight bounding boxes without confidence collapse. Benefiting from BAFE’s adaptive suppression of foggy backgrounds and DSRP’s retention of high-resolution features, BDRNet effectively localizes low-contrast vehicle targets. Notably, halations from streetlights and headlights often induce false positives. BDRNet achieves excellent background response suppression, significantly reducing false detections triggered by these light sources.

5. Discussion

The experimental results show that UAV remote sensing object detection remains challenging because small object scales, dense spatial distributions, complex backgrounds, and localization ambiguity often coexist in aerial images. The performance gains of BDRNet indicate that addressing these challenges requires coordinated improvements in feature representation, multi-scale information interaction, and bounding box regression. Specifically, BAFE strengthens feature representation through background-aware recalibration, DSRP improves the use of high-resolution and multi-scale features for small and scale-varied objects, and SGNW loss introduces scale- and geometry-aware constraints to enhance localization stability. These designs complement each other and jointly improve the detection accuracy of BDRNet in complex UAV scenes.
From a practical perspective, BDRNet achieves a favorable balance between detection accuracy and model complexity, which is important for UAV platforms with limited computational resources. Its lightweight design makes it potentially suitable for real-time UAV perception tasks. However, actual deployment performance may still depend on hardware platforms, inference engines, image resolution, and flight environments, which require further validation on real UAV or edge-computing devices.
Despite its effectiveness, BDRNet still has limitations in extremely challenging scenarios. Missed detections may occur when targets are very small, weakly illuminated, partially occluded, or visually similar to surrounding background regions. In particular, BAFE and DSRP may struggle when targets are both extremely small and structurally similar to background elements, because background-aware recalibration may not clearly separate weak target cues from clutter, and high-resolution routing may still preserve ambiguous background details. In future work, we will further explore infrared–visible image fusion to improve object perception under weak illumination and low-visibility conditions. We will also investigate more robust detection strategies for adverse environments, including dense fog, motion blur, nighttime scenes, and other degraded imaging conditions commonly encountered in UAV remote sensing.

6. Conclusions

This paper proposes BDRNet, a lightweight background-aware dynamic-scale routing network for UAV remote sensing object detection. The method is designed to address the challenges of small object scales, dense distributions, and complex aerial backgrounds while maintaining a compact model structure. BDRNet consists of three main components. The BAFE module improves feature representation through background-aware feature recalibration. The DSRP module enhances multi-scale feature interaction by preserving high-resolution details and adaptively routing cross-scale information. The SGNW loss introduces scale- and geometry-aware constraints to improve bounding box regression stability for small objects. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate that BDRNet achieves consistent improvements over the baseline and several representative detectors while maintaining competitive computational efficiency. These results indicate that the proposed framework provides an effective and lightweight solution for UAV remote sensing object detection, and that background-aware feature enhancement and adaptive scale routing are promising directions for improving small object detection in complex aerial scenes.

Author Contributions

Conceptualization, X.Z. and F.S.; methodology, X.Z. and F.S.; software, X.Z. and Y.Y.; validation, X.Z. and Q.L.; resources, F.S. and J.D.; data curation, X.Z. and Y.Y.; writing—original draft preparation, X.Z. and C.C.; writing—review and editing, X.Z. and T.Z.; visualization, Q.L. and T.Z.; supervision, Q.L. and J.D.; project administration, C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sens. 2023, 16, 149. [Google Scholar] [CrossRef]
  2. Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
  3. Hua, W.; Chen, Q. A Survey of Small Object Detection Based on Deep Learning in Aerial Images. Artif. Intell. Rev. 2025, 58, 162. [Google Scholar] [CrossRef]
  4. Leng, J.; Ye, Y.; Mo, M.; Gao, C.; Gan, J.; Xiao, B.; Gao, X. Recent Advances for Aerial Object Detection: A Survey. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
  5. Li, M.; Lan, J.; Zhang, Y.; Huang, K. LAST-Net: Local Adaptivity Spatial Transformer Network for Multiobject Detection in UAV Remote Sensing Thermal Infrared Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5003414. [Google Scholar] [CrossRef]
  6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  7. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
  8. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar] [CrossRef]
  9. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
  10. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  11. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
  12. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
  13. Yu, H.; Li, G.; Zhang, W.; Huang, Q.; Du, D.; Tian, Q.; Sebe, N. The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline. Int. J. Comput. Vis. 2020, 128, 1141–1159. [Google Scholar] [CrossRef]
  14. Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 213–226. [Google Scholar] [CrossRef]
  15. Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered Object Detection in Aerial Images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 8310–8319. [Google Scholar] [CrossRef]
  16. Liu, Y.; Yang, F.; Hu, P. Small-Object Detection in UAV-Captured Images via Multi-Branch Parallel Feature Pyramid Networks. IEEE Access 2020, 8, 145740–145750. [Google Scholar] [CrossRef]
  17. Chen, Z.; Ji, H.; Zhang, Y.; Zhu, Z.; Li, Y. High-Resolution Feature Pyramid Network for Small Object Detection on Drone View. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 475–489. [Google Scholar] [CrossRef]
  18. Zhang, T.; Zhuang, Y.; Wang, G.; Chen, H.; Wang, H.; Li, L.; Li, J. Controllable Generative Knowledge-Driven Few-Shot Object Detection from Optical Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5612319. [Google Scholar] [CrossRef]
  19. Zhang, T.; Zhuang, Y.; Wang, G.; Chen, H.; Li, L.; Li, J. A Unified Remote Sensing Object Detector Based on Fourier Contour Parametric Learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5611225. [Google Scholar] [CrossRef]
  20. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
  21. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. [Google Scholar] [CrossRef]
  22. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
  23. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
  24. Xu, W.; Zhang, C.; Wang, Q.; Dai, P. FEA-Swin: Foreground Enhancement Attention Swin Transformer Network for Accurate UAV-Based Dense Object Detection. Sensors 2022, 22, 6993. [Google Scholar] [CrossRef] [PubMed]
  25. Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H.; et al. Swin-Transformer-Enabled YOLOv5 with Attention Mechanism for Small Object Detection on Satellite Images. Remote Sens. 2022, 14, 2861. [Google Scholar] [CrossRef]
  26. Wan, Z.; Lan, Y.; Xu, Z.; Shang, K.; Zhang, F. DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images. Remote Sens. 2025, 17, 1768. [Google Scholar] [CrossRef]
  27. Li, Y.l.; Feng, Y.; Zhou, M.l.; Xiong, X.c.; Wang, Y.h.; Qiang, B.h. DMA-YOLO: Multi-Scale Object Detection Method with Attention Mechanism for Aerial Images. Vis. Comput. 2024, 40, 4505–4518. [Google Scholar] [CrossRef]
  28. Zhang, G.; Chen, X.; Tan, X.; Zhang, J.; Lan, X. U-YOLO: Improved YOLOv5 for Small Object Detection on UAV-Captured Images. In Proceedings of the Cognitive Computation and Systems; Sun, F., Li, J., Liu, H., Chu, Z., Eds.; Springer: Singapore, 2023; pp. 3–15. [Google Scholar] [CrossRef]
  29. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 936–944. [Google Scholar] [CrossRef]
  30. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
  31. Ghiasi, G.; Lin, T.Y.; Le, Q.V. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 7029–7038. [Google Scholar] [CrossRef]
  32. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
  33. Sun, H.; Chen, Y.; Lu, X.; Xiong, S. Decoupled Feature Pyramid Learning for Multi-Scale Object Detection in Low-Altitude Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 6556–6567. [Google Scholar] [CrossRef]
  34. Wang, C.; Wei, X.; Jiang, X. MA-YOLO: Multi-Scale Information Prediction Network Based on the Multi-Direction Weighted Pyramid for UAV Scene. In 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–8. [Google Scholar] [CrossRef]
  35. Liu, S.; Zhu, M.; Tao, R.; Ren, H. Fine-Grained Feature Perception for Unmanned Aerial Vehicle Target Detection Algorithm. Drones 2024, 8, 181. [Google Scholar] [CrossRef]
  36. Wang, K.; Liu, Z. BA-YOLO for Object Detection in Satellite Remote Sensing Images. Appl. Sci. 2023, 13, 13122. [Google Scholar] [CrossRef]
  37. Liu, X.; Zheng, Y.; Cai, Y.; Ding, Y.; Li, J.; Kang, W.; Cai, Z. RFHA-YOLO: Dynamic Receptive Field and Adaptive Hybrid Attention for Small-Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2026, 64, 5611813. [Google Scholar] [CrossRef]
  38. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 658–666. [Google Scholar] [CrossRef]
  39. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
  40. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2022, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
  41. Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. VarifocalNet: An IoU-aware Dense Object Detector. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 8510–8519. [Google Scholar] [CrossRef]
  42. Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 3490–3499. [Google Scholar] [CrossRef]
  43. Ma, C.; Fu, Y.; Wang, D.; Guo, R.; Zhao, X.; Fang, J. YOLO-UAV: Object Detection Method of Unmanned Aerial Vehicle Imagery Based on Efficient Multi-Scale Feature Fusion. IEEE Access 2023, 11, 126857–126878. [Google Scholar] [CrossRef]
  44. Sun, H.R.; Shi, B.J.; Hu, Y.L. A Lightweight YOLO-Based Model in Small-Object Detection for AAV Optical Sensors. IEEE Sens. J. 2025, 25, 17585–17599. [Google Scholar] [CrossRef]
  45. Di, X.; Cui, K.; Wang, R.F. Toward Efficient UAV-Based Small Object Detection: A Lightweight Network with Enhanced Feature Fusion. Remote Sens. 2025, 17, 2235. [Google Scholar] [CrossRef]
  46. Cheng, Y.; Wang, T.; Zhang, W. Real-Time UAV Small Object Detection: An Efficient Approach Using SGFNet and Dynamic Loss Optimization. Digit. Signal Process. 2026, 168, 105543. [Google Scholar] [CrossRef]
  47. Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Detecting Tiny Objects in Aerial Images: A Normalized Wasserstein Distance and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
  48. Yuan, X.; Zheng, Z.; Li, Y.; Liu, X.; Liu, L.; Li, X.; Hou, Q.; Cheng, M.M. Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection. Proc. AAAI Conf. Artif. Intell. 2026, 40, 12259–12267. [Google Scholar] [CrossRef]
  49. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
  50. Jocher, G. YOLOv5 by Ultralytics. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 10 June 2026).
  51. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
  52. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  53. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 7464–7475. [Google Scholar]
  54. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 April 2026).
  55. Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar] [CrossRef]
  56. Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 June 2026).
  57. Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.M. YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-Time Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef] [PubMed]
  58. Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. Adv. Neural Inf. Process. Syst. 2023, 36, 51094–51112. [Google Scholar]
  59. Yang, Y.; Yuan, G.; Li, J. SFFNet: A Wavelet-Based Spatial and Frequency Domain Fusion Network for Remote Sensing Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3000617. [Google Scholar] [CrossRef]
Figure 1. Illustration of typical challenges in UAV aerial object detection. Yellow solid boxes indicate scale variations, red dashed ellipses denote densely distributed objects, and blue dashed rectangles highlight complex background interference.
Figure 1. Illustration of typical challenges in UAV aerial object detection. Yellow solid boxes indicate scale variations, red dashed ellipses denote densely distributed objects, and blue dashed rectangles highlight complex background interference.
Remotesensing 18 01987 g001
Figure 2. The overall structure of the BDRNet model. The red arrow highlights the retained high-resolution P 2 branch.
Figure 2. The overall structure of the BDRNet model. The red arrow highlights the retained high-resolution P 2 branch.
Remotesensing 18 01987 g002
Figure 3. Structure of the BGSM block.
Figure 3. Structure of the BGSM block.
Remotesensing 18 01987 g003
Figure 4. The schematic diagram of the DRM.
Figure 4. The schematic diagram of the DRM.
Remotesensing 18 01987 g004
Figure 5. Category and scale distributions of the VisDrone2019 dataset.
Figure 5. Category and scale distributions of the VisDrone2019 dataset.
Remotesensing 18 01987 g005
Figure 6. Sample images from the VisDrone2019 dataset.
Figure 6. Sample images from the VisDrone2019 dataset.
Remotesensing 18 01987 g006
Figure 7. Category and scale distributions of the UAVDT dataset.
Figure 7. Category and scale distributions of the UAVDT dataset.
Remotesensing 18 01987 g007
Figure 8. Sample images from the UAVDT dataset.
Figure 8. Sample images from the UAVDT dataset.
Remotesensing 18 01987 g008
Figure 9. Sensitivity analysis of the BGSM directional convolution kernel size k on the VisDrone2019 dataset. Red-bordered bars indicate the best-performing configuration ( k = 7 ).
Figure 9. Sensitivity analysis of the BGSM directional convolution kernel size k on the VisDrone2019 dataset. Red-bordered bars indicate the best-performing configuration ( k = 7 ).
Remotesensing 18 01987 g009
Figure 10. Qualitative detection results under different module configurations on the VisDrone2019 dataset.
Figure 10. Qualitative detection results under different module configurations on the VisDrone2019 dataset.
Remotesensing 18 01987 g010
Figure 11. Qualitative detection results under different module configurations on the UAVDT dataset.
Figure 11. Qualitative detection results under different module configurations on the UAVDT dataset.
Remotesensing 18 01987 g011
Figure 12. Per-class AP@0.5 comparison on the VisDrone2019 dataset.
Figure 12. Per-class AP@0.5 comparison on the VisDrone2019 dataset.
Remotesensing 18 01987 g012
Figure 13. Comparison of PR curves on the VisDrone2019 dataset.
Figure 13. Comparison of PR curves on the VisDrone2019 dataset.
Remotesensing 18 01987 g013
Figure 14. Comparison of F1 curves on the VisDrone2019 dataset.
Figure 14. Comparison of F1 curves on the VisDrone2019 dataset.
Remotesensing 18 01987 g014
Figure 15. Confusion matrix comparison between YOLOv10s and BDRNet on the VisDrone2019 dataset.
Figure 15. Confusion matrix comparison between YOLOv10s and BDRNet on the VisDrone2019 dataset.
Remotesensing 18 01987 g015
Figure 16. Grad-CAM activation maps of YOLOv10s (top) and BDRNet (bottom) on representative VisDrone2019 scenes.
Figure 16. Grad-CAM activation maps of YOLOv10s (top) and BDRNet (bottom) on representative VisDrone2019 scenes.
Remotesensing 18 01987 g016
Figure 17. Qualitative detection comparison between YOLOv10s and BDRNet on representative UAVDT scenes, including foggy conditions, nighttime illumination, distant small vehicles, and dense traffic scenarios. Red boxes and dashed red lines indicate zoomed comparison regions.
Figure 17. Qualitative detection comparison between YOLOv10s and BDRNet on representative UAVDT scenes, including foggy conditions, nighttime illumination, distant small vehicles, and dense traffic scenarios. Red boxes and dashed red lines indicate zoomed comparison regions.
Remotesensing 18 01987 g017
Table 1. Experimental configuration of the computing platform.
Table 1. Experimental configuration of the computing platform.
ParameterConfiguration
CPUIntel Core i9-14900K
GPURTX 5070
Operating systemUbuntu 20.04 LTS
Deep learning frameworkPyTorch 2.0.6
GPU acceleratorCUDA 13.0
Integrated development environmentPyCharm 2025.2.2
Programming languagePython 3.11
Table 2. Experimental configuration of training hyperparameters.
Table 2. Experimental configuration of training hyperparameters.
ParameterConfiguration
OptimizerSGD
Learning rate0.001
LR schedulerCosine Annealing
Image size 640 × 640
Batch size32
Workers8
Epochs200
Weight decay0.0005
Momentum0.937
Close mosaicLast 10 epochs
Table 3. Ablation experiments on the VisDrone2019 dataset. Bold and underlined values indicate the best and second-best results, respectively.
Table 3. Ablation experiments on the VisDrone2019 dataset. Bold and underlined values indicate the best and second-best results, respectively.
ModelModuleParams
(M)
FLOPs
(G)
P (%)R (%)mAP50 (%)mAP50-95 (%)
BAFE DSRP SGNW
Baseline7.2521.644.033.131.417.5
A7.3922.345.133.632.318.2
B7.1421.244.633.831.717.9
C7.2521.645.234.332.318.3
D7.2721.745.334.532.818.4
Ours7.2721.745.834.933.318.8
Table 4. Comparison of bounding box regression loss functions on the VisDrone2019 dataset. † denotes the proposed loss. Bold and underlined values indicate the best and second-best results, respectively.
Table 4. Comparison of bounding box regression loss functions on the VisDrone2019 dataset. † denotes the proposed loss. Bold and underlined values indicate the best and second-best results, respectively.
Loss FunctionP (%)R (%)mAP50 (%) mAP S (%)
CIoU44.033.131.416.2
EIoU44.433.331.516.5
NWD44.734.031.816.8
SGNW   45.234.332.317.2
Table 5. Ablation experiments on the UAVDT dataset. Bold and underlined values indicate the best and second-best results, respectively.
Table 5. Ablation experiments on the UAVDT dataset. Bold and underlined values indicate the best and second-best results, respectively.
ModelModuleParams
(M)
FLOPs
(G)
P (%)R (%)mAP50 (%)mAP50-95 (%)
BAFE DSRP SGNW
Baseline7.2521.689.276.863.838.5
A7.3922.390.977.764.739.5
B7.1421.288.977.464.239.0
C7.2521.689.577.164.639.3
D7.2721.790.278.365.240.2
Ours7.2721.790.777.865.940.8
Table 6. Comparison of experimental results on the VisDrone2019 dataset.
Table 6. Comparison of experimental results on the VisDrone2019 dataset.
ModelParams (M)FLOPs (G)FPSPrecisionRecallmAP@0.5mAP@0.5:0.95
YOLOv10s [11]7.321.61510.440.330.3140.175
RT-DETR [49]30.0136.0740.500.390.3450.198
YOLOv5s [50]9.123.81370.420.330.3080.174
YOLOv6s [51]16.344.0740.420.310.2950.169
YOLOX-S [52]9.026.81220.420.320.2960.161
YOLOv7-tiny [53]6.213.82200.410.320.2940.159
YOLOv8s [54]11.128.51140.440.330.3130.178
YOLOv9s [55]7.226.71220.440.330.3120.179
YOLOv11s [56]9.421.31530.440.330.3090.176
YOLO-MS-S [57]8.524.61330.440.330.3120.174
Gold-YOLO-S [58]8.930.11080.440.330.3170.178
SFFNet [59]6.324.01360.470.350.3260.183
BDRNet (Ours)7.321.71480.480.370.3330.188
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, X.; Shao, F.; Liu, Q.; Dai, J.; Yue, Y.; Zhang, T.; Chen, C. BDRNet: Background-Aware Dynamic-Scale Routing Network for UAV Remote Sensing Object Detection. Remote Sens. 2026, 18, 1987. https://doi.org/10.3390/rs18121987

AMA Style

Zheng X, Shao F, Liu Q, Dai J, Yue Y, Zhang T, Chen C. BDRNet: Background-Aware Dynamic-Scale Routing Network for UAV Remote Sensing Object Detection. Remote Sensing. 2026; 18(12):1987. https://doi.org/10.3390/rs18121987

Chicago/Turabian Style

Zheng, Xuelong, Faming Shao, Qing Liu, Juying Dai, Yiming Yue, Tao Zhang, and Caian Chen. 2026. "BDRNet: Background-Aware Dynamic-Scale Routing Network for UAV Remote Sensing Object Detection" Remote Sensing 18, no. 12: 1987. https://doi.org/10.3390/rs18121987

APA Style

Zheng, X., Shao, F., Liu, Q., Dai, J., Yue, Y., Zhang, T., & Chen, C. (2026). BDRNet: Background-Aware Dynamic-Scale Routing Network for UAV Remote Sensing Object Detection. Remote Sensing, 18(12), 1987. https://doi.org/10.3390/rs18121987

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop