4.3.1. Performance Assessment
To comprehensively evaluate the effectiveness of FDE-YOLO, we conduct extensive comparisons with state-of-the-art object detection methods on the VisDrone2019 dataset under identical experimental conditions. The baseline methods include representative two-stage detectors (Faster R-CNN), classical one-stage detectors from the YOLO family (YOLOv5s, YOLOX, YOLOv8s, YOLOv10s, YOLOv11s, YOLOv12s, and YOLO26), transformer-based architectures (RT-DETR and D-Fine), and UAV-specific detection methods (Drone-YOLO and MASF-YOLO). All models are evaluated using consistent metrics including precision, recall, mAP50, mAP50:95, parameter count, and inference speed. The comprehensive comparison results are presented in
Table 2.
The experimental results demonstrate that FDE-YOLO achieves superior detection performance while maintaining a favorable balance between accuracy and computational efficiency. Specifically, FDE-YOLO attains 53.6% precision, 42.5% recall, 43.3% mAP50, and 26.3% mAP50:95, representing substantial improvements of 2.8%, 4.4%, 4.1%, and 2.8% respectively over the baseline YOLOv11s. Compared with the UAV-specialized Drone-YOLO and MASF-YOLO, FDE-YOLO achieves 2.8% and 0.2% higher mAP50 respectively, while significantly reducing model parameters by 37.5% (10.25 M vs. 16.4 M) compared to Drone-YOLO and 29.8% (10.25 M vs. 14.6 M) compared to MASF-YOLO, demonstrating the effectiveness of our lightweight architectural innovations. Against other YOLO variants including YOLOv5s, YOLOX, YOLOv8s, YOLOv10s, YOLOv12s, and the recently proposed YOLO26, FDE-YOLO consistently outperforms across all evaluation metrics, validating its robustness for small target detection in aerial imagery. When compared with transformer-based architectures, D-Fine exhibits the highest overall performance with 49.3% mAP50 and 30.3% mAP50:95, followed by RT-DETR with 44.8% mAP50 and 27.4% mAP50:95. While FDE-YOLO achieves slightly lower accuracy (6.0% mAP50 gap with D-Fine and 1.5% gap with RT-DETR), it requires only 35.7% of D-Fine’s parameters (10.25 M vs. 28.7 M) and 31.2% of RT-DETR’s parameters (10.25 M vs. 32.8 M), highlighting the efficiency advantages of our approach for resource-constrained UAV deployment scenarios where computational budget and real-time performance are critical constraints.
4.3.2. Ablation Study
To verify the performance of the improved modules proposed in this paper, ablation experiments were conducted on the VisDrone2019 dataset using the YOLOv11s model as baseline. The effects of the improved Fine-Grained Detection Pyramid (FGDP), self-developed detection head DDFHead, and EdgeSpaceNet replacing C3k2 were further analyzed. The experimental results are shown in
Table 3.
Adding the algorithm YOLOv11s-F that improves the original Neck structure to enhance small target detection accuracy, results show that precision (Pre), recall (R), mAP50, and mAP50:95 increased by 2.3%, 3.2%, 3.4%, and 2.2% respectively, while network size only increased by 0.66 M. On this basis, introducing the dynamic detection head algorithm YOLOv11s-FD to enhance feature extraction and improve recognition capability for targets in different complex environments, precision (Pre) and recall (R) increased by 0.1% and 0.4%, and mean average precision mAP50 and mAP50:95 also increased by 0.5% and 0.4% respectively. On this basis, introducing the algorithm YOLOv11s-FDE that replaces the traditional C3k2 convolution structure, by combining edge information and spatial information extraction to enhance image feature representation capability, precision (Pre), recall (R), mAP50, and mAP50:95 increased by 0.4%, 0.8%, 0.2%, and 0.2% respectively, while simultaneously reducing network model parameter count (P) by 0.25 M. The final improved algorithm FDE-YOLO compared to the original algorithm YOLOv11s shows precision (Pre) increased by 2.8%, recall (R) increased by 4.4%, and mean average precision mAP50 and mAP50:95 increased by 4.1% and 2.8% respectively, with relatively obvious improvement in detection performance.
Comparative analysis of experimental result figures for YOLOv11s and the improved algorithm shows that the improved model achieves significant detection accuracy improvements across all categories, especially with Bus class improving by 9.5%. Additionally, although the Bicycle and Awning Tricycle classes have limited feature information making them difficult to detect, detection accuracy for these two classes also improved by 5.5% and 3.6%, respectively. The precision–recall curves for both models are presented separately in
Figure 7 and
Figure 8 for better clarity and detailed analysis.
The ablation experiments demonstrate that feature fusion mechanisms fundamentally contribute to improved object detection performance in aerial images. Among the proposed modules, the Fine-Grained Detection Pyramid (FGDP) exhibits the most substantial impact on detection accuracy. As shown in
Table 3, introducing FGDP alone (YOLOv11s-F) improves mAP50 by 3.4% and mAP50:95 by 2.2%, accounting for 82.9% and 78.6% of the total improvement, respectively. This substantial gain stems from FGDP’s multi-scale feature fusion strategy, which integrates features across multiple granularity levels—local details through
depthwise convolutions, anisotropic context via cross-shaped dilated convolutions, and global semantics through gated aggregation. Given that small targets in UAV aerial images often span only 5–20 pixels, they are highly sensitive to feature scale. The multi-scale fusion architecture ensures that subtle appearance cues at fine scales are preserved while semantic context from coarse scales provides discrimination against background clutter, particularly critical in complex aerial scenes with dense small objects.
Beyond multi-scale integration, the hierarchical feature fusion architecture in CSP-MFE demonstrates how gradient flow optimization enhances representation learning. By splitting features into 25% shallow and 75% deep branches, the module enables parallel processing of features at different abstraction levels before fusion. This design alleviates gradient vanishing issues while allowing the network to jointly optimize multi-granularity features during backpropagation. The adaptive feature recalibration mechanism in the global branch further refines the fused features by learning channel-wise importance weights, suppressing irrelevant channels while amplifying discriminative ones. In aerial images where background patterns such as roads and buildings often exhibit similar textures to small vehicles or pedestrians, this channel-wise recalibration following fusion significantly improves feature discriminability, contributing to the 3.2% recall improvement observed in the ablation study.
The fusion of edge and spatial features through EdgeSpaceNet provides another dimension of complementary information essential for precise localization. The ablation results indicate that adding EdgeSpaceNet (YOLOv11s-FDE) contributes an additional 0.8% recall improvement beyond FGDP and DDFHead. This gain is attributed to the explicit extraction and fusion of boundary information via Sobel operators with spatial contextual features. Small targets in aerial images frequently suffer from blurred boundaries due to motion blur, atmospheric effects, or low resolution. By fusing edge-enhanced features with spatial convolution outputs through residual connections, EdgeSpaceNet recovers boundary details typically lost in standard convolutions, enabling more accurate bounding box regression. The synergistic integration of multi-level feature fusion (FGDP), hierarchical channel fusion (CSP-MFE), and boundary-spatial fusion (EdgeSpaceNet) ultimately achieves 4.1% mAP50 improvement, demonstrating that comprehensive feature fusion across scales, channels, and feature types is fundamental to robust small target detection in challenging aerial scenarios.
To verify the precision and efficiency of the detection head DDFHead designed in this paper, comparison experiments with the Dynamic Head detection head were designed. The experimental data uses the VisDrone2019 dataset, with fixed test image size of
, baseline model YOLOv11s, and sequential replacement with our designed detection head and Dynamic Head detection head. Experimental results are shown in
Table 4. From the table, it can be seen that the detection head designed in this paper achieves improvements of 0.8%, 0.5%, 1.3%, and 1.3% in precision (Pre), recall (R), mAP50, and mAP50:95, respectively, compared to the Dynamic Head detection head. Experimental results demonstrate that the DDFHead detection head achieves overall detection effect improvements based on the Dynamic Head detection head.
To verify the superiority of the Sobel operator in the EdgeSpaceNet module designed in this paper, comparison experiments were conducted by integrating EdgeSpaceNet with different edge detection operators (Sobel, Prewitt, and Roberts) to replace the traditional C3k2 convolution module in YOLOv11s. The experimental data uses the VisDrone2019 dataset, with fixed test image size of
. Experimental results are shown in
Table 5. From the table, it can be seen that the Sobel-based EdgeSpaceNet achieves improvements of 2.6%, 2.8%, 2.4%, and 2.4% in precision (Pre), recall (R), mAP50, and mAP50:95, respectively, compared to the Prewitt-based variant, and improvements of 5.2%, 5.4%, 5.1%, and 4.2%, respectively, compared to the Roberts-based variant. Experimental results demonstrate the superiority of the Sobel operator for edge feature extraction in the EdgeSpaceNet module.
4.3.3. Parameter Sensitivity Analysis
To validate the robustness of FDE-YOLO and justify the parameter choices adopted in our experiments, we conduct comprehensive sensitivity analysis on four critical parameters: CSP branch ratio in the CSP-MFE module, dilation rate in the cross-shaped dilated convolution, initial learning rate, and batch size. These parameters directly influence the model’s feature representation capability, gradient flow optimization, and convergence behavior. For each parameter, we systematically vary its value while keeping all other configurations constant, evaluating the impact on detection performance using mAP50 and mAP50:95 metrics on the VisDrone2019 validation set. The comprehensive sensitivity analysis results are presented in
Figure 9, which reveals the relationship between parameter configurations and detection accuracy across different architectural and training dimensions.
The sensitivity curves presented in
Figure 9 demonstrate that our adopted parameter configurations achieve optimal or near-optimal performance across all evaluated settings. For the CSP branch ratio, the 25/75 allocation achieves peak performance, substantially outperforming alternative configurations, with the 20/80 ratio showing marginally lower performance due to insufficient shallow feature representation, while higher ratios (30/70, 35/65, and 40/60) exhibit progressive degradation as reduced deep branch capacity limits multi-scale extraction. For the dilation rate in CCDC, d = 3 achieves optimal performance with d = 2 yielding highly competitive results, while smaller rates provide insufficient contextual coverage for elongated targets and larger rates introduce excessive spatial gaps degrading localization accuracy. The learning rate analysis reveals that 0.01 optimally balances convergence speed and final accuracy, with 0.005 producing comparable but slightly lower results requiring longer training, whereas higher rates cause training instability and notable performance degradation. Similarly, batch size 32 achieves peak performance, with size 16 maintaining competitive accuracy at the cost of noisier gradient estimates, while larger batches exhibit diminishing returns. The relatively smooth sensitivity curves around optimal configurations indicate stable performance under moderate parameter variations, suggesting practical robustness without requiring exhaustive hyperparameter tuning. Notably, all evaluated configurations consistently surpass the baseline YOLOv11s performance across both metrics, validating that the architectural innovations in FGDP, DDFHead, and EdgeSpaceNet provide fundamental improvements beyond mere hyperparameter optimization, while also providing practitioners with guidance for adapting FDE-YOLO to specific application requirements or hardware constraints.
4.3.5. Generalization Analysis
To rigorously assess the generalization capability and cross-domain robustness of FDE-YOLO, we conduct comprehensive validation experiments on two additional benchmark datasets with distinct characteristics from the primary VisDrone2019 training corpus. The UAV-DT dataset, released by Zhejiang University, comprises UAV imagery captured under diverse urban and open-space scenarios with varying illumination conditions, camera angles, and target densities, providing a stringent test for model adaptability to heterogeneous aerial imaging conditions. The NWPU VHR-10 dataset, established by Northwestern Polytechnical University, contains high-resolution satellite remote sensing images covering ten object categories including aircraft, ships, and storage tanks, characterized by significantly different imaging modalities, target scales, and background complexity compared to low-altitude UAV footage. These cross-dataset evaluations enable assessment of whether the architectural innovations in FDE-YOLO learn generalizable feature representations rather than dataset-specific patterns, a critical consideration for real-world deployment where imaging conditions inevitably deviate from training distributions.
The generalization performance on the UAV-DT dataset, presented in
Table 6, demonstrates FDE-YOLO’s robust transferability to unseen aerial imaging scenarios. FDE-YOLO achieves 93.1% precision, 90.8% recall, 96.0% mAP50, and 67.1% mAP50:95, substantially outperforming the baseline YOLOv11s by 0.5%, 3.0%, 1.6%, and 4.9%, respectively, across all metrics. Notably, the 3.0% recall improvement and 4.9% mAP50:95 enhancement are particularly significant, indicating that FDE-YOLO’s multi-scale feature fusion and edge-spatial feature extraction mechanisms effectively capture diverse target appearances and localization cues that generalize beyond the training domain. When compared with the UAV-specialized Drone-YOLO method, FDE-YOLO achieves 1.9% higher mAP50 (96.0% vs. 94.1%), validating that our architectural designs yield superior cross-scenario adaptability. The consistent performance gains across precision, recall, and mAP metrics suggest that FDE-YOLO successfully balances false positive suppression and missed detection reduction in varying environmental conditions, demonstrating robust generalization to diverse UAV imaging scenarios beyond the VisDrone2019 training distribution.
The cross-domain generalization to satellite remote sensing imagery, evaluated on the NWPU VHR-10 dataset (
Table 7), further substantiates FDE-YOLO’s capability to transfer learned representations across drastically different imaging modalities. Despite the substantial domain shift between low-altitude UAV footage (training domain) and high-altitude satellite imagery (test domain), which encompasses differences in spatial resolution, viewing geometry, atmospheric effects, and target appearance, FDE-YOLO maintains strong detection performance with 95.6% precision, 82.6% recall, 94.6% mAP50, and 63.9% mAP50:95. Compared to the baseline YOLOv11s, FDE-YOLO achieves improvements of 0.7%, 0.8%, 1.5%, and 3.0% across all evaluation metrics, with the 3.0% mAP50:95 enhancement being particularly noteworthy as it reflects improved localization accuracy under stricter IoU thresholds. The 2.2% mAP50 advantage over Drone-YOLO (94.6% vs. 92.4%) underscores the superior generalizability of FDE-YOLO’s feature extraction and fusion strategies. Collectively, the consistent performance improvements across both UAV-DT and NWPU VHR-10 datasets, spanning diverse imaging conditions, target categories, and domain characteristics, provide compelling evidence that FDE-YOLO’s architectural innovations—particularly the multi-granularity feature fusion in FGDP, adaptive attention mechanisms in DDFHead, and edge-spatial feature integration in EdgeSpaceNet—learn robust, generalizable representations rather than overfitting to specific dataset characteristics, thereby ensuring reliable deployment across heterogeneous real-world aerial detection scenarios.