For qualitative visualization in this section, YOLO11s is adopted as the representative baseline because FCML-YOLO is developed based on the YOLO11s architecture. These visual comparisons are intended to illustrate typical detection differences between YOLO11s and FCML-YOLO under challenging aerial imaging conditions. In the qualitative results presented in this section, red boxes indicate regions where YOLO11s produces missed detections, and the same regions are marked in the corresponding image columns to facilitate direct comparison.
4.6.1. Experiments on VisDrone
To further showcase the competitive performance of the proposed FCML-YOLO model, we conducted a comprehensive comparison with multiple cutting-edge object detection algorithms using the VisDrone dataset, such as YOLOv5s, YOLOv8s, YOLOv10s, YOLO11s, YOLOv12s, FFCA-YOLO, DDSC-YOLO, CS-YOLOv8, and Drone-YOLO. The results of DDSC-YOLO, CS-YOLOv8, and Drone-YOLO were directly cited from their original publications because their official implementations are not publicly available.
As shown in
Table 7 and
Figure 10, the experimental results indicate that the proposed detection model achieves notable advantages in overall performance. Specifically, while YOLO11s shows slightly lower precision than YOLOv5s, YOLOv8s, and YOLOv10s, it outperforms all baseline YOLO series in recall, mAP50, and mAP50:95, demonstrating a clear advantage in detection accuracy. The results further show that FCML-YOLO not only significantly improves all evaluation metrics over the YOLO baseline models but also outperforms advanced models such as DDSC-YOLO, CS-YOLOv8, and Drone-YOLO with respect to precision, recall, mAP50, and mAP50:95. While ensuring high-accuracy detection, FCML-YOLO maintains a lightweight design with only 3.52 M parameters, representing reductions of approximately 62.6% and 67.7% compared to YOLO11s and Drone-YOLO, respectively, and demonstrating strong potential for real-time applications.
Table 8 compares the detection performance between FCML-YOLO and multiple mainstream object detection algorithms across diverse categories in the VisDrone dataset. The results demonstrate that our model yields comprehensive improvements over all 10 target classes. Compared to popular models such as YOLOv8s and YOLOv10s, FCML-YOLO exhibits significant performance gains in small-object detection tasks, particularly for classes like pedestrian, person, bicycle, and motorcycle. For instance, compared with YOLO11s, the mAP50 for people detection increased by 10.3%, and for motorcycle detection by 9.2%, demonstrating the elevated precision of the modified model when detecting densely clustered and small objects. For visually similar categories, such as truck and bus, FCML-YOLO improves mAP50 by 4.3% and 3.7%, respectively, over the baseline, further validating its superiority in distinguishing similar-looking targets.
Figure 11 illustrates qualitative comparisons between FCML-YOLO and the baseline YOLO11 model across diverse real-world detection contexts. In occlusion scenarios (see
Figure 11a), the lightweight multi-level feature fusion module equips the model with stronger associative reasoning capabilities, allowing it to accurately detect vehicle targets even when partially obscured by branches, with noticeably improved bounding box completeness. In small-scale object scenes (
Figure 11b), FCML-YOLO demonstrates heightened sensitivity through shallow feature enhancement mechanisms, successfully detecting small pedestrian targets at the far end of a sports field that the baseline model failed to capture. Under low-light conditions (
Figure 11c), the improved model leverages cross-level feature interactions to mitigate edge blurring caused by uneven illumination, enabling the detection of previously missed motorcycles in dark areas and producing more precise contour fits for vehicles. In crowded scenes (
Figure 11d), FCML-YOLO significantly lowers false negatives and false positives for distant pedestrians relative to the original YOLO11, achieving more accurate differentiation of individuals in high-density areas and generating bounding boxes that better conform to the actual object distribution. These visualization results clearly demonstrate the superior performance of FCML-YOLO across various representative remote sensing scenarios, further highlighting its robustness in complex aerial environments.
4.6.2. Comparison of Performance Across Different Datasets
The generalizability of a model is an important indicator for evaluating its robustness and adaptability across heterogeneous scenarios. In this study, we systematically evaluate the proposed FCML-YOLO model on four representative datasets, namely USOD, DOTA, TinyPerson, and SARD. These datasets cover remote-sensing vehicle detection, multi-scale aerial object detection, tiny pedestrian detection, and UAV-based search-and-rescue person detection, respectively. As reported in
Table 9, FCML-YOLO consistently outperforms representative baseline detectors, including YOLOv5s, YOLOv8s, YOLOv10s, YOLO11s, and YOLOv12s.
On the USOD dataset, FCML-YOLO achieves a precision of 92.1% and an mAP50 value of 91.8%, surpassing all other models. Specifically, it improves precision by 1.1% over YOLO11s and mAP50 by 1.5% over YOLOv8s, which are the strongest corresponding baselines for these two metrics. On the DOTA dataset, which is characterized by dense target distributions and significant scale variations, FCML-YOLO improves mAP50 by 4.0%, 3.4%, 6.2%, 4.2%, and 4.9% over YOLOv5s, YOLOv8s, YOLOv10s, YOLO11s, and YOLOv12s, respectively. These results further validate the model’s strong feature representation capability across targets of varying scales. On the TinyPerson dataset, which primarily includes tiny pedestrian targets, FCML-YOLO achieves mAP50 improvements of 6.6%, 6.3%, 6.6%, 6.1%, and 7.1% over YOLOv5s, YOLOv8s, YOLOv10s, YOLO11s, and YOLOv12s, respectively, underscoring the model’s superior efficacy in tiny-object detection tasks.
Furthermore, on the SARD dataset, FCML-YOLO achieves the best performance across all metrics, with 96.6% precision, 92.2% recall, 96.5% mAP50, and 60.7% mAP50:95. Compared with the strongest baseline for each metric, FCML-YOLO improves precision, recall, mAP50, and mAP50:95 by 2.1%, 3.5%, 2.9%, and 3.7%, respectively. In particular, the notable recall improvement indicates that FCML-YOLO can more effectively detect human targets in UAV-based search-and-rescue scenarios, where reducing missed detections is critical. Overall, the results across the four datasets demonstrate that FCML-YOLO maintains strong generalization capability in both general aerial small-object detection and UAV-based emergency search-and-rescue person detection.
Across the evaluated datasets, YOLOv12s does not show the expected advantage over YOLO11s in UAV-based small-object detection. This phenomenon should not be interpreted as a general weakness of YOLOv12, but rather as a task-dependent architecture-suitability issue. Although YOLOv12 adopts an attention-centric design to enhance contextual representation, UAV-based tiny-object detection is dominated by low-resolution targets, weak texture cues, complex backgrounds, occlusion, and large scale variations. In such scenarios, detection accuracy depends more on high-resolution local details and fine-grained spatial cues, while stronger contextual modeling may not fully compensate for insufficient local target evidence. In addition, all YOLO-series baselines are trained under the same protocol without model-specific hyperparameter optimization. These results suggest that task-specific architectural adaptation, particularly high-resolution detail preservation and multi-level feature interaction, is more critical than model recency alone for UAV-based tiny-object detection.
Figure 12 presents qualitative comparisons between YOLO11s and FCML-YOLO across representative aerial detection scenarios, including occluded targets, wide-area surveillance scenes, multi-scale object distributions, tiny pedestrian targets, and UAV-based search-and-rescue cases. On the USOD dataset, which contains low-resolution grayscale images, YOLO11s exhibits evident missed detections when targets are small, moving, or partially occluded. In contrast, FCML-YOLO localizes these targets more accurately, demonstrating stronger robustness under low-resolution and occlusion conditions. On the DOTA dataset, where targets are densely distributed and exhibit significant scale variations, YOLO11s still suffers from noticeable missed detections. Benefiting from the enhanced multi-scale feature fusion strategy, FCML-YOLO improves detection completeness and localization quality in such large-area aerial scenes. On the TinyPerson dataset, FCML-YOLO also provides more reliable detection results for extremely small pedestrian targets, further confirming its effectiveness in tiny-object detection.
For the SARD visualization results, the red-highlighted regions show typical missed detections of YOLO11s. The simulated injured person in the high-grass area is small and highly blended with surrounding vegetation textures, making it difficult for YOLO11s to recognize. In contrast, FCML-YOLO accurately detects this target and also recovers small human targets missed by YOLO11s near shaded and forest-edge regions. These results indicate that FCML-YOLO can effectively reduce missed detections in UAV-based search-and-rescue scenarios and maintain stronger robustness under challenging conditions such as high-grass occlusion, shadow interference, and injured-like postures. Overall, the qualitative results across multiple datasets demonstrate that FCML-YOLO achieves more complete and stable detection performance under diverse aerial imaging conditions, especially for UAV-based emergency search-and-rescue person detection, where reducing missed detections of small human targets is critical.
4.6.3. Performance Comparison of the Self-Built OPVM-VIRD Dataset on Edge Devices
The OPVM-VIRD dataset was collected using our proprietary UAV platform, which captures diverse visual data from complex scenes, including hilly and mountainous areas. We tested multiple key performance metrics on the NVIDIA Jetson AGX Orin platform using this dataset. To validate the advantages of FCML-YOLO, we conducted comparative experiments between FCML-YOLO and several models from the YOLO series, including YOLOv5s, YOLOv8s, YOLOv10s, YOLO11s, and YOLOv12s. The experimental results, shown in
Table 10, indicate that FCML-YOLO outperforms all other models across all evaluation metrics. Specifically, FCML-YOLO achieves the highest precision and recall, reaching 92.7% and 92.8%, respectively, which indicates its ability to reduce both false detections and missed detections in practical UAV-based monitoring scenarios. In terms of mAP50, FCML-YOLO achieved a score of 95.2%, surpassing YOLO11s by 4.4%, and outperformed other YOLO versions by 4.1% to 5.9%. In terms of mAP50:95, FCML-YOLO reached 69.6%, outperforming YOLO11s by 5.4% and thus demonstrating its more stable localization performance under stricter IoU thresholds. These quantitative results demonstrate the effectiveness of FCML-YOLO on the self-built OPVM-VIRD dataset, especially for UAV-based hilly road monitoring scenes with small targets, vegetation occlusion, and complex background interference.
As shown in
Figure 13, we evaluated the performance of YOLO11s and the proposed FCML-YOLO under both low and high altitude perspectives. The red boxes in
Figure 13 indicate typical missed detection regions of YOLO11s. In the low-altitude perspective, pedestrians are partially occluded by roadside vegetation and tree shadows. YOLO11s fails to completely detect the occluded pedestrian in the highlighted region, whereas FCML-YOLO successfully detects the missed target, showing better robustness to local occlusion and background interference. In the high-altitude perspective, pedestrians become smaller and are distributed along winding hilly roads, making them difficult to distinguish from tree shadows, road textures, and dense vegetation. YOLO11s produces more missed detections in the highlighted region, while FCML-YOLO detects more small pedestrian instances and provides more complete detection results. The experimental results indicate that FCML-YOLO maintains stronger detection robustness on hilly mountain roads with challenging terrain, dense vegetation, and winding roads, which supports its applicability to UAV-based road safety monitoring and rescue-oriented scenarios.