4.4.1. Quantitative Analysis
To evaluate the effectiveness of our proposed framework under occlusion, we compare DOMino-YOLO with a wide range of YOLOv11-based state-of-the-art variants on the VOD-UAV dataset, as shown in
Table 1. Our model consistently outperforms all baselines across most vehicle categories and all evaluation metrics. In addition, we further analyze several representative non-YOLO detectors to provide a broader perspective on occlusion-aware vehicle detection.
From an efficiency perspective, DOMino-YOLO achieves a favorable balance between detection accuracy and computational cost. As reported in
Table 1, our model operates at 256.1 FPS with 56.4 GFlops, maintaining real-time inference capability while delivering the best overall accuracy. It should be noted that the inference speed (FPS) and computational complexity (GFlops) of FCOS, Faster R-CNN, and DETR are not reported, as these models were not re-implemented under the same YOLO11 framework and hardware settings, making direct comparison unfair. Compared with high-capacity variants such as YOLO11-SK, which significantly increase computational complexity (102.3 GFlops) at the expense of inference speed, DOMino-YOLO attains higher mAP with nearly half the computational cost. Meanwhile, relative to lightweight YOLOv11 variants (e.g., Rep, DCN, or ImplicitHead), our framework introduces only moderate overhead, yet yields consistent and substantial performance gains under occlusion.
Specifically, DOMino-YOLO achieves the highest overall mAP50 of 0.420 and mAP50-90 of 0.293, outperforming the strongest existing baselines such as YOLO11-SK, which achieves 0.409 and 0.282 respectively. In categories characterized by severe occlusion, including BUS, MO, TC, and AT, our framework shows notable improvements. For example, the AP50 in these challenging cases increases by as much as 0.09 compared with competitive models such as YOLO11-Rep and YOLO11-ImplicitHead, demonstrating the effectiveness of our occlusion-aware design in recovering visibility-compromised targets.
These results validate the effectiveness of our three-pronged design. The DCEM module improves shape alignment and localization under irregular contours, the VASA module effectively retains visible structure while preserving scale-sensitive semantics, and the CSIM-Head mitigates contextual noise around occluded instances. In addition, our visibility-weighted loss contributes to more balanced learning across different occlusion levels, which is particularly evident in improved detection accuracy for low-visibility targets.
In contrast to designs that emphasize architectural complexity (e.g., Biformer or FasterNeXt), our method maintains a lightweight and real-time structure while delivering superior performance. This indicates that occlusion-specific modeling, rather than generic backbone upgrades, is key to improving detection in UAV-based occlusion scenarios.
Figure 11 shows a radar-chart-based comparison of precision and recall across five occlusion levels (OL 0-4) for eight state-of-the-art detectors, including RCS-OSA, CAFERE, GhostSlimFPN [
62], AIFI, BiFormer, DCN, LOW-FAM [
63], and DOMino-YOLO (OUR), providing an intuitive per-category view of robustness under increasing occlusion.
As shown in the radar charts, DOMino-YOLO forms a consistently larger and more uniform radar profile than competing methods, indicating superior and more stable performance across multiple occlusion levels. This advantage is especially pronounced in recall, which is critical for occlusion-aware UAV detection, where missed targets pose greater risks than occasional false positives.
Quantitatively, across comparative experiments involving 8 vehicle categories and 5 occlusion levels, DOMino-YOLO achieves the best precision in 21 out of 40 cases and the best recall in 25 out of 40 cases, further confirming its robustness under diverse occlusion conditions.
For lightly or unoccluded targets (Level 0), DOMino-YOLO achieves competitive precision and the best recall in nearly all classes (e.g., TC: 0.626 recall vs. others ≤ 0.603; BUS: 0.894 recall vs. others ≤ 0.893), showing that our model retains strong generalization even without occlusion.
As occlusion severity increases from level 1 to level 3, DOMino-YOLO demonstrates consistent advantages in both recall and precision. For example, in the MO category at occlusion level 3, DOMino-YOLO achieves a precision of 0.353 and a recall of 0.525, which are considerably higher than those of the next best-performing model, LOW-FAM, with 0.348 precision and 0.150 recall. In the BUS category at occlusion level 4, DOMino-YOLO achieves the highest recall value of 0.853 among all models, highlighting its strong capability to detect severely occluded targets.
However, a consistent observation across all evaluated models, including DOMino-YOLO, is a marked decline in precision for BC (bicycle) and AC (awning tricycle) at occlusion level 4, with precision even dropping to zero in some cases. This performance degradation should be attributed to a dataset-level limitation rather than a model-specific deficiency, as it is primarily caused by the severe scarcity of training samples under the highest occlusion conditions, particularly for BC and AC at level 4. Such extreme data imbalance limits the ability of any model to learn reliable feature representations for these rare and heavily occluded categories. These results highlight the importance of maintaining a more balanced data distribution across both occlusion levels and object categories in future dataset construction.
Compared to models like CAFERE and GhostSlimFPN that struggle under higher occlusion (frequently showing near-zero recall or precision), DOMino-YOLO benefits from three key innovations: (1) the DCEM that adapts receptive fields to fragmented shapes; (2) the VASA module that amplifies partial cues; and (3) the CSIM-Head that filters contextual noise. These modules collectively bolster the model’s robustness to structural and spatial incompleteness, as evidenced by DOMino-YOLO’s superior recall under OL = 2–4.
Notably, although DOMino-YOLO achieves top-tier results, its performance in BC under OL = 4 (Precision: 0.388, Recall: 0.147) still reflects difficulty in detecting highly occluded, thin-structured objects like bicycles. This suggests that further gains could be realized by integrating topology-aware modules or leveraging generative augmentation for minority classes with severe occlusion.
In summary, DOMino-YOLO demonstrates consistent advantages across occlusion levels and categories, especially under severe occlusion, where the majority of baselines degrade substantially. These results validate the design of our occlusion-specific architecture and loss formulation, positioning DOMino-YOLO as a strong candidate for reliable UAV-based occlusion-aware detection.
4.4.2. Qualitative Analysis
Figure 12 presents the detection results of the top three models on the VOD-UAV dataset under three representative occlusion levels: slight occlusion, moderate occlusion, and heavy occlusion. We select three representative real-world images, each containing a high density of slightly, moderately, or heavily occluded vehicles. Missed detections and false positives are highlighted in red bounding boxes for clarity.
For slight occlusion, all three models successfully detect most vehicles, with only marginal differences in detection accuracy. However, both YOLO11-RepHELAN and YOLO11-SK fail to detect a tricycle in the upper-left corner. As shown previously in
Figure 9, tricycles constitute a very small portion of the dataset, making them inherently more challenging to learn. Remarkably, our DOMino-YOLO is still able to correctly detect the tricycle, demonstrating its superior ability to generalize to underrepresented categories through its occlusion-aware structural aggregation and robust feature encoding.
For moderate occlusion, YOLO11-RepHELAN shows multiple missed detections and even false positives, while YOLO11-SK again misses the tricycle. In contrast, DOMino-YOLO achieves complete and accurate detection of all targets, highlighting its robustness in scenarios where discriminative features are partially obscured.
For heavy occlusion, the detection difficulty increases significantly. As shown in
Figure 12, all three models exhibit missed detections and false detections, which are highlighted by the red bounding boxes. In particular, one motorcycle located at the bottom of the image is fully visible but is still incorrectly missed by all models. We attribute this failure to strong background interference, where complex contextual signals dominate feature representations and suppress true object responses. In addition, YOLO11-RepHELAN fails to detect a heavily occluded vehicle entirely. Most notably, both YOLO11-RepHELAN and YOLO11-SK misclassify a van as a car, whereas DOMino-YOLO correctly identifies it even under partial occlusion. This demonstrates DOMino-YOLO’s enhanced discriminative capability, which stems from its context-suppressed implicit modulation head that filters background noise and reinforces fine-grained object features.
In summary, the comparative results in
Figure 12 clearly illustrate that while existing variants struggle with underrepresented classes, moderate-to-severe occlusions, and background interference, DOMino-YOLO consistently achieves more accurate detection and classification. Its improvements are particularly evident in handling small-sample categories and in distinguishing visually similar vehicle types under occlusion, thereby validating the effectiveness of its occlusion-aware design.
Figure 13 illustrates the model’s attention regions across three representative vehicle categories: car, truck, and van. These classes were selected based on their prevalence and structural diversity within the dataset.
In
Figure 13a, we focus on the car category, which is the most abundant class in our dataset. From left to right, the images show cars being occluded by buildings at increasing severity. The attention maps indicate that the model dynamically adjusts its focus according to the visible parts of the vehicle. As occlusion intensifies, the attention region shifts and contracts toward the remaining visible areas, demonstrating the model’s adaptability to partial visibility.
Figure 13b presents attention responses for trucks under varying occlusion angles caused by surrounding trees. In the first four cases, the model predominantly focuses on the rear portion of the truck, suggesting that the rear-end structure carries strong class-specific cues. When the rear is completely occluded, the attention expands to include the visible frontal area of the vehicle. This indicates the model’s flexibility in utilizing alternative visual features for recognition. However, in the final example, where the truck is almost entirely occluded, the attention region becomes extremely limited. Although the model still predicts the correct class, the bounding box is significantly undersized—highlighting the detrimental impact of extreme occlusion on localization accuracy.
Figure 13c, we analyze vans under similar occlusion conditions. The model initially focuses on the frontal region of the vehicle. As the occlusion of the front increases (shown in the third to sixth images), the attention gradually shifts and expands to encompass a broader area of the vehicle, suggesting a redistribution of attention when key features are partially obstructed.
These observations provide evidence that our model adapts its attention to available visual cues under different occlusion configurations, and also reveal class-specific attention behaviors that are crucial for robust detection in UAV imagery.
4.4.3. Generalized Analysis
In the VisDrone generalization experiments, occlusion annotations are not available. Therefore, while DCEM, VASA, and CSIM-Head are retained to enhance feature representation, all occlusion-aware supervision is disabled, and the model is trained using the standard YOLO loss without occlusion weights.
As shown in
Table 2, different detection frameworks exhibit distinct performance characteristics on the VisDrone2019 dataset. The two-stage detector Faster R-CNN achieves the best performance in terms of mAP
50 and medium-scale objects (AP
m), reflecting its strong capability in modeling relatively complete objects with sufficient visual features. DETR attains the highest mAP
75, demonstrating the advantage of Transformer-based architectures under stricter IoU evaluation criteria and their effectiveness in global feature modeling.
In contrast, the proposed DOMino-YOLO achieves the best results in overall mAP50-90 as well as for very small (APvt) and small objects (APt). This advantage can be attributed to its occlusion-aware design tailored for partial visibility. Specifically, the DCEM alleviates spatial misalignment caused by scale variations, the VASA module effectively extracts discriminative multi-scale features from partially visible regions, and the CSIM-Head reduces false activations in dense scenes through adaptive context suppression. In addition, the proposed OAR-Loss further enhances the discrimination of heavily occluded and small-scale instances.
It is worth noting that no occlusion annotations are used during training on the VisDrone dataset, and the occlusion-aware components are explicitly ignored in this setting. Despite this, DOMino-YOLO still maintains competitive or superior performance on small-scale objects and under stricter evaluation metrics, indicating that the proposed method does not overfit to occlusion-specific datasets and exhibits good generalization capability in complex UAV scenarios.