DOMino-YOLO: A Deformable Occlusion-Aware Framework for Vehicle Detection in Aerial Imagery

Fu, Tianyi; Dong, Hongbin; Yang, Benyi; Deng, Baosong

doi:10.3390/rs18010066

Open AccessArticle

DOMino-YOLO: A Deformable Occlusion-Aware Framework for Vehicle Detection in Aerial Imagery

¹

College of Computer Science and Technology, Harbin Engineering University, Nantong Street, Harbin 150001, China

²

National Engineering Laboratory for E-Government Modeling and Simulation, Harbin Engineering University, Nantong Street, Harbin 150001, China

³

Defense Innovation Institute, Chinese Academy of Military Science, Beijing 100071, China

⁴

Intelligent Game and Decision Laboratory, Beijing 100071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(1), 66; https://doi.org/10.3390/rs18010066

Submission received: 25 November 2025 / Revised: 22 December 2025 / Accepted: 23 December 2025 / Published: 25 December 2025

(This article belongs to the Special Issue Deep Learning-Based Interpretation and Processing of Remote Sensing Images)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We introduce VOD-UAV, the first UAV-based vehicle detection dataset with fine-grained five-level occlusion annotations.
We develop DOMino-YOLO, a YOLOv11-based framework with DCEM, VASA, and CSIM-Head to enhance robustness under occlusion.

What are the implications of the main findings?

We propose an Occlusion-Aware Repulsion Loss that suppresses redundant predictions and emphasizes heavily occluded objects.
The new dataset and framework significantly improve vehicle detection accuracy under varying visibility conditions.

Abstract

Occlusion-aware vehicle detection in UAV imagery is challenging due to partial visibility from varied viewpoints, dense scenes, and limited features. To address this, we introduce two contributions. First, VOD-UAV, the first UAV-based vehicle detection dataset focused on occlusion, containing 712 synthetic and 1219 real-world images, each annotated with five discrete occlusion levels. These fine-grained labels enable structured supervision and detailed analysis under varying visibility conditions. Second, DOMino-YOLO, a YOLOv11-based detection framework, enhances occlusion robustness via three components: the Deformable Convolution Enhanced Module (DCEM) for spatial alignment, the Visibility-Aware Structural Aggregation (VASA) module for multi-scale feature extraction from partially visible regions, and the Context-Suppressed Implicit Modulation Head (CSIM-Head) for reducing false activations by adaptive channel reweighting. An Occlusion-Aware Repulsion Loss (OAR-Loss) combines Repulsion Loss and Visibility-Weighted Classification Loss to suppress redundant predictions and emphasize heavily occluded objects. Extensive experiments on VOD-UAV demonstrate that DOMino-YOLO significantly improves detection accuracy and robustness under occlusion. The dataset and code will publicly available to support future research.

Keywords:

YOLO; UAV-based detection; occluded object detection; deformable convolution; attention modulation

1. Introduction

Occluded object detection remains a fundamental yet challenging problem in computer vision, particularly in UAV-based (unmanned aerial vehicle) scenarios, where frequent object overlap, dynamic viewpoints, and limited visible cues severely degrade detection performance. Unlike ground-based imagery, UAV platforms often observe targets from oblique or top-down perspectives, resulting in complex inter-object occlusion, drastic scale variation, and fragmented object appearances. This challenge is especially critical in real-world applications such as intelligent traffic monitoring, aerial law enforcement, and autonomous navigation, where reliable recognition of partially visible vehicles is essential for downstream perception and decision-making systems. Recent advances in large-scale aerial benchmarks, including VisDrone [1], UAVDT [2], and TinyPerson [3], as well as other representative datasets [4,5,6,7,8], have substantially promoted progress in UAV-based object detection. However, these datasets are primarily designed for general-purpose detection and lack explicit occlusion annotations. Moreover, their exclusive reliance on real-world UAV imagery limits the controllability of key factors such as occlusion severity, object scale, and viewpoint diversity—making it difficult to systematically evaluate how occlusion impacts detection performance across conditions. To alleviate these limitations, recent studies have explored simulation-based and synthetic-real hybrid UAV datasets [9,10,11,12,13,14,15,16,17], demonstrating the advantages of controllable occlusion modeling and diversified viewpoints, yet still falling short of providing fine-grained, UAV-oriented occlusion supervision for vehicle detection.

In parallel, extensive efforts have been devoted to occlusion-aware detection methodologies. Early part-based and graphical models [18,19,20,21] explicitly modeled object components and their spatial relationships, while subsequent works explored deep occlusion modeling and tracking mechanisms [22,23,24]. More recent approaches introduced occlusion-aware loss functions and feature pooling strategies [25,26,27,28], as well as improvements within mainstream anchor-based and anchor-free detection frameworks [29,30,31,32,33,34,35,36,37,38,39,40,41]. In the UAV domain, several occlusion-aware designs have been proposed [42,43,44,45,46,47,48]. Despite these efforts, most existing methods still struggle under heavy occlusion due to three intrinsic limitations: (1) spatial ambiguity caused by irregular and unpredictable visible regions, (2) severe loss of discriminative visual cues, and (3) strong contextual interference from surrounding objects and cluttered backgrounds. These factors jointly lead to misalignment, false positives, and missed detections in densely occluded aerial scenes.

To address the above limitations, we introduce VOD-UAV, the first dedicated dataset for occluded vehicle detection from a UAV perspective. VOD-UAV consists of 712 high-fidelity synthetic images rendered using Unreal Engine 5 and 1219 real-world UAV images, where all vehicles are re-annotated with five discrete occlusion levels: no occlusion, slight occlusion, moderate occlusion, heavy occlusion, and extreme occlusion. This fine-grained annotation protocol enables structured supervision and quantitative analysis of detection performance across different visibility conditions. As illustrated in Figure 1, the dataset covers diverse occlusion patterns across multiple vehicle categories and illumination scenarios. By combining controllable synthetic data with visually realistic real-world imagery, VOD-UAV provides a balanced and practical benchmark for occlusion-aware UAV detection.

Building upon this dataset, we propose DOMino-YOLO (Deformable Occlusion-aware Modulation Network based on YOLOv11), a unified framework tailored for occluded vehicle detection in UAV scenarios. Unlike prior methods that mainly pursue overall accuracy gains, DOMino-YOLO explicitly addresses occlusion-induced challenges. Specifically, spatial ambiguity is alleviated by a Deformable Convolution Enhanced Module (DCEM), visual incompleteness is compensated by a Visibility-Aware Structural Aggregation (VASA) module, and contextual interference is suppressed via a Context-Suppressed Implicit Modulation Head (CSIM-Head). In addition, an Occlusion-Aware Repulsion Loss is introduced to reduce redundant predictions in crowded scenes while enhancing robustness across different occlusion levels.

Our main contributions are summarized as follows:

We present VOD-UAV, the first UAV-based occluded vehicle detection dataset that combines real-world and multi-scene synthetic aerial imagery, and provides fine-grained occlusion annotations to enable systematic evaluation and occlusion-aware learning.
We propose DOMino-YOLO, a unified occlusion-aware detection framework that explicitly addresses spatial ambiguity, visual information loss, and contextual interference in heavily occluded UAV scenes.
We design an Occlusion-Aware Repulsion Loss Function, which integrates repulsion constraints with visibility-weighted classification to enhance robustness under varying occlusion levels.
We conduct extensive experiments and establish comprehensive benchmarks on VOD-UAV, demonstrating consistent improvements over state-of-the-art methods, and release our dataset and code to facilitate future research in occlusion-aware UAV detection.

2. Methodology

DOMino-YOLO is a specialized occlusion-aware framework built upon the YOLOv11 architecture, designed to improve detection performance in UAV imagery where targets are frequently occluded, truncated, or surrounded by cluttered backgrounds. These challenges stem from spatial ambiguity caused by unpredictable occlusion patterns, visual information loss when key object parts are hidden, and contextual interference from surrounding objects or complex backgrounds.

To address these issues, DOMino-YOLO integrates three key modules: (1) the DCEM, which aligns receptive fields with irregular contours to alleviate spatial ambiguity; (2) the VASA, which enriches partial structural cues; and (3) the CSIM-Head, which filters misleading contextual signals. Additionally, an Occlusion-Aware Repulsion Loss Function (OAR-Loss) is employed during training. This loss combines a Repulsion Loss Function, which enforces spatial dispersion among foreground predictions, with a Visibility-Weighted Classification Loss that assigns higher weights to heavily occluded instances. Together, these components enhance both localization precision and classification robustness in crowded or occluded aerial scenes. The overall architecture is illustrated in Figure 2.

2.1. Deformable Convolution Enhanced Module

In complex aerial traffic scenarios, vehicle detection frequently suffers from partial occlusion, truncation, and dense object distributions. These issues lead to degraded spatial representations, shape distortion, and structural inconsistency—especially in the intermediate layers of conventional convolutional backbones. To overcome this, we introduce the DCEM, a spatially adaptive encoding module that enhances geometric alignment and structural stability for partially visible vehicles.

As depicted in Figure 3, DCEM begins with a dimensionality reduction layer, followed by three parallel residual branches. Each branch includes a combination of standard and deformable convolutional blocks. This parallel structure captures multi-scale spatial patterns while maintaining computational efficiency. The outputs of all branches are concatenated and fused via a 1 × 1 convolution, and a skip connection is used to preserve the input semantics.

From a spatial perspective, DCEM employs modulated deformable convolutions to dynamically adjust sampling positions and attention weights. This improves the alignment of receptive fields with irregular or occluded object boundaries, reducing false activations from unrelated regions. The deformable feature at spatial location

p_{0}

is calculated as:

Y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot F (p_{0} + p_{n} + Δ p_{n}) \cdot m_{n}

(1)

where

p_{0} \in Z^{2}

denotes the spatial coordinate of the output feature map,

F (\cdot)

represents the input feature map, and

R

is the predefined regular convolutional sampling grid, which is fixed as a

3 \times 3

kernel in our implementation.

p_{n}

denotes the n-th sampling offset within

R

,

w (p_{n})

is the corresponding convolution kernel weight,

Δ p_{n}

is the learnable offset that adaptively shifts the sampling location, and

m_{n} \in [0, 2]

is the modulation scalar controlling the relative importance of each sampled feature.

Δ p = f_{offset} (F), m = 2 \cdot σ (f_{\mod} (F))

(2)

where

f_{offset} (\cdot)

and

f_{\mod} (\cdot)

denote learnable convolutional functions for offset and modulation estimation, respectively, and

σ (\cdot)

is the sigmoid activation function.

From a structural standpoint, each residual block adopts a bottleneck design (1 × 1 convolution → 3 × 3 convolution → 3 × 3 deformable convolution), enabling both fine-grained feature extraction and semantic abstraction. This configuration enhances spatial continuity by first encoding local structural cues through standard convolutions and then adaptively linking spatially separated yet semantically consistent regions via deformable sampling, enabling coherent representation of fragmented objects.

The final output feature map is constructed as:

Y_{o u t} = F + {Conv}_{1 \times 1} (Concat (B_{1}, B_{2}, B_{3}))

(3)

where

B_{1}

,

B_{2}

, and

B_{3}

denote the output feature maps of the three parallel bottleneck branches,

Concat (\cdot)

represents channel-wise concatenation, and

{Conv}_{1 \times 1} (\cdot)

is a

1 \times 1

convolution used for feature fusion and channel alignment. The residual connection with the input feature F ensures stable gradient propagation and effective multi-level contextual aggregation.

In summary, DCEM provides a flexible mechanism to handle spatial deformation and incomplete visibility in occluded vehicles. It combines deformable sampling, which adapts receptive fields to visible features, with multi-branch residual fusion to enrich contextual information. Together, these mechanisms enhance localization accuracy in UAV scenes with occlusions and complex backgrounds.

2.2. Visibility-Aware Structural Aggregation

In aerial vehicle detection, occlusion often leads to severely limited visible regions, posing significant challenges for accurate localization and recognition. Traditional convolutional backbones treat spatial and channel dimensions uniformly, which may dilute subtle yet critical cues from partially visible objects. To explicitly address this issue, we propose a visibility-aware feature refinement module, termed VASA, as illustrated in Figure 4. VASA is designed to selectively emphasize informative visible structures by hierarchically modeling spatial patterns and inter-channel dependencies in a lightweight and efficient manner.

The architecture of VASA consists of four sequential stages: a RepVGG block, two Shuffle RepVGG (SR) blocks, a Residual Shuffle RepVGG (ResSR) block, and a concluding squeeze-and-excitation (SE) module. Each stage progressively enhances structural context while maintaining visibility-sensitive features across different abstraction levels.

At the first stage, the input feature map

X_{i n}

is processed through a RepVGG block, which acts as a reparameterizable encoder. This component captures essential spatial structures while suppressing background redundancy, thereby establishing a robust base representation for partially occluded objects.

Next, two SR blocks are applied in sequence. Each SR block splits the feature channels into two branches: one branch is processed by a RepVGG unit, while the other bypasses processing. The outputs of both branches are concatenated and subjected to a channel shuffle operation, promoting cross-branch feature interaction and enhancing mid-level representation diversity—particularly important for fragmented or non-contiguous object parts.

To further abstract semantic features while preserving spatial continuity, the output from the second SR block is passed to a ResSR module. This module comprises two stacked RepVGG layers within a residual framework. The shortcut path can be either an identity or a projection connection, depending on dimensional consistency. The ResSR design facilitates deeper transformation of visibility-aware features without sacrificing gradient flow, enabling stronger recognition of heavily occluded vehicles.

After hierarchical aggregation, the outputs from the RepVGG, two SR, and ResSR blocks are concatenated along the channel axis. The fused features are then processed by a reparameterizable convolutional layer, followed by an SE block that adaptively calibrates channel-wise importance. This recalibration highlights salient visible cues while suppressing distracting background information.

The final VASA output

F_{VASA}

is computed as:

\begin{matrix} Z_{1} & = ϕ_{Rep} (X) \\ Z_{2} & = ϕ_{SR} (Z_{1}) \\ Z_{3} & = ϕ_{SR} (Z_{2}) \\ Z_{4} & = ϕ_{ResSR} (Z_{3}) \\ F_{VASA} & = SE (ϕ_{out}^{Rep} (Z_{1} ∥ Z_{2} ∥ Z_{3} ∥ Z_{4})) \end{matrix}

(4)

where

ϕ_{Rep}

,

ϕ_{SR}

, and

ϕ_{ResSR}

represent the operations of the RepVGG, SR, and ResSR modules, respectively. The operator || denotes channel-wise concatenation, and

SE (\cdot)

performs channel-wise recalibration on the concatenated multi-stage features to adaptively select informative cues from different depths.

In contrast to conventional feature extractors that overlook the variable visibility under occlusion, VASA introduces visibility-aware refinements at both the architectural and semantic levels. By progressively integrating features from low- to high-level stages, VASA enables the model to robustly localize and classify objects even when only fragmented cues are available. Furthermore, the use of lightweight, reparameterizable components ensures high efficiency, making VASA suitable for real-time applications and resource-constrained UAV platforms.

2.3. Context-Suppressed Implicit Modulation Head

While the proposed DCEM and VASA modules effectively mitigate spatial ambiguity and visible feature loss, a third major challenge remains: redundant contextual interference. In densely populated aerial scenes, nearby objects such as adjacent vehicles, road textures, or cast shadows can generate misleading signals that reduce localization accuracy and increase false positives. To overcome this, we propose the CSIM-Head, a lightweight and effective detection head that selectively suppresses irrelevant context while reinforcing foreground-relevant responses.

Unlike explicit spatial attention mechanisms that depend on external attention maps or auxiliary supervision, CSIM-Head performs implicit channel-wise modulation via two complementary stages: additive suppression and multiplicative enhancement. These operations dynamically reweight feature channels to filter out background noise while preserving important target-related activations, without incurring spatial attention overhead. The process of CSIM-Head is shown in Figure 5.

Let

F_{i} \in R^{C_{i} \times H_{i} \times W_{i}}

denote the intermediate feature map at detection scale i, where

i = 1, \dots, L

. In the first modulation stage, CSIM-Head applies a learnable channel-wise additive bias to shift the activation distribution:

{\hat{F}}_{i} = F_{i} + A_{i}, A_{i} \sim N (0, σ^{2}), A_{i} \in R^{C_{i} \times 1 \times 1}

(5)

where

A_{i}

is a learnable bias parameter initialized from a zero-mean Gaussian distribution. This additive operation acts as a soft suppression mechanism that attenuates activations from background-sensitive channels while retaining meaningful object-level cues. The bias is optimized during training in a data-driven manner, allowing the model to discover which channels contribute to noise or confusion.

The modulated features

{\hat{F}}_{i}

are then fed into two decoupled sub-heads for regression and classification:

R_{i} = H_{i}^{reg} ({\hat{F}}_{i}), C_{i} = H_{i}^{cls} ({\hat{F}}_{i})

(6)

where

H_{i}^{reg} (\cdot)

is the regression sub-head at scale i, responsible for predicting bounding box coordinates.

H_{i}^{cls} (\cdot)

is the classification sub-head at scale i, responsible for predicting category scores. To further emphasize confident foreground responses, CSIM-Head applies a second modulation phase using learnable channel-wise scaling factors:

{\tilde{R}}_{i} = M_{i}^{reg} ⊙ R_{i}, {\tilde{C}}_{i} = M_{i}^{cls} ⊙ C_{i}

(7)

where

M_{i}^{reg}

and

M_{i}^{cls}

are initialized with positive constants (e.g., 1.0) and learned via backpropagation to amplify discriminative responses while downweighting ambiguous ones. Together, these additive and multiplicative modulations enable fine-grained feature refinement across all detection layers.

R_{i}

is the raw output of the regression head before modulation.

C_{i}

is the raw output of the classification head before modulation.

M_{i}^{reg}

is the learnable channel-wise multiplicative weight vector for the regression output, initialized as positive constants.

M_{i}^{cls}

is the learnable channel-wise multiplicative weight vector for the classification output, initialized as positive constants.

{\tilde{R}}_{i}

is the final modulated regression output and

{\tilde{C}}_{i}

is the final modulated classification output.

Finally, the outputs from all feature scales are concatenated and passed through a distribution-aware regression head and a sigmoid-activated classification head. Since CSIM-Head operates in a resolution-agnostic and channel-efficient manner, it introduces negligible computational overhead and fully retains the real-time inference capabilities of the YOLOv11 backbone.

By implicitly modeling context and adaptively highlighting foreground saliency, CSIM-Head enhances the detector’s capacity to suppress irrelevant noise and accurately detect targets in complex UAV scenes. This proves especially valuable in occlusion-prone environments with tight object distributions and visually cluttered backgrounds.

2.4. Occlusion-Aware Repulsion Loss Function

In UAV-based vehicle detection tasks, severe occlusion often leads to spatial ambiguity in object locations and overlapping prediction boxes, especially in densely populated scenes. Conventional detection losses typically fail to effectively penalize these redundant outputs, leading to degraded localization performance. To explicitly address this issue, this section proposes an Occlusion-Aware Repulsion Loss (OAR-Loss) that incorporates explicit repulsion constraints and an occlusion-weighted mechanism to enhance detection robustness under complex occlusion conditions.

2.4.1. Overall Objective Function

The overall training objective integrates four components—bounding box regression, classification, distribution modeling, and repulsion constraints—and is defined as follows:

\begin{matrix} L_{total} = & λ_{box} L_{box} + λ_{rep} L_{rep} + λ_{cls} L_{cls}^{occ} \end{matrix}

(8)

where

L_{box}

denotes the bounding box regression loss;

L_{rep}

represents the occlusion-aware repulsion loss;

L_{cls}^{occ}

is the occlusion-weighted classification loss; and

λ_{*}

denotes the balance coefficients for each component.

This joint objective optimizes three aspects during training: (1) improving localization accuracy through

L_{box}

, which consists of

L_{LCIoU}

and

L_{DFL}

; (2) enhancing classification robustness through

L_{cls}^{occ}

; and (3) reducing prediction redundancy and improving spatial separation through

L_{RepGT}

and

L_{RepBox}

.

The following subsections introduce the components of the proposed loss in detail.

2.4.2. Bounding Box Regression Loss

To ensure consistency between predicted and ground-truth bounding boxes in terms of spatial position and geometric shape, a hybrid regression strategy combining the lightweight Complete-IoU loss (LCIoU) and the Distribution-Focal Loss (DFL) is adopted. It is defined as:

L_{box} = L_{LCIoU} + L_{DFL}

(9)

Based on the original CIoU, the LCIoU loss (

L_{L C I o U}

) simplifies the formulation while preserving consistency in overlap, center-point distance, and aspect-ratio geometry. It is defined as:

L_{LCIoU} = 1 - IoU + \frac{ρ^{2} (b^{pred}, b^{gt})}{c^{2}} + α v

(10)

where IoU denotes the intersection-over-union between the predicted and ground-truth boxes:

IoU (b_{j}^{pred}, b_{k}^{gt}) = \frac{area (b_{j}^{pred} \cap b_{k}^{gt})}{area (b_{j}^{pred} \cup b_{k}^{gt})}

(11)

where

ρ^{2} (b^{pred}, b^{gt})

is the squared Euclidean distance between the predicted and ground-truth centers, c is the diagonal length of the smallest enclosing box covering both boxes, and v is the aspect-ratio consistency term:

v = \frac{4}{π^{2}} {(arctan \frac{w^{gt}}{h^{gt}} - arctan \frac{w^{pred}}{h^{pred}})}^{2}

(12)

α

is the weighting factor that balances the influence of aspect-ratio consistency:

α = \frac{v}{(1 - IoU) + v}

(13)

where

L_{DFL}

denotes the Distribution-Focal Loss. To improve localization accuracy, DFL expands the regression of each box coordinate from a single scalar to a discrete probability distribution, encouraging the predicted distribution to concentrate near the ground-truth boundary and thus achieving sub-pixel precision. For each coordinate, the model predicts a probability distribution over K discrete bins:

p = [p_{0}, p_{1}, \dots, p_{K - 1}], \sum_{i = 0}^{K - 1} p_{i} = 1

(14)

The final continuous offset

\hat{y}

is obtained by computing the expectation:

\hat{y} = \sum_{i = 0}^{K - 1} p_{i} \cdot i

(15)

To optimize this distribution, the following DFL loss is used:

L_{DFL} = - \sum_{i = 0}^{K - 1} q_{i} log (p_{i})

(16)

where

p_{i}

denotes the predicted probability of the i-th discrete bin obtained by applying the softmax function to the regression logits, and

q_{i}

represents the corresponding soft target distribution constructed by discretizing the continuous bounding box regression target. Specifically, the continuous target is linearly interpolated between its two neighboring bins, resulting in a normalized distribution satisfying

\sum_{i = 0}^{K - 1} q_{i} = 1

. Here, K denotes the number of discrete bins used to model each bounding box offset.

2.4.3. Repulsion Loss Function

To reduce the issue of overlapping predictions in dense scenes, a repulsion loss

L_{rep}

is introduced to constrain the spatial relationships among predicted boxes. This loss consists of two complementary components: repulsion from ground-truth boxes

L_{RepGT}

and repulsion among predicted boxes

L_{RepBox}

.

L_{rep} = L_{RepGT} + L_{RepBox}

(17)

L_{RepGT}

: To prevent multiple predicted boxes from clustering around the same ground-truth object, the Intersection-over-Groundtruth (IoG) is computed between each high-confidence foreground prediction and its matched ground-truth box. A smoothed logarithmic penalty function is then applied to suppress highly overlapping predictions.

IoG (b_{i}^{pred}, b_{i}^{gt}) = \frac{area (b_{i}^{pred} \cap b_{i}^{gt})}{area (b_{i}^{gt})}

(18)

L_{RepGT} = \frac{1}{N} \sum_{i = 1}^{N} smooth_\ln (IoG (b_{i}^{pred}, b_{i}^{gt}); σ_{repgt})

(19)

where N denotes the number of foreground predictions used for computing the

L_{RepGT}

, the number of high-confidence predicted boxes matched to a ground-truth object.

L_{RepBox}

: Promote spatial separation between prediction boxes. Compute the Intersection over Union (IoU) pairwise for all foreground boxes and apply the same penalty function:

L_{RepBox} = \frac{1}{M} \sum_{j \neq k} smooth_\ln (IoU (b_{j}^{pred}, b_{k}^{pred}); σ_{rep})

(20)

where M represents the number of foreground prediction box pairs used to calculate

R e p B o x

(i.e., the number of matching pairs involved in the pairwise repulsion calculation between prediction boxes). Specifically, if there are

N_{f}

foreground prediction boxes, then:

M = \frac{N_{f} (N_{f} - 1)}{2}

(21)

where

smooth_\ln (\cdot)

is a logarithmic-linear smooth penalty piecewise function, which smoothly transitions from logarithmic growth to linear growth at the threshold

σ

(RepGT uses the threshold

σ_{repgt}

, and RepBox uses the threshold

σ_{rep}

). It is used to impose strong penalties in the case of high overlap while ensuring stable training gradients. Its definition is:

smooth_\ln (x) = \{\begin{matrix} - ln (1 - x), & x < σ \\ \frac{(x - σ)}{1 - σ} - ln (1 - σ), & x \geq σ \end{matrix}

(22)

For the RepBox term, self-pairs and prediction pairs associated with the same ground-truth object are excluded to avoid gradient conflicts. This term suppresses overlapping predictions and improves localization accuracy in dense scenes. Both

L_{RepGT}

and

L_{RepBox}

are applied only to high-confidence foreground samples, with gradients decoupled between prediction and ground-truth boxes to ensure stable optimization.

2.4.4. Occlusion-Weighted Classification Loss

To enhance the model’s classification robustness toward occluded targets, occlusion level weights (

w_{i}

) are introduced to adjust sample contributions in

L_{cls}^{occ}

:

L_{cls}^{occ} = \frac{1}{\sum_{i = 1}^{N} w_{i}} \sum_{i = 1}^{N} w_{i} [- y_{i} log (p_{i}) - (1 - y_{i}) log (1 - p_{i})],

(23)

The sample weight

w_{i}

is determined by the occlusion level

o_{i} \in {0, 1, 2, 3, 4}

of the i-th target, and two modes are supported:

w_{i} = \{\begin{matrix} linear : α + \frac{o_{i}}{4} (β - α), & linear mode \\ custom : lookup (o_{i}), & custom mode \end{matrix}

(24)

The linear mode interpolates the weight from

α

to

β

(for example, from 1.0 to 2.0). The custom mode can directly specify user-defined weights for each occlusion level, such as 0: 1.0, 1: 1.2, 2: 1.5, 3: 1.8, 4: 2.0. This strategy ensures that samples with severe occlusion have a greater influence in the classification gradient, prompting the model to learn more discriminative features from hard samples.

This joint loss simultaneously embodies two complementary objectives: reducing redundant localizations through exclusion and enhancing classification robustness through occlusion-weighted supervision. Together, they enable the detector to more accurately identify and localize weakly occluded targets in aerial photography scenarios.

3. VOD-UAV Dataset: For Occluded Vehicle Detection

To facilitate occlusion-aware vehicle detection in UAV imagery, we introduce VOD-UAV, a benchmark dataset comprising 1931 high-resolution (1920 × 1080) images, including 712 synthetic and 1219 real-world scenes (approximately a 1:2 ratio). Unlike existing datasets such as VisDrone [1] or UAVDT [2], which provide limited occlusion diversity and lack fine-grained visibility annotations, VOD-UAV offers structured, high-resolution imagery with explicit, first-of-its-kind occlusion-level labels, enabling discriminative learning and robust evaluation. All images contain at least one occluded vehicle, and the label format is fully compatible with the YOLO series, with the occlusion-level field appended as the last column, allowing users to optionally leverage it for weighted learning or performance stratification.

3.1. Occlusion Annotations

To ensure annotation consistency and interpretability, occlusion levels are assigned based on the proportion of visually occluded regions within each vehicle instance. Specifically, for horizontally or vertically aligned vehicles, the tight bounding box of each vehicle is evenly divided into eight rectangular sub-regions along its major axis. The occlusion level is determined by the number of sub-regions that are visually occluded by other objects in the scene: no occluded sub-region corresponds to Level 0; occlusion affecting up to two sub-regions corresponds to Level 1; occlusion spanning two to four sub-regions corresponds to Level 2; occlusion covering four to six sub-regions corresponds to Level 3 and occlusion affecting more than six sub-regions is assigned to Level 4, as illustrated in Figure 6.

For oblique vehicles with arbitrary orientations (e.g., around 45°), where rectangular partitioning becomes ambiguous, a fan-shaped partitioning strategy is adopted. In this case, the vehicle is partitioned into multiple concentric fan-shaped sectors originating from the vehicle front, with the maximum radius set to four units. Occlusion within the innermost sector corresponds to Level 1, while occlusion extending to the annular regions between radii 1-2, 2-3, and 3-4 is assigned to Levels 2, 3, and 4, respectively, as illustrated in Figure 7.

Based on this strategy, only true occlusion caused by other objects in the scene is considered when assigning occlusion levels, whereas objects partially truncated by image boundaries are excluded from occlusion labeling.

All occlusion annotations were performed by a single experienced annotator over a five-month period, ensuring consistent application of the annotation criteria. While this design choice helps maintain internal consistency, it may introduce a degree of subjectivity, particularly for borderline cases between adjacent occlusion levels (e.g., Levels 2-3-4). This aspect is therefore considered a potential limitation of the dataset. To improve transparency and reproducibility, detailed annotation guidelines and representative examples for ambiguous cases will be released together with the dataset.

Importantly, the occlusion level is appended as the last column in the YOLO-format label file, allowing it to be optionally ignored during training. This design preserves compatibility with standard YOLO pipelines while providing additional flexibility and generalization.

3.2. Synthetic Subset

The synthetic subset comprises 712 high-resolution images generated using Unreal Engine 5 (UE5). This virtual environment allows precise control over scene layout, lighting, vehicle density, and occlusion configurations, supporting diverse urban contexts, including:

High-rise metropolitan scenes with strong shadow interference.
Suburban roads embedded in complex environments.
Tree-lined park-like settings with variable visibility.

As illustrated in Figure 8, these three representative scenarios respectively emphasize variations in spatial layout, lighting conditions, and traffic density. By explicitly modeling these factors across distinct environments, the synthetic subset provides a broad spectrum of urban diversity, offering controllable yet challenging data that complements the complexity of real UAV imagery.

To further enhance this diversity, scenes are captured from multiple UAV altitudes and viewpoints, covering both top-down and oblique perspectives. Within these settings, four representative vehicle categories—car, van, truck (TR), and bus—are included, and each instance is annotated with both a bounding box and a five-level occlusion label. This ensures that the synthetic subset not only enriches the variability of visual conditions but also delivers fine-grained supervisory signals for occlusion-aware learning.

3.3. Real-World Subset

The real-world portion contains 1219 manually selected images from VisDrone [1], covering eight vehicle categories: bicycle (BC), car, van, truck (TR), tricycle (TC), awning-tricycle (AT), bus, and motor (MO). All annotations were carefully refined, and each object was re-annotated with its occlusion level in YOLO format, with the final column indicating the occlusion level.

By combining synthetic and real-world images, VOD-UAV provides both controlled, richly annotated samples and diverse, realistic scenes. The synthetic subset ensures high-quality training data with balanced coverage of common vehicles, while the real-world subset introduces additional categories and naturally occurring occlusion scenarios.

3.4. Data Distribution

As shown in Figure 9, most categories contain sufficient samples across different occlusion levels (typically more than 30 instances per level), while the most severe occlusion cases (Level 4) are less frequent, reflecting realistic urban traffic distributions. Specifically, the car and van categories dominate the dataset, with car instances exceeding 30,000 and exhibiting a balanced distribution across all five occlusion levels. In contrast, rare vehicle types such as awning-tricycles and buses contain relatively fewer samples, yet still maintain coverage across multiple occlusion levels to support reliable training and evaluation.

From the perspective of global occlusion statistics (Figure 9b), non-occluded vehicles (Level 0) account for the majority proportion at 81%. Slight (Level 1) and moderate occlusion (Level 2) cases together contribute about 13%, while heavy (Level 3) and extreme occlusion (Level 4) remain relatively rare, representing 3.3% and 2.8%, respectively. This long-tailed distribution mirrors real-world UAV surveillance scenarios, where most vehicles remain partially visible but heavily occluded targets still occur in dense traffic conditions or complex environments such as intersections.

To ensure the reliability of subsequent experiments, Figure 10 shows the distribution of vehicle categories and occlusion levels after splitting the dataset into training and validation sets at a 7:3 ratio.

3.5. Contribution and Advantages

By combining synthetic and real-world images with fine-grained, first-of-its-kind occlusion annotations, VOD-UAV provides a unique benchmark for systematic evaluation of occlusion-robust detection algorithms. The dataset supports controlled experiments on synthetic scenes, realistic testing on natural UAV imagery, and flexible utilization of occlusion labels. Its design enables discriminative learning under varying visibility conditions and facilitates the development of novel occlusion-aware methods. Section 4.5.1 further investigates the impact of synthetic-to-real ratios and occlusion stratification on detection performance, highlighting the dataset’s utility in advancing UAV-based occluded vehicle detection research.

4. Experiments

4.1. Dataset

As VOD-UAV is the first UAV dataset that provides fine-grained occlusion annotations for vehicles, it serves as the primary benchmark for evaluating occlusion-aware detection in this study. Most experiments are therefore conducted on VOD-UAV. Existing UAV datasets, such as VisDrone [1] and UAVDT [2], do not provide explicit occlusion-level labels, which limits their direct applicability for training or evaluating occlusion-aware models. Nevertheless, to further examine the generalization ability of the proposed method, we additionally conduct experiments on VisDrone, a widely used UAV benchmark without occlusion annotations. In this setting, the occlusion-level field is simply ignored during training and evaluation. VOD-UAV consists of high-resolution images from both synthetic and real-world scenes, covering multiple vehicle categories, diverse occlusion levels, and varied viewpoints. Its label format is fully compatible with the YOLO series, with the occlusion-level field optionally used for weighted loss design or performance stratification. This flexible design enables systematic evaluation of standard detection performance on both occlusion-annotated and non-annotated datasets, while allowing a focused assessment of robustness under partial visibility conditions on VOD-UAV.

4.2. Evaluation Metrics

To quantitatively assess the performance of occluded vehicle detection, we adopt several widely-used evaluation metrics in object detection. These metrics provide a comprehensive view of the model’s detection accuracy, localization precision, and sensitivity under varying levels of occlusion. The primary metrics include Precision, Recall, Average Precision (AP), and mean Average Precision (mAP), evaluated at both IoU thresholds of 0.5 and 0.5:0.95.

Precision (P) measures the proportion of true positive detections among all positive predictions:

$Precision = \frac{TP}{TP + FP}$

(25)
Recall (R) measures the proportion of true positive detections among all ground-truth objects:

$Recall = \frac{TP}{TP + FN}$

(26)
F1-score provides a harmonic mean of precision and recall, reflecting a balance between the two:

$F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$

(27)
Average Precision (AP) is computed as the area under the precision-recall (PR) curve:

$A P = \int_{0}^{1} p (r) d r$

(28)

where $p (r)$ is the precision as a function of recall.
Mean Average Precision (mAP) is the mean of AP over all object categories. We report both mAP at IoU threshold 0.5:

${mAP}_{50} = \frac{1}{C} \sum_{c = 1}^{C} A P_{c}^{IoU = 0.5}$

(29)

and mAP averaged over multiple IoU thresholds (from 0.5 to 0.95 with a step size of 0.05), following COCO evaluation protocol:

${mAP}_{50 : 95} = \frac{1}{10 \times C} \sum_{t = 0.5}^{0.95} \sum_{c = 1}^{C} A P_{c}^{IoU = t}$

(30)

where $A P_{i}$ represents the average accuracy of class i. C represents the total number of target categories in the dataset.

4.3. Training Details

All experiments are conducted on a workstation equipped with dual NVIDIA RTX 4090 GPUs (24GB each), running Ubuntu 22.04 with PyTorch 2.0 and CUDA 11.7. The model is implemented based on the official Ultralytics YOLOv11 repository, with custom modifications for occlusion-aware modules and visibility-weighted loss functions.

The model is trained from scratch without using any pre-trained weights. We use a batch size of 8 and an input resolution of 640 × 640. The training process spans 300 epochs using synchronized dual-GPU parallelism. To enhance model robustness against varying occlusion patterns and spatial configurations, Mosaic data augmentation is applied during training.

For optimization, we adopt the AdamW optimizer, with an initial learning rate

l r_{0} = 0.01

and a final learning rate factor

l r_{f} = 0.01

. The momentum parameter is set to 0.937, and the weight decay coefficient is fixed at 0.0005. A warm-up strategy is employed during the first 4 epochs, where the momentum is gradually increased from 0.8 to its target value, and the bias learning rate is initialized to 0.1 to stabilize early-stage training.

The dataset is split in a 7:3 ratio into training and validation sets. Unless otherwise specified, all evaluation follows standard object detection protocols.

4.4. Experiment Results

4.4.1. Quantitative Analysis

To evaluate the effectiveness of our proposed framework under occlusion, we compare DOMino-YOLO with a wide range of YOLOv11-based state-of-the-art variants on the VOD-UAV dataset, as shown in Table 1. Our model consistently outperforms all baselines across most vehicle categories and all evaluation metrics. In addition, we further analyze several representative non-YOLO detectors to provide a broader perspective on occlusion-aware vehicle detection.

From an efficiency perspective, DOMino-YOLO achieves a favorable balance between detection accuracy and computational cost. As reported in Table 1, our model operates at 256.1 FPS with 56.4 GFlops, maintaining real-time inference capability while delivering the best overall accuracy. It should be noted that the inference speed (FPS) and computational complexity (GFlops) of FCOS, Faster R-CNN, and DETR are not reported, as these models were not re-implemented under the same YOLO11 framework and hardware settings, making direct comparison unfair. Compared with high-capacity variants such as YOLO11-SK, which significantly increase computational complexity (102.3 GFlops) at the expense of inference speed, DOMino-YOLO attains higher mAP with nearly half the computational cost. Meanwhile, relative to lightweight YOLOv11 variants (e.g., Rep, DCN, or ImplicitHead), our framework introduces only moderate overhead, yet yields consistent and substantial performance gains under occlusion.

Specifically, DOMino-YOLO achieves the highest overall mAP₅₀ of 0.420 and mAP_50-90 of 0.293, outperforming the strongest existing baselines such as YOLO11-SK, which achieves 0.409 and 0.282 respectively. In categories characterized by severe occlusion, including BUS, MO, TC, and AT, our framework shows notable improvements. For example, the AP₅₀ in these challenging cases increases by as much as 0.09 compared with competitive models such as YOLO11-Rep and YOLO11-ImplicitHead, demonstrating the effectiveness of our occlusion-aware design in recovering visibility-compromised targets.

These results validate the effectiveness of our three-pronged design. The DCEM module improves shape alignment and localization under irregular contours, the VASA module effectively retains visible structure while preserving scale-sensitive semantics, and the CSIM-Head mitigates contextual noise around occluded instances. In addition, our visibility-weighted loss contributes to more balanced learning across different occlusion levels, which is particularly evident in improved detection accuracy for low-visibility targets.

In contrast to designs that emphasize architectural complexity (e.g., Biformer or FasterNeXt), our method maintains a lightweight and real-time structure while delivering superior performance. This indicates that occlusion-specific modeling, rather than generic backbone upgrades, is key to improving detection in UAV-based occlusion scenarios.

Figure 11 shows a radar-chart-based comparison of precision and recall across five occlusion levels (OL 0-4) for eight state-of-the-art detectors, including RCS-OSA, CAFERE, GhostSlimFPN [62], AIFI, BiFormer, DCN, LOW-FAM [63], and DOMino-YOLO (OUR), providing an intuitive per-category view of robustness under increasing occlusion.

As shown in the radar charts, DOMino-YOLO forms a consistently larger and more uniform radar profile than competing methods, indicating superior and more stable performance across multiple occlusion levels. This advantage is especially pronounced in recall, which is critical for occlusion-aware UAV detection, where missed targets pose greater risks than occasional false positives.

Quantitatively, across comparative experiments involving 8 vehicle categories and 5 occlusion levels, DOMino-YOLO achieves the best precision in 21 out of 40 cases and the best recall in 25 out of 40 cases, further confirming its robustness under diverse occlusion conditions.

For lightly or unoccluded targets (Level 0), DOMino-YOLO achieves competitive precision and the best recall in nearly all classes (e.g., TC: 0.626 recall vs. others ≤ 0.603; BUS: 0.894 recall vs. others ≤ 0.893), showing that our model retains strong generalization even without occlusion.

As occlusion severity increases from level 1 to level 3, DOMino-YOLO demonstrates consistent advantages in both recall and precision. For example, in the MO category at occlusion level 3, DOMino-YOLO achieves a precision of 0.353 and a recall of 0.525, which are considerably higher than those of the next best-performing model, LOW-FAM, with 0.348 precision and 0.150 recall. In the BUS category at occlusion level 4, DOMino-YOLO achieves the highest recall value of 0.853 among all models, highlighting its strong capability to detect severely occluded targets.

However, a consistent observation across all evaluated models, including DOMino-YOLO, is a marked decline in precision for BC (bicycle) and AC (awning tricycle) at occlusion level 4, with precision even dropping to zero in some cases. This performance degradation should be attributed to a dataset-level limitation rather than a model-specific deficiency, as it is primarily caused by the severe scarcity of training samples under the highest occlusion conditions, particularly for BC and AC at level 4. Such extreme data imbalance limits the ability of any model to learn reliable feature representations for these rare and heavily occluded categories. These results highlight the importance of maintaining a more balanced data distribution across both occlusion levels and object categories in future dataset construction.

Compared to models like CAFERE and GhostSlimFPN that struggle under higher occlusion (frequently showing near-zero recall or precision), DOMino-YOLO benefits from three key innovations: (1) the DCEM that adapts receptive fields to fragmented shapes; (2) the VASA module that amplifies partial cues; and (3) the CSIM-Head that filters contextual noise. These modules collectively bolster the model’s robustness to structural and spatial incompleteness, as evidenced by DOMino-YOLO’s superior recall under OL = 2–4.

Notably, although DOMino-YOLO achieves top-tier results, its performance in BC under OL = 4 (Precision: 0.388, Recall: 0.147) still reflects difficulty in detecting highly occluded, thin-structured objects like bicycles. This suggests that further gains could be realized by integrating topology-aware modules or leveraging generative augmentation for minority classes with severe occlusion.

In summary, DOMino-YOLO demonstrates consistent advantages across occlusion levels and categories, especially under severe occlusion, where the majority of baselines degrade substantially. These results validate the design of our occlusion-specific architecture and loss formulation, positioning DOMino-YOLO as a strong candidate for reliable UAV-based occlusion-aware detection.

4.4.2. Qualitative Analysis

Figure 12 presents the detection results of the top three models on the VOD-UAV dataset under three representative occlusion levels: slight occlusion, moderate occlusion, and heavy occlusion. We select three representative real-world images, each containing a high density of slightly, moderately, or heavily occluded vehicles. Missed detections and false positives are highlighted in red bounding boxes for clarity.

For slight occlusion, all three models successfully detect most vehicles, with only marginal differences in detection accuracy. However, both YOLO11-RepHELAN and YOLO11-SK fail to detect a tricycle in the upper-left corner. As shown previously in Figure 9, tricycles constitute a very small portion of the dataset, making them inherently more challenging to learn. Remarkably, our DOMino-YOLO is still able to correctly detect the tricycle, demonstrating its superior ability to generalize to underrepresented categories through its occlusion-aware structural aggregation and robust feature encoding.

For moderate occlusion, YOLO11-RepHELAN shows multiple missed detections and even false positives, while YOLO11-SK again misses the tricycle. In contrast, DOMino-YOLO achieves complete and accurate detection of all targets, highlighting its robustness in scenarios where discriminative features are partially obscured.

For heavy occlusion, the detection difficulty increases significantly. As shown in Figure 12, all three models exhibit missed detections and false detections, which are highlighted by the red bounding boxes. In particular, one motorcycle located at the bottom of the image is fully visible but is still incorrectly missed by all models. We attribute this failure to strong background interference, where complex contextual signals dominate feature representations and suppress true object responses. In addition, YOLO11-RepHELAN fails to detect a heavily occluded vehicle entirely. Most notably, both YOLO11-RepHELAN and YOLO11-SK misclassify a van as a car, whereas DOMino-YOLO correctly identifies it even under partial occlusion. This demonstrates DOMino-YOLO’s enhanced discriminative capability, which stems from its context-suppressed implicit modulation head that filters background noise and reinforces fine-grained object features.

In summary, the comparative results in Figure 12 clearly illustrate that while existing variants struggle with underrepresented classes, moderate-to-severe occlusions, and background interference, DOMino-YOLO consistently achieves more accurate detection and classification. Its improvements are particularly evident in handling small-sample categories and in distinguishing visually similar vehicle types under occlusion, thereby validating the effectiveness of its occlusion-aware design.

Figure 13 illustrates the model’s attention regions across three representative vehicle categories: car, truck, and van. These classes were selected based on their prevalence and structural diversity within the dataset.

In Figure 13a, we focus on the car category, which is the most abundant class in our dataset. From left to right, the images show cars being occluded by buildings at increasing severity. The attention maps indicate that the model dynamically adjusts its focus according to the visible parts of the vehicle. As occlusion intensifies, the attention region shifts and contracts toward the remaining visible areas, demonstrating the model’s adaptability to partial visibility.

Figure 13b presents attention responses for trucks under varying occlusion angles caused by surrounding trees. In the first four cases, the model predominantly focuses on the rear portion of the truck, suggesting that the rear-end structure carries strong class-specific cues. When the rear is completely occluded, the attention expands to include the visible frontal area of the vehicle. This indicates the model’s flexibility in utilizing alternative visual features for recognition. However, in the final example, where the truck is almost entirely occluded, the attention region becomes extremely limited. Although the model still predicts the correct class, the bounding box is significantly undersized—highlighting the detrimental impact of extreme occlusion on localization accuracy.

Figure 13c, we analyze vans under similar occlusion conditions. The model initially focuses on the frontal region of the vehicle. As the occlusion of the front increases (shown in the third to sixth images), the attention gradually shifts and expands to encompass a broader area of the vehicle, suggesting a redistribution of attention when key features are partially obstructed.

These observations provide evidence that our model adapts its attention to available visual cues under different occlusion configurations, and also reveal class-specific attention behaviors that are crucial for robust detection in UAV imagery.

4.4.3. Generalized Analysis

In the VisDrone generalization experiments, occlusion annotations are not available. Therefore, while DCEM, VASA, and CSIM-Head are retained to enhance feature representation, all occlusion-aware supervision is disabled, and the model is trained using the standard YOLO loss without occlusion weights.

As shown in Table 2, different detection frameworks exhibit distinct performance characteristics on the VisDrone2019 dataset. The two-stage detector Faster R-CNN achieves the best performance in terms of mAP₅₀ and medium-scale objects (AP_m), reflecting its strong capability in modeling relatively complete objects with sufficient visual features. DETR attains the highest mAP₇₅, demonstrating the advantage of Transformer-based architectures under stricter IoU evaluation criteria and their effectiveness in global feature modeling.

In contrast, the proposed DOMino-YOLO achieves the best results in overall mAP_50-90 as well as for very small (AP_vt) and small objects (AP_t). This advantage can be attributed to its occlusion-aware design tailored for partial visibility. Specifically, the DCEM alleviates spatial misalignment caused by scale variations, the VASA module effectively extracts discriminative multi-scale features from partially visible regions, and the CSIM-Head reduces false activations in dense scenes through adaptive context suppression. In addition, the proposed OAR-Loss further enhances the discrimination of heavily occluded and small-scale instances.

It is worth noting that no occlusion annotations are used during training on the VisDrone dataset, and the occlusion-aware components are explicitly ignored in this setting. Despite this, DOMino-YOLO still maintains competitive or superior performance on small-scale objects and under stricter evaluation metrics, indicating that the proposed method does not overfit to occlusion-specific datasets and exhibits good generalization capability in complex UAV scenarios.

4.5. Ablation Study

4.5.1. Ablation Analysis on the Ratio of Synthetic and Real Data

Table 3 presents an ablation study on the impact of different synthetic-to-real training data ratios on detection performance in the VOD-UAV dataset. For reproducibility and clarity, the number of training images used in each setting is explicitly reported as

N_{train}

. Six configurations are evaluated, ranging from a real-only baseline (0:1) to synthetic-dominant ratios, with performance assessed using per-category AP₅₀, overall mAP₅₀, and mAP_50–90.

The results indicate that increasing the number of training samples alone does not necessarily lead to improved performance. Although the 1:2 and 2:3 settings employ relatively large training sets (

N train = 1274

and 1246), their performance remains inferior to the 3:2 configuration, which uses fewer images (

N train = 829

) yet achieves the highest mAP₅₀ (0.427) and mAP_50–90 (0.311). This observation suggests that detection performance is more sensitive to data composition and distribution balance than to dataset scale alone.

The real-only baseline (0:1) performs competitively on certain tail categories such as BC and AT, but exhibits notable performance degradation on occlusion-heavy and structurally complex categories including Truck, Van, and Bus, highlighting the limitations of relying solely on real-world data. As synthetic data is gradually introduced, detection robustness under occlusion consistently improves, particularly for Truck and Bus, accompanied by steady gains in mAP_50–90. These results indicate that synthetic samples effectively enrich occlusion diversity and challenging spatial configurations that are difficult to capture in real UAV imagery.

Notably, the optimal performance is achieved at a synthetic-to-real ratio of 3:2. Despite not using the largest training set, this configuration yields peak AP values for occlusion-sensitive categories such as Truck (0.763) and Bus (0.800), while also delivering the strongest overall metrics. However, excessive reliance on synthetic data may negatively impact certain tail categories; for example, the AP for BC decreases from 0.0826 in the real-only setting to 0.0274 in the 2:1 configuration. Overall, this ablation study demonstrates that the effectiveness of hybrid training depends not only on data quantity but also on the balance and complementarity between synthetic and real samples.

4.5.2. Ablation Study of DOMino-YOLO Components

Table 4 presents an extensive ablation study to evaluate the individual and joint contributions of each module in DOMino-YOLO, including the DCEM, VASA, CSIM Head, CA, and the Occlusion-Aware Repulsion Loss (OAR-loss). All experiments are conducted on the VOD-UAV dataset using a synthetic-to-real data ratio of 3:2, which was shown in Section 4.5.1 to yield the best balance between generalization and robustness under occlusion.

Compared to the baseline YOLOv11, which achieves a moderate mAP₅₀ of 0.343 and mAP_50-90 of 0.245, the addition of each proposed module leads to consistent performance gains across occluded vehicle categories. Specifically, DCEM improves structural alignment and contributes a 3.2-point gain in mAP₅₀ while maintaining real-time speed (553.9 FPS), and VASA enhances multi-scale visibility-sensitive features, slightly improving performance while keeping the model efficient (714.6 FPS). The CSIM Head, though lightweight, demonstrates notable benefits in cluttered contexts, with negligible impact on speed (875.7 FPS). The introduction of OAR-loss further reinforces the model’s ability to separate closely packed instances, particularly under severe occlusion, while the CA module contributes more modest improvements.

When integrated together, the proposed modules demonstrate a clear synergistic effect. The combination of the DCEM and VASA modules alone leads to a notable increase in detection performance, elevating the mAP₅₀ to 0.422, although this also brings a substantial rise in model complexity, with the parameter count reaching 51.2 million. Further enabling all modules in the framework, including the CA component and the occlusion-aware repulsion loss, results in the best overall performance, with mAP₅₀ reaching 0.443 and mAP_50-90 improving to 0.322. This full configuration achieves consistent improvements across all object categories. The most significant gains are observed in categories such as BC, AT, and MO, where detection is typically more difficult due to either limited visible features or cluttered backgrounds.

Despite increasing to 56.4M parameters, DOMino-YOLO achieves 256.1 FPS on a dual RTX 4090 workstation, exceeding the computational capacity of typical UAV platforms (20-30 FPS). This indicates real-time capability after hardware scaling and inference optimization. Furthermore, its modular design allows computationally intensive components (e.g., CA or OAR-Loss) to be disabled, enabling flexible accuracy-efficiency trade-offs for resource-constrained deployment.

4.5.3. Ablation Analysis on Hyperparameters

A comprehensive analysis of Table 5, Table 6 and Table 7 leads to the following observations.

First, Table 5 shows that the repulsion weight

λ_{r e p}

plays a critical role in dense and occluded scenarios. When

λ_{r e p} = 0

, the model degenerates into a form without repulsion constraints, resulting in redundant predictions and degraded recall for heavily occluded targets. As

λ_{r e p}

increases, spatial separation among predictions is gradually improved, with

λ_{r e p} = 1.0

achieving the best balance between overall detection accuracy and occluded-object recall. Further increasing

λ_{r e p}

to 2.0 introduces overly strong suppression, which slightly degrades performance.

Second, Table 6 validates the effectiveness of the occlusion-weighted classification strategy. Both linear and custom weighting schemes outperform the baseline without occlusion weighting, indicating that explicitly emphasizing occluded samples improves classification robustness. Among them, the custom weighting scheme achieves the best performance, especially for highly occluded instances, suggesting that non-uniform modeling across occlusion levels is more effective. However, excessively strong weights may disrupt sample balance and marginally affect overall accuracy.

Finally, Table 7 investigates the influence of the classification loss weight

λ_{c l s}

. A small value (e.g., 0.25) provides insufficient supervision for occluded targets, while a large value (e.g., 1.0) causes the classification term to dominate the optimization process, potentially compromising localization accuracy. The results indicate that

λ_{c l s} = 0.5

offers the most favorable trade-off between classification robustness and localization precision.

Overall, these ablation studies confirm both the effectiveness and stability of the proposed Occlusion-Aware Repulsion Loss, with consistent performance trends observed across a reasonable range of hyperparameter settings; in all experiments,

λ_{b o x}

is fixed to 7.5 following the default YOLO setting, while

λ_{r e p} = 1.0

,

λ_{c l s} = 0.5

,

σ_{r e p g t}

and

σ_{r e p}

are fixed as defined in the loss formulation, and a custom occlusion-weight scheme

1.0, 1.2, 1.5, 1.8, 2.0

is adopted as the final configuration.

5. Conclusions

This paper proposes DOMino-YOLO, an occlusion-robust vehicle detection framework for UAV imagery, together with VOD-UAV, the first aerial dataset annotated with fine-grained occlusion levels that integrates real and synthetic images for controllable diversity.

DOMino-YOLO addresses key challenges in occluded scenes by alleviating spatial ambiguity with DCEM, mitigating visual incompleteness via VASA, and suppressing contextual interference using the CSIM-Head. In addition, an occlusion-aware repulsion loss, which combines localization repulsion and visibility-weighted classification, further improves detection robustness in dense and occluded environments.

Extensive experiments on VOD-UAV demonstrate consistent superiority over state-of-the-art detectors, particularly under moderate and severe occlusion, while ablation studies verify the effectiveness of each component and the synthetic-real hybrid training strategy. Overall, this work provides a unified occlusion-aware solution for aerial object detection and a solid foundation for future research.

For rare categories such as bicycle and awning-tricycle, all evaluated methods exhibit a noticeable precision drop under level-4 occlusion due to the scarcity of severely occluded samples, reflecting an inherent dataset limitation. Future work will focus on enriching both real and synthetic data to better support these challenging cases. In addition, extending the proposed framework to weakly supervised and oriented object detection remains an important research direction.

Author Contributions

Conceptualization, T.F.; methodology, T.F.; software, T.F.; validation, T.F.; writing—original draft preparation, T.F.; writing—review and editing, B.Y.; visualization, T.F.; supervision, H.D., B.Y. and B.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be made publicly available in a repository. The original data presented in this study will be openly available at https://github.com/futianyi3, accessed on 25 December 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Du, D.; Zheng, L.; Wang, L.; Li, Y. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 213–226. [Google Scholar]
Du, D.; Wang, C.; Wang, L.; Zheng, L.; Li, Y.; Yuan, J. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale match for tiny person detection. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 1257–1265. [Google Scholar]
Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Liu, Z.; Hu, J.; Weng, L.; Yang, Y. Rotated region based CNN for ship detection. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 900–904. [Google Scholar]
Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Yilmaz, C.; Maraş, B.; Arica, N.; Ertüzün, A.B. Creation of Annotated Synthetic UAV Video Dataset for Object Detection and Tracking. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 10–17 October 2023; pp. 1–4. [Google Scholar]
Sama, A.K.; Sharma, A. Simulated UAV dataset for object detection. E3S Web Conf. 2023, 54, 02006. [Google Scholar] [CrossRef]
Wang, J.; Teng, X.; Li, Z.; Yu, Q.; Bian, Y.; Wei, J. VSAI: A multi-view dataset for vehicle detection in complex scenarios using aerial images. Drones 2022, 6, 161. [Google Scholar] [CrossRef]
Suo, J.; Wang, T.; Zhang, X.; Chen, H.; Zhou, W.; Shi, W. HIT-UAV: A high-altitude infrared thermal dataset for UAV-based object detection. Sci. Data 2023, 10, 227. [Google Scholar] [CrossRef]
Zhao, Z.; Bo, K.; Hsu, C.-Y.; Liao, L. Lightweight UAV object detection algorithm based on improved YOLOv8. Intell. Data Anal. 2025, 29, 235–252. [Google Scholar] [CrossRef]
Yu, C.; Shin, Y. MCG-RTDETR: Multi-convolution and context-guided network with cascaded group attention for object detection in UAV imagery. Remote Sens. 2024, 16, 3169. [Google Scholar] [CrossRef]
Xiao, M.; Min, W.; Yang, C.; Song, Y. A novel network framework on simultaneous road segmentation and vehicle detection for UAV aerial traffic images. Sensors 2024, 24, 3606. [Google Scholar] [CrossRef] [PubMed]
Bozcan, I.; Kayacan, E. Au-air: A multi-modal UAV dataset for low altitude traffic surveillance. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 June 2020; pp. 8504–8510. [Google Scholar]
Rahman, M.H.; Madria, S. An augmented dataset for vision-based UAV detection and tracking. In Proceedings of the 2023 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), St. Louis, MO, USA, 27–29 September 2023; pp. 1–8. [Google Scholar]
Zuo, G.-M.; Xu, L.-H. Recognition of partially occluded objects based on ARG model. J. Nanchang Inst. Aeronaut. Technol. 2003, 90, 217–241. [Google Scholar]
Sovrano, V.A.; Bisazza, A. Recognition of partly occluded objects by fish. Anim. Cogn. 2008, 11, 161–166. [Google Scholar] [CrossRef]
Lim, K.-B.; Du, T.-H.; Wang, Q. Partially occluded object recognition. Int. J. Comput. Appl. Technol. 2011, 40, 122–131. [Google Scholar] [CrossRef]
Brahmbhatt, S. Detecting Partially Occluded Objects in Images. Doctoral Dissertation, University of Pennsylvania, Philadelphia, PA, USA, 2014. [Google Scholar]
Ren, J.; Ren, M.; Liu, R.; Sun, L.; Zhang, K. An effective imaging system for 3D detection of occluded objects. In Proceedings of the 2021 4th International Conference on Image and Graphics Processing, Sanya, China, 1–3 January 2021; pp. 20–30. [Google Scholar]
Cuhadar, C.; Tsao, H.N. A computer vision sensor for AI-accelerated detection and tracking of occluded objects. Adv. Intell. Syst. 2022, 4, 2100285. [Google Scholar] [CrossRef]
Su, Y.; Sun, R.; Shu, X.; Zhang, Y.; Wu, Q. Occlusion-aware detection and re-id calibrated network for multi-object tracking. arXiv 2023, arXiv:2308.15795. [Google Scholar]
Wang, Q.; Liu, H.; Peng, W.; Tian, C.; Li, C. A vision-based approach for detecting occluded objects in construction sites. Neural Comput. Appl. 2024, 36, 10825–10837. [Google Scholar] [CrossRef]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-aware R-CNN: Detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 637–653. [Google Scholar]
Wang, X.; Han, T.; Yan, S. Repulsion loss: Detecting pedestrians in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7774–7783. [Google Scholar]
Ayvaci, A. Occlusions and Their Role in Object Detection in Video. Ph.D. Dissertation, University of California, Los Angeles, CA, USA, 2012. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Scaled-YOLOv4: Scaling cross stage partial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.-Y.; Yeh, I.H.; Liao, H.-Y.M. You only learn one representation: Unified network for multiple tasks. arXiv 2021, arXiv:2105.04206. [Google Scholar] [CrossRef]
Chen, J.; Wen, R.; Ma, L. Small object detection model for UAV aerial image based on YOLOv7. Signal Image Video Process. 2024, 18, 2695–2707. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the Computer Vision-ECCV 2024, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Ringwald, T.; Sommer, L.; Schumann, A.; Beyerer, J.; Stiefelhagen, R. UAV-Net: A fast aerial vehicle detector for mobile platforms. In Proceedings of the International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhao, H.; Zhang, Y.; Hu, X. Improved YOLOv5 for object detection in UAV aerial images. In Proceedings of the International Conference on Mechatronic Engineering and Artificial Intelligence (MEAI 2023), Shenyang, China, 5–7 December 2024; Volume 13071, pp. 609–614. [Google Scholar]
Fu, T.; Dong, H.; Yang, B.; Deng, B. DE-DFNet: Edge enhanced diversity feature fusion guided by differences in remote sensing imagery tiny object detection. Image Vis. Comput. 2025, 161, 105627. [Google Scholar] [CrossRef]
Guo, B.; Zhang, H.; Wang, H.; Li, X.; Jin, L. Adaptive occlusion object detection algorithm based on OL-IoU. Sci. Rep. 2024, 14, 27644. [Google Scholar] [CrossRef] [PubMed]
Fu, T.; Dong, H.; Yang, B.; Deng, B. TMBO-AOD: Transparent mask background optimization for accurate object detection in large-scale remote-sensing images. Remote Sens. 2025, 17, 1762. [Google Scholar] [CrossRef]
Feng, Z.; Yang, B.; Deng, B. UTS-SAM: Adapting segment anything model for UAV target segmentation based on a complementary dual encoder. J. Supercomput. 2025, 81, 967. [Google Scholar] [CrossRef]
Fu, T.; Yang, B.; Dong, H.; Deng, B. Enhanced tiny object detection in aerial images. In Proceedings of the Advances in Computer Vision-ECCV 2024, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 149–161. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-aware reassembly of features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Wang, R.; Shivanna, R.; Cheng, D.; Jain, S.; Lin, D.; Hong, L.; Chi, E. DCNv2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the ACM SIGKDD, Virtual Event, 14–18 August 2021; pp. 1785–1797. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
Xiong, Y.; Li, Z.; Chen, Y.; Wang, F.; Zhu, X.; Luo, J.; Wang, W.; Lu, T.; Li, H.; Qiao, Y.; et al. Efficient deformable ConvNets: Rethinking dynamic and sparse operator for vision applications. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5652–5661. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W.H. BiFormer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPs for faster neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Yang, Z.; Guan, Q.; Zhao, K.; Yang, J.; Xu, X.; Long, H.; Tang, Y. Multi-branch auxiliary fusion YOLO with re-parameterization heterogeneous convolutional for accurate object detection. In Proceedings of the Computer Vision-ECCV 2024, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 492–505. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Kang, M.; Ting, C.-M.; Ting, F.F.; Phan, R.C.W. RCS-YOLO: A fast and high-accuracy object detector for brain tumor detection. In Proceedings of the Computer Vision-ECCV 2023, Vancouver, BC, Canada, 8–12 October 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 600–610. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision-ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar] [CrossRef]
Azad, R.; Niggemeier, L.; Hüttemann, M.; Kazerouni, A.; Aghdam, E.K.; Velichko, Y.; Bagci, U.; Merhof, D. Beyond self-attention: Deformable large kernel attention for medical image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 1287–1297. [Google Scholar]
Weng, K.; Chu, X.; Xu, X.; Huang, J.; Wei, X. EfficientRep: An efficient RepVGG-style convnets with hardware-aware neural network design. arXiv 2023, arXiv:2302.00386. [Google Scholar]

Figure 1. Vehicle occlusion samples in the proposed VOD-UAV dataset under three representative scenarios.

Figure 2. The complete structure of DOMinoNet based on Yolov11.

Figure 3. Architecture of the DCEM based on deformable convolutions.

Figure 4. Overview of the proposed VASA, which aggregates multi-stage features from RepVGG-, SR-, and ResSR-based blocks with SE recalibration to enhance visibility-aware representation.

Figure 5. Scale-aware detection head with implicit context suppression pipeline.

Figure 6. Illustration of vehicle occlusion levels.

Figure 7. Illustration of occlusion level annotation based on segmental and radial division strategies.

Figure 8. Illustration of representative urban scenarios with varying scene layout, lighting, and traffic density.

Figure 9. (a) Distribution of Occlusion Levels for Each Vehicle Type. (b) Percentage of vehicles with different occlusion levels of VOD-UAV.

Figure 10. Distribution of vehicle categories and occlusion levels in the training and validation sets.

Figure 11. Precision and recall comparison across multiple occlusion levels for different vehicle categories.

Figure 12. Visualization of detection results for the top-3 models on the VOD-UAV dataset.

Figure 13. Region of interest visualization for (a) car, (b) truck, and (c) van under different occlusion conditions.

Table 1. Comparison with State-of-the-Art Methods on VOD-UAV.

Methods	AP₅₀								mAP₅₀	mAP_50-90	GFlops	FPS
Methods	BC	CAR	VAN	TR	TC	AT	BUS	MO	mAP₅₀	mAP_50-90	GFlops	FPS
YOLO11-Rep [27]	0.0278	0.672	0.361	0.536	0.0577	0.117	0.581	0.143	0.312	0.215	6.3	638.3
YOLO11-ImplicitHead [37]	0.0340	0.678	0.371	0.552	0.0861	0.101	0.605	0.157	0.323	0.219	7.6	959.3
YOLO11-CARAFE [49]	0.0324	0.673	0.363	0.547	0.0656	0.118	0.479	0.0587	0.319	0.218	6.9	721.1
YOLO11-DCN [50]	0.0295	0.670	0.356	0.531	0.0481	0.108	0.593	0.151	0.311	0.211	7.4	780.2
YOLO11-DCN2 [51]	0.0273	0.689	0.375	0.556	0.0512	0.109	0.582	0.149	0.327	0.209	7.5	770.8
YOLO11-DCN3 [52]	0.0295	0.670	0.356	0.531	0.0481	0.108	0.593	0.151	0.311	0.211	7.6	776.0
YOLO11-DCN4 [53]	0.0269	0.664	0.344	0.524	0.0648	0.09042	0.574	0.145	0.304	0.207	6.4	795.9
YOLO11-Biformer [54]	0.0318	0.659	0.343	0.527	0.0551	0.0856	0.582	0.148	0.304	0.206	6.0	567.0
YOLO11-SK [55]	0.0564	0.769	0.502	0.662	0.115	0.197	0.558	0.26	0.409	0.282	102.3	276.9
YOLO11-FasterNeXt [56]	0.0311	0.677	0.37	0.535	0.0745	0.0821	0.59	0.156	0.314	0.215	6.4	575.3
YOLO11-CSCGhost [57]	0.028	0.676	0.351	0.524	0.0671	0.106	0.57	0.147	0.309	0.21	6.3	627.8
YOLO11-RepHELAN [58]	0.0324	0.691	0.38	0.566	0.0638	0.118	0.603	0.159	0.327	0.222	8.7	462.6
YOLO11-AIFI [59]	0.0347	0.677	0.356	0.563	0.0728	0.0879	0.587	0.178	0.32	0.216	6.5	637.5
YOLO11-RCS-OSA [60]	0.0278	0.672	0.361	0.536	0.0577	0.117	0.581	0.143	0.312	0.215	51.2	231.6
Faster R-CNN	0.0248	0.621	0.336	0.501	0.0469	0.092	0.534	0.127	0.289	0.192	-	-
FCOS [40]	0.0215	0.598	0.314	0.482	0.0412	0.083	0.512	0.118	0.276	0.183	-	-
DETR [61]	0.0261	0.645	0.348	0.512	0.0493	0.098	0.547	0.131	0.297	0.185	-	-
OUR	0.0652	0.811	0.521	0.658	0.118	0.225	0.673	0.285	0.420	0.293	56.4	256.1

Note: Bold values indicate the best performance.

Table 2. Comparison of detection results with state-of-the-art models on VisDrone2019.

Methods	mAP_50–90	mAP₅₀	mAP₇₅	AP_vt	AP_t	AP_s	AP_m
Faster R-CNN	25.8	47.6	26.9	6.8	15.9	26.8	40.3
FCOS	14.6	27.5	14.0	0.3	2.7	9.2	21.5
EfficientRep [64]	25.7	43.7	25.8	7.3	7.6	27.2	34.1
DETR	26.1	45.1	27.6	6.5	15.2	25.6	38.9
OUR	26.6	44.0	27.3	7.9	16.7	26.1	36.2

Note: Bold values indicate the best performance.

Table 3. Ablation Study on Synthetic-to-Real Data Ratio for Vehicle Detection.

Synthetic:Real Ratio	N_train	AP₅₀								mAP₅₀	mAP_50–90
Synthetic:Real Ratio	N_train	BC	CAR	VAN	TR	TC	AT	BUS	MO	mAP₅₀	mAP_50–90
0:1 (Real Only)	856	0.0826	0.778	0.491	0.312	0.108	0.196	0.379	0.276	0.328	0.21
1:1	996	0.0362	0.791	0.462	0.708	0.115	0.136	0.759	0.235	0.405	0.286
1:2	1274	0.0719	0.807	0.51	0.65	0.154	0.173	0.68	0.302	0.419	0.291
2:1	747	0.0274	0.804	0.412	0.747	0.0343	0.153	0.797	0.186	0.395	0.288
2:3	1246	0.0602	0.809	0.521	0.669	0.14	0.144	0.677	0.232	0.407	0.284
3:2	829	0.0298	0.817	0.519	0.763	0.105	0.13	0.8	0.252	0.427	0.311

Note: Bold values indicate the best performance.

Table 4. Ablation study between DOMino-YOLO components on VOD-UAV.

DCEM	VASA	CSIM Head	CA	OAR-Loss	AP₅₀								mAP₅₀	mAP_50–90	Params(M)	FPS
DCEM	VASA	CSIM Head	CA	OAR-Loss	BC	CAR	VAN	TR	TC	AT	BUS	MO	mAP₅₀	mAP_50–90	Params(M)	FPS
					0.0123	0.682	0.394	0.658	0.0748	0.046	0.757	0.118	0.343	0.245	6.3	834.0
✓					0.0163	0.721	0.451	0.708	0.104	0.0933	0.773	0.137	0.375	0.27	9.3	553.9
	✓				0.0137	0.706	0.425	0.689	0.0322	0.0737	0.762	0.142	0.355	0.251	12.5	714.6
		✓			0.0129	0.69	0.417	0.681	0.0699	0.0799	0.776	0.132	0.357	0.255	7.6	875.7
			✓		0.013	0.684	0.394	0.664	0.0419	0.137	0.751	0.112	0.35	0.251	6.3	638.7
				✓	0.0141	0.705	0.411	0.675	0.0801	0.052	0.799	0.131	0.363	0.255	6.6	796.1
✓			✓		0.0126	0.711	0.432	0.681	0.0699	0.0756	0.789	0.144	0.364	0.261	9.3	633.4
✓	✓				0.0298	0.82	0.516	0.759	0.127	0.0645	0.807	0.255	0.422	0.308	51.2	257.6
	✓		✓		0.0387	0.816	0.507	0.764	0.0798	0.148	0.829	0.239	0.428	0.311	51.2	263.8
	✓	✓			0.0116	0.713	0.447	0.698	0.0534	0.123	0.768	0.132	0.368	0.262	13.8	584.5
✓		✓		✓	0.0389	0.815	0.521	0.755	0.0967	0.15	0.819	0.259	0.432	0.312	31.4	298.5
✓	✓	✓	✓		0.0322	0.825	0.522	0.765	0.117	0.134	0.817	0.25	0.433	0.317	56.2	257.0
✓	✓	✓	✓	✓	0.0795	0.834	0.526	0.787	0.121	0.243	0.809	0.301	0.443	0.322	56.4	256.1

Note: Bold values indicate the best performance. A checkmark indicates that the module is enabled.

Table 5. Ablation study on loss weight coefficients on the VOD-UAV dataset.

$λ_{box}$	$λ_{rep}$	$λ_{cls}$	mAP₅₀	mAP_50–90	Recall_occ≥3
7.5	0.0	0.5	0.401	0.273	0.312
7.5	0.5	0.5	0.413	0.285	0.356
7.5	1.0	0.5	0.420	0.293	0.381
7.5	2.0	0.5	0.418	0.291	0.374

Note: Bold values indicate the best performance.

Table 6. Ablation of occlusion-weighted classification strategies.

Weight Mode	Weight Setting	mAP₅₀	Recall_occ≥3
None	$w_{i} = 1$	0.409	0.328
Linear	$α = 1.0, β = 2.0$	0.416	0.361
Custom	[1.0, 1.2, 1.5, 1.8, 2.0]	0.420	0.381
Custom (strong)	[1.0, 1.3, 1.7, 2.1, 2.5]	0.417	0.376

Note: Bold values indicate the best performance.

Table 7. Effect of

λ_{c l s}

on detection performance.

Table 7. Effect of

λ_{c l s}

on detection performance.

$λ_{cls}$	mAP₅₀	mAP_50–90	Recall_occ≥3
0.25	0.414	0.287	0.352
0.50	0.420	0.293	0.381
1.00	0.418	0.291	0.369

Note: Bold values indicate the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fu, T.; Dong, H.; Yang, B.; Deng, B. DOMino-YOLO: A Deformable Occlusion-Aware Framework for Vehicle Detection in Aerial Imagery. Remote Sens. 2026, 18, 66. https://doi.org/10.3390/rs18010066

AMA Style

Fu T, Dong H, Yang B, Deng B. DOMino-YOLO: A Deformable Occlusion-Aware Framework for Vehicle Detection in Aerial Imagery. Remote Sensing. 2026; 18(1):66. https://doi.org/10.3390/rs18010066

Chicago/Turabian Style

Fu, Tianyi, Hongbin Dong, Benyi Yang, and Baosong Deng. 2026. "DOMino-YOLO: A Deformable Occlusion-Aware Framework for Vehicle Detection in Aerial Imagery" Remote Sensing 18, no. 1: 66. https://doi.org/10.3390/rs18010066

APA Style

Fu, T., Dong, H., Yang, B., & Deng, B. (2026). DOMino-YOLO: A Deformable Occlusion-Aware Framework for Vehicle Detection in Aerial Imagery. Remote Sensing, 18(1), 66. https://doi.org/10.3390/rs18010066

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DOMino-YOLO: A Deformable Occlusion-Aware Framework for Vehicle Detection in Aerial Imagery

Highlights

Abstract

1. Introduction

2. Methodology

2.1. Deformable Convolution Enhanced Module

2.2. Visibility-Aware Structural Aggregation

2.3. Context-Suppressed Implicit Modulation Head

2.4. Occlusion-Aware Repulsion Loss Function

2.4.1. Overall Objective Function

2.4.2. Bounding Box Regression Loss

2.4.3. Repulsion Loss Function

2.4.4. Occlusion-Weighted Classification Loss

3. VOD-UAV Dataset: For Occluded Vehicle Detection

3.1. Occlusion Annotations

3.2. Synthetic Subset

3.3. Real-World Subset

3.4. Data Distribution

3.5. Contribution and Advantages

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Training Details

4.4. Experiment Results

4.4.1. Quantitative Analysis

4.4.2. Qualitative Analysis

4.4.3. Generalized Analysis

4.5. Ablation Study

4.5.1. Ablation Analysis on the Ratio of Synthetic and Real Data

4.5.2. Ablation Study of DOMino-YOLO Components

4.5.3. Ablation Analysis on Hyperparameters

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI