Robust Occluded Object Detection in Multimodal Autonomous Driving: A Fusion-Aware Learning Framework

Li, Zhengqing; Singh, Baljit

doi:10.3390/electronics15010245

Open AccessArticle

Robust Occluded Object Detection in Multimodal Autonomous Driving: A Fusion-Aware Learning Framework

by

Zhengqing Li

^1,2,3

and

Baljit Singh

^2,*

¹

School of New Energy and Intelligent Networked Automobile, University of Sanya, Sanya 572022, China

²

Faculty of Mechanical Engineering, Universiti Teknologi MARA, Shah Alam 40450, Selangor, Malaysia

³

New Energy and Intelligent Vehicle Engineering Research Center of Hainan Province, Sanya 572022, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 245; https://doi.org/10.3390/electronics15010245

Submission received: 5 November 2025 / Revised: 26 December 2025 / Accepted: 31 December 2025 / Published: 5 January 2026

(This article belongs to the Special Issue Celebrating the 70th Anniversary of Beijing University of Posts and Telecommunications—Computer Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

Reliable occluded object detection remains a persistent core challenge for autonomous driving perception systems, particularly in complex urban scenarios where targets are predominantly partially or fully obscured by static obstacles or dynamic agents. Conventional single-modality detectors often fail to capture adequate discriminative cues for robust recognition, while existing multimodal fusion strategies typically lack explicit occlusion modeling and effective feature completion mechanisms, ultimately degrading performance in safety-critical operating conditions. To address these limitations, we propose a novel Fusion-Aware Occlusion Detection (FAOD) framework that integrates explicit visibility reasoning with implicit cross-modal feature reconstruction. Specifically, FAOD leverages synchronized red–green–blue (RGB), light detection and ranging (LiDAR), and optional radar/infrared inputs, employs a visibility-aware attention mechanism to infer target occlusion states, and embeds a cross-modality completion module to reconstruct missing object features via complementary non-occluded modal information; it further incorporates an occlusion-aware data augmentation and annotation strategy to enhance model generalization across diverse occlusion patterns. Extensive evaluations on four benchmark datasets demonstrate that FAOD achieves state-of-the-art performance, including a +8.75% occlusion-level mean average precision (OL-mAP) improvement over existing methods on heavily occluded objects

O = 2

in the nuScenes dataset, while maintaining real-time efficiency. These findings confirm FAOD’s potential to advance reliable multimodal perception for next-generation autonomous driving systems in safety-critical environments.

Keywords:

multimodal autonomous driving; occluded object detection; cross-modal fusion; FAOD

1. Introduction

Occlusion has long been considered one of the most critical challenges in object detection for autonomous driving, especially in dense and dynamic urban environments [1]. Target objects may be partially obscured by static obstacles such as vehicles, poles, or roadside infrastructure; completely invisible due to severe full occlusion; or involved in complex interactive occlusions where multiple dynamic agents overlap. Such scenarios substantially deteriorate perception performance by distorting object appearances, blurring spatial boundaries, and introducing high levels of uncertainty in both localization and classification.

Recent advances in multimodal perception have opened promising avenues to mitigate these issues. By integrating complementary sensory modalities—RGB cameras, LiDAR sensors, and radar/infrared (IR) systems—perception systems can leverage distinct advantages: RGB imagery provides fine-grained texture and semantic cues; LiDAR offers precise 3D geometric measurements; and radar/IR sensors ensure resilience under adverse weather or low-illumination conditions [2]. This synergy allows for the construction of more comprehensive and robust object representations, particularly in scenarios characterized by severe occlusion or degraded visibility.

1.1. Motivation and Problem

Despite these benefits, existing multimodal fusion methods remain insufficient for reliable occlusion handling. First, while middle-fusion transformers such as BEVFormer can effectively align multimodal features in a shared bird’s-eye-view (BEV) space, they lack a dedicated mechanism to explicitly identify occluded regions and orchestrate targeted, directional information flow from unoccluded modalities for recovering missing object cues [3]. Second, feature alignment and cross-modal completion strategies are often inadequate, leading to fragmented or inconsistent representations across modalities, and recent BEV-space fusion frameworks such as BEVFusion primarily optimize geometric consistency and efficiency rather than explicit occlusion reasoning [4]. Third, most state-of-the-art fusion pipelines suffer from high computational complexity and limited scalability, hindering their deployment in real-time, safety-critical autonomous driving systems [1]. These limitations underscore the need for new frameworks that can explicitly reason about occlusion while efficiently exploiting multimodal complementarities.

We address these gaps by coupling (i) explicit visibility estimation, (ii) geometry-aware cross-modal attentive completion, and (iii) occlusion-adaptive fusion and calibration within a single trainable objective, designed to preserve efficiency for deployment.

1.2. Approach Overview

At a high level, FAOD introduces a visibility-guided multimodal detector. It first estimates multi-granular visibility cues, then uses geometry-aware cross-modal attention to complete features for occluded regions, and finally performs occlusion-adaptive fusion and calibrated post-processing. All components are trained end-to-end under a unified objective; the architectural details are provided in Section 6.

1.3. Contributions

In this work, we propose a novel framework termed Fusion-Aware Occlusion Detection (FAOD), which tightly integrates explicit occlusion modeling with implicit cross-modal feature reconstruction. The main contributions of this study are summarized as follows:

Explicit visibility reasoning for occlusion-aware BEV detection: We propose FAOD as a unified multimodal detection framework that explicitly models occlusion/visibility as learnable variables, including an instance-level occlusion classification and a region-level visibility map. These signals are supervised by occlusion-aware objectives and geometric consistency constraints, and are further used to guide downstream feature completion, fusion, and confidence scoring, rather than relying on implicit BEV aggregation.
Visibility-guided directed cross-modal attention (CMA) for alignment and feature completion: We design a geometry-aware CMA module that performs asymmetric, visibility-driven information transfer (donor → recipient): when a target modality is heavily occluded, complementary less-occluded modalities are selectively attended to reconstruct missing BEV features and align cross-modal representations. This goes beyond symmetric BEV fusion, enabling targeted restoration of occluded object regions.
Occlusion-aware dynamic fusion and score calibration at inference: FAOD couples visibility estimation with adaptive modality weighting and occlusion-aware post-processing. Fusion weights are adjusted conditioned on occlusion severity and modality reliability, while occlusion-aware Soft-NMS and confidence calibration mitigate false suppression of heavily occluded objects, improving detection stability under partial and complete occlusions.
Occlusion-oriented augmentation/labeling and comprehensive benchmarking with deployment considerations: To evaluate FAOD under controlled occlusion levels, we develop an occlusion-centric augmentation and labeling pipeline that explicitly accounts for different visibility regimes. Extensive experiments on four representative datasets (nuScenes, KITTI-MOD, DENSE, and JRDB) show consistent gains over strong baselines, and we additionally adopt streamlined fusion components to maintain practical efficiency for real-time safety-critical deployment.

2. Related Work

2.1. Single-Modality Occluded Detection

Under occlusion, purely image-based detectors rely on spatial context, part/instance completion, and occlusion-aware heads to stabilize recall in crowded scenes. CityPersons highlighted occlusion as a first-class challenge for pedestrian detection and catalyzed research on modeling visible parts and suppressing inter-instance interference [5]. Representative lines include explicit occlusion reasoning within detectors (e.g., Occlusion-Aware Region-based Convolutional Neural Network (R-CNN)) and repulsive/overlap-aware losses that discourage proposals from drifting toward neighboring instances in crowds [6,7]. Several single-modality image detectors have revisited occlusion from the perspective of feature design and loss shaping. For road-scene RGB detection, RE-YOLOv5 enhances the receptive field and introduces deformable feature extraction to better localize heavily occluded vehicles and pedestrians in dense traffic [8], while monocular 3D pipelines with thermodynamic-inspired object losses have reported improved accuracy on distant and partially occluded targets [9].

Meanwhile, LiDAR-based detectors benefit from accurate 3D geometry and can be comparatively robust to appearance changes, yet suffer at long range and under self-occlusion where returns become sparse. Milestones range from voxelized or pillarized backbones (SECOND, PointPillars) to modern center-based heads (CenterPoint) and hybrid point–voxel designs (PV-RCNN) [10,11,12,13]. Occlusion-aware formulations further leverage the notion of “measurable” geometry; for instance, WYSIWYG constrains learning to what is physically observable, decoupling prediction from unmeasurable, fully occluded regions [14]. Sparsity-aware backbones such as SPBA-Net emphasize robust aggregation under extremely sparse returns, and depth-guided monocular detectors like MonoDFNet refine regression for partially visible objects [15,16]. Despite these advances on both RGB and LiDAR fronts, these approaches remain fundamentally single-modality and cannot exploit complementary sensing when one modality is severely degraded, which motivates the multimodal, occlusion-aware design adopted in this work.

2.2. Multimodal Fusion Strategies

To overcome the limitations of single sensors, numerous multimodal fusion strategies have been proposed. Early fusion methods operate at the data level. PointPainting injects per-pixel semantics into points before 3D detection, yielding a simple pipeline but making the system sensitive to semantic noise and camera–LiDAR projection errors [17].

Feature-level or middle-fusion methods, MVX-Net couples image and LiDAR streams at the feature level; 3D-CVF performs cross-view projection/gating to align modalities in 3D/BEV; and more recent UVTR/UniTR families unify features in a shared BEV/voxel space and exchange information via attention, improving geometric consistency and efficiency [18,19,20,21,22]. Recent LiDAR–camera fusion architectures further refine BEV-space multimodal detection. Zhao et al. propose SimpleBEV, an improved LiDAR–camera fusion backbone that jointly reasons over LiDAR BEV and image features, achieving gains on KITTI and nuScenes compared with earlier middle-fusion designs [23]. MSAFusion introduces a multi-sensor adaptive BEV fusion scheme that adjusts modality contributions according to scene context and improves detection robustness in dense traffic [24]. SpaRC integrates sparse 4D imaging radar with cameras via spatially aligned fusion and attention, showing clear benefits under adverse weather and low-visibility conditions where radar complements RGB [25]. Samfusion employs a sensor-adaptive multimodal fusion mechanism that explicitly downweights degraded modalities under rain, fog, and low light [26].

Decision-level or late-fusion strategies, candidate-level fusion (e.g., CLOCs) combines per-modal proposals/scores with weaker sensitivity to alignment noise but provides limited capacity to recover fine-grained details in heavy occlusion [27]. TransFusion and BEVFormer aggregate multi-view/multimodal evidence with queries anchored in the object or BEV space, capturing long-range dependencies while preserving cross-view geometry, and have become a dominant paradigm for robust fusion [3,4,28]. However, these fusion pipelines still lack a dedicated, learnable visibility variable that explicitly drives cross-modal completion and occlusion-aware fusion, which is the focus of FAOD.

2.3. Occlusion Modeling

Explicit supervision of occlusion masks/levels or depth ordering improves reliability under partial and full occlusion by exposing the detector to visibility as a separate variable that gates features and calibrates scores [6,14,29].

In monocular 3D detection, modeling geometric/pose uncertainty in projection helps recover partially occluded instances (e.g., GUPNet), while cross-view attention in BEVFormer-style architectures maintains spatial coherence despite missing observations [3,30].

RadarOcc predicts a dense 3D occupancy grid from 4D imaging radar (optionally combined with LiDAR), explicitly modeling occluded free space and demonstrating robustness in fog and rain [31]. GS-Occ3D builds on the Occ3D benchmark by introducing visibility-aware labels and a geometry-guided network, enabling more accurate reconstruction of occupied, free, and occluded voxels from multi-view images [32]. LinkOcc extends occupancy prediction temporally by linking LiDAR–camera features across frames for 4D panoptic occupancy, which helps recover occluded regions in crowded scenes [33]. More recently, a BEV occlusion mitigation framework has been proposed to minimize the occlusion effect on multi-view camera perception by combining multi-sensor cues and layered depth reasoning [34]. While these works explicitly model visibility or occupancy, they typically do not couple visibility estimation directly with cross-modal completion and the detection head in a single objective, as FAOD does.

2.4. Feature Alignment and Completion

Middle-level fusion must handle parallax, time misalignment, and calibration errors. Cross-view mapping with gating (3D-CVF) and unification in BEV/voxel space (UVTR/UniTR/BEV-style fusion) reduce reprojection artifacts and enable stable geometry-aware interaction [3,18,20,21,22].

Methods such as EPNet and PointAugmenting inject image semantics into point features to increase separability in sparse regions; DeepInteraction and Cross-Modal Transformer variants deploy attention to pass information bidirectionally across modalities and to align latent spaces under occlusion/missing-data, improving robustness in

O = 1 / 2

regimes [35,36,37,38,39].

Beyond early BEVFusion-style alignment, several recent works focus on more precise cross-modal feature alignment and completion. High-performance sparse fusion frameworks project image features only at sparse LiDAR locations and refine them with attention in BEV, reducing computation while preserving alignment quality [40]. FusionFormer combines BEV-space LiDAR–camera fusion with a temporal transformer, enforcing temporal consistency of multi-sensory features across frames [41]. Wang et al. introduce a BEV fusion method based on mutual deformable attention and temporal aggregation, explicitly aligning LiDAR and camera features across depth and time [42]. These methods primarily target alignment quality and temporal smoothing, whereas FAOD ties cross-modal attention and completion directly to an explicit occlusion signal and learns them jointly with the detection losses.

2.5. Datasets and Metrics

KITTI provides occlusion/truncation tags and remains a foundational benchmark for 2D/3D detection; nuScenes offers a rich multi-sensor suite (6 cameras + LiDAR + radar) and a 10-class evaluation set; JRDB focuses on mobile robotics with 360° cameras and 3D LiDAR in crowded indoor/outdoor scenes; DENSE targets adverse weather and low-visibility conditions with additional sensing (e.g., radar/thermal) [43,44,45,46].

The standard mean average precision (mAP) and the nuScenes Detection Score (NDS) are widely adopted. For occlusion, occlusion-level mAP (OL-mAP)—computed over subsets partitioned by the visibility ratio or official occlusion flags—has emerged as a sensitive measure of robustness under

O = 1 / 2

conditions [44].

Overall, prior BEV and multimodal detectors either (a) lack explicit visibility modeling, (b) fuse modalities without reliability- or occlusion-aware weighting, (c) decouple completion from detection, or (d) rely on calibration-sensitive alignment without safeguards to misalignment. FAOD addresses these limitations by unifying visibility estimation, directed cross-modal completion with geometric bias, and occlusion-adaptive fusion/calibration within one objective.

3. Problem Formulation

3.1. Sensor Inputs and Metadata

The multimodal input includes synchronized RGB images, LiDAR point clouds, and IR maps, together with calibration parameters and temporal alignment and deskew pre-processing. The RGB image, denoted as

I_{RGB} \in R^{H \times W \times 3}

, is accompanied by intrinsic parameters

K

, extrinsic transform

T_{E \leftarrow Cam} \in SE (3)

, and a time stamp t. The LiDAR point cloud is represented as

P_{LiDAR} = {p_{i} = (x_{i}, y_{i}, z_{i}, [{int}_{i}, {ring}_{i}, t_{i}^{rel}])}_{i = 1}^{N}

. We keep the intensity

int

, the laser

ring

, and the relative sample time

t^{rel}

. Record

T_{E \leftarrow LiDAR}

, scan duration

Δ t

, and whether multi-sweep accumulation is used. The IR map is expressed as

T_{IR} \in R^{H^{'} \times W^{'} [\times C^{'}]}

. Its metadata comprise intrinsic parameters and resolution, extrinsics

T_{E \leftarrow IR}

, sampling rate, and beam model for mapping range–velocity.

For temporal alignment and deskewing, let E denote the unified world frame. We align all sensor timestamps to the LiDAR mid-scan time

t_{0}

and deskew LiDAR by continuous poses

T (t)

:

{\tilde{p}}_{i} = T (t_{0}) T {(t_{i})}^{- 1} p_{i} .

(1)

3.2. Frames, Projections, and Gridding

All modalities are geometrically aligned with a unified coordinate frame E or a common BEV representation for spatial consistency. For the camera projection (LiDAR → image), a 3D point

p

in the LiDAR frame is first transformed into the camera coordinate system as

X = T_{Cam \leftarrow LiDAR} [p; 1]

. The homogeneous pixel

\tilde{u} = K [R | t] X

is normalized to

u = (u, v) = (\tilde{u} / \tilde{w}, \tilde{v} / \tilde{w})

. For BEV mapping (points/voxels → BEV), the ground plane is discretized according to

(i, j) = (⌊ (x - x_{min}) / s ⌋, ⌊ (y - y_{min}) / s ⌋)

, and features are pooled/encoded along the vertical dimension z to obtain

F_{BEV}

.

3.3. Instance-Level Annotations and Visibility

Each object instance is associated with a semantic category label

c \in C

, where

C

denotes the predefined set of target classes (e.g., pedestrian, cyclist, passenger vehicle, large vehicle, traffic facility, etc.). The category label characterizes the semantic attributes of an object and serves as one of the fundamental prediction variables in multimodal detection tasks. Depending on the dataset configuration, the cardinality

| C |

can range from a small set of classes (e.g., three in KITTI) to a richer taxonomy (e.g., ten in nuScenes) and may be extended to support additional categories in more complex scenarios. To ensure cross-modality consistency, the category labels are defined and indexed in a unified manner across RGB images, LiDAR point clouds, and IR/radar annotations, enabling the detection model to align and share a common semantic space among heterogeneous sensors. Moreover, in order to evaluate robustness under occlusion, the category labels are further combined with visibility levels and bounding box annotations, which facilitates fine-grained performance analysis under varying occlusion conditions.

In addition to the semantic category label, each object is also described by a 3D bounding box

b = (x, y, z, w, h, l, θ)

with optional velocity

v

. The occlusion level is defined

O \in {0, 1, 2}

and unified via a visible ratio:

\begin{matrix} v_{img} & = \frac{# visible pixels}{# box pixels}, \end{matrix}

(2)

\begin{matrix} v_{pc} & = \frac{# valid in-box points}{# expected points}, \end{matrix}

(3)

\begin{matrix} v_{ratio} & = γ v_{img} + (1 - γ) v_{pc}, γ \in [0, 1], \end{matrix}

(4)

with thresholds

O = \{\begin{matrix} 0, & v_{ratio} \geq 0.75, \\ 1, & 0.25 \leq v_{ratio} < 0.75, \\ 2, & v_{ratio} < 0.25 . \end{matrix}

Here,

v_{img}

denotes the fraction of visible pixels within the 2D bounding box in the image plane,

v_{pc}

denotes the fraction of valid LiDAR points inside the 3D box relative to the expected number of points, and

v_{ratio}

is a weighted combination of the two, with

γ

controlling the relative contribution of image and point-cloud visibility. The thresholds above assign

O = 0

to mostly visible objects,

O = 1

to partially occluded objects, and

O = 2

to heavily or fully occluded ones.

The cutoffs at

0.75

and

0.25

follow common three-level occlusion protocols in driving benchmarks, where roughly three quarters of the object area being visible corresponds to “non-occluded”, and less than one quarter corresponds to “heavily occluded”. The intermediate band

[0.25, 0.75)

provides a sufficiently wide regime of partially occluded samples for learning while keeping the semantic interpretation of each level clear. Since

v_{ratio}

is a convex combination of

v_{img}

and

v_{pc}

, increasing

γ

shifts the occlusion decision towards image-based visibility, whereas decreasing

γ

emphasizes LiDAR-based visibility. The sensitivity of

v_{ratio}

to

γ

is bounded by

| v_{img} - v_{pc} |

; when the two modalities broadly agree, moderate changes of

γ

do not alter the assigned occlusion level O, and only strong disagreements lead to boundary cases.

Optional fine-grained labels include a pixel/BEV visibility mask

M_{vis} \in [0, 1]

, a depth-order/occlusion graph

G

(edges “occluder → occluded”), and sensor-availability flags.

3.4. Sample Organization and Occlusion-Aware Sampling

Under heavy occlusion, temporal windows with motion compensation are employed to increase visibility and maintain spatiotemporal continuity across frames.

To further address dataset imbalance, stratified sampling is applied to balance samples with different occlusion levels (

O = 0 / 1 / 2

or to oversample partially and fully occluded instances (

O \in {1, 2}

)), preventing domination by non-occluded samples.

For the assignment, both anchor-free and anchor-based strategies are adapted to handle occluded samples. In the anchor-free setting, a top-k dynamic assignment with center/distance priors is commonly used. For anchor-based methods, intersection over union (IoU) thresholds are relaxed for high-O samples, and additional center biases are introduced. Define a composite cost (anchor-free example):

C = λ_{cls} C_{cls} + λ_{box} C_{box} + λ_{occ} ϕ (O),

(5)

where

ϕ (O)

downweights penalties for highly occluded instances.

4. Task Objective and Subtasks

The learning objective follows the detection mapping introduced in Section 6 (Equation (9)), which transforms multimodal sensory inputs into instance-level predictions of category, 3D geometry, and occlusion state.

For readability, we describe the learning objective through three coupled subtasks rather than listing all loss terms in the main text: (i) visibility estimation that predicts instance-level occlusion and a region-level visibility cue, (ii) visibility-guided cross-modal completion that reconstructs missing BEV evidence using complementary modalities, and (iii) modality-aware detection that dynamically fuses completed features and decodes

(c, b)

with calibrated confidence. The full formulations of Subtask A/B/C (including all losses and constraints) are provided in Appendix A.

5. Overall Optimization Objective and Training Strategy

The visibility, completion, and fusion modules are trained jointly so the network learns not only to detect objects, but also to estimate occlusion and recover missing information in a coordinated manner. This section summarizes the global objective and the strategies used to emphasize low-visibility cases while keeping training stable.

5.1. Global Objective

We jointly optimize detection, multi-granular visibility estimation, and cross-modal completion using

L_{total} = L_{\det} + λ_{A} L_{A} + λ_{B} L_{B},

(6)

where

L_{\det}

is the detection objective (classification + box regression),

L_{A}

is the visibility-estimation objective (including occlusion classification and region-level visibility constraints), and

L_{B}

is the completion objective. The complete definitions of

L_{\det}

,

L_{A}

, and

L_{B}

and all constituent losses are given in Appendix A.

5.2. Strategy I: Occlusion-Aware Reweighting

This strategy upweights hard samples so partially and fully occluded instances contribute more strongly during training. Concretely, for difficult cases (

O \in {1, 2}

), we amplify the occlusion-related objectives and the completion consistency:

λ_{occ} (O) = λ_{occ}^{0} (1 + β_{occ} \cdot ⊮ [O \in {1, 2}]),

(7)

λ_{rec} (O) = λ_{rec}^{0} (1 + β_{rec} \cdot ⊮ [O = 2]),

(8)

with stronger amplification for

O = 2

to enforce completion consistency under full occlusion.

In addition, we optionally employ robust multi-task balancing (e.g., homoscedastic uncertainty weighting) and gradient normalization techniques (e.g., GradNorm/PCGrad) to prevent a single objective (typically classification) from dominating optimization. The explicit formulation used in our implementation is provided in Appendix B.

5.3. Strategy II: Spatiotemporal Consistency (Stable Completion with Multi-Frame Aggregation)

To stabilize completion across time, we enforce that point clouds and feature maps corresponding to the same physical object remain consistent across neighboring frames. This reduces flicker and overfitting to single-frame noise, which becomes noticeable when visibility is low.

Given a temporal window

T = {t - K, \dots, t + K}

and poses

T (\cdot)

, we incorporate (i) point-level consistency under ego-motion and (ii) feature-level consistency via geometric warping. The complete equations (including the Chamfer-like point loss and feature warping loss) are reported in Appendix B. In practice, we first converge a single-frame model, and then introduce the temporal consistency terms together with multi-frame aggregation.

5.4. Strategy III: Post-Processing (Occlusion-Aware NMS and Calibration)

Beyond the core network, we apply occlusion-aware post-processing to avoid suppressing hard occluded true positives and to calibrate confidence scores. The key idea is to soften suppression and adjust score calibration when a hypothesis is predicted as heavily occluded, because occlusion increases localization uncertainty and reduces IoU overlap.

We adopt occlusion-aware Soft-NMS and occlusion-conditioned temperature scaling (and, optionally, uncertainty-aware NMS). To keep the main text lightweight, the full post-processing equations are given in Appendix B.

5.5. Training Pipeline and Curriculum

We recommend a staged training pipeline. Training proceeds as follows: (i) train

L_{\det}

to stability; (ii) enable

L_{A}

for visibility estimation; and (iii) activate

L_{B}

together with temporal consistency (if used). The occlusion curriculum increases the synthetic occlusion strength

ρ

from

ρ_{0}

to

ρ_{max}

(linear/cosine schedule) while ramping up the completion weight. To improve robustness to missing sensors, we randomly drop modalities (guided by sensor-availability flags) so the completion and fusion modules generalize across sensor degradation. Unless otherwise stated, all weights and thresholds are treated as learnable or scheduled hyperparameters, supporting reproducibility and systematic ablations across datasets and sensor configurations.

6. Method

6.1. Overall Architecture

FAOD is an occlusion-robust multimodal detector that links modality-specific encoding → occlusion-aware representation → cross-modal attentive completion → multi-task detection in an end-to-end pipeline. To make the data flow easier to follow at first glance, Figure 1 provides a conceptual view of how visibility estimation, cross-modal completion, and occlusion-aware fusion operate together. Figure 2 then presents the detailed module design and training/inference signals.

Formally, given synchronized sensory inputs from RGB cameras

I_{RGB}

, LiDAR point clouds

P_{LiDAR}

, and optionally radar/IR maps

T_{IR}

, the objective is to learn a detection function:

f : (I_{RGB}, P_{LiDAR}, T_{IR}) \mapsto {(c, b, O)}

(9)

where

c \in C

denotes the semantic object category,

b = (x, y, z, w, h, l, θ)

defines the 3D bounding box that includes spatial position, dimensions, and orientation, and

O \in {0, 1, 2}

represents the occlusion state corresponding to no occlusion, partial occlusion, and full occlusion. FAOD comprises (i) modality-specific encoders for RGB, LiDAR, and IR/radar; (ii) an occlusion-aware feature extractor producing multi-granular visibility signals; (iii) CMA for selective fusion and completion; and (iv) a multi-task head that predicts

(c, b, O)

with occlusion-adaptive fusion and decoding (see Figure 2).

6.2. Feature Extraction Modules

With a ResNet/Swin backbone and an FPN, the image encoder produces multi-scale features

{F_{l}^{img}}_{l = 1}^{L}

, where

F_{l}^{img} \in R^{H_{l} \times W_{l} \times C_{l}}

. For alignment with BEV/point features, a perspective or learnable view transform is applied:

{\tilde{F}}_{l}^{img} = T_{view} (F_{l}^{img}; K, T_{E \leftarrow Cam}) .

(10)

For LiDAR, the voxel pathway (VoxelNet/SECOND) builds a tensor

V \in R^{D \times H \times W \times C}

and yields BEV features

F^{BEV} \in R^{H_{b} \times W_{b} \times C_{b}}

via 3D/2D convolutions. The point pathway (PointNet++) aggregates raw points

P = {p_{i}}

to

F^{pc} \in R^{N \times C_{p}}

, then pools to BEV with a voxel/grid operator

Γ (\cdot)

:

{\tilde{F}}^{pc} = Γ (P, F^{pc}) \in R^{H_{b} \times W_{b} \times {\tilde{C}}_{b}} .

(11)

For IR/Radar, a lightweight CNN/Transformer produces

F^{ir} \in R^{H^{'} \times W^{'} \times C^{'}}

; geometric calibration maps it to the unified view:

{\tilde{F}}^{ir} = Φ (F^{ir}; T_{E \leftarrow IR}) .

(12)

The aligned main-scale maps

F_{RGB}, F_{LiDAR}, F_{IR} \in R^{H_{b} \times W_{b} \times C}

are then used by subsequent modules in a common BEV/grid domain.

6.3. Occlusion-Aware Submodules

FAOD augments the backbone with auxiliary occlusion branches that provide explicit visibility cues for downstream completion and fusion. The goal of these submodules is to estimate, for each candidate and region in the scene, how strongly it is occluded, so that later stages can selectively trust or discount modality evidence.

At each candidate (instance or BEV grid), the occlusion branch outputs an instance probability

p_{O} = (p_{0}, p_{1}, p_{2})

for

O \in {0, 1, 2}

and a region-level visibility map

A_{vis} \in {[0, 1]}^{H_{b} \times W_{b}}

. Instance-level occlusion is trained with class-balanced cross-entropy or Focal loss, and the visibility map is supervised by BCE with total-variation (TV) regularization, as defined in Subtask A.

We obtain a semantic-guided visibility map by concatenating RGB and LiDAR features and projecting to a single channel:

A_{vis} = σ ({Conv}_{1 \times 1} ([F_{RGB} ∥ F_{LiDAR}])),

(13)

where

[\cdot ∥ \cdot]

denotes channel-wise concatenation.

Given a coarse fused map

F_{fusion}

, multi-head self-attention with positional encoding

P

and geometric bias

B_{geo}

is applied using the following geometry-aware attention operator:

Attn (Q, K, V) = softmax (\frac{Q K^{⊤} + B_{geo}}{\sqrt{d}}) V .

(14)

6.4. Cross-Modal Attention and Completion

Once visibility has been estimated, FAOD uses cross-modal attention to transfer information from less occluded “donor” modalities to more occluded “target” modalities. Intuitively, this module aims to complete or refine target features in regions where they are unreliable, by borrowing geometry-consistent evidence from other sensors.

Given a target modality

F_{a}

and a donor modality

F_{b}

, queries, keys, and values are obtained by linear projections:

Q = W_{Q} F_{a}, K = W_{K} F_{b}, V = W_{V} F_{b} .

(15)

The attended target features are then computed by the operator in Equation (14):

{\hat{F}}_{a} = Attn (Q, K, V) .

(16)

For occlusion-gated mixing, let

M_{occ} = 1 - A_{vis}

. A modality reliability score

r_{m} \in (0, 1)

(estimated from density/SNR/texture/motion blur) yields a donor weight

ω_{b} = \frac{exp {κ (1 - A_{vis}) r_{b}}}{\sum_{m \in {RGB, LiDAR, IR}} exp {κ (1 - A_{vis}) r_{m}}},

(17)

and the completed target features are updated by

F_{a}^{comp} = F_{a} + ω_{b} M_{occ} ⊙ ({\hat{F}}_{a} - F_{a}) .

(18)

6.5. Detection Head and Occlusion-Aware Fusion

The final detection stage converts the completed multimodal features into class, box, and occlusion predictions, while adaptively weighting each modality according to visibility and reliability. This head ties together the preceding modules and determines how much each sensor contributes to the final decision at each spatial location.

On the fused representation

F_{final}

, a multi-task head predicts class, box, and occlusion:

{p_{c}, \hat{b}, \hat{O}} = Head (F_{final}) .

(19)

The detection objective follows Subtask C, with

L_{\det} = λ_{cls} L_{cls} + λ_{box} L_{box} + λ_{occ} L_{occ}

; the formulations of

L_{cls}

,

L_{box}

(IoU/DIoU +

ℓ_{1}

with periodic angle), and

L_{occ}

are defined there and not repeated here.

Occlusion-aware dynamic fusion computes, at each location

x

, per-modality weights via a learnable gate

g_{m}

(e.g., a two-layer MLP over

[M_{occ} (x), r_{m} (x), GAP (F_{m})]

); the resulting logits are normalized to weights:

\begin{matrix} g_{m} (x) & = {MLP}_{m} ([M_{occ} (x), r_{m} (x), GAP (F_{m})]), \end{matrix}

(20)

\begin{matrix} α_{m} (x) & = \frac{exp {g_{m} (x)}}{\sum_{m^{'} \in {RGB, LiDAR, IR}} exp {g_{m^{'}} (x)}}, \end{matrix}

(21)

\begin{matrix} F_{final} (x) & = \sum_{m \in {RGB, LiDAR, IR}} α_{m} (x) F_{m}^{[comp]} (x) . \end{matrix}

(22)

Here,

M_{occ}

is the occlusion mask defined earlier,

r_{m} \in (0, 1)

denotes modality reliability, and

GAP (\cdot)

is global average pooling. Higher occlusion (larger

M_{occ}

) and higher reliability

r_{m}

increase

α_{m}

, prioritizing robust modalities (e.g., LiDAR/IR) when needed.

6.6. Training and Implementation Notes

All modalities are aligned to a unified BEV/grid; for multi-frame inputs, pose-based registration and LiDAR deskew are applied. Synthetic occlusion

T_{occ} (ρ)

is applied with strength

ρ

gradually increased, and

λ_{rec}

is ramped up in tandem to stabilize completion learning.

The overall objective follows the global formulation in Section 5 (Equation (6)). For samples with

O \in {1, 2}

,

λ_{occ}

and

λ_{rec}

are increased; homoscedastic uncertainty weights

{σ_{k}}

may be used for task balancing. During inference, occlusion-aware Soft-NMS (weaker suppression for

O = 2

), temperature scaling, and variable IoU thresholds are used to reduce misses and over-suppression under heavy occlusion.

The pipeline forms a causal loop from visibility estimation → cross-modal completion → dynamic fusion. The heatmap

A_{vis}

localizes occlusions, CMA performs geometry-consistent targeted recovery, and

α_{m}

assigns modality weights based on occlusion and reliability. The multi-task head jointly optimizes these components, yielding robust 3D detection under partial and full occlusion.

7. Experiments

7.1. Overall Performance and Stratified Analysis

Across four benchmarks, FAOD consistently outperforms multimodal and occlusion-specialized baselines. Averaged over three independent runs (distinct random seeds), the absolute improvements are typically

+ 3 %

to

+ 7 %

in mAP and

+ 2

to

+ 5

in NDS, with statistical significance verified by bootstrap testing

(p < 0.05)

.

To assess robustness under occlusion, results are partitioned by the unified visibility thresholds into

O \in {0, 1, 2}

(non-occluded, partially occluded, and severely/near-completely occluded) and reported as OL-mAP per subset. For non-occluded cases

(O = 0)

, FAOD matches or slightly exceeds strong baselines by

+ 0.5 %

to

+ 2 %

, indicating no loss of upper-bound accuracy. For partially occluded cases

(O = 1)

, gains of

+ 4 %

to

+ 9 %

are observed, attributable to visibility guidance and cross-modal completion that mitigate weak texture and sparse points. For severely/near-completely occluded cases

(O = 2)

, the gains are most pronounced

(+ 8 %

to

+ 15 %)

; recall increases more than precision, consistent with CMA and dynamic fusion recovering detectability under low-information conditions.

Category-wise and scale-wise analyses indicate especially notable improvements in small/distant classes (e.g., pedestrian, cyclist). Scale binning shows larger OL-mAP gains for small-to-medium objects, consistent with FAOD’s ability to compensate for sparse LiDAR returns and weak image cues. This trend is also aligned with typical road scenes, where small agents are the first to disappear behind occluders and the last to provide clean geometry.

The region-level visibility map

A_{vis}

attains higher IoU and lower cross-entropy against ground-truth

M_{vis}

than unsupervised baselines. The Spearman correlation between

A_{vis}

and classification confidence is also higher, supporting its use in score calibration. In practice, this correlation matters more for

(O = 1, 2)

: the score needs to reflect “how much evidence is really there”, otherwise the post-processing step tends to discard the hard positives.

On DENSE nighttime/low-light/rain–snow subsets, FAOD’s advantage widens (OL-mAP gains of about

+ 6 %

at

O = 1

and

+ 12 %

at

O = 2

). On JRDB’s crowded indoor scenes with long occlusion chains, FAOD maintains stable recall. These are the cases where symmetric fusion is easily confused by missing or noisy cues; the visibility-gated completion and occlusion-aware calibration are simply more forgiving, and the improvements show up consistently in the occlusion-stratified metrics.

7.2. Baseline Comparison and Component Contributions

Compared with multimodal baselines (e.g., PointPainting, MVX-Net, UVTR), PointPainting is vulnerable to noisy image semantics, particularly at

O = 2

; FAOD suppresses unreliable channels via

A_{vis}

and

r_{m}

, reducing false positives. MVX-Net/UVTR degrade under alignment errors or missing modalities; FAOD’s geometric bias

B_{geo}

and gated fusion

α_{m}

show greater robustness.

Versus occlusion-specialized baselines (e.g., GUPNet, ORN, DetZero), single-modality/view occlusion reasoning is limited under complete occlusion; FAOD’s cross-modal completion transfers information directionally (donor→recipient), reconstructing features for invisible recipients. In dense crowds, ORN/DetZero’s reliance on ordering/logic graphs is less robust to annotation noise; FAOD with

L_{rank}

yields smoother behavior.

Ablations show consistent trends. Removing the occlusion branch

(L_{occ} + L_{vis})

notably degrades OL-mAP at

O = 2

and weakens

A_{vis}

, limiting CMA completion. Removing the geometric bias

B_{geo}

hurts more in high-parallax camera–radar/camera–LiDAR settings. Removing reliability gating

r_{m}

increases mis-fusion in low-light/sparse segments and reduces the variance of

α_{m}

. Disabling consistency/contrastive losses

(L_{rec}, L_{align})

during the occlusion curriculum leads to over-completion or local overfitting with larger per-subset variance. Temporal consistency (optional) further improves

O = 2

recall at modest latency cost.

7.3. Performance Analysis: Efficiency, Resources, and Deployability

We use three model scales—FAOD-S (small), FAOD-M (medium), and FAOD-L (large). Unless otherwise stated, the latency breakdown reports the large scale (FAOD-L). Under a common protocol (nuScenes, single GPU, FP16, batch = 1), we report latency and key resource metrics in Table 1 and Table 2. CMA and the image backbone dominate compute; reducing image resolution/backbone width and triggering CMA sparsely (e.g., ROI-based) provide the largest speedups.

Speed–accuracy trade-offs are shown in Table 3. FAOD-M reduces latency by ≈39% vs. FAOD-L while losing ≈2.8 pts on

O = 2

, suitable for online use; FAOD-L favors offline high-accuracy.

In this context, FAOD-S can be regarded as the lightweight variant targeting resource-constrained or embedded deployments. Compared with FAOD-L, it reduces latency from

94.7

ms to

41.0

ms (about a

57 %

decrease) and peak memory (see Table 2) at the cost of several points in mAP and OL-mAP

(O = 2)

. Such a trade-off is acceptable for many automotive ECUs where on-board compute and memory are limited. On automotive-grade SoCs, additional gains are expected from TensorRT/ONNX engines, mixed precision, and moderate backbone width scaling; a full evaluation of FAOD-S on embedded hardware is left for future work.

Engine-level optimizations reduce memory and improve throughput (Table 4); e.g., TensorRT yields 20–

25 %

throughput gains.

Calibration and post-processing analyses on nuScenes val are given in Table 5. Temperature scaling improves calibration (ECE/Brier), and occlusion-aware Soft-NMS further improves detection under heavy occlusion (

O = 2

; higher OL-mAP) together with overall mAP.

Robustness to modality dropout at inference is summarized in Table 6. LiDAR is critical under strong occlusion; RGB/IR remain complementary in low light and sparse-point regimes.

Efficiency impacts of key components are shown in Table 7. CMA yields the largest accuracy gains with moderate cost;

B_{geo}

and

r_{m}

are highly cost-effective for high-occlusion accuracy.

FAOD delivers statistically significant gains on aggregate and occlusion-stratified metrics across four benchmarks, with the largest improvements at

O = 2

due to cross-modal completion and adaptive gating. Efficiency-wise, FAOD traces a clear Pareto frontier via image/BEV resolution and sparse attention, enabling both offline and online deployments. Interpretability (

A_{vis}

,

α_{m}

maps) and better calibration (temperature scaling) support practical deployment and safety analyses.

7.4. Occlusion-Stratified Results on nuScenes

To rigorously assess robustness under varying degrees of occlusion, the nuScenes validation set is stratified into three visibility tiers—non-occluded

(O = 0)

, partially occluded

(O = 1)

, and heavily occluded

(O = 2)

—and OL-mAP is reported for each tier. Overall, FAOD-L attains the best or tied-best performance across all tiers and exhibits a smaller degradation as occlusion increases than both multimodal and occlusion-specialized baselines (Figure 3).

For

(O = 0)

, FAOD-L achieves an OL-mAP of

71.0

, exceeding the mean of four baselines (69.25) by

+ 1.75

percentage points (pp) and outperforming the best baseline (70.0) by

+ 1.0

pp. This suggests that introducing explicit visibility reasoning does not come at the cost of peak accuracy: when observations are clean, the model largely behaves like a strong BEV fusion detector rather than “over-correcting” what is already reliable. For

(O = 1)

, FAOD-L reaches

63.0

, improving over the baseline mean (56.5) by

+ 6.5

pp and over the strongest baseline (58.0) by

+ 5.0

pp. In many nuScenes scenes, partial occlusion is the more common and also the more confusing case: one modality may still carry a usable fragment (e.g., a contour in RGB), while another becomes sparse or locally corrupted (e.g., missing returns in LiDAR). The visibility heatmap helps here by damping unreliable regions and letting the fusion focus on the parts that are still trustworthy; CMA then supplies complementary cues where the target stream is weak, instead of mixing all modalities symmetrically in BEV. For

(O = 2)

, FAOD-L attains

53.0

, surpassing the baseline mean (44.25) by

+ 8.75

pp and the strongest baseline (46.0) by

+ 7.0

pp. The improvement is strongest under

(O = 2)

and is driven mainly by recall: in these cases, the detector often needs to work with very limited evidence (a few points, a small edge fragment, or intermittent responses). Visibility-gated CMA, together with reliability weighting, reconstructs discriminative features only where information is genuinely missing, making the remaining cues usable without spreading artifacts across the scene. This also makes post-processing less brittle, because a hard true positive under severe occlusion may not achieve the “nice” overlap pattern that standard suppression heuristics assume.

The tiered results also hint at what kind of situations in nuScenes FAOD benefits from. Under

(O = 1)

, the gain tends to come from cases that are partially blocked but still geometrically consistent—for instance, an agent visible in one stream while partially missing in another due to occluders or viewpoint. The directed (donor → recipient) completion is especially useful in this regime: it transfers information from the less-occluded donor stream to the occluded target stream, which is a different behavior from symmetric BEV aggregation. Under

(O = 2)

, detections are closer to the decision boundary. Here, the visibility gating prevents the completion module from “guessing everywhere”, and the occlusion-aware calibration/NMS helps avoid over-suppressing these low-IoU, low-confidence but correct hypotheses. In short,

(O = 1)

benefits more from selective restoration, while

(O = 2)

benefits from both restoration and a more forgiving confidence/suppression policy.

Figure 4, Figure 5 and Figure 6 provide qualitative comparisons between the baseline fusion model (BEVFormer) and FAOD under heavy occlusion. For each scene, the top image shows the baseline result, while the bottom image shows FAOD. In scenarios where target objects are largely invisible in RGB and only sparsely observed in LiDAR, the baseline often fails to form meaningful responses, leading to missed detections or fragmented hypotheses. In contrast, FAOD produces more coherent BEV activations and more stable object predictions. The visibility cues highlight occluded regions, while cross-modal attention selectively transfers complementary geometric information from less-occluded modalities, resulting in more complete object representations.

Degradation with occlusion is quantified by

Δ_{occ} = OL-mAP (O = 0) - OL-mAP (O = 2)

. PointPainting: 26 pp; MVX-Net: 25 pp; UVTR: 26 pp; DetZero: 23 pp; FAOD-L: 18 pp. Relative to the best baseline(DetZero, 23 pp), FAOD-L reduces the penalty by 5 pp (relative reduction

\approx 21.7 %

), yielding a flatter performance–occlusion curve and stronger cross-tier consistency.

8. Discussion

8.1. Generalization Capability

The proposed FAOD framework demonstrates strong generalization ability when deployed in previously unseen environments and across novel object categories. By leveraging multimodal feature representations and explicit occlusion reasoning, the model is less reliant on dataset-specific appearance patterns, thereby enhancing robustness in diverse urban scenarios. Experimental results across four heterogeneous benchmarks confirm that FAOD can effectively adapt to varying sensor configurations and scene geometries without significant performance degradation.

8.2. Robustness to Occlusion Types

A key strength of our approach lies in its robustness against different types of occlusion. In addition to handling static and partial occlusions, FAOD exhibits stable performance in highly dynamic conditions where occlusions are caused by moving vehicles, pedestrians, or other agents. The explicit visibility reasoning module enables reliable estimation of occlusion levels, while the cross-modal feature completion mechanism recovers object representations even when large portions are visually obscured.

8.3. Computational Efficiency

Practical deployment in autonomous driving requires a balance between accuracy and efficiency. In this work, all runtime and resource measurements are obtained on a single NVIDIA RTX 3090 GPU under the evaluation protocol described in the Experiments section (FP16, batch

= 1

, with image resolution and LiDAR sweeps as specified for each FAOD-S/M/L configuration). The reference implementation has a compact model size of about 110 MB of learnable parameters, which fits comfortably within the memory budgets of current GPU and automotive SoC platforms.

Under this protocol, the three model scales trace a clear accuracy–latency frontier: FAOD-L targets offline or high-compute settings, FAOD-M offers a favorable trade-off between accuracy and speed, and FAOD-S is explicitly designed as a lightweight variant for resource-constrained or embedded deployments, using lower image resolution, fewer LiDAR sweeps, and narrower backbones while preserving most of the occlusion-stratified gains. Engine-level optimizations such as TensorRT/ONNX conversion, mixed-precision execution, operator fusion, and sparsified (e.g., ROI-triggered) attention further reduce latency and memory footprint. A detailed quantitative evaluation on specific automotive-grade embedded hardware is left for future work.

8.4. Limitations and Future Directions

Despite its effectiveness, the current framework has several limitations. First, the present FAOD implementation assumes fixed and precise extrinsic calibration between cameras, LiDAR, and IR/radar. The geometric bias term

B_{geo}

and the BEV projections are computed directly from these calibration parameters. In practice, LiDAR misalignment (e.g., due to mechanical tolerances, thermal drift, or mounting vibrations) can distort cross-modal attention and reliability gating, and we do not yet explicitly model or correct such effects. Future variants could incorporate calibration-robust feature encodings, online refinement of extrinsics, or uncertainty-aware fusion that downweights modalities suspected to be misaligned.

FAOD is evaluated in a single-frame setting and does not yet include explicit temporal modeling. Without temporal aggregation, the method cannot fully exploit motion cues and cross-frame visibility to stabilize occlusion estimates or recover objects that are only intermittently visible under heavy occlusion. A natural extension is to aggregate BEV features over short, pose-compensated temporal windows and apply a lightweight temporal attention module on top of the existing BEV representation, together with temporal consistency losses to regularize completion in heavily occluded scenes.

The current study focuses on passive sensing with a fixed sensor layout. We do not consider active strategies such as view planning, adaptive sensor scheduling, or dynamic exposure control, which may further mitigate severe occlusion and adverse-weather degradation. Exploring these directions, together with temporal reasoning and calibration-aware fusion, is left for future work.

9. Conclusions

In this work, we proposed FAOD, a novel Fusion-Aware Occlusion Detection framework designed to address the persistent challenge of object detection under occlusion in autonomous driving systems. By integrating explicit visibility reasoning with implicit cross-modal feature completion, FAOD is capable of reconstructing object representations even in highly cluttered and visually degraded scenarios. A central innovation of our approach lies in the attention-guided multimodal fusion mechanism, which dynamically aligns heterogeneous features from RGB, LiDAR, and infrared/radar modalities to maximize complementary strengths while mitigating occlusion-induced information loss.

Extensive experiments on four representative autonomous driving benchmarks demonstrate that FAOD achieves state-of-the-art performance across a wide range of occlusion conditions, including partial and full occlusions, static and dynamic obstacles, and diverse sensor configurations. Notably, the framework maintains both high accuracy and computational efficiency, reaching real-time inference rates with a compact model size, which highlights its potential for practical deployment in safety-critical driving environments.

Beyond empirical performance, FAOD contributes a methodological foundation that can generalize to multimodal perception research. Its explicit occlusion modeling, modality-aware feature reconstruction, and attention-driven alignment are not confined to detection; they could also support occlusion-aware tracking, improve the reliability of motion forecasting, and refine occupancy prediction, in addition to aiding cooperative multi-agent perception. In each of these tasks, the same principle applies: reasoning about which signals are missing and selectively completing them with information from other modalities can make the system more robust. More broadly, FAOD exemplifies a practical paradigm for dealing with incomplete multimodal data, offering a transferable approach that extends beyond autonomous driving and remains relevant wherever sensor degradation or partial observability pose challenges.

Looking ahead, future research directions include incorporating temporal reasoning to leverage motion dynamics across video sequences, as well as exploring active perception strategies that adapt sensor utilization to occlusion severity. By advancing towards these goals, FAOD can serve as a stepping stone for the development of next-generation robust, reliable, and intelligent perception systems for autonomous driving and broader real-world applications.

Author Contributions

Conceptualization, Z.L. and B.S.; methodology, Z.L.; software, Z.L.; validation, Z.L. and B.S.; formal analysis, Z.L.; investigation, Z.L.; resources, B.S.; data curation, Z.L.; writing—original draft preparation, Z.L.; writing—review and editing, B.S.; visualization, Z.L.; supervision, B.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data in this study come from relevant open-source datasets in the field of autonomous driving.

Acknowledgments

The authors wants to thanks the anonymous referees for the constructive request.

Conflicts of Interest

The authors declares no conflicts of interest.

Appendix A. Full Task Objective and Loss Formulations

Appendix A.1. Subtask A: Occlusion Classification (Visibility Estimation)

This subtask predicts the instance-level occlusion label O and a region-level visibility heatmap

A_{vis} \in {[0, 1]}^{Ω}

(

Ω

: pixels or BEV grids) for downstream gating and completion. The input features include multi-scale image features

{F_{l}^{img}}

, BEV/voxel features

F^{pc}

, IR/Radar features

F^{ir}

, and geometric statistics

ψ

(point density, view angle, and count of neighboring occluders).

The instance-level occlusion loss adopts a class-balanced cross-entropy or Focal loss:

L_{occ} = - \sum_{k \in {0, 1, 2}} α_{k} {(1 - p_{k})}^{γ} ⊮ [O = k] log p_{k} .

(A1)

For region-level visibility prediction, a binary cross-entropy term with total variation (TV) regularization is used:

L_{vis} = BCE (A_{vis}, M_{vis}) + λ_{tv} \sum_{x \in Ω} {∥ \nabla A_{vis} (x) ∥}_{1} .

(A2)

Depth-order consistency is enforced using an occlusion graph

G

:

L_{rank} = \sum_{u \to v} [max (0, d (v) - d (u) + δ) + β max (0, \bar{A_{vis} (v)} - \bar{A_{vis} (u)} + ϵ)],

(A3)

where

d (\cdot)

is mean depth and

\bar{A_{vis} (\cdot)}

is instance-wise average visibility.

Cross-modal visibility consistency is encouraged through an

L_{1}

projection term:

L_{cons} = ∥ A_{vis}^{img} - Π (A_{vis}^{pc}) ∥_{1},

(A4)

where

Π (\cdot)

is the projection operator from the point cloud to the image plane.

The Subtask-A objective combines all components as

L_{A} = λ_{occ} L_{occ} + λ_{vis} L_{vis} + λ_{rank} L_{rank} + λ_{cons} L_{cons} .

(A5)

Appendix A.2. Subtask B: Feature Restoration (Cross-Modal Completion)

Within occluded regions indicated by

A_{vis}

, missing semantics/geometry are reconstructed using complementary modalities.

For modality reliability, let

m \in {RGB, LiDAR, IR}

. We compute statistics

ϕ_{m}

(texture strength, SNR, point density, motion blur) and

r_{m} = σ (w_{m}^{⊤} ϕ_{m} + b_{m}) \in (0, 1) .

(A6)

We then form donor weights:

ω_{m} = \frac{exp {κ (1 - A_{vis}) r_{m}}}{\sum_{m^{'}} exp {κ (1 - A_{vis}) r_{m^{'}}}},

(A7)

where

κ > 0

is a temperature and

r_{m} \in (0, 1)

are modality reliability scalars.

For CMA completion, given target

F_{a}

(Query) and donor

F_{b}

(Key/Value), after geometric alignment (projection or learnable deformation

Δ x

):

\begin{matrix} A_{a b} & = softmax (\frac{Q (F_{a}) K {(F_{b})}^{⊤} + B_{geo}}{\sqrt{d}}), \end{matrix}

(A8)

\begin{matrix} {\hat{F}}_{a} & = A_{a b} V (F_{b}), \end{matrix}

(A9)

with geometric bias

B_{geo}

. Occlusion-gated mixing is

F_{a}^{comp} = F_{a} + M_{occ} ⊙ ({\hat{F}}_{a} - F_{a}), M_{occ} = 1 - A_{vis} .

(A10)

With synthetic occlusion

T_{occ}

(Cutout/Copy-Paste, ray-drop), consistency and teacher supervision are enforced: the teacher on clean samples provides

y^{*} = (p^{*}, b^{*})

, while the student on occluded samples outputs

\hat{y} = (\hat{p}, \hat{b})

, with

L_{rec} = KL (\hat{p} ∥ p^{*}) + λ_{box} {∥ \hat{b} - b^{*} ∥}_{1} .

(A11)

We additionally use an instance-wise InfoNCE:

L_{align} = - log \frac{exp (〈 z_{a}, z_{b} 〉 / τ)}{\sum_{b^{'}} exp (〈 z_{a}, z_{b^{'}} 〉 / τ)} .

(A12)

Boundary smoothness is enforced as

L_{smooth} = \sum {∥ \nabla F_{a}^{comp} ∥}_{1} .

(A13)

The Subtask-B objective is

L_{B} = λ_{rec} L_{rec} + λ_{align} L_{align} + λ_{smooth} L_{smooth} .

(A14)

Appendix A.3. Subtask C: Modality-Aware Detection (Dynamic Fusion and Decoding)

Dynamic fusion weights modality features by visibility-guided reliability:

\begin{matrix} α_{m} & = \frac{exp {η 〈 (1 - A_{vis}), r_{m} 〉}}{\sum_{m^{'}} exp {η 〈 (1 - A_{vis}), r_{m^{'}} 〉}}, \end{matrix}

(A15)

\begin{matrix} F_{fused} & = \sum_{m} α_{m} F_{m}^{[comp]}, \end{matrix}

(A16)

emphasizing robust modalities (LiDAR/IR) under heavy occlusion.

For decoding, the detector predicts category c and box

b

, while the occlusion state O from Subtask A is used for occlusion-aware fusion and post-processing. Classification uses Focal or class-balanced cross-entropy:

L_{cls} = - \sum_{c \in C} α_{c} {(1 - p_{c})}^{γ} ⊮ [c = c^{*}] log p_{c} .

(A17)

Regression uses IoU/distance-IoU (DIoU) +

ℓ_{1}

with a periodic angle loss:

\begin{matrix} L_{box} & = L_{IoU} (b, b^{*}) + μ ∥ Δ d ∥_{1} + μ_{θ} ℓ_{angle} (θ, θ^{*}), \\ ℓ_{angle} & = | sin (θ - θ^{*}) | + | cos (θ - θ^{*}) | . \end{matrix}

(A18)

The detection objective is

L_{\det} = λ_{cls} L_{cls} + λ_{box} L_{box} .

(A19)

Appendix B. Additional Training and Post-Processing Details

This appendix provides the explicit formulations used for optional robust multi-task balancing, temporal consistency, and occlusion-aware post-processing.

Appendix B.1. Robust Multi-Task Balancing (Optional)

We employ homoscedastic uncertainty weighting to balance task gradients:

L_{total} = \sum_{k} \frac{1}{2 σ_{k}^{2}} L_{k} + log σ_{k}, k \in {\det, A, B},

(A20)

where

σ_{k}

are learnable scalars. In practice, this complements the occlusion-level amplification in Equations (7) and (8). We also use GradNorm/PCGrad to equalize per-task gradient norms within a mini-batch, preventing domination (e.g., by

L_{cls}

).

Appendix B.2. Spatiotemporal Consistency (Optional)

Given poses

T (\cdot)

, a source point

p_{s}

at time s is mapped to the reference frame t by

{\tilde{p}}_{s \to t} = T (t) T {(s)}^{- 1} p_{s} .

(A21)

Within a target box region

S

, we use a Chamfer-like loss:

L_{temp-pc} = \sum_{p \in S_{t}} min_{q \in {\tilde{S}}_{s \to t}} {∥ p - q ∥}_{1} + \sum_{q \in {\tilde{S}}_{s \to t}} min_{p \in S_{t}} {∥ q - p ∥}_{1},

(A22)

where

S_{t}

denotes the set of points inside the target box at frame t;

S_{s}

is the in-box set at frame s;

{\tilde{S}}_{s \to t} = {T (t) T {(s)}^{- 1} p_{s} ∣ p_{s} \in S_{s}}

; and

{∥ \cdot ∥}_{1}

denotes the

ℓ_{1}

norm.

For feature-level consistency (image/BEV), let

W_{s \to t} (\cdot)

be a cross-frame warp (by geometry/pose):

L_{temp-feat} = \sum_{l} ∥ F_{t}^{(l)} - W_{s \to t} (F_{s}^{(l)}) ∥_{1} .

(A23)

This term can be merged into

L_{B}

with weight

λ_{temp}

.

Appendix B.3. Occlusion-Aware Post-Processing (Optional)

In occlusion-aware Soft-NMS, given candidate i and another

j \neq i

, we apply Gaussian score decay with an occlusion-adaptive width:

\begin{matrix} s_{j}^{'} & = s_{j} \cdot exp (- \frac{IoU {(b_{i}, b_{j})}^{2}}{σ^{2} (O_{j})}), \\ σ (O) & = σ_{0} + Δ_{σ} ⊮ [O = 2] . \end{matrix}

(A24)

Here,

σ (O)

enlarges the decay width for fully occluded hypotheses and thus softens suppression.

For occlusion-aware thresholds and temperature scaling, classification logits are rescaled by an occlusion-dependent temperature:

s^{cal} = sigm (z / τ_{cal} (O)), τ_{cal} (O) = τ_{0} + Δ τ \cdot ⊮ [O = 2],

(A25)

where

sigm (x) = 1 / (1 + e^{- x})

and z is the classification logit. The NMS IoU threshold is lowered under full occlusion:

τ_{nms} (O) = τ_{0} - Δ_{nms} \cdot ⊮ [O = 2] .

(A26)

Ties can be broken in favor of higher-O boxes.

For uncertainty-aware NMS, if the regressor outputs a covariance

Σ

, we use the Mahalanobis distance between centers

c_{i}, c_{j}

:

D_{M} (b_{i}, b_{j}) = \sqrt{{(c_{i} - c_{j})}^{⊤} Σ^{- 1} (c_{i} - c_{j})},

(A27)

and threshold or decay by

D_{M}

;

Σ

can be enlarged for higher O to reflect occlusion-induced uncertainty.

References

Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Gläeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1341–1360. [Google Scholar] [CrossRef]
Li, Y.; Ibanez-Guzman, J. Lidar for autonomous driving: The principles, challenges, and trends for automotive lidar and perception systems. IEEE Signal Process. Mag. 2020, 37, 50–61. [Google Scholar] [CrossRef]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Dai, J. BEVFormer: Learning bird’s-eye-view representation from LiDAR-Camera via spatiotemporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar] [CrossRef]
Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3213–3221. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-aware R-CNN: Detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 637–653. [Google Scholar]
Wang, X.; Xiao, T.; Jiang, Y.; Shao, S.; Sun, J.; Shen, C. Repulsion loss: Detecting pedestrians in a crowd. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7774–7783. [Google Scholar]
Li, T.; Xiong, X.; Zhang, Y.; Fan, X.; Zhang, Y.; Huang, H.; Hu, D.; He, M.; Liu, Z. RE-YOLOv5: Enhancing Occluded Road Object Detection via Visual Receptive Field Improvements. Sensors 2025, 25, 2518. [Google Scholar] [CrossRef] [PubMed]
Liu, G.; Xie, X.; Yu, Q. Monocular 3D object detection with thermodynamic loss and decoupled instance depth. Connect. Sci. 2024, 36, 2316022. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection From Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 19–25 June 2021; pp. 11784–11793. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 14–19 June 2020; pp. 10529–10538. [Google Scholar]
Hu, P.; Ziglar, J.; Held, D.; Ramanan, D. What you see is what you get: Exploiting visibility for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 14–19 June 2020; pp. 11001–11009. [Google Scholar]
Sha, H.; Gao, Q.; Zeng, H.; Li, K.; Li, W.; Zhang, X.; Wang, X. SPBA-Net point cloud object detection with sparse attention and box aligning. Sci. Rep. 2024, 14, 28420. [Google Scholar] [CrossRef] [PubMed]
Gao, Y.; Wang, P.; Li, X.; Sun, M.; Di, R.; Li, L.; Hong, W. MonoDFNet: Monocular 3D Object Detection with Depth Fusion and Adaptive Optimization. Sensors 2025, 25, 760. [Google Scholar] [CrossRef] [PubMed]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 14–19 June 2020; pp. 4604–4612. [Google Scholar]
Sindagi, V.A.; Zhou, Y.; Tuzel, O. MVX-Net: Multimodal VoxelNet for 3D object detection. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7276–7282. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D proposal generation and object detection from view aggregation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3D-CVF: Generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 720–736. [Google Scholar]
Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; Jia, J. Unifying voxel-based representation with transformer for 3D object detection. Adv. Neural Inf. Process. Syst. 2022, 35, 18442–18455. [Google Scholar]
Wang, H.; Tang, H.; Shi, S.; Li, A.; Li, Z.; Schiele, B.; Wang, L. UniTR: A unified and efficient multi-modal transformer for bird’s-eye-view representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 6792–6802. [Google Scholar]
Zhao, Y.; Gong, Z.; Zheng, P.; Zhu, H.; Wu, S. SimpleBEV: Improved LiDAR-camera fusion architecture for 3D object detection. arXiv 2024, arXiv:2411.05292. [Google Scholar]
Wang, M.; Wang, H.; Li, Y.; Chen, L.; Cai, Y.; Shao, Z. MSAFusion: Object Detection Based on Multi-Sensor Adaptive Fusion under BEV. IEEE Trans. Instrum. Meas. 2025; Early Access. [Google Scholar] [CrossRef]
Wolters, P.; Gilg, J.; Teepe, T.; Herzog, F.; Fent, F.; Rigoll, G. SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection. arXiv 2024, arXiv:2411.19860. [Google Scholar]
Palladin, E.; Dietze, R.; Narayanan, P.; Bijelic, M.; Heide, F. Samfusion: Sensor-adaptive multimodal fusion for 3d object detection in adverse weather. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2024; pp. 484–503. [Google Scholar]
Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 10386–10393. [Google Scholar]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. TransFusion: Robust LiDAR-Camera fusion for 3D object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 1090–1099. [Google Scholar]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 770–779. [Google Scholar]
Lu, Y.; Ma, X.; Yang, L.; Zhang, T.; Liu, Y.; Chu, Q.; Ouyang, W. Geometry uncertainty projection network for monocular 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual Event, 11–17 October 2021; pp. 3111–3121. [Google Scholar]
Ding, F.; Wen, X.; Zhu, Y.; Li, Y.; Lu, C.X. RadarOcc: Robust 3D occupancy prediction with 4D imaging radar. Adv. Neural Inf. Process. Syst. 2024, 37, 101589–101617. [Google Scholar]
Ye, B.; Qin, M.; Zhang, S.; Gong, M.; Zhu, S.; Zhao, H.; Zhao, H. Gs-occ3d: Scaling vision-only occupancy reconstruction with gaussian splatting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025; pp. 25925–25937. [Google Scholar]
Ouyang, W.; Xu, Z.; Shen, B.; Wang, J.; Xu, Y. LinkOcc: 3D semantic occupancy prediction with temporal association. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 1374–1384. [Google Scholar] [CrossRef]
Kumar, S.; Truong, H.; Sharma, S.; Sistu, G.; Scanlan, T.; Grua, E.; Eising, C. Minimizing Occlusion Effect on Multi-View Camera Perception in BEV with Multi-Sensor Fusion. arXiv 2025, arXiv:2501.05997. [Google Scholar] [CrossRef]
Huang, T.; Liu, Z.; Chen, X.; Bai, X. EPNet: Enhancing point features with image semantics for 3D object detection. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 35–52. [Google Scholar]
Wang, C.; Ma, C.; Zhu, M.; Yang, X. PointAugmenting: Cross-modal augmentation for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 11794–11803. [Google Scholar]
Yang, Z.; Chen, J.; Miao, Z.; Li, W.; Zhu, X.; Zhang, L. DeepInteraction: 3D object detection via modality interaction. Adv. Neural Inf. Process. Syst. 2022, 35, 1992–2005. [Google Scholar]
Yan, J.; Liu, Y.; Sun, J.; Jia, F.; Li, S.; Wang, T.; Zhang, X. Cross-modal transformer: Towards fast and robust 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 18268–18278. [Google Scholar]
Zhou, F.; Chen, H. Cross-modal translation and alignment for survival analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 21485–21494. [Google Scholar]
Zhang, H.; Liang, L.; Zeng, P.; Song, X.; Wang, Z. SparseLIF: High-performance sparse LiDAR-camera fusion for 3D object detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 109–128. [Google Scholar]
Hu, C.; Zheng, H.; Li, K.; Xu, J.; Mao, W.; Luo, M.; Wang, L.; Chen, M.; Peng, Q.; Liu, K.; et al. FusionFormer: A multi-sensory fusion in bird’s-eye-view and temporal consistent transformer for 3D object detection. arXiv 2023, arXiv:2309.05257. [Google Scholar]
Wang, J.; Li, F.; An, Y.; Zhang, X.; Sun, H. Toward robust LiDAR-camera fusion in BEV space via mutual deformable attention and temporal aggregation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5753–5764. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 14–19 June 2020; pp. 11621–11631. [Google Scholar]
Martin-Martin, R.; Patel, M.; Rezatofighi, H.; Shenoi, A.; Gwak, J.; Frankel, E.; Savarese, S. JRDB: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6748–6765. [Google Scholar] [CrossRef] [PubMed]
Bijelic, M.; Gruber, T.; Mannan, F.; Kraus, F.; Ritter, W.; Dietmayer, K.; Heide, F. Seeing through fog without seeing fog: Deep multimodal sensor fusion in unseen adverse weather. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 14–19 June 2020; pp. 11682–11692. [Google Scholar]

Figure 1. Conceptual pipeline of FAOD. Multimodal features are first extracted by modality-specific encoders. The visibility estimation module outputs an instance-level occlusion score O and a region-level visibility map

A_{vis}

, which act as control signals to (i) gate cross-modal feature completion and (ii) modulate occlusion-aware fusion (modality weighting) before the final detection head.

Figure 1. Conceptual pipeline of FAOD. Multimodal features are first extracted by modality-specific encoders. The visibility estimation module outputs an instance-level occlusion score O and a region-level visibility map

A_{vis}

, which act as control signals to (i) gate cross-modal feature completion and (ii) modulate occlusion-aware fusion (modality weighting) before the final detection head.

Figure 2. Overview of FAOD. From left to right: (i) Modality-specific encoding extracts features from RGB, LiDAR and IR streams, producing

F_{RGB}, F_{LiDAR}, F_{IR}

(shown as

F_{Resnet}, F_{LiDAR}, F_{IR}

in the diagram). (ii) Occlusion-aware representation estimates a multi-granular visibility signal: an instance-level occlusion score and a region-level visibility map

A_{vis}

, supervised by BCE + TV and class-balanced losses. (iii) Cross-modal attentive completion performs geometry-aware CMA and occlusion-gated mixing. Queries come from a target modality, and keys/values from a donor modality; a geometric bias

B_{geo}

preserves cross-view consistency. The mixing mask is

M_{occ} = 1 - A_{vis}

, optionally modulated by modality reliability

r_{m}

. (iv) Multi-task detection fuses the completed features with dynamic weights

α_{m}

to obtain

F_{final}

, and predicts category c, 3D box

b = (x, y, z, w, h, l, θ)

, and occlusion level O. Training uses

L_{\det}

together with

L_{vis}, L_{rank}, L_{cons}, L_{rec}, L_{align}, L_{smooth}

; inference applies occlusion-aware calibration and Soft-NMS.

Figure 2. Overview of FAOD. From left to right: (i) Modality-specific encoding extracts features from RGB, LiDAR and IR streams, producing

F_{RGB}, F_{LiDAR}, F_{IR}

(shown as

F_{Resnet}, F_{LiDAR}, F_{IR}

in the diagram). (ii) Occlusion-aware representation estimates a multi-granular visibility signal: an instance-level occlusion score and a region-level visibility map

A_{vis}

, supervised by BCE + TV and class-balanced losses. (iii) Cross-modal attentive completion performs geometry-aware CMA and occlusion-gated mixing. Queries come from a target modality, and keys/values from a donor modality; a geometric bias

B_{geo}

preserves cross-view consistency. The mixing mask is

M_{occ} = 1 - A_{vis}

, optionally modulated by modality reliability

r_{m}

. (iv) Multi-task detection fuses the completed features with dynamic weights

α_{m}

to obtain

F_{final}

, and predicts category c, 3D box

b = (x, y, z, w, h, l, θ)

, and occlusion level O. Training uses

L_{\det}

together with

L_{vis}, L_{rank}, L_{cons}, L_{rec}, L_{align}, L_{smooth}

; inference applies occlusion-aware calibration and Soft-NMS.

Figure 3. nuScenes validation: occlusion-stratified mAP (OL-mAP).

Figure 4. Scene 1.

Figure 5. Scene 2.

Figure 6. Scene 3.

Table 1. Latency breakdown (nuScenes, single GPU, FP16, batch

= 1

).

Table 1. Latency breakdown (nuScenes, single GPU, FP16, batch

= 1

).

Component	Latency (ms)	Share (%)
Image backbone + FPN	38.4	41.0
LiDAR voxel/BEV	21.7	23.2
IR/Radar branch	4.8	5.1
CMA (with $B_{geo}$ )	17.6	18.8
Head + Occl-SoftNMS	12.2	13.0
Total (FAOD-L)	94.7	100

Table 2. Key resource metrics (same protocol).

Metric	Value
FPS (=1000/total latency)	10.6
Params (M)	79.2
FLOPs (G)	321.5
Peak memory (GB)	9.3

Table 3. Pareto trade-off across model scales (same protocol).

Config	Img Short	BEV Step	Sweeps	mAP	NDS	OL-mAP ( $O = 2$ )	Lat. (ms)	FPS
FAOD-L	960	0.20	10	66.0	68.1	52.8	94.7	10.6
FAOD-M	800	0.25	5	64.3	66.7	50.0	58.1	17.2
FAOD-S	640	0.30	3	62.5	64.9	47.2	41.0	24.4

Table 4. Memory and throughput with/without engine optimizations.

Variant	Peak Mem (GB)	TRT/ONNX	Throughput (FPS)
FAOD-L (native)	9.3	No	10.6
FAOD-L + TensorRT	7.8	Yes	12.9
FAOD-M (native)	6.1	No	17.2
FAOD-S (native)	4.7	No	24.4

Table 5. Effect of calibration and occlusion-aware NMS (lower is better for ECE/Brier; higher is better for mAP/OL-mAP).

Setting	ECE ↓	Brier ↓	mAP ↑	OL-mAP ( $O = 2$ ) ↑
No temperature scaling and standard NMS	4.6%	0.158	65.3	50.9
+Temperature scaling ( $τ$ constant)	2.9%	0.141	65.8	51.6
+Occl-SoftNMS ( $σ (O)$ adaptive)	2.8%	0.140	66.0	52.8

Table 6. Robustness to modality dropout at inference (nuScenes val).

Dropped Modality	ΔmAP	ΔOL-mAP (O = 2)
IR/Radar off	$- 0.6$	$- 1.1$
RGB off	$- 3.9$	$- 6.4$
LiDAR off	$- 12.7$	$- 18.9$

Table 7. Ablations on efficiency and occlusion accuracy (deltas vs. full FAOD).

Variant	ΔmAP	ΔOL-mAP (O = 2)	ΔLatency (ms)
w/o CMA	$- 2.8$	$- 6.9$	$- 13.7$
w/o $B_{geo}$	$- 1.3$	$- 3.2$	$- 2.1$
w/o $r_{m}$ /gating	$- 1.7$	$- 4.1$	$- 0.8$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Singh, B. Robust Occluded Object Detection in Multimodal Autonomous Driving: A Fusion-Aware Learning Framework. Electronics 2026, 15, 245. https://doi.org/10.3390/electronics15010245

AMA Style

Li Z, Singh B. Robust Occluded Object Detection in Multimodal Autonomous Driving: A Fusion-Aware Learning Framework. Electronics. 2026; 15(1):245. https://doi.org/10.3390/electronics15010245

Chicago/Turabian Style

Li, Zhengqing, and Baljit Singh. 2026. "Robust Occluded Object Detection in Multimodal Autonomous Driving: A Fusion-Aware Learning Framework" Electronics 15, no. 1: 245. https://doi.org/10.3390/electronics15010245

APA Style

Li, Z., & Singh, B. (2026). Robust Occluded Object Detection in Multimodal Autonomous Driving: A Fusion-Aware Learning Framework. Electronics, 15(1), 245. https://doi.org/10.3390/electronics15010245

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Occluded Object Detection in Multimodal Autonomous Driving: A Fusion-Aware Learning Framework

Abstract

1. Introduction

1.1. Motivation and Problem

1.2. Approach Overview

1.3. Contributions

2. Related Work

2.1. Single-Modality Occluded Detection

2.2. Multimodal Fusion Strategies

2.3. Occlusion Modeling

2.4. Feature Alignment and Completion

2.5. Datasets and Metrics

3. Problem Formulation

3.1. Sensor Inputs and Metadata

3.2. Frames, Projections, and Gridding

3.3. Instance-Level Annotations and Visibility

3.4. Sample Organization and Occlusion-Aware Sampling

4. Task Objective and Subtasks

5. Overall Optimization Objective and Training Strategy

5.1. Global Objective

5.2. Strategy I: Occlusion-Aware Reweighting

5.3. Strategy II: Spatiotemporal Consistency (Stable Completion with Multi-Frame Aggregation)

5.4. Strategy III: Post-Processing (Occlusion-Aware NMS and Calibration)

5.5. Training Pipeline and Curriculum

6. Method

6.1. Overall Architecture

6.2. Feature Extraction Modules

6.3. Occlusion-Aware Submodules

6.4. Cross-Modal Attention and Completion

6.5. Detection Head and Occlusion-Aware Fusion

6.6. Training and Implementation Notes

7. Experiments

7.1. Overall Performance and Stratified Analysis

7.2. Baseline Comparison and Component Contributions

7.3. Performance Analysis: Efficiency, Resources, and Deployability

7.4. Occlusion-Stratified Results on nuScenes

8. Discussion

8.1. Generalization Capability

8.2. Robustness to Occlusion Types

8.3. Computational Efficiency

8.4. Limitations and Future Directions

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Full Task Objective and Loss Formulations

Appendix A.1. Subtask A: Occlusion Classification (Visibility Estimation)

Appendix A.2. Subtask B: Feature Restoration (Cross-Modal Completion)

Appendix A.3. Subtask C: Modality-Aware Detection (Dynamic Fusion and Decoding)

Appendix B. Additional Training and Post-Processing Details

Appendix B.1. Robust Multi-Task Balancing (Optional)

Appendix B.2. Spatiotemporal Consistency (Optional)

Appendix B.3. Occlusion-Aware Post-Processing (Optional)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI