1. Introduction
Occlusion has long been considered one of the most critical challenges in object detection for autonomous driving, especially in dense and dynamic urban environments [
1]. Target objects may be partially obscured by static obstacles such as vehicles, poles, or roadside infrastructure; completely invisible due to severe full occlusion; or involved in complex interactive occlusions where multiple dynamic agents overlap. Such scenarios substantially deteriorate perception performance by distorting object appearances, blurring spatial boundaries, and introducing high levels of uncertainty in both localization and classification.
Recent advances in multimodal perception have opened promising avenues to mitigate these issues. By integrating complementary sensory modalities—RGB cameras, LiDAR sensors, and radar/infrared (IR) systems—perception systems can leverage distinct advantages: RGB imagery provides fine-grained texture and semantic cues; LiDAR offers precise 3D geometric measurements; and radar/IR sensors ensure resilience under adverse weather or low-illumination conditions [
2]. This synergy allows for the construction of more comprehensive and robust object representations, particularly in scenarios characterized by severe occlusion or degraded visibility.
1.1. Motivation and Problem
Despite these benefits, existing multimodal fusion methods remain insufficient for reliable occlusion handling. First, while middle-fusion transformers such as BEVFormer can effectively align multimodal features in a shared bird’s-eye-view (BEV) space, they lack a dedicated mechanism to explicitly identify occluded regions and orchestrate targeted, directional information flow from unoccluded modalities for recovering missing object cues [
3]. Second, feature alignment and cross-modal completion strategies are often inadequate, leading to fragmented or inconsistent representations across modalities, and recent BEV-space fusion frameworks such as BEVFusion primarily optimize geometric consistency and efficiency rather than explicit occlusion reasoning [
4]. Third, most state-of-the-art fusion pipelines suffer from high computational complexity and limited scalability, hindering their deployment in real-time, safety-critical autonomous driving systems [
1]. These limitations underscore the need for new frameworks that can explicitly reason about occlusion while efficiently exploiting multimodal complementarities.
We address these gaps by coupling (i) explicit visibility estimation, (ii) geometry-aware cross-modal attentive completion, and (iii) occlusion-adaptive fusion and calibration within a single trainable objective, designed to preserve efficiency for deployment.
1.2. Approach Overview
At a high level, FAOD introduces a visibility-guided multimodal detector. It first estimates multi-granular visibility cues, then uses geometry-aware cross-modal attention to complete features for occluded regions, and finally performs occlusion-adaptive fusion and calibrated post-processing. All components are trained end-to-end under a unified objective; the architectural details are provided in
Section 6.
1.3. Contributions
In this work, we propose a novel framework termed Fusion-Aware Occlusion Detection (FAOD), which tightly integrates explicit occlusion modeling with implicit cross-modal feature reconstruction. The main contributions of this study are summarized as follows:
Explicit visibility reasoning for occlusion-aware BEV detection: We propose FAOD as a unified multimodal detection framework that explicitly models occlusion/visibility as learnable variables, including an instance-level occlusion classification and a region-level visibility map. These signals are supervised by occlusion-aware objectives and geometric consistency constraints, and are further used to guide downstream feature completion, fusion, and confidence scoring, rather than relying on implicit BEV aggregation.
Visibility-guided directed cross-modal attention (CMA) for alignment and feature completion: We design a geometry-aware CMA module that performs asymmetric, visibility-driven information transfer (donor → recipient): when a target modality is heavily occluded, complementary less-occluded modalities are selectively attended to reconstruct missing BEV features and align cross-modal representations. This goes beyond symmetric BEV fusion, enabling targeted restoration of occluded object regions.
Occlusion-aware dynamic fusion and score calibration at inference: FAOD couples visibility estimation with adaptive modality weighting and occlusion-aware post-processing. Fusion weights are adjusted conditioned on occlusion severity and modality reliability, while occlusion-aware Soft-NMS and confidence calibration mitigate false suppression of heavily occluded objects, improving detection stability under partial and complete occlusions.
Occlusion-oriented augmentation/labeling and comprehensive benchmarking with deployment considerations: To evaluate FAOD under controlled occlusion levels, we develop an occlusion-centric augmentation and labeling pipeline that explicitly accounts for different visibility regimes. Extensive experiments on four representative datasets (nuScenes, KITTI-MOD, DENSE, and JRDB) show consistent gains over strong baselines, and we additionally adopt streamlined fusion components to maintain practical efficiency for real-time safety-critical deployment.
3. Problem Formulation
3.1. Sensor Inputs and Metadata
The multimodal input includes synchronized RGB images, LiDAR point clouds, and IR maps, together with calibration parameters and temporal alignment and deskew pre-processing. The RGB image, denoted as , is accompanied by intrinsic parameters , extrinsic transform , and a time stamp t. The LiDAR point cloud is represented as . We keep the intensity , the laser , and the relative sample time . Record , scan duration , and whether multi-sweep accumulation is used. The IR map is expressed as . Its metadata comprise intrinsic parameters and resolution, extrinsics , sampling rate, and beam model for mapping range–velocity.
For temporal alignment and deskewing, let
E denote the unified world frame. We align all sensor timestamps to the LiDAR mid-scan time
and deskew LiDAR by continuous poses
:
3.2. Frames, Projections, and Gridding
All modalities are geometrically aligned with a unified coordinate frame E or a common BEV representation for spatial consistency. For the camera projection (LiDAR → image), a 3D point in the LiDAR frame is first transformed into the camera coordinate system as . The homogeneous pixel is normalized to . For BEV mapping (points/voxels → BEV), the ground plane is discretized according to , and features are pooled/encoded along the vertical dimension z to obtain .
3.3. Instance-Level Annotations and Visibility
Each object instance is associated with a semantic category label , where denotes the predefined set of target classes (e.g., pedestrian, cyclist, passenger vehicle, large vehicle, traffic facility, etc.). The category label characterizes the semantic attributes of an object and serves as one of the fundamental prediction variables in multimodal detection tasks. Depending on the dataset configuration, the cardinality can range from a small set of classes (e.g., three in KITTI) to a richer taxonomy (e.g., ten in nuScenes) and may be extended to support additional categories in more complex scenarios. To ensure cross-modality consistency, the category labels are defined and indexed in a unified manner across RGB images, LiDAR point clouds, and IR/radar annotations, enabling the detection model to align and share a common semantic space among heterogeneous sensors. Moreover, in order to evaluate robustness under occlusion, the category labels are further combined with visibility levels and bounding box annotations, which facilitates fine-grained performance analysis under varying occlusion conditions.
In addition to the semantic category label, each object is also described by a 3D bounding box
with optional velocity
. The occlusion level is defined
and unified via a visible ratio:
with thresholds
Here, denotes the fraction of visible pixels within the 2D bounding box in the image plane, denotes the fraction of valid LiDAR points inside the 3D box relative to the expected number of points, and is a weighted combination of the two, with controlling the relative contribution of image and point-cloud visibility. The thresholds above assign to mostly visible objects, to partially occluded objects, and to heavily or fully occluded ones.
The cutoffs at and follow common three-level occlusion protocols in driving benchmarks, where roughly three quarters of the object area being visible corresponds to “non-occluded”, and less than one quarter corresponds to “heavily occluded”. The intermediate band provides a sufficiently wide regime of partially occluded samples for learning while keeping the semantic interpretation of each level clear. Since is a convex combination of and , increasing shifts the occlusion decision towards image-based visibility, whereas decreasing emphasizes LiDAR-based visibility. The sensitivity of to is bounded by ; when the two modalities broadly agree, moderate changes of do not alter the assigned occlusion level O, and only strong disagreements lead to boundary cases.
Optional fine-grained labels include a pixel/BEV visibility mask , a depth-order/occlusion graph (edges “occluder → occluded”), and sensor-availability flags.
3.4. Sample Organization and Occlusion-Aware Sampling
Under heavy occlusion, temporal windows with motion compensation are employed to increase visibility and maintain spatiotemporal continuity across frames.
To further address dataset imbalance, stratified sampling is applied to balance samples with different occlusion levels ( or to oversample partially and fully occluded instances ()), preventing domination by non-occluded samples.
For the assignment, both anchor-free and anchor-based strategies are adapted to handle occluded samples. In the anchor-free setting, a top-
k dynamic assignment with center/distance priors is commonly used. For anchor-based methods, intersection over union (IoU) thresholds are relaxed for high-
O samples, and additional center biases are introduced. Define a composite cost (anchor-free example):
where
downweights penalties for highly occluded instances.
5. Overall Optimization Objective and Training Strategy
The visibility, completion, and fusion modules are trained jointly so the network learns not only to detect objects, but also to estimate occlusion and recover missing information in a coordinated manner. This section summarizes the global objective and the strategies used to emphasize low-visibility cases while keeping training stable.
5.1. Global Objective
We jointly optimize detection, multi-granular visibility estimation, and cross-modal completion using
where
is the detection objective (classification + box regression),
is the visibility-estimation objective (including occlusion classification and region-level visibility constraints), and
is the completion objective. The complete definitions of
,
, and
and all constituent losses are given in
Appendix A.
5.2. Strategy I: Occlusion-Aware Reweighting
This strategy upweights hard samples so partially and fully occluded instances contribute more strongly during training. Concretely, for difficult cases (
), we amplify the occlusion-related objectives and the completion consistency:
with stronger amplification for
to enforce completion consistency under full occlusion.
In addition, we optionally employ robust multi-task balancing (e.g., homoscedastic uncertainty weighting) and gradient normalization techniques (e.g., GradNorm/PCGrad) to prevent a single objective (typically classification) from dominating optimization. The explicit formulation used in our implementation is provided in
Appendix B.
5.3. Strategy II: Spatiotemporal Consistency (Stable Completion with Multi-Frame Aggregation)
To stabilize completion across time, we enforce that point clouds and feature maps corresponding to the same physical object remain consistent across neighboring frames. This reduces flicker and overfitting to single-frame noise, which becomes noticeable when visibility is low.
Given a temporal window
and poses
, we incorporate (i) point-level consistency under ego-motion and (ii) feature-level consistency via geometric warping. The complete equations (including the Chamfer-like point loss and feature warping loss) are reported in
Appendix B. In practice, we first converge a single-frame model, and then introduce the temporal consistency terms together with multi-frame aggregation.
5.4. Strategy III: Post-Processing (Occlusion-Aware NMS and Calibration)
Beyond the core network, we apply occlusion-aware post-processing to avoid suppressing hard occluded true positives and to calibrate confidence scores. The key idea is to soften suppression and adjust score calibration when a hypothesis is predicted as heavily occluded, because occlusion increases localization uncertainty and reduces IoU overlap.
We adopt occlusion-aware Soft-NMS and occlusion-conditioned temperature scaling (and, optionally, uncertainty-aware NMS). To keep the main text lightweight, the full post-processing equations are given in
Appendix B.
5.5. Training Pipeline and Curriculum
We recommend a staged training pipeline. Training proceeds as follows: (i) train to stability; (ii) enable for visibility estimation; and (iii) activate together with temporal consistency (if used). The occlusion curriculum increases the synthetic occlusion strength from to (linear/cosine schedule) while ramping up the completion weight. To improve robustness to missing sensors, we randomly drop modalities (guided by sensor-availability flags) so the completion and fusion modules generalize across sensor degradation. Unless otherwise stated, all weights and thresholds are treated as learnable or scheduled hyperparameters, supporting reproducibility and systematic ablations across datasets and sensor configurations.
6. Method
6.1. Overall Architecture
FAOD is an occlusion-robust multimodal detector that links modality-specific encoding → occlusion-aware representation → cross-modal attentive completion → multi-task detection in an end-to-end pipeline. To make the data flow easier to follow at first glance,
Figure 1 provides a conceptual view of how visibility estimation, cross-modal completion, and occlusion-aware fusion operate together.
Figure 2 then presents the detailed module design and training/inference signals.
Formally, given synchronized sensory inputs from RGB cameras
, LiDAR point clouds
, and optionally radar/IR maps
, the objective is to learn a detection function:
where
denotes the semantic object category,
defines the 3D bounding box that includes spatial position, dimensions, and orientation, and
represents the occlusion state corresponding to no occlusion, partial occlusion, and full occlusion. FAOD comprises (i) modality-specific encoders for RGB, LiDAR, and IR/radar; (ii) an occlusion-aware feature extractor producing multi-granular visibility signals; (iii) CMA for selective fusion and completion; and (iv) a multi-task head that predicts
with occlusion-adaptive fusion and decoding (see
Figure 2).
6.2. Feature Extraction Modules
With a ResNet/Swin backbone and an FPN, the image encoder produces multi-scale features
, where
. For alignment with BEV/point features, a perspective or learnable view transform is applied:
For LiDAR, the voxel pathway (VoxelNet/SECOND) builds a tensor
and yields BEV features
via 3D/2D convolutions. The point pathway (PointNet++) aggregates raw points
to
, then pools to BEV with a voxel/grid operator
:
For IR/Radar, a lightweight CNN/Transformer produces
; geometric calibration maps it to the unified view:
The aligned main-scale maps are then used by subsequent modules in a common BEV/grid domain.
6.3. Occlusion-Aware Submodules
FAOD augments the backbone with auxiliary occlusion branches that provide explicit visibility cues for downstream completion and fusion. The goal of these submodules is to estimate, for each candidate and region in the scene, how strongly it is occluded, so that later stages can selectively trust or discount modality evidence.
At each candidate (instance or BEV grid), the occlusion branch outputs an instance probability for and a region-level visibility map . Instance-level occlusion is trained with class-balanced cross-entropy or Focal loss, and the visibility map is supervised by BCE with total-variation (TV) regularization, as defined in Subtask A.
We obtain a semantic-guided visibility map by concatenating RGB and LiDAR features and projecting to a single channel:
where
denotes channel-wise concatenation.
Given a coarse fused map
, multi-head self-attention with positional encoding
and geometric bias
is applied using the following geometry-aware attention operator:
6.4. Cross-Modal Attention and Completion
Once visibility has been estimated, FAOD uses cross-modal attention to transfer information from less occluded “donor” modalities to more occluded “target” modalities. Intuitively, this module aims to complete or refine target features in regions where they are unreliable, by borrowing geometry-consistent evidence from other sensors.
Given a target modality
and a donor modality
, queries, keys, and values are obtained by linear projections:
The attended target features are then computed by the operator in Equation (
14):
For occlusion-gated mixing, let
. A modality reliability score
(estimated from density/SNR/texture/motion blur) yields a donor weight
and the completed target features are updated by
6.5. Detection Head and Occlusion-Aware Fusion
The final detection stage converts the completed multimodal features into class, box, and occlusion predictions, while adaptively weighting each modality according to visibility and reliability. This head ties together the preceding modules and determines how much each sensor contributes to the final decision at each spatial location.
On the fused representation
, a multi-task head predicts class, box, and occlusion:
The detection objective follows Subtask C, with ; the formulations of , (IoU/DIoU + with periodic angle), and are defined there and not repeated here.
Occlusion-aware dynamic fusion computes, at each location
, per-modality weights via a learnable gate
(e.g., a two-layer MLP over
); the resulting logits are normalized to weights:
Here,
is the occlusion mask defined earlier,
denotes modality reliability, and
is global average pooling. Higher occlusion (larger
) and higher reliability
increase
, prioritizing robust modalities (e.g., LiDAR/IR) when needed.
6.6. Training and Implementation Notes
All modalities are aligned to a unified BEV/grid; for multi-frame inputs, pose-based registration and LiDAR deskew are applied. Synthetic occlusion is applied with strength gradually increased, and is ramped up in tandem to stabilize completion learning.
The overall objective follows the global formulation in
Section 5 (Equation (
6)). For samples with
,
and
are increased; homoscedastic uncertainty weights
may be used for task balancing. During inference, occlusion-aware Soft-NMS (weaker suppression for
), temperature scaling, and variable IoU thresholds are used to reduce misses and over-suppression under heavy occlusion.
The pipeline forms a causal loop from visibility estimation → cross-modal completion → dynamic fusion. The heatmap localizes occlusions, CMA performs geometry-consistent targeted recovery, and assigns modality weights based on occlusion and reliability. The multi-task head jointly optimizes these components, yielding robust 3D detection under partial and full occlusion.
7. Experiments
7.1. Overall Performance and Stratified Analysis
Across four benchmarks, FAOD consistently outperforms multimodal and occlusion-specialized baselines. Averaged over three independent runs (distinct random seeds), the absolute improvements are typically to in mAP and to in NDS, with statistical significance verified by bootstrap testing .
To assess robustness under occlusion, results are partitioned by the unified visibility thresholds into (non-occluded, partially occluded, and severely/near-completely occluded) and reported as OL-mAP per subset. For non-occluded cases , FAOD matches or slightly exceeds strong baselines by to , indicating no loss of upper-bound accuracy. For partially occluded cases , gains of to are observed, attributable to visibility guidance and cross-modal completion that mitigate weak texture and sparse points. For severely/near-completely occluded cases , the gains are most pronounced to ; recall increases more than precision, consistent with CMA and dynamic fusion recovering detectability under low-information conditions.
Category-wise and scale-wise analyses indicate especially notable improvements in small/distant classes (e.g., pedestrian, cyclist). Scale binning shows larger OL-mAP gains for small-to-medium objects, consistent with FAOD’s ability to compensate for sparse LiDAR returns and weak image cues. This trend is also aligned with typical road scenes, where small agents are the first to disappear behind occluders and the last to provide clean geometry.
The region-level visibility map attains higher IoU and lower cross-entropy against ground-truth than unsupervised baselines. The Spearman correlation between and classification confidence is also higher, supporting its use in score calibration. In practice, this correlation matters more for : the score needs to reflect “how much evidence is really there”, otherwise the post-processing step tends to discard the hard positives.
On DENSE nighttime/low-light/rain–snow subsets, FAOD’s advantage widens (OL-mAP gains of about at and at ). On JRDB’s crowded indoor scenes with long occlusion chains, FAOD maintains stable recall. These are the cases where symmetric fusion is easily confused by missing or noisy cues; the visibility-gated completion and occlusion-aware calibration are simply more forgiving, and the improvements show up consistently in the occlusion-stratified metrics.
7.2. Baseline Comparison and Component Contributions
Compared with multimodal baselines (e.g., PointPainting, MVX-Net, UVTR), PointPainting is vulnerable to noisy image semantics, particularly at ; FAOD suppresses unreliable channels via and , reducing false positives. MVX-Net/UVTR degrade under alignment errors or missing modalities; FAOD’s geometric bias and gated fusion show greater robustness.
Versus occlusion-specialized baselines (e.g., GUPNet, ORN, DetZero), single-modality/view occlusion reasoning is limited under complete occlusion; FAOD’s cross-modal completion transfers information directionally (donor→recipient), reconstructing features for invisible recipients. In dense crowds, ORN/DetZero’s reliance on ordering/logic graphs is less robust to annotation noise; FAOD with yields smoother behavior.
Ablations show consistent trends. Removing the occlusion branch notably degrades OL-mAP at and weakens , limiting CMA completion. Removing the geometric bias hurts more in high-parallax camera–radar/camera–LiDAR settings. Removing reliability gating increases mis-fusion in low-light/sparse segments and reduces the variance of . Disabling consistency/contrastive losses during the occlusion curriculum leads to over-completion or local overfitting with larger per-subset variance. Temporal consistency (optional) further improves recall at modest latency cost.
7.3. Performance Analysis: Efficiency, Resources, and Deployability
We use three model scales—FAOD-S (small), FAOD-M (medium), and FAOD-L (large). Unless otherwise stated, the latency breakdown reports the large scale (FAOD-L). Under a common protocol (nuScenes, single GPU, FP16, batch = 1), we report latency and key resource metrics in
Table 1 and
Table 2. CMA and the image backbone dominate compute; reducing image resolution/backbone width and triggering CMA sparsely (e.g., ROI-based) provide the largest speedups.
Speed–accuracy trade-offs are shown in
Table 3. FAOD-M reduces latency by ≈39% vs. FAOD-L while losing ≈2.8 pts on
, suitable for online use; FAOD-L favors offline high-accuracy.
In this context, FAOD-S can be regarded as the lightweight variant targeting resource-constrained or embedded deployments. Compared with FAOD-L, it reduces latency from
ms to
ms (about a
decrease) and peak memory (see
Table 2) at the cost of several points in mAP and OL-mAP
. Such a trade-off is acceptable for many automotive ECUs where on-board compute and memory are limited. On automotive-grade SoCs, additional gains are expected from TensorRT/ONNX engines, mixed precision, and moderate backbone width scaling; a full evaluation of FAOD-S on embedded hardware is left for future work.
Engine-level optimizations reduce memory and improve throughput (
Table 4); e.g., TensorRT yields 20–
throughput gains.
Calibration and post-processing analyses on nuScenes val are given in
Table 5. Temperature scaling improves calibration (ECE/Brier), and occlusion-aware Soft-NMS further improves detection under heavy occlusion (
; higher OL-mAP) together with overall mAP.
Robustness to modality dropout at inference is summarized in
Table 6. LiDAR is critical under strong occlusion; RGB/IR remain complementary in low light and sparse-point regimes.
Efficiency impacts of key components are shown in
Table 7. CMA yields the largest accuracy gains with moderate cost;
and
are highly cost-effective for high-occlusion accuracy.
FAOD delivers statistically significant gains on aggregate and occlusion-stratified metrics across four benchmarks, with the largest improvements at due to cross-modal completion and adaptive gating. Efficiency-wise, FAOD traces a clear Pareto frontier via image/BEV resolution and sparse attention, enabling both offline and online deployments. Interpretability (, maps) and better calibration (temperature scaling) support practical deployment and safety analyses.
7.4. Occlusion-Stratified Results on nuScenes
To rigorously assess robustness under varying degrees of occlusion, the nuScenes validation set is stratified into three visibility tiers—non-occluded
, partially occluded
, and heavily occluded
—and OL-mAP is reported for each tier. Overall, FAOD-L attains the best or tied-best performance across all tiers and exhibits a smaller degradation as occlusion increases than both multimodal and occlusion-specialized baselines (
Figure 3).
For , FAOD-L achieves an OL-mAP of , exceeding the mean of four baselines (69.25) by percentage points (pp) and outperforming the best baseline (70.0) by pp. This suggests that introducing explicit visibility reasoning does not come at the cost of peak accuracy: when observations are clean, the model largely behaves like a strong BEV fusion detector rather than “over-correcting” what is already reliable. For , FAOD-L reaches , improving over the baseline mean (56.5) by pp and over the strongest baseline (58.0) by pp. In many nuScenes scenes, partial occlusion is the more common and also the more confusing case: one modality may still carry a usable fragment (e.g., a contour in RGB), while another becomes sparse or locally corrupted (e.g., missing returns in LiDAR). The visibility heatmap helps here by damping unreliable regions and letting the fusion focus on the parts that are still trustworthy; CMA then supplies complementary cues where the target stream is weak, instead of mixing all modalities symmetrically in BEV. For , FAOD-L attains , surpassing the baseline mean (44.25) by pp and the strongest baseline (46.0) by pp. The improvement is strongest under and is driven mainly by recall: in these cases, the detector often needs to work with very limited evidence (a few points, a small edge fragment, or intermittent responses). Visibility-gated CMA, together with reliability weighting, reconstructs discriminative features only where information is genuinely missing, making the remaining cues usable without spreading artifacts across the scene. This also makes post-processing less brittle, because a hard true positive under severe occlusion may not achieve the “nice” overlap pattern that standard suppression heuristics assume.
The tiered results also hint at what kind of situations in nuScenes FAOD benefits from. Under , the gain tends to come from cases that are partially blocked but still geometrically consistent—for instance, an agent visible in one stream while partially missing in another due to occluders or viewpoint. The directed (donor → recipient) completion is especially useful in this regime: it transfers information from the less-occluded donor stream to the occluded target stream, which is a different behavior from symmetric BEV aggregation. Under , detections are closer to the decision boundary. Here, the visibility gating prevents the completion module from “guessing everywhere”, and the occlusion-aware calibration/NMS helps avoid over-suppressing these low-IoU, low-confidence but correct hypotheses. In short, benefits more from selective restoration, while benefits from both restoration and a more forgiving confidence/suppression policy.
Figure 4,
Figure 5 and
Figure 6 provide qualitative comparisons between the baseline fusion model (BEVFormer) and FAOD under heavy occlusion. For each scene, the top image shows the baseline result, while the bottom image shows FAOD. In scenarios where target objects are largely invisible in RGB and only sparsely observed in LiDAR, the baseline often fails to form meaningful responses, leading to missed detections or fragmented hypotheses. In contrast, FAOD produces more coherent BEV activations and more stable object predictions. The visibility cues highlight occluded regions, while cross-modal attention selectively transfers complementary geometric information from less-occluded modalities, resulting in more complete object representations.
Degradation with occlusion is quantified by . PointPainting: 26 pp; MVX-Net: 25 pp; UVTR: 26 pp; DetZero: 23 pp; FAOD-L: 18 pp. Relative to the best baseline(DetZero, 23 pp), FAOD-L reduces the penalty by 5 pp (relative reduction ), yielding a flatter performance–occlusion curve and stronger cross-tier consistency.
8. Discussion
8.1. Generalization Capability
The proposed FAOD framework demonstrates strong generalization ability when deployed in previously unseen environments and across novel object categories. By leveraging multimodal feature representations and explicit occlusion reasoning, the model is less reliant on dataset-specific appearance patterns, thereby enhancing robustness in diverse urban scenarios. Experimental results across four heterogeneous benchmarks confirm that FAOD can effectively adapt to varying sensor configurations and scene geometries without significant performance degradation.
8.2. Robustness to Occlusion Types
A key strength of our approach lies in its robustness against different types of occlusion. In addition to handling static and partial occlusions, FAOD exhibits stable performance in highly dynamic conditions where occlusions are caused by moving vehicles, pedestrians, or other agents. The explicit visibility reasoning module enables reliable estimation of occlusion levels, while the cross-modal feature completion mechanism recovers object representations even when large portions are visually obscured.
8.3. Computational Efficiency
Practical deployment in autonomous driving requires a balance between accuracy and efficiency. In this work, all runtime and resource measurements are obtained on a single NVIDIA RTX 3090 GPU under the evaluation protocol described in the Experiments section (FP16, batch , with image resolution and LiDAR sweeps as specified for each FAOD-S/M/L configuration). The reference implementation has a compact model size of about 110 MB of learnable parameters, which fits comfortably within the memory budgets of current GPU and automotive SoC platforms.
Under this protocol, the three model scales trace a clear accuracy–latency frontier: FAOD-L targets offline or high-compute settings, FAOD-M offers a favorable trade-off between accuracy and speed, and FAOD-S is explicitly designed as a lightweight variant for resource-constrained or embedded deployments, using lower image resolution, fewer LiDAR sweeps, and narrower backbones while preserving most of the occlusion-stratified gains. Engine-level optimizations such as TensorRT/ONNX conversion, mixed-precision execution, operator fusion, and sparsified (e.g., ROI-triggered) attention further reduce latency and memory footprint. A detailed quantitative evaluation on specific automotive-grade embedded hardware is left for future work.
8.4. Limitations and Future Directions
Despite its effectiveness, the current framework has several limitations. First, the present FAOD implementation assumes fixed and precise extrinsic calibration between cameras, LiDAR, and IR/radar. The geometric bias term and the BEV projections are computed directly from these calibration parameters. In practice, LiDAR misalignment (e.g., due to mechanical tolerances, thermal drift, or mounting vibrations) can distort cross-modal attention and reliability gating, and we do not yet explicitly model or correct such effects. Future variants could incorporate calibration-robust feature encodings, online refinement of extrinsics, or uncertainty-aware fusion that downweights modalities suspected to be misaligned.
FAOD is evaluated in a single-frame setting and does not yet include explicit temporal modeling. Without temporal aggregation, the method cannot fully exploit motion cues and cross-frame visibility to stabilize occlusion estimates or recover objects that are only intermittently visible under heavy occlusion. A natural extension is to aggregate BEV features over short, pose-compensated temporal windows and apply a lightweight temporal attention module on top of the existing BEV representation, together with temporal consistency losses to regularize completion in heavily occluded scenes.
The current study focuses on passive sensing with a fixed sensor layout. We do not consider active strategies such as view planning, adaptive sensor scheduling, or dynamic exposure control, which may further mitigate severe occlusion and adverse-weather degradation. Exploring these directions, together with temporal reasoning and calibration-aware fusion, is left for future work.
9. Conclusions
In this work, we proposed FAOD, a novel Fusion-Aware Occlusion Detection framework designed to address the persistent challenge of object detection under occlusion in autonomous driving systems. By integrating explicit visibility reasoning with implicit cross-modal feature completion, FAOD is capable of reconstructing object representations even in highly cluttered and visually degraded scenarios. A central innovation of our approach lies in the attention-guided multimodal fusion mechanism, which dynamically aligns heterogeneous features from RGB, LiDAR, and infrared/radar modalities to maximize complementary strengths while mitigating occlusion-induced information loss.
Extensive experiments on four representative autonomous driving benchmarks demonstrate that FAOD achieves state-of-the-art performance across a wide range of occlusion conditions, including partial and full occlusions, static and dynamic obstacles, and diverse sensor configurations. Notably, the framework maintains both high accuracy and computational efficiency, reaching real-time inference rates with a compact model size, which highlights its potential for practical deployment in safety-critical driving environments.
Beyond empirical performance, FAOD contributes a methodological foundation that can generalize to multimodal perception research. Its explicit occlusion modeling, modality-aware feature reconstruction, and attention-driven alignment are not confined to detection; they could also support occlusion-aware tracking, improve the reliability of motion forecasting, and refine occupancy prediction, in addition to aiding cooperative multi-agent perception. In each of these tasks, the same principle applies: reasoning about which signals are missing and selectively completing them with information from other modalities can make the system more robust. More broadly, FAOD exemplifies a practical paradigm for dealing with incomplete multimodal data, offering a transferable approach that extends beyond autonomous driving and remains relevant wherever sensor degradation or partial observability pose challenges.
Looking ahead, future research directions include incorporating temporal reasoning to leverage motion dynamics across video sequences, as well as exploring active perception strategies that adapt sensor utilization to occlusion severity. By advancing towards these goals, FAOD can serve as a stepping stone for the development of next-generation robust, reliable, and intelligent perception systems for autonomous driving and broader real-world applications.