Semantic-Enhanced Bidirectional Multimodal Fusion for 3D Object Detection Under Adverse Weather

Jiao, Tianzhe; Chen, Yuming; Feng, Xiaoyue; Guo, Chaopeng; Song, Jie

doi:10.3390/app16062943

Open AccessArticle

Semantic-Enhanced Bidirectional Multimodal Fusion for 3D Object Detection Under Adverse Weather

by

Tianzhe Jiao

,

Yuming Chen

,

Xiaoyue Feng

,

Chaopeng Guo

and

Jie Song

^*

Software College, Northeastern University, Shenyang 112000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2943; https://doi.org/10.3390/app16062943

Submission received: 14 February 2026 / Revised: 7 March 2026 / Accepted: 16 March 2026 / Published: 18 March 2026

(This article belongs to the Special Issue Deep Learning-Based Computer Vision Technology and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

Multimodal fusion methods leveraging various sensors provide strong support for 3D object detection. However, under adverse weather conditions such as rain, fog, snow, and intense glare, complex environmental factors can degrade sensor data quality, leading to increased false positives and missed detections. In addition, sensor modalities (e.g., LiDAR and cameras) inherently vary in information density, and directly fusing them can cause critical details in high-density data to be diluted by low-density data, thereby increasing errors. To address these issues, we propose a Semantic-Enhanced Bidirectional Multimodal Fusion (SeBFusion) framework. By introducing a semantic enhancement mechanism and a bidirectional fusion strategy, SeBFusion mitigates the impact of noise under adverse weather and alleviates information dilution in multimodal fusion. Specifically, SeBFusion first employs a virtual point generation and camera semantic injection module to selectively map image semantic features into 3D space, producing semantically enhanced LiDAR features to compensate for the sparsity of the raw LiDAR point cloud. Then, during cross-modal interaction, we design a bidirectional cross-attention fusion module. This module estimates the confidence of each modality and adaptively reweights the bidirectional information flow, thereby reducing the risk of noise propagation across modalities and improving the robustness and accuracy of 3D object detection in complex environments. Experiments on adverse-weather versions of datasets such as KITTI-C and nuScenes-C validate the effectiveness and superiority of the proposed method. On the nuScenes-C dataset, it achieves 66.2% mAP and 66.6% mAP under fog and snow conditions, respectively.

Keywords:

adverse weather; multimodal fusion; 3D object detection; LiDAR-camera; BEV; confidence-aware fusion; cross-attention

1. Introduction

With the rapid development of autonomous driving and intelligent perception technologies, multi-sensor-fusion-based 3D object detection methods have become an important research direction in the field of environmental perception [1,2]. By fusing data from LiDAR, cameras, and other sensors, multimodal approaches can exploit the complementary strengths of heterogeneous sensors, thereby improving detection accuracy and robustness in complex scenarios. However, real-world driving often involves adverse weather (e.g., rain, fog, and snow) and extreme lighting conditions, which can degrade sensor data quality and compromise system reliability and safety [3,4]. Therefore, how to reliably leverage complementary multimodal information in adverse weather to improve detection robustness remains a critical open challenge.

To address the degradation in perception performance caused by adverse weather and environmental interference, existing studies have mainly explored two directions. The first line of work focuses on improving the data quality of each modality by suppressing noise at the input or feature level. This includes image restoration techniques (e.g., dehazing, deraining, and low-light enhancement) [5,6,7] as well as robust training strategies (e.g., data augmentation, domain adaptation, and adversarial training) to mitigate distribution shifts between training and testing conditions [8,9], thereby improving the stability of modality-specific features. The second line of work aims to improve multimodal fusion mechanisms to enhance a model’s ability to select and exploit cross-modal information. Typical approaches introduce attention mechanisms, gating units, or dynamic weight assignment strategies to enable finer-grained cross-modal interactions in BEV space or at the point-cloud feature level [10], thereby better balancing the contributions of different modalities during fusion, as shown in Figure 1. However, these fusion strategies remain challenged under complex noise conditions. In particular, in fusion paradigms dominated by a single modality (e.g., the camera or LiDAR modality), when that modality becomes globally or locally unreliable, the fusion network often fails to suppress error propagation effectively, leading to unstable fusion results and degraded performance on downstream tasks.

In addition, differences in feature density between LiDAR and cameras can lead to instability in multimodal fusion in practice. Specifically, LiDAR point clouds are inherently sparse in 3D space, and this sparsity becomes more pronounced as target distance increases. At longer ranges, fewer returns are observed and spatial coverage becomes more discontinuous, making it difficult for point- or voxel-based 3D features to be sufficiently informative. In contrast, cameras provide continuous, dense semantic features on the image plane, offering richer texture and category cues that are especially important for distinguishing small and distant objects. However, most mainstream fusion paradigms require aligning image features to 3D representations (e.g., projecting them onto point clouds, voxels, or BEV grids) to enable cross-modal interaction. During this process, alignment and aggregation are often constrained by the underlying 3D representation. When the point cloud is extremely sparse in distant regions, there are not enough 3D support points to enable reliable alignment and matching, so dense image semantics can only be sampled or aggregated at a limited number of locations. This forces image features to enter the 3D space in a sparse form, causing a large amount of potentially useful semantic information to be underutilized or discarded during fusion.

To address the above issues, we propose a semantic-enhanced bidirectional multimodal fusion framework (SeBFusion) to simultaneously mitigate weather-induced noise and information dilution during multimodal fusion. SeBFusion first leverages reliable image semantics to selectively enrich 3D features, improving the semantic richness and density of point-cloud representations. Next, during cross-modal interaction, it explicitly models the confidence of each modality and suppresses cross-modal noise propagation via adaptive reweighting of bidirectional information flow, thereby obtaining a more robust fused representation, as shown in Figure 1. Specifically, SeBFusion first introduces a virtual point generation and camera semantic injection module. By selectively mapping semantic features from images into 3D space, it generates semantically enhanced LiDAR features while preserving geometric consistency, compensating for the sparsity of raw point clouds in long-range regions. Then, at the cross-modal interaction stage, we propose a bidirectional cross-attention fusion module that performs information exchange in both directions and estimates the confidence of each modality in the current scene, enabling adaptive reweighting of the bidirectional information flow. This strategy effectively reduces the risk that low-quality modalities propagate noise into the other modality under corruptions such as rain, fog, and snow, while also preventing the strong modality from being overly disturbed by the weak modality during fusion, thereby improving detection accuracy and stability in complex environments. We validate the effectiveness of our method on adverse-weather benchmarks such as KITTI-C and nuScenes-C, and achieve consistent improvements under corruptions including fog and snow. The results show that SeBFusion can more reliably exploit multimodal complementary information, reducing false positives and missed detections caused by improper fusion, and demonstrating strong robustness to adverse weather. The main contributions of this paper are summarized as follows:

1.: We propose SeBFusion to address adverse-weather noise and multimodal information dilution via semantic enhancement and bidirectional fusion.
2.: We design a virtual point generation and camera semantic injection mechanism to selectively map image semantics into 3D space, enhancing point-cloud feature representations.
3.: We propose a confidence-aware bidirectional cross-attention fusion module that adaptively regulates bidirectional information flow to suppress noise propagation and improve fusion robustness.

The remainder of this paper is organized as follows: Section 2 reviews related work on multimodal 3D detection under adverse weather. Section 3 presents the proposed SeBFusion framework, including semantic enhancement via virtual point generation and confidence-aware bidirectional cross-attention fusion. Section 4 reports experimental settings and results on KITTI-C and nuScenes-C, with comparisons and ablations. Section 5 concludes the paper and discusses future directions.

2. Related Work

Adverse weather degrades multi-sensor observations in modality-specific and spatially non-uniform ways: camera images suffer from reduced contrast, streaks, and occlusions, while LiDAR point clouds exhibit scattering-induced outliers, attenuation, and echo dropouts that increase sparsity. To mitigate the resulting performance drop in 3D object detection, existing research largely follows two technical routes: improving per-modality data quality by suppressing noise at the input or feature level, and improving multimodal fusion mechanisms to better select and exploit cross-modal information.

2.1. Enhancing Data Quality Under Adverse Weather

A first line of work improves the quality and stability of each modality before (or alongside) detection. On the image side, adverse-weather restoration aims to recover informative textures and semantics via dedicated enhancement networks, including Transformer-based restoration and other high-capacity restoration models [11,12]. Such restoration-then-detection pipelines can strengthen cross-weather visual representations and benefit downstream perception, but they often introduce additional computational overhead and may propagate restoration artifacts to detection when the weather corruption is severe or spatially heterogeneous [13].

On the LiDAR side, many studies model weather-induced degradations and synthesize corruptions to augment training. Representative works simulate fog attenuation on real LiDAR scans and snowfall interference with explicit dropout or outlier modeling, enabling training and evaluation under controlled adverse conditions [14,15]. LISA further provides physics-based light scattering augmentation to generate adverse-weather point-cloud perturbations [16]. Beyond simulation, targeted denoising methods attempt to remove specific noise patterns using intensity and spatio-temporal cues [17]. Meanwhile, reconstruction-oriented approaches address the sparse or hollow far-field issue by filling missing points and densifying distant objects. For example, EquiDetect reconstructs point clouds via equirectangular projection using distance-constrained denoising and object-centric ray generation, targeting adverse-weather noise and long-range sparsity [7].

In addition, robust learning strategies tackle distribution shifts between training and testing environments. Unsupervised domain adaptation (UDA) via self-training and cross-modal adversarial alignment have been explored to reduce domain gaps for 3D detection [18,19]. Weather-based out-of-distribution (OOD) robustness has also been improved with adversarial training schemes tailored to LiDAR perception under unseen adverse weather [9]. Knowledge distillation across weather conditions provides another direction to transfer clean-weather knowledge to corrupted settings, while robust multimodal designs have been investigated in combination with large foundation models [20,21,22]. Robustness benchmarks for 3D detection under common corruptions further standardize evaluation and highlight failure modes under realistic degradations [23,24]. However, denoising and restoration-based methods are often brittle. Errors or artifacts introduced in sequential pre-processing can be amplified by downstream fusion and detection, especially under spatially non-uniform corruptions.

2.2. Improving Multimodal Fusion Mechanisms for Robust 3D Detection

A second line of work focuses on improving fusion to better exploit complementary sensor cues. Early LiDAR–camera fusion methods integrate image semantics into 3D representations, such as MVX-Net and EPNet, which enhance voxel or point features with image-derived cues [25,26]. Later efforts introduce stronger cross-modal interaction modules and Transformer-style fusion, including TransFusion, which refines LiDAR proposals with image features, and BEVFusion, which aligns multiple sensors into a unified BEV representation for end-to-end fusion and multi-task learning [27,28]. Other works emphasize interaction design and sparse multimodal representations, such as DeepInteraction and SparseFusion, to improve information exchange and efficiency [29,30]. Semantic injection strategies (e.g., FusionPainting) further explore injecting image semantics into the 3D detection pipeline via adaptive attention [31]. Beyond camera–LiDAR, radar-involved fusion improves robustness in adverse weather where radar is less affected by fog and rain, albeit with lower spatial resolution and more clutter [3,32].

More recently, sensor-adaptive multimodal fusion under adverse weather has been studied in several works. For example, Bijelic et al. [33] proposed an entropy-driven adaptive deep fusion method for handling unseen adverse weather conditions under asymmetric sensor distortions, thereby improving generalization to previously unseen weather conditions without requiring large amounts of annotated adverse-weather data. In addition, SAMFusion [34] uses depth information to guide cross-modal interaction, facilitating feature transformation and fusion. In contrast, our method exploits depth priors for multi-depth candidate generation and virtual-point semantic lifting. Under different depth hypotheses, image semantics are lifted into multiple plausible 3D locations and then aggregated into a semantic BEV representation. Thus, our method focuses on semantic lifting and uncertainty modeling before fusion.

While improved fusion architectures generally boost accuracy on clean data and provide gains under moderate corruption, two core issues remain. When a modality becomes globally or locally unreliable, fusion modules may fail to suppress erroneous information flow, allowing corrupted cues to contaminate the other modality and destabilize the fused representation. Moreover, the density mismatch between dense image semantics and sparse far-field LiDAR support can force image features to enter 3D in an overly sparse manner, leading to underutilization and dilution, while aggressive semantic injection can cause semantic contamination under misalignment.

These limitations provide a clear motivation for explicit, controllable fusion decisions: selectively enriching 3D features with camera semantics only when alignment is trustworthy, and regulating bidirectional cross-modal interaction based on estimated per-modality confidence to prevent noise propagation under adverse weather.

3. Method

3.1. Overall Framework

Adverse weather induces multimodal degradation, as LiDAR echoes become sparser and noisier due to rain droplets, snow particle scattering, and far-field attenuation, while camera images experience reduced contrast and texture loss under fog, rain streaks, and occlusion. More critically, multimodal degradation is spatially non-uniform and modality-dependent: within a single frame, LiDAR and camera reliability can vary by region. As a result, naïve fusion may propagate corrupted cues across modalities and dilute high-density geometric evidence with low-density or noisy features, causing misalignment, false positives, and missed detections. To address these challenges, we propose SeBFusion, a Semantic-Enhanced Bidirectional Multimodal Fusion framework for robust 3D object detection under adverse weather. Figure 2 illustrates the overall architecture of SeBFusion.

To mitigate weather-induced noise and alleviate information dilution in multimodal fusion, SeBFusion introduces a two-stage design. First, it performs semantic enhancement by generating virtual points and selectively injecting camera semantics into BEV space to obtain semantically enhanced LiDAR features, compensating for point-cloud sparsity in long-range regions. Second, it conducts confidence-aware bidirectional fusion: the model explicitly estimates per-modality confidence and adaptively reweights the bidirectional information flow, reducing cross-modal noise propagation and alleviating information dilution under corruptions. SeBFusion makes multimodal fusion more robust under adverse weather by integrating semantic enhancement and confidence-aware bidirectional fusion, which jointly suppress noise propagation and mitigate information dilution during cross-modal interaction.

3.2. Semantic LiDAR Feature Generation Module

The semantic LiDAR feature generation module aims to preserve the geometric stability of LiDAR BEV features while selectively injecting discriminative camera semantics into BEV space, so as to compensate for long-range sparsity and avoid diluting reliable LiDAR geometry with unreliable image cues under adverse weather. The module comprises two components: virtual-point generation with semantic BEV aggregation, and alignment assessment with dynamic injection. The core idea is to leverage camera-derived semantic information to reinforce LiDAR features while restricting the injection process to a guided, alignment-aware regime rather than unconditional concatenation or addition. This controlled, region-aware fusion reduces the propagation of cross-modal misalignments under harsh weather.

3.2.1. Virtual Point Generation

In multimodal 3D detection, translating camera semantic information into 3D-space representations is a critical step. Directly lifting image features into 3D with a single depth hypothesis is fragile under adverse weather: LiDAR becomes sparse and noisy, image semantics are dense but depth-ambiguous, and sparse 3D supports in the far field can cause dense semantics to enter 3D in an overly sparse form, leading to underutilization and information dilution. To address this, we introduce a virtual-point mechanism with multi-depth candidates to improve geometric tolerance and semantic coverage. For each pixel, the corresponding semantic is elevated into several plausible 3D points across different depths. This yields multiple camera-derived 3D hypotheses that inherit stronger geometric tolerance, enabling more reliable and comprehensive semantic injection into the BEV space even in harsh weather.

First, we run a lightweight 2D detector on each camera view to obtain high-confidence foreground proposals [35], which are then used to select seed pixels for virtual point generation:

B^{v} = {\{({box}_{n}^{v}, {score}_{n}^{v}, {class}_{n}^{v})\}}_{n = 1}^{N_{v}},

(1)

where

B^{v}

denotes the set of 2D detection results on the v-th camera view, with n indexing the detections and

N_{v}

the number of results kept after non-maximum suppression. The term

{box}_{n}^{v}

represents the 2D bounding box parameters of the n-th detection in view v,

{score}_{n}^{v}

the confidence score of that 2D bounding box, and

{class}_{n}^{v}

the class information of the bounding box. To reduce background noise and occlusion artefacts under adverse weather, we restrict seed sampling to high-confidence 2D proposals instead of performing depth search over the entire image.

Then, we construct the reference depth and the multi-depth candidates, and project the LiDAR points

x_{i} \in P

onto the camera plane to obtain the reference set:

(u_{i}, v_{i}, d_{i}) = π_{ν} (x_{i}), R^{v} = {(u_{i}, v_{i}, d_{i})},

(2)

where

x_{i}

is the coordinate vector of the i-th LiDAR point in the 3D coordinate system,

π_{v} (\cdot)

is the projection function from the 3D point to the v-th camera view,

(u_{i}, v_{i})

are the pixel coordinates of the projection of

x_{i}

onto the image from the v-th camera view (horizontal axis u, vertical axis v), and

d_{i}

is the depth value of

x_{i}

in that camera view. For each seed

(u_{m}, v_{m})

, perform a KNN search in the pixel space over

R^{'}

to obtain K neighboring reference depths as candidates:

D ((u_{m}, v_{m})) = {d_{k}^{K}}_{k = 1}^{K},

(3)

where

d_{m}^{k}

denotes the k-th candidate depth value, which comes from the reference set

R^{ν}

obtained by projecting LiDAR points onto the v-th camera and performing a nearest-neighbor search in the pixel space for

(u_{m}, v_{m})

. For each seed, the number of depth candidates kept is K. The index m denotes the seed index. Multiple depth candidates are used to explicitly model depth ambiguity. When depth is unreliable due to rain and fog occlusion or sparse point clouds, multiple candidates can improve the probability that the semantic is assigned to the correct BEV cell, thus reducing the negative impact of single-depth incorrect projections.

Subsequently, for each

((u_{m}, v_{m}), d_{m}^{k})

, project through the camera model in reverse and transform to the ego coordinate system, as follows:

x_{m}^{k} = T_{e g o \leftarrow cam}^{v} (d_{m}^{k} {(K^{v})}^{- 1} {\bar{U}}_{m}), {\bar{U}}_{m} = {[u_{m}, v_{m}, 1]}^{⊤},

(4)

where

x_{m}^{k}

denotes the 3D coordinates of the virtual point generated for the m-th seed under the k-th depth candidate.

T_{ego \leftarrow cam}^{v}

denotes the rigid-body transformation from the v-th camera coordinate system to the ego coordinate system.

{\bar{U}}_{m}

denotes the homogeneous pixel coordinates of the seed pixel, and

u_{m}, v_{m}

denote the pixel coordinates of the seed in the image. From the camera feature map

U_{m}

, sampling yields

f_{img} (U_{m})

, and it is fused with the depth encoding

P E (d)

to generate depth-aware semantics:

f_{v p}^{k} = MLP ([f_{img} (U_{m}) ‖ PE (d_{m}^{k})]),

(5)

where

f_{v p}^{k}

denotes the semantic feature vector of the virtual point generated for the m-th seed under the k-th depth candidate. The operation

MLP (\cdot)

is a multilayer perceptron used to map the concatenated vector to the target feature dimension. The symbol ‖ denotes feature concatenation, and

PE (\cdot)

is the positional encoding function for the depth values.

Finally, scatter the virtual points onto the BEV grid, and for the input features, their BEV cell indices are as follows:

(i, j) = ([\frac{x - x_{min}}{r_{x}}, \frac{y - y_{min}}{r_{y}}]),

(6)

where x and y denote the coordinates of a virtual point on the BEV plane;

x_{min}

and

y_{min}

denote the minimum boundaries of the BEV detection region along the x and y directions (the ROI origin), used to translate continuous coordinates to a grid coordinate system starting from zero.

r_{x}

and

r_{y}

denote the grid cell sizes of the BEV grid in the x and y directions, and

(i, j)

denote the corresponding BEV grid indices. For the virtual point features

{f_{v p}}

within the same cell, aggregation yields the camera fusion feature

F_{C}^{fuse}

.

3.2.2. Camera Semantic Injection

Under adverse weather, cross-modal misalignment is more likely. We therefore adopt alignment-guided injection: a BEV-level semantic-consistency score is computed to gate camera-to-LiDAR semantic injection, preventing corrupted or misaligned image semantics from contaminating reliable LiDAR geometry. First, as shown below, align the channel dimension of the LiDAR features and the camera fusion features

F^{fuse}

:

{\tilde{F}}_{L} = W_{L} (F_{L}), {\tilde{F}}_{C} = W_{C} (F_{C}), {\tilde{F}}_{L}, {\tilde{F}}_{C} \in R^{H \times W \times D},

(7)

Then, as shown in the formula, construct alignment scores using cosine similarity and score them using Sigmoid mapping:

S (x, y) = σ (α \cdot \frac{[{\tilde{F}}_{L} (x, y), {\tilde{F}}_{C} (x, y)]}{∥ {\tilde{F}}_{L} (x, y) ∥ ∥ {\tilde{F}}_{C} (x, y) ∥} + β), S \in {[0, 1]}^{H \times W \times 1},

(8)

where

α

and

β

are learnable parameters, and

σ (\cdot)

denotes the Sigmoid function. Next, as indicated by the formula, perform cell-level gating on the camera BEV features to address the reliability-selection problem in the spatial dimension, significantly reducing semantic contamination in misaligned regions.

{\hat{F}}_{C}^{fuse} = S \cdot F_{C}^{fuse},

(9)

We concatenate the LiDAR feature

F_{L}

with the camera fused feature

{\hat{F}}_{C}^{fuse}

and perform local convolution fusion; the detailed process is shown below:

F^{static} = Conv ([F_{L} ‖ {\hat{F}}_{C}^{fuse}]) \in R^{H \times W \times C},

(10)

Finally, as shown in the formula, we use a channel attention mechanism to model the contributions of different channels, and inject the camera semantic information into the LiDAR features according to their contributions, yielding the LiDAR semantic feature

F_{L}^{sem}

with LiDAR geometric stability and camera semantic discriminability:

F_{L}^{sem} = σ (Conv (GAP (F^{static}))) \cdot F^{static},

(11)

where

GAP (\cdot)

denotes the channel-wise global average pooling operation, and

σ (\cdot)

denotes the Sigmoid function. In the process of injecting camera semantic information into LiDAR features, the semantic-consistency score can control spatial reliability, and channel-wise modeling reveals which fusion channels are more beneficial for robust detection under adverse weather, enabling selective semantic enrichment at both spatial and channel levels, which improves robustness by reducing semantic contamination and mitigating information dilution during fusion.

3.3. Bidirectional Cross-Attention Fusion Module

Semantic LiDAR feature

F_{L}^{sem}

possesses geometric stability and semantic discriminability, while the camera BEV feature

F_{c}^{bev}

obtained by aggregating from the camera viewpoint has cross-view context and dense semantics. Although both are aligned in BEV space to the same coordinate system, two typical difficulties still arise under adverse weather. First, modality degradation can be modality-dependent and spatially non-uniform, causing a modality to be unreliable in parts of a frame; second, cross-attention during information interaction may explicitly propagate the noise of the degraded modality to the other modality, leading to false positives and overfitting. To address this, we introduce a confidence-aware bidirectional cross-attention module that first estimates modality confidence and then performs confidence-weighted bidirectional interaction, adaptively allocating cross-modal contributions while suppressing unreliable information flow.

We define the confidence used in the proposed bidirectional fusion module. For each modality

m \in {LiDAR, Camera}

, confidence is a sample-level modality reliability scalar

C_{m} \in [0, 1]

, predicted by an auxiliary head. Unlike relying solely on fusion weights learned implicitly from the detection loss, we explicitly predict the confidence coefficients for the two modalities,

C_{LiDAR}

and

C_{Camera}

, and use them to control the strength of cross-modal information interaction. First, we fuse the semantic LiDAR feature

F_{L}^{sem}

and the camera BEV feature

F_{C}^{bev}

globally and map them into a shared embedding space, as shown by the equation:

Z_{L} = ⌀_{L} (GAP (F_{L}^{sem})), Z_{C} = ⌀_{C} (GAP (F_{C}^{bev})),

(12)

where

ϕ_{L}^{0}

and

ϕ_{C}^{0}

as lightweight MLP networks, which are used to encourage semantic alignment between LiDAR and Camera features at the same time, and to push apart non-matching samples. We adopt the formula shown below to constrain embedding consistency:

L_{con} = - log (\frac{exp (sim (z_{L}, z_{C}) / τ)}{\sum_{i = 1}^{K} exp (sim (z_{L}, z_{C}^{i}) / τ)}),

(13)

where

sim (\cdot, \cdot)

is the cosine similarity,

τ

as the temperature parameter, and K as the number of candidates in a batch. When a modality degrades severely, its embedding becomes harder to align with the other modality, thus providing a more discriminative feature basis for subsequent confidence estimation. Next, we predict modality-level confidences

C_{LiDAR}

and

C_{Camera}

on the embeddings; the concrete procedure is shown below:

C_{LiDAR} = σ (ω_{L} Z_{L} + b_{L}), C_{Camera} = σ (ω_{C} Z_{C} + b_{C}),

(14)

where

σ (\cdot)

is the Sigmoid function, and

ω_{L}

and

ω_{C}

as the linear-layer weight vectors. During training, we apply stochastic modality degradation augmentation to generate supervisory signals, with labels

y_{m} \in {0, 1}

, where 1 indicates clean and 0 degraded, and we compute the confidence loss via binary cross-entropy:

L_{conf} = BCE (C_{LiDAR}, y_{LiDAR}) + BCE (C_{Camera}, y_{Camera}),

(15)

Ultimately, the reliability loss is:

L_{rel} = λ_{con} L_{con} + λ_{conf} L_{conf},

(16)

where

λ_{con}

and

λ_{conf}

is the weighting coefficients used to balance the influence of

L_{con}

and

L_{conf}

on training. Then, as shown in Figure 3, we construct bidirectional interactions at the BEV-token level; the linear projections of LiDAR and camera features are shown below:

Q_{LiDAR} = X_{LiDAR} W_{LiDAR}^{Q}, K_{LiDAR} = X_{LiDAR} W_{LiDAR}^{K}, V_{LiDAR} = X_{LiDAR} W_{LiDAR}^{V},

(17)

Q_{Camera} = X_{Camera} W_{Camera}^{Q}, K_{Camera} = X_{Camera} W_{Camera}^{K}, V_{Camera} = X_{Camera} W_{Camera}^{V},

(18)

where

X_{LiDAR}

and

X_{Camera}

are the flattened BEV token sequences derived from the semantically enhanced LiDAR feature map

F_{L}^{sem}

and the camera BEV feature map

F_{C}^{bev}

, respectively.

For the LiDAR → Camera branch, using the camera token as the query, we aggregate the structural information from LiDAR tokens and multiply by

C_{LiDAR}

:

X_{L \to C} = C_{LiDAR} \cdot Softmax (\frac{Q_{C a m e r a} {(K_{L i D A R})}^{⊤}}{\sqrt{d_{k}}}) V_{L i D A R},

(19)

When LiDAR degrades,

C_{LiDAR}

decreases, reducing the injection strength in the LiDAR-to-Camera branch and avoiding unreliable cross-branch contextual modeling in the Camera branch. For the Camera-to-LiDAR branch, using the LiDAR token as the query, semantic information is aggregated from the Camera BEV tokens and multiplied by

C_{Camera}

:

X_{C \to L} = C_{Camera} \cdot Softmax (\frac{Q_{L i D A R} {(K_{C a m e r a})}^{⊤}}{\sqrt{d_{k}}}) V_{C a m e r a},

(20)

When rain and fog degrade imaging contrast and texture,

C_{Camera}

suppresses semantic propagation in this direction, reducing false detections caused by camera noise. Finally, as shown in below, the outputs of the two branches are fused to obtain the final fused feature:

F_{fuse} = X_{L \to C} + X_{C \to L},

(21)

After the semantic enhancement and bidirectional cross-modal fusion, we obtain the final fused BEV feature

F_{fuse}

. We then feed

F_{fuse}

into the 3D detection head, which predicts the classification scores and regresses the 3D bounding box parameters based on this fused representation, thereby enabling 3D object detection under adverse weather conditions. Since the fused feature integrates LiDAR geometry with camera semantics while suppressing low-confidence modality noise, the detection head can produce more stable and reliable detection results.

4. Experiments

4.1. Dataset

We evaluate SeBFusion on the adverse-weather variants of the KITTI, Seeing Through Fog and nuScenes datasets. KITTI is a large-scale autonomous-driving dataset collected mainly under daytime and favorable weather conditions, with 7481 training samples and 7518 test samples. nuScenes comprises 1000 sequences, each about 20 seconds long, covering a range of urban and suburban driving scenes. Following prior work, we apply established weather-simulation techniques to generate fog, rain, snow, and strong illumination conditions [14,15,16,36], forming KITTI-C and nuScenes-C to validate the model’s generalizability. Seeing Through Fog is a real-world multimodal adverse-weather benchmark collected over more than 10,000 km of driving in northern Europe [33]. The dataset provides synchronized multimodal measurements, including RGB cameras, LiDAR, radar, and gated sensors, together with fine-grained weather annotations under clear and adverse conditions. Compared with synthetic corruption benchmarks, SeeingThroughFog contains real sensor degradations caused by fog and precipitation, making it more suitable for evaluating the robustness of multimodal 3D detection methods in practical adverse-weather scenarios. Following a distance-aware evaluation protocol, we report performance in three distance ranges, i.e., 0–30 m, 30–50 m, and 50–80 m, to further analyze the robustness of different fusion methods from near to long range.

4.2. Implementation Details

We implement our model within the MMDetection3D framework and train it on four NVIDIA GeForce RTX 4090 GPUs with a per-GPU batch size of 2 (effective batch size = 8). Training follows a two-stage protocol. We first pretrain the image branch to provide stable semantic cues and the LiDAR branch to learn geometry-aware BEV features. We then train SeBFusion end-to-end with both branches initialized, enabling selective semantic injection and confidence-aware bidirectional fusion under corruptions. The fusion training uses AdamW with an initial learning rate of 1

\times

10⁻⁴ and runs for 12 epochs.

α

and

β

are initialized as

α

= 2.0 and

β

= −0.5, respectively. To supervise confidence prediction, we apply stochastic modality degradation during training by randomly corrupting either LiDAR or images, and use the applied corruption state as the binary label for

L_{c o n f}

.

4.3. Compare with Other Methods

We compare SeBFusion with representative LiDAR-only, camera-only, and LiDAR + camera fusion 3D detectors on KITTI-C and nuScenes-C under Clean, Fog, Rain, Snow, and Strong Light corruptions, and further evaluate camera–LiDAR fusion methods on SeeingThroughFog for pedestrian detection under real adverse-weather conditions.

Table 1 reports the performance of different detectors on KITTI-C under clean and four adverse conditions. A clear trend is that camera-only methods are the most fragile, even on clean data, SMOKE, PGD, and ImVoxelNet remain at 7.09, 8.10, and 11.49, and their performance collapses once visibility and texture cues are corrupted. For example, ImVoxelNet drops to 1.34 in fog and 0.22 in snow. This is consistent with the fact that monocular 3D detection relies heavily on appearance cues and implicit depth reasoning, both of which become unreliable when contrast is reduced or when rain or snow introduces structured artifacts and occlusions.

LiDAR-only methods are notably more stable than camera-only baselines, but they still suffer substantial degradation under severe corruptions. For example, although PV-RCNN reaches 85.17 on clean data, it falls to 52.21 in rain and 53.12 in snow, reflecting that scattering and weather-induced outliers can significantly reduce point cloud quality. Simply adopting multimodal fusion does not automatically guarantee robustness either: EPNets and Focals Conv perform well on clean KITTI-C, yet drop sharply in fog, suggesting that when the degradation is spatially non-uniform, unconditional cross-modal interaction can amplify misalignment, propagate corrupted cues across modalities, and dilute reliable LiDAR geometry with low-quality image features, especially under spatially non-uniform degradations.

In contrast, our method achieves the best results across all KITTI-C conditions, reaching 89.05, 83.60, 74.20, 69.10 and 82.10 on Clean, Fog, Rain, Snow, Light. Compared with AWARDistill, which is the strongest baseline among fusion methods in Table, our gains are consistent and become more pronounced in harder weather. The larger improvements in rain and snow indicate that explicitly controlling what to fuse and how strongly to fuse is particularly beneficial when one modality becomes locally unreliable and when cross-modal alignment errors are more likely to appear. This is also reflected in corruption sensitivity: relative to clean performance, our drops are smaller than AWARDistill in every corruption type, showing that the benefit is not only higher peak accuracy but also improved stability.

We observe the same pattern on nuScenes-C in Table 2. Camera-only methods again degrade severely under adverse weather, especially snow, while LiDAR-only methods are more resilient but still affected. For example, CenterPoint drops from 59.28 clean to 43.78 in fog. Fusion-based approaches provide the strongest overall performance, and our method ranks first under all five conditions, achieving 69.90, 66.20, 67.80, 66.60, 67.30 on Clean, Fog, Rain, Snow, and Strong Light. Compared with RoboFusion-B, our improvements are consistent across conditions, and we also obtain slightly smaller drops from clean to corrupted settings, indicating better robustness when the corruption type changes. Figure 4 shows the visualization results of our method compared to other methods. Overall, the results on both KITTI-C and nuScenes-C support our motivation that making fusion decisions explicit—by gating semantic injection with alignment cues and modulating bidirectional interaction with reliability—helps prevent degraded-modality noise from propagating and enables more stable exploitation of multimodal complementarity under adverse weather.

To further verify the effectiveness of SeBFusion on real adverse-weather data, we additionally evaluate it on the SeeingThroughFog dataset for pedestrian detection. The results are reported in Table 3. Our method achieves the best performance across all weather and distance settings. Under snow, SeBFusion obtains 80.64, 74.63, and 38.59 AP in the 0–30 m, 30–50 m, and 50–80 m ranges, respectively, surpassing the strongest baseline by 4.41, 7.79, and 12.76 points. Under fog, it achieves 81.79, 63.37, and 26.51 AP, improving over the best competing method by 2.54, 4.98, and 9.46 points, respectively. Notably, the gains become more pronounced at longer distances, where adverse weather further aggravates point-cloud sparsity and modality imbalance. These results indicate that the proposed semantic enhancement and confidence-aware bidirectional fusion can more effectively exploit complementary camera–LiDAR information and maintain robust pedestrian detection performance in real adverse-weather environments.

Although SeBFusion achieves strong overall robustness under adverse weather, it still exhibits failure cases when modality degradation becomes extremely severe. Figure 5 presents several representative failure cases of SeBFusion under severe adverse-weather corruption. When the degradation simultaneously affects both camera and LiDAR observations, the model may still produce missed detections, false positives, or bounding-box misalignment. In particular, dense fog weakens image contrast and shortens the effective sensing range of LiDAR, making distant pedestrians harder to detect reliably. Rain and snow can introduce strong scattering noise and irregular sparse patterns, which further disturb multimodal correspondence. Similarly, strong illumination may locally impair image semantics and reduce the quality of cross-modal fusion. These results suggest that, while SeBFusion improves robustness by suppressing unreliable information flow, it is still fundamentally constrained by the quality of the available sensor observations in extremely degraded scenarios.

4.4. Ablation Study

We conduct a module-wise ablation study on nuScenes-C using mAP as the evaluation metric, and report results under five conditions: Clean, Fog, Rain, Snow, and Light. Our baseline adopts a simple fusion strategy. We then progressively add MDC, CSI, and BCAF to quantify the contribution of each component. The results are summarized in Table 4.

Adding MDC on top of the baseline yields consistent gains across all conditions,

+ 5.00

mAP on Fog and

+ 5.50

mAP on Snow, indicating that MDC provides richer candidate information and mitigates fusion uncertainty under corruptions, thereby improving robustness. Further introducing CSI upon MDC brings additional improvements. This suggests that alignment-guided injection reduces cross-modal mismatch in corrupted scenarios and stabilizes the fusion process, with more pronounced benefits under severe conditions such as Fog and Snow.

Finally, incorporating BCAF into MDC + CSI gives the full model, achieving 69.90, 66.20, 67.80, 66.60 and 67.30 mAP across the five conditions. Relative to the previous setting (MDC + CSI), the improvements are particularly notable on Fog and Snow, which implies that BCAF suppresses negative transfer from low-confidence features while enhancing complementary information exchange via reliability-aware interactions, further boosting robustness under corruptions.

5. Conclusions

Adverse weather simultaneously causes point-cloud sparsity and noise as well as degradation in image contrast and texture, and can further lead to cross-modal misalignment and noise propagation. To address these issues, this paper proposes a robust LiDAR–camera multimodal 3D object detection framework for adverse weather. In the semantic enhancement stage, the proposed method robustly lifts camera semantics into the BEV space via virtual point generation and semantic BEV aggregation, and explicitly characterizes depth uncertainty using multi-depth candidates (MDC), thereby improving geometric tolerance and semantic coverage. Meanwhile, it introduces camera semantic injection (CSI), which gates and selectively injects camera semantics at the BEV grid level according to semantic consistency. Furthermore, in the cross-modal interaction stage, we propose bidirectional cross-attention fusion (BCAF), which explicitly estimates modality-level confidence to modulate the strength of bidirectional information flow, reducing the risk of noise from a degraded modality propagating to the other. Although our method performs well in most cases, its performance may still degrade when both modalities suffer severe degradation simultaneously. Future work may further target more complex real-world degradation mechanisms and distribution shifts, exploring stable fusion with temporal information, finer-grained reliability modeling, and adaptive mechanisms for engineering factors such as extrinsic calibration drift and sensor asynchrony, to enhance reliability in practical deployment.

Author Contributions

Conceptualization, T.J. and J.S.; methodology, T.J. and Y.C.; software, X.F.; validation, Y.C., T.J. and C.G.; formal analysis, T.J.; investigation, Y.C.; resources, T.J.; data curation, X.F.; writing—original draft preparation, T.J.; writing—review and editing, J.S.; visualization, Y.C.; supervision, J.S.; project administration, C.G.; funding acquisition, C.G. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by the National Natural Science Foundation of China (Grant No. 62302086).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wan, R.; Zhao, T.; Lu, W. Robust 3D sparse object detection through multi-modal framework with dynamic feature encoding and hierarchical object-guided feature enhancement. Inf. Fusion 2026, 126, 103648. [Google Scholar] [CrossRef]
Wang, Z.; Huang, Z.; Gao, Y.; Wang, N.; Liu, S. MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 609–623. [Google Scholar] [CrossRef] [PubMed]
Huang, X.; Xu, Z.; Wu, H.; Wang, J.; Xia, Q.; Xia, Y.; Li, J.; Gao, K.; Wen, C.; Wang, C. L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object Detection. In Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Walsh, T., Shah, J., Kolter, Z., Eds.; AAAI Press: Menlo Park, CA, USA, 2025; pp. 3806–3814. [Google Scholar] [CrossRef]
Mudavath, T.; Mamidi, A. Object detection challenges: Navigating through varied weather conditions—Acomprehensive survey. J. Ambient Intell. Humaniz. Comput. 2025, 16, 443–457. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, Z.; Su, Q.; Yang, K.; Wu, Y.; He, L.; Tang, X. Object detection for autonomous vehicles under adverse weather conditions. Expert Syst. Appl. 2026, 296, 128994. [Google Scholar] [CrossRef]
Xu, J.; Hu, X.; Zhu, L.; Heng, P.A. Unifying Physically-Informed Weather Priors in a Single Model for Image Restoration Across Multiple Adverse Weather Conditions. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 9575–9591. [Google Scholar] [CrossRef]
Yoon, J.H.; Jung, J.W.; Yoo, S.B. Equirectangular Point Reconstruction for Domain Adaptive Multimodal 3D Object Detection in Adverse Weather Conditions. In Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Walsh, T., Shah, J., Kolter, Z., Eds.; AAAI Press: Menlo Park, CA, USA, 2025; pp. 9553–9561. [Google Scholar] [CrossRef]
Graf, M.; Steinhauser, D.; Vaculín, O.; Brandmeier, T. Impact of Adverse Weather on Road Safety: A Survey of Test Methods for Enhancing Safety of Automated Vehicles and Sensor Robustness in Challenging Environmental Conditions. IEEE Access 2025, 13, 179817–179838. [Google Scholar] [CrossRef]
Batten, B.; Lomuscio, A. Improving Weather-based OOD Generalisation in Lidar-based Object Detection Models via Adversarial Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Nashville, TN, USA, 11–12 June 2025; pp. 4360–4368. [Google Scholar]
Xing, L.; Ye, J.; Deng, K.; Wu, H.; Ma, H.; Gao, J. Cerberus: Accurate Real-Time Object Detection System Under Adverse Weather Conditions via Multimodal Fusion. IEEE Internet Things J. 2025, 12, 52837–52849. [Google Scholar] [CrossRef]
Valanarasu, J.M.J.; Yasarla, R.; Patel, V.M. TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 2343–2353. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 5718–5729. [Google Scholar] [CrossRef]
Cui, Y.; Ren, W.; Cao, X.; Knoll, A. Focal Network for Image Restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 12955–12965. [Google Scholar] [CrossRef]
Hahner, M.; Sakaridis, C.; Dai, D.; Gool, L.V. Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 15263–15272. [Google Scholar] [CrossRef]
Hahner, M.; Sakaridis, C.; Bijelic, M.; Heide, F.; Yu, F.; Dai, D.; Gool, L.V. LiDAR Snowfall Simulation for Robust 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 16343–16353. [Google Scholar] [CrossRef]
Kilic, V.; Hegde, D.; Cooper, A.B.; Patel, V.M.; Foster, M.A. LiDAR Light Scattering Augmentation (LISA): Physics-based Simulation of Adverse Weather Conditions for 3D Object Detection. In Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, 6–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar] [CrossRef]
Li, B.; Li, J.; Chen, G.; Wu, H.; Huang, K. De-snowing LiDAR Point Clouds With Intensity and Spatial-Temporal Features. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–25 May 2022; pp. 2359–2365. [Google Scholar] [CrossRef]
Yang, J.; Shi, S.; Wang, Z.; Li, H.; Qi, X. ST3D++: Denoised Self-Training for Unsupervised Domain Adaptation on 3D Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6354–6371. [Google Scholar] [CrossRef] [PubMed]
Chang, G.; Roh, W.; Jang, S.; Lee, D.; Ji, D.; Oh, G.; Park, J.; Kim, J.; Kim, S. CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, BC, Canada, 20–27 February 2024; Wooldridge, M.J., Dy, J.G., Natarajan, S., Eds.; AAAI Press: Menlo Park, CA, USA, 2024; pp. 972–980. [Google Scholar] [CrossRef]
Huang, X.; Wu, H.; Li, X.; Fan, X.; Wen, C.; Wang, C. Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, BC, Canada, 20–27 February 2024; Wooldridge, M.J., Dy, J.G., Natarajan, S., Eds.; AAAI Press: Menlo Park, CA, USA, 2024; pp. 2409–2416. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Lan, R.; Cheng, C.; Wu, Z. AWARDistill: Adaptive and robust 3D object detection in adverse conditions through knowledge distillation. Expert Syst. Appl. 2025, 266, 126032. [Google Scholar] [CrossRef]
Song, Z.; Zhang, G.; Liu, L.; Yang, L.; Xu, S.; Jia, C.; Jia, F.; Wang, L. RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, Republic of Korea, 3–9 August 2024; pp. 1272–1280. [Google Scholar]
Dong, Y.; Kang, C.; Zhang, J.; Zhu, Z.; Wang, Y.; Yang, X.; Su, H.; Wei, X.; Zhu, J. Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 1022–1032. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, Y.; Li, H.; Zheng, S.; Zhu, C.; Yang, L. Benchmarking the Robustness of Deep Neural Networks to Common Corruptions in Digital Pathology. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2022—25th International Conference, Singapore, 18–22 September 2022; Proceedings, Part II; Lecture Notes in Computer Science; Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13432, pp. 242–252. [Google Scholar] [CrossRef]
Sindagi, V.A.; Zhou, Y.; Tuzel, O. MVX-Net: Multimodal VoxelNet for 3D Object Detection. In Proceedings of the International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, 20–24 May 2019; IEEE: New York, NY, USA, 2019; pp. 7276–7282. [Google Scholar] [CrossRef]
Huang, T.; Liu, Z.; Chen, X.; Bai, X. EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection. In Proceedings of the Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV; Lecture Notes in Computer Science; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12360, pp. 35–52. [Google Scholar] [CrossRef]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 1080–1089. [Google Scholar] [CrossRef]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, 29 May–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 2774–2781. [Google Scholar] [CrossRef]
Yang, Z.; Chen, J.; Miao, Z.; Li, W.; Zhu, X.; Zhang, L. DeepInteraction: 3D Object Detection via Modality Interaction. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
Xie, Y.; Xu, C.; Rakotosaona, M.; Rim, P.; Tombari, F.; Keutzer, K.; Tomizuka, M.; Zhan, W. SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 17545–17556. [Google Scholar] [CrossRef]
Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Zhou, B.; Zhang, L. FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection. In Proceedings of the 24th IEEE International Intelligent Transportation Systems Conference, ITSC 2021, Indianapolis, IN, USA, 19–22 September 2021; IEEE: New York, NY, USA, 2021; pp. 3047–3054. [Google Scholar] [CrossRef]
Nabati, R.; Qi, H. CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, 3–8 January 2021; IEEE: New York, NY, USA, 2021; pp. 1526–1535. [Google Scholar] [CrossRef]
Bijelic, M.; Gruber, T.; Mannan, F.; Kraus, F.; Ritter, W.; Dietmayer, K.; Heide, F. Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation/IEEE: New York, NY, USA, 2020; pp. 11679–11689. [Google Scholar] [CrossRef]
Palladin, E.; Dietze, R.; Narayanan, P.; Bijelic, M.; Heide, F. SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather. arXiv 2025, arXiv:2508.16408. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Carballo, A.; Lambert, J.; Monrroy, A.; Wong, D.R.; Narksri, P.; Kitsukawa, Y.; Takeuchi, E.; Kato, S.; Takeda, K. LIBRE: The Multiple 3D LiDAR Dataset. In Proceedings of the IEEE Intelligent Vehicles Symposium, IV 2020, Las Vegas, NV, USA, 19 October–13 November 2020; IEEE: New York, NY, USA, 2020; pp. 1094–1101. [Google Scholar] [CrossRef]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-Based 3D Single Stage Object Detector. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation/IEEE: New York, NY, USA, 2020; pp. 11037–11045. [Google Scholar] [CrossRef]
Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection. Int. J. Comput. Vis. 2023, 131, 531–551. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Shih, Y.; Liao, W.; Lin, W.; Wong, S.; Wang, C. Reconstruction and Synthesis of Lidar Point Clouds of Spray. IEEE Robot. Autom. Lett. 2022, 7, 3765–3772. [Google Scholar] [CrossRef]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection From Point Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: New York, NY, USA, 2019; pp. 12697–12705. [Google Scholar] [CrossRef]
Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From Points to Parts: 3D Object Detection From Point Cloud With Part-Aware and Part-Aggregation Network. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2647–2664. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Wu, Z.; Tóth, R. SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, 14–19 June 2020; Computer Vision Foundation/IEEE: New York, NY, USA, 2020; pp. 4289–4298. [Google Scholar] [CrossRef]
Wang, T.; Zhu, X.; Pang, J.; Lin, D. Probabilistic and Geometric Depth: Detecting Objects in Perspective. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; Proceedings of Machine Learning Research; Faust, A., Hsu, D., Neumann, G., Eds.; PMLR: New York, NY, USA, 2021; Volume 164, pp. 1475–1485. [Google Scholar]
Rukhovich, D.; Vorontsova, A.; Konushin, A. ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, 3–8 January 2022; IEEE: New York, NY, USA, 2022; pp. 1265–1274. [Google Scholar] [CrossRef]
Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal Sparse Convolutional Networks for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 5418–5427. [Google Scholar] [CrossRef]
Zhu, X.; Ma, Y.; Wang, T.; Xu, Y.; Shi, J.; Lin, D. SSN: Shape Signature Networks for Multi-class Object Detection from Point Clouds. In Proceedings of the Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXV; Lecture Notes in Computer Science; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12370, pp. 581–597. [Google Scholar] [CrossRef]
Yin, T.; Zhou, X.; Krähenbühl, P. Center-Based 3D Object Detection and Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, 19–25 June 2021; Computer Vision Foundation/IEEE: New York, NY, USA, 2021; pp. 11784–11793. [Google Scholar] [CrossRef]
Wang, T.; Zhu, X.; Pang, J.; Lin, D. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. arXiv 2021, arXiv:2104.10956. [Google Scholar]
Wang, Y.; Guizilini, V.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; Proceedings of Machine Learning Research; Faust, A., Hsu, D., Neumann, G., Eds.; PMLR: New York, NY, USA, 2021; Volume 164, pp. 180–191. [Google Scholar]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation From LiDAR-Camera via Spatiotemporal Transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Zhang, T.; Wang, Y.; Wang, Y.; Zhao, H. FUTR3D: A Unified Sensor Fusion Framework for 3D Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023—Workshops, Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 172–181. [Google Scholar] [CrossRef]

Figure 1. Multimodal fusion frameworks.

Figure 2. Overall framework diagram of SeBFusion.

Figure 3. Bidirectional Cross Attention Fusion of Camera and LiDAR Features.

Figure 4. Visualization results of different methods.

Figure 5. Visualization results of detection in severe modal degradation.

Table 1. Performance results of different methods on the KITTI-C dataset.

Methods	Modality	Clean	Fog	Rain	Snow	Light
3DSSD [37]	LiDAR	80.18	46.26	28.31	28.33	25.14
PV-RCNN [38]	LiDAR	85.17	79.22	52.21	53.12	80.55
SECOND [39]	LiDAR	81.15	74.63	52.12	55.81	77.21
PointRCNN [40]	LiDAR	82.23	77.15	51.02	52.64	62.08
PointPillars [41]	LiDAR	78.41	64.28	36.18	36.47	62.28
Part-A2 [42]	LiDAR	82.45	71.61	41.63	42.70	76.45
SMOKE [43]	Camera	7.09	5.63	3.94	2.47	6.00
PGD [44]	Camera	8.10	0.87	3.06	0.63	7.07
ImVoxelNet [45]	Camera	11.49	1.34	1.24	0.22	10.08
EPNets [26]	LiDAR + Camera	85.13	44.16	40.12	34.71	69.12
Focals Conv [46]	LiDAR + Camera	84.16	44.15	40.12	35.23	80.75
AWARDistill [21]	LiDAR + Camera	88.62	82.74	70.92	65.74	80.19
Ours	LiDAR + Camera	89.05	83.60	74.20	69.10	82.10

Table 2. Performance results of different methods on the nuScenes-C dataset.

Methods	Modality	Clean	Fog	Rain	Snow	Light
PointPillars [41]	LiDAR	27.69	24.49	27.71	27.57	23.71
SSN [47]	LiDAR	46.65	41.64	46.50	46.38	40.28
CenterPoint [48]	LiDAR	59.28	43.78	56.08	55.90	54.20
FCOS3D [49]	Camera	23.86	13.53	13.00	2.01	17.20
PGD [44]	Camera	23.19	12.83	13.51	2.30	22.77
DETR3D [50]	Camera	34.71	27.89	20.39	5.08	34.66
BEVFormer [51]	Camera	41.65	32.76	24.97	5.73	41.68
FUTR3D [52]	LiDAR + Camera	64.17	53.19	58.40	52.73	57.70
TransFusion [27]	LiDAR + Camera	66.38	53.67	65.35	63.30	55.14
BEVFusion [28]	LiDAR + Camera	68.45	54.10	66.13	62.84	64.42
RoboFusion-B [22]	LiDAR + Camera	69.40	65.54	67.01	66.07	66.71
AWARDistill [21]	LiDAR + Camera	68.11	60.11	66.93	66.03	62.92
Ours	LiDAR + Camera	69.90	66.20	67.80	66.60	67.30

Table 3. Pedestrian detection AP on the Seeing Through Fog dataset under snow and fog at different distance ranges.

Method	Modality	Snow			Fog
Method	Modality	0–30 m	30–50 m	50–80 m	0–30 m	30–50 m	50–80 m
MVXNet [25]	Camera + LiDAR	76.23	59.73	25.83	73.89	50.98	16.73
BEVFusion [28]	Camera + LiDAR	71.12	62.61	10.01	76.24	58.04	8.61
SparseFusion [30]	Camera + LiDAR	73.33	66.84	19.87	79.25	58.39	17.05
Ours	Camera + LiDAR	80.64	74.63	38.59	81.79	63.37	26.51

Table 4. Ablation results on nuScenes-C. MDC denotes multi-depth candidate virtual-point generation; CSI denotes camera semantic injection; BCAF denotes bidirectional cross-attention fusion.

Setting	MDC	CSI	BCAF	Clean	Fog	Rain	Snow	Light
Baseline				62.15	53.08	58.11	51.96	57.52
+MDC	✓			64.95	58.08	61.61	57.46	61.12
+CSI	✓	✓		67.25	62.08	64.61	61.96	64.12
+BCAF (Full)	✓	✓	✓	69.90	66.20	67.80	66.60	67.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiao, T.; Chen, Y.; Feng, X.; Guo, C.; Song, J. Semantic-Enhanced Bidirectional Multimodal Fusion for 3D Object Detection Under Adverse Weather. Appl. Sci. 2026, 16, 2943. https://doi.org/10.3390/app16062943

AMA Style

Jiao T, Chen Y, Feng X, Guo C, Song J. Semantic-Enhanced Bidirectional Multimodal Fusion for 3D Object Detection Under Adverse Weather. Applied Sciences. 2026; 16(6):2943. https://doi.org/10.3390/app16062943

Chicago/Turabian Style

Jiao, Tianzhe, Yuming Chen, Xiaoyue Feng, Chaopeng Guo, and Jie Song. 2026. "Semantic-Enhanced Bidirectional Multimodal Fusion for 3D Object Detection Under Adverse Weather" Applied Sciences 16, no. 6: 2943. https://doi.org/10.3390/app16062943

APA Style

Jiao, T., Chen, Y., Feng, X., Guo, C., & Song, J. (2026). Semantic-Enhanced Bidirectional Multimodal Fusion for 3D Object Detection Under Adverse Weather. Applied Sciences, 16(6), 2943. https://doi.org/10.3390/app16062943

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic-Enhanced Bidirectional Multimodal Fusion for 3D Object Detection Under Adverse Weather

Abstract

1. Introduction

2. Related Work

2.1. Enhancing Data Quality Under Adverse Weather

2.2. Improving Multimodal Fusion Mechanisms for Robust 3D Detection

3. Method

3.1. Overall Framework

3.2. Semantic LiDAR Feature Generation Module

3.2.1. Virtual Point Generation

3.2.2. Camera Semantic Injection

3.3. Bidirectional Cross-Attention Fusion Module

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Compare with Other Methods

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI