MTC-BEV: Semantic-Guided Temporal and Cross-Modal BEV Feature Fusion for 3D Object Detection

Xi, Qiankai; Ma, Li; Zhang, Jikai; Bai, Hongying; Wang, Zhixing

doi:10.3390/wevj16090493

Open AccessArticle

MTC-BEV: Semantic-Guided Temporal and Cross-Modal BEV Feature Fusion for 3D Object Detection

by

Qiankai Xi

¹

,

Li Ma

^1,2,*

,

Jikai Zhang

¹,

Hongying Bai

³ and

Zhixing Wang

¹

Digital Intelligence Industry College, Inner Mongolia University of Science and Technology, Baotou 014000, China

²

Department of Business Information, Beijing Commercial School, Beijing 100000, China

³

School of Information Engineering, Ordos Institute of Technology, Ordos 017000, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2025, 16(9), 493; https://doi.org/10.3390/wevj16090493

Submission received: 18 July 2025 / Revised: 17 August 2025 / Accepted: 20 August 2025 / Published: 1 September 2025

(This article belongs to the Special Issue Electric Vehicle Autonomous Driving Based on Image Recognition)

Download

Browse Figures

Versions Notes

Abstract

We propose MTC-BEV, a novel multi-modal 3D object detection framework for autonomous driving that achieves robust and efficient perception by combining spatial, temporal, and semantic cues. MTC-BEV integrates image and LiDAR features in the Bird’s-Eye View (BEV) space, where heterogeneous modalities are aligned and fused through the Bidirectional Cross-Modal Attention Fusion (BCAP) module with positional encodings. To model temporal consistency, the Temporal Fusion (TTFusion) module explicitly compensates for ego-motion and incorporates past BEV features. In addition, a segmentation-guided BEV enhancement projects 2D instance masks into BEV space, highlighting semantically informative regions. Experiments on the nuScenes dataset demonstrate that MTC-BEV achieves a nuScenes Detection Score (NDS) of 72.4% at 14.91 FPS, striking a favorable balance between accuracy and efficiency. These results confirm the effectiveness of the proposed design, highlighting the potential of semantic-guided cross-modal and temporal fusion for robust 3D object detection in autonomous driving.

Keywords:

autonomous driving; 3D object detection; semantic guidance; temporal fusion

1. Introduction

Recent advancements in artificial intelligence, deep learning architectures, and sensor technologies have significantly accelerated the deployment of autonomous driving systems in real-world applications. Environment perception, a critical component of these systems, enables vehicles to build a detailed, real-time understanding of their dynamic surroundings, supporting downstream tasks such as motion planning, trajectory prediction, and decision-making [1].

Unlike 2D object detection, which focuses on localizing objects in image coordinates, 3D object detection aims to extract spatially rich information, including 3D bounding box locations, object orientation, dimensions, and categorical semantics. This requires integrating multiple sensory modalities, such as LiDAR point clouds, RGB images, millimeter-wave radar, and high-definition maps [2]. Each modality provides unique data: LiDAR offers precise geometric structure and depth; cameras deliver high-resolution texture and semantic details; radar ensures reliability in adverse conditions like fog, rain, and low light.

Single-modality 3D detection frameworks face inherent limitations. For example, LiDAR exhibits sparsity at long ranges, resulting in incomplete shape data for distant or occluded objects, while vision-based methods are sensitive to lighting and lack direct depth information. Multi-modal fusion has thus become a key research area, aiming to combine the strengths of diverse sensors while addressing their weaknesses [3].

Developing a robust and efficient fusion framework remains challenging due to issues such as spatial misalignment between modalities, temporal inconsistencies in dynamic scenes, and modality-specific noise. Additionally, existing fusion methods often underutilize semantic priors from images and temporal context from LiDAR or video sequences. To overcome these challenges, we introduce MTC-BEV, a novel semantic-guided, multi-modal 3D object detection framework that integrates temporal modeling and cross-modal interaction in a unified BEV representation. Its performance is illustrated in Figure 1.

The main contributions of this paper are summarized as follows:

We propose a location-aware, cross-modal bidirectional feature fusion mechanism to enhance geometric consistency and fine-grained representation in multi-modal fusion.
We design a temporal fusion module that aligns and aggregates historical features, improving spatio-temporal modeling and robustness in dynamic environments.
We introduce a segmentation-guided BEV feature enhancement module that projects semantic masks into the BEV space and reweights features to emphasize key object regions, thereby improving detection accuracy.

2. Related Work

2.1. Camera-Based 3D Target Detection

Camera-based 3D object detection has gained growing interest in autonomous driving research due to its low hardware cost, rich semantic data, and scalability for large-scale deployment. In contrast to LiDAR sensors, which rely on costly hardware and generate sparse point clouds, cameras deliver dense, high-resolution RGB images, facilitating detailed perception. Nevertheless, deriving 3D structure from monocular or multi-view images poses significant challenges due to the lack of direct depth information and sensitivity to occlusions and varying illumination.

Early methods, such as Mono3D [4] and Deep3DBox [5], aimed to derive 3D object properties—including position, orientation, and dimensions—directly from single images using geometric priors and 2D-3D constraints. Mono3D employed handcrafted features and 2D proposals with geometric constraints for 3D box estimation, but its effectiveness was limited in distant or cluttered scenes. Deep3DBox leveraged deep convolutional networks and geometric reasoning for pose estimation, though it exhibited reduced accuracy in occluded or complex environments. To address depth ambiguity, Pseudo-LiDAR [6] suggested creating pseudo point clouds from predicted depth maps, enabling the use of point-cloud-based detectors. While this method enhanced detection accuracy by incorporating spatial geometry, its performance depended heavily on the accuracy of monocular depth estimation, which remains a limitation under varying lighting or weather conditions.

Recent advancements have focused on multi-view fusion frameworks. Methods like PETR [7] and MVSFormer [8] utilize multiple synchronized images to enrich spatial context and reduce occlusion. PETR introduced position embedding transformation to improve 3D query alignment, while MVSFormer applied multi-view stereo depth estimation with transformer-based architectures to boost geometric consistency, significantly enhancing robustness over monocular approaches.

Despite these improvements, challenges remain: camera-based methods face difficulties with reliable long-range detection, depth estimation noise, and sensitivity to varying camera intrinsics or weather conditions. Additionally, many techniques require complex geometric calibration or post-processing, potentially impeding real-time deployment in safety-critical systems. Consequently, camera-only 3D detection remains an unresolved issue, driving the development of cross-modal fusion approaches that combine image semantics with LiDAR or radar geometry for improved perception.

2.2. Three-Dimensional Target Detection Based on LiDAR

LiDAR-based 3D object detection remains a cornerstone of perception systems in autonomous driving, owing to its ability to provide highly accurate depth and spatial structure regardless of lighting conditions. Unlike image sensors, LiDAR captures the geometric layout of a scene through dense point measurements, making it particularly effective for localization and obstacle detection.

VoxelNet [9] was a pioneering work that introduced an end-to-end pipeline by voxelizing raw point clouds and applying 3D convolutions for feature extraction. While effective, its computational cost limited real-time applicability. To improve efficiency, SECOND [10] employed sparse 3D convolutional networks, significantly reducing inference time. However, it suffered from reduced accuracy in detecting small or distant objects due to the loss of fine-grained information.

Subsequent methods such as PointRCNN [11] and PointPillars [12] adopted point-based and pseudo-image strategies, respectively. These approaches offered a practical trade-off between speed and detection quality, but still struggled with challenges like long-range sparsity and occlusions in dense urban environments. PV-RCNN [13] combined voxel-based and point-based features in a hybrid architecture, achieving state-of-the-art accuracy but at the cost of increased model complexity and resource demands.

More recently, models like CenterPoint [14] and VoxelNeXt [15] embraced BEV representations and highly optimized sparse operations to improve scalability and runtime efficiency. These methods demonstrated strong performance in real-time settings, yet limitations remain—particularly in detecting small, partially occluded, or semantically ambiguous objects.

These unresolved issues highlight the inherent limitations of relying solely on geometric data and motivate the ongoing shift toward multi-modal fusion frameworks.

2.3. Multi-Modal Fusion for 3D Target Detection

In complex, dynamic urban driving scenarios, single-modality perception systems often fail to achieve the accuracy and robustness needed for safe autonomous navigation. Challenges such as sparse long-range LiDAR data, image-based depth ambiguity, and sensor-specific noise or degradation in adverse weather conditions drive the development of multi-modal fusion techniques. By combining complementary data from LiDAR, cameras, and radar, these systems enhance the capture of semantic cues, geometric structure, and temporal consistency.

Recently, BEV-based unified fusion frameworks have become a leading approach. These methods convert features from multiple modalities into a shared BEV space, enabling efficient and uniform joint reasoning. A notable example is BEVFusion [16], which uses Swin-Transformer and VoxelNet for visual and geometric feature extraction, followed by a lift–splat–shoot module to project features into BEV space, improving spatial alignment and modality interaction.

Building on this, UniBEVFusion [17] incorporates Radar Depth LSS and a unified feature fusion module, enhancing sensor integration and robustness in challenging conditions like fog, rain, or nighttime. These advancements have shown significant gains in detection accuracy and processing efficiency.

Despite progress, key challenges remain. Accurately modeling cross-modal interactions is complex, particularly with asynchronous or conflicting modality data. Semantic inconsistencies across sensors, especially at ambiguous object boundaries or reflective areas, impede effective fusion. Additionally, achieving real-time inference while preserving high detection performance remains a critical bottleneck for real-world deployment.

To tackle these challenges, we introduce MTC-BEV, a semantic-guided multi-modal fusion framework that integrates spatially aligned features, temporal context, and segmentation priors within a unified BEV representation.

3. Methodology

3.1. Overview

Figure 2 depicts the overall architecture of the proposed multi-modal BEV perception framework, developed to enable robust and efficient 3D object detection by integrating multi-view camera images and LiDAR point clouds in a unified BEV representation.

As illustrated in Figure 2, the image branch starts by encoding multi-view images with a convolutional neural network to extract high-dimensional semantic features. These features are then transformed into the BEV domain using a dedicated View Transformer module, which projects perspective view features into the BEV plane while retaining spatial geometry. A 2D semantic segmentation module generates instance-level masks to enhance semantic perception. These masks are further refined via a deep supervision mechanism to produce attention maps, directing the network to prioritize spatially significant BEV regions.

Concurrently, the LiDAR branch processes raw point clouds through voxelization and 3D encoding, producing high-resolution BEV features that maintain accurate depth and geometric structure.

For effective multi-modal fusion, we introduce a Bidirectional Cross-Modal Attention Fusion (BCAP) module, enabling two-way interaction between image-derived and LiDAR-derived BEV features. Guided by positional encodings, this module ensures spatial alignment and enhances cross-modal information exchange. The resulting fused BEV features form a dense representation of the current frame. To integrate temporal context, a Temporal Fusion (TTFusion) module aligns historical BEV features using motion compensation and aggregates them via attention-based fusion, improving temporal consistency and robustness in dynamic settings. Finally, a Top-K selection strategy identifies the most confident and discriminative BEV feature regions, which are passed to the detection head for final object prediction.

3.2. BCAP Module

Images offer rich semantic information, while LiDAR provides precise spatial structure but limited semantic expressiveness. To leverage the complementary strengths of these modalities, we developed the Bidirectional Cross-Modal Attention Fusion (BCAP) module, as shown in Figure 3. This module boosts the expressiveness and robustness of BEV representations, enhancing the accuracy and stability of 3D object detection in autonomous driving scenarios.

The BCAP module takes as input BEV features from both image and LiDAR modalities. These features are initially processed through three convolutional layers with varying dilation rates to extract multi-scale local contextual information. This approach ensures effective feature capture across different spatial scales, particularly benefiting the detection of small or distant objects.

To maintain spatial structure and facilitate accurate cross-modal alignment, we implement a 2D fixed sine–cosine positional encoding for the BEV feature maps prior to cross-modal attention. This encoding is computed independently along the height (H, corresponding to the Y-axis in BEV) and width (W, corresponding to the X-axis in BEV) dimensions. Each BEV grid cell is assigned a unique positional embedding based on its absolute position indices (

p_{h}

,

p_{w}

), where

p_{h} \in [0, H - 1]

and

p_{w} \in [0, W - 1]

. The encoding is formulated as follows:

For the height dimension:

{PE}_{h} (p_{h}, i) = \{\begin{matrix} sin (\frac{p_{h}}{10000^{\frac{2 i}{C}}}), & if i is even \\ cos (\frac{p_{h}}{10000^{\frac{2 (i - 1)}{C}}}), & if i is odd \end{matrix}

(1)

where

p_{h} \in [0, H - 1]

is the height position index,

i \in [0, C / 2 - 1]

is the channel index, and C is the total number of channels in the embedding (C = 128).

For the width dimension:

{PE}_{w} (p_{w}, i) = \{\begin{matrix} sin (\frac{p_{w}}{10000^{\frac{2 i}{C}}}), & if i is even \\ cos (\frac{p_{w}}{10000^{\frac{2 (i - 1)}{C}}}), & if i is odd \end{matrix}

(2)

where

p_{w} \in [0, W - 1]

is the width position index.

The positional encodings from both dimensions are concatenated along the channel axis to form the final 2D positional vector for each grid cell, resulting in a tensor of shape

[1, C, H, W]

. This encoding is precomputed using a sine–cosine formulation with a scaling factor of 10,000 to control frequency, enabling the model to capture both fine-grained and coarse spatial relationships across dimensions. The encoding is added to the BEV features before the attention computation, enhancing the BCAP module’s ability to model spatial relationships and align image and LiDAR features, particularly for objects at varying scales or under partial occlusions.

The BCAP module centers on its two symmetrical cross-modal attention paths. In one path, image-derived BEV features act as queries (Q), with LiDAR-derived BEV features serving as keys (K) and values (V). In the reverse path, the roles are swapped. This bidirectional approach facilitates deep, mutual feature interaction between the image and LiDAR modalities, enabling each to compensate for the other’s limitations. Within each path, a cross-modal attention mechanism combines complementary contextual data, boosting the perception capability of the target modality. A multi-head attention structure captures diverse dependency patterns across subspaces, enhanced by residual connections and layer normalization. Residual connections retain original modality-specific features, while normalization ensures training stability and accelerates convergence. The outputs from both paths are concatenated along the channel dimension, yielding fused BEV features for subsequent network components to support downstream 3D object detection tasks.

3.3. TTFusion Module

Single-frame perception is often limited by sensor blind spots, occlusions, noise, and transient dynamic interference. In contrast, multi-frame fusion utilizes temporal continuity, allowing the system to enhance incomplete or degraded spatial data in the current frame using historical observations. This approach proves particularly effective in challenging conditions, such as partial occlusions, distant or blurred targets, and significant lighting changes, where stable semantic and geometric features from prior frames offer critical contextual support, reducing errors from data sparsity and transient occlusions. To tackle these issues, we introduce the Temporal Fusion (TTFusion) module, with its architecture depicted in Figure 4.

3.3.1. Temporal Alignment with Motion Compensation

To mitigate spatial misalignment between consecutive frames caused by ego-motion, the TTFusion module explicitly aligns the previous BEV feature map before temporal fusion. Let

F_{t - 1} \in R^{C \times H \times W}

denote the BEV feature from the

(t - 1)

-th frame, where H and W are the BEV height and width and C is the channel dimension. We first generate a normalized BEV coordinate grid

G \in {[- 1, 1]}^{H \times W \times 2}

covering the spatial extent of the BEV map. The ego-motion between frame

t - 1

and t is represented by a planar rotation

R_{t} \in R^{2 \times 2}

and translation

t_{t} \in R^{2}

, which are extracted from the vehicle pose information. These are combined into a homogeneous transformation matrix:

T_{t} = [\begin{matrix} R_{t} & t_{t} \\ 0^{⊤} & 1 \end{matrix}] \in R^{3 \times 3} .

(3)

The normalized grid is flattened into

H W

coordinates, transformed by

R_{t}

and

t_{t}

, and then reshaped back to the BEV size:

G^{'} = Reshape (G R_{t}^{⊤} + t_{t}, H, W, 2) .

(4)

Finally, the aligned historical feature is obtained via bilinear interpolation using a differentiable warping function:

F_{t - 1}^{aligned} = Warp (F_{t - 1}, G^{'}),

(5)

The

Warp (\cdot)

operation refers to reshaping or deforming an image or feature map by sampling pixels according to a displacement or transformation grid. In PyTorch, this is implemented using the grid_sample function, which employs bilinear interpolation to smoothly interpolate pixel values (rather than simply taking the nearest neighbor). Additionally, with align_corners=True, the four corner pixels of the input image are aligned with the four corners of the sampling coordinate grid. This ensures more accurate and consistent alignment at the image boundaries during spatial transformations.

Compared to implicit temporal modeling methods like BEVFormer, the explicit motion compensation in the TTFusion module maintains spatial correspondence across frames, facilitating precise tracking of fast-moving objects and enhancing robustness in dynamic driving conditions.

3.3.2. Multi-Head Temporal Attention Fusion

After alignment, the transformed historical feature

F_{t - 1}^{aligned}

and the current-frame feature

F_{t}

are jointly fed into a multi-head self-attention module to capture long-range dependencies and temporal correlations across frames. The attention mechanism operates as follows:

F_{fused} = Attn (Q = F_{t}, K = V = F_{t - 1}^{aligned}) + F_{t}

(6)

In this approach, the query Q and key/value pairs K, V are extracted from BEV features, with the current frame acting as the query and the aligned historical frame supplying the keys and values. Multi-head attention allows the TTFusion module to capture diverse temporal cues across subspaces, enhancing its ability to model motion patterns and maintain contextual continuity over time. Residual connections, applied post-attention, retain the original current-frame features, mitigating over-smoothing and information loss during fusion while also improving training stability and convergence.

3.4. Segmentation-Guided BEV Feature Enhancement

While image features inherently provide rich semantic information, their precision is often substantially reduced during projection into the BEV space via perspective transformations. This reduction is particularly noticeable at object boundaries and in distant areas, where downsampling, occlusions, and geometric distortions lead to diminished spatial focus and ambiguous semantic representation.

Furthermore, most existing multi-modal fusion frameworks depend solely on image backbones or attention mechanisms to implicitly acquire semantic awareness, lacking explicit semantic priors. This approach frequently results in suboptimal feature fusion, marked by redundant or biased data, which adversely affects detection accuracy, especially for small or distant objects.

To overcome these shortcomings, we introduce a 2D segmentation-guided BEV feature enhancement module that integrates semantic segmentation masks as explicit guidance during BEV construction. By projecting multi-view segmentation maps into the BEV space, this module enhances spatial semantic focus and boosts object localization performance, as demonstrated in Figure 5.

3.4.1. Depth-Aware BEV Projection with Segmentation Masks

As shown in Figure 5, we first derive multi-view semantic segmentation results from camera images using a pre-trained 2D segmentation network. To project these segmentation maps accurately into the BEV space, we calculate pixel-level depth information by mapping LiDAR point clouds onto the image plane using the sensor-to-camera transformation matrix (denoted as lidar2img). This procedure generates a dense depth map, supplying depth values for each pixel.

Next, using the estimated depth values (d) and the inverse of the camera intrinsic matrix

K^{- 1}

, we back-project each 2D pixel coordinate

(u, v)

into the 3D camera coordinate system as follows:

p_{cam} = K^{- 1} \cdot [\begin{matrix} u \cdot d \\ v \cdot d \\ d \end{matrix}]

(7)

The resulting 3D points are then transformed into the vehicle (ego) coordinate system using the camera-to-ego transformation matrix

T_{cam 2 ego}

:

p_{BEV} = T_{cam 2 ego} \cdot p_{cam}

(8)

Subsequently, the

x_{e}

and

y_{e}

components of

p_{BEV}

are mapped to corresponding grid positions in the BEV feature map with spatial resolution

{bev}_{x} \times {bev}_{y}

using the following:

{bev}_{x} = (\frac{x_{e} + x_{range}}{2 \cdot x_{range}}) \cdot {bev}_{w}

(9)

{bev}_{y} = (\frac{y_{range} - y_{e}}{2 \cdot y_{range}}) \cdot {bev}_{h}

(10)

where

x_{range}

and

y_{range}

define the spatial boundaries of the BEV coordinate space, typically set to 51.2 m in our implementation.

x_{e}

denotes the coordinate value of the point in the forward/backward direction within the ego coordinate system, and

y_{e}

denotes the coordinate value in the left/right direction within the ego coordinate system.

3.4.2. BEV Mask Generation and Feature Enhancement

Based on the mapped BEV grid coordinates, we construct a binary attention mask

M_{bev}

, where grid cells corresponding to segmented object regions are assigned a value of 1 and all other cells are set to 0:

M_{bev} [{bev}_{y} \cdot {bev}_{w} + {bev}_{x}] = 1

(11)

This mask is then used to selectively emphasize key spatial regions within the BEV feature map. The final enhanced BEV feature map is computed by modulating the original BEV features with the attention mask as follows:

F_{bev}^{enhanced} = F_{bev} \cdot (1 + M_{bev})

(12)

This straightforward yet effective approach enhances the weighting of object regions in the BEV representation, allowing the model to prioritize semantically informative areas while reducing background noise.

4. Experiments

4.1. Datasets

The nuScenes dataset [18], released by the Motional team in 2019, is a leading multi-modal benchmark for autonomous driving. It comprises 1000 driving sequences, each 20 s long, collected from urban traffic scenes in Boston and Singapore, encompassing diverse traffic conditions, lighting, and weather scenarios. The dataset offers six high-resolution camera views (1600 × 900), 32-line LiDAR point clouds at 20 Hz, five millimeter-wave radar streams at 13 Hz, and high-precision GPS/IMU data, with all sensor streams synchronized and spatio-temporally calibrated. For annotations, nuScenes provides over 1.4 million high-quality 3D bounding boxes (annotated at 2 Hz), spanning 23 object categories, including vehicles, pedestrians, and traffic cones. Each annotated instance includes 3D position, size, orientation, and eight additional attributes, such as speed, visibility, and activity status.

4.2. Evaluation Metrics

To thoroughly evaluate our method’s performance, we use the official nuScenes Detection Score (NDS) as the primary evaluation metric. NDS is a composite score that combines multiple detection performance indicators into a single value from 0 to 1, integrating mean Average Precision (mAP) with five True Positive (TP) metrics: mean Translation Error (mATE), mean Scale Error (mASE), mean Orientation Error (mAOE), mean Velocity Error (mAVE), and mean Attribute Error (mAAE). The NDS is calculated as follows:

N D S = \frac{1}{10} (5 \times m A P + \sum_{m T P \in T P} (1 - min (1, m T P)))

(13)

The mAP assesses detection accuracy based on the 2D center distance between predicted and ground-truth objects in the BEV. We also report Frames Per Second (FPS) as a metric, where higher values reflect improved sensor data processing efficiency, supporting timely detection of obstacles, pedestrians, and surrounding vehicles in real-world scenarios. Real-time inference capability, as indicated by FPS, is essential for autonomous driving systems to quickly detect and respond to dynamic traffic conditions.

4.3. Experimental Details

For image inputs, we adopt a resolution of 192 × 544 pixels. In the image branch, a ResNet-18 backbone, pretrained on ImageNet, serves as the camera feature encoder. The segmentation mask is obtained from a YOLOv8 [19] model pretrained on COCO, with category mappings aligned to nuScenes classes. For efficiency, we employ an FP16-quantized model running in parallel with the image backbone. On an NVIDIA RTX 3090, mask inference takes approximately 3.2 ms.

For the LiDAR branch, we use VoxelNet to process raw point clouds, applying a voxel size of (0.1 m × 0.1 m × 0.2 m), a BEV spatial range of [−51.2 m, 51.2 m] along the X and Y axes, and a vertical range of [−5 m, 3 m]. Training was completed for 20 epochs on a workstation equipped with 4 NVIDIA RTX 3090 GPUs using the AdamW optimizer with a batch size of 4.

4.4. Comparison with Other Methods

Table 1 compares MTC-BEV with representative camera-only, LiDAR-only, and LiDAR–camera fusion methods on the nuScenes validation set in terms of NDS, mAP, and FPS. Voxel size refers to the BEV grid resolution for voxel-based methods. Camera-only methods achieve lower accuracy than fusion-based approaches but can run at high FPS. LiDAR-only methods achieve stronger spatial accuracy but lack image semantics.

Among BEV-based fusion methods, BEVFusion and TransFusion explicitly perform feature-level fusion in the BEV space. While TransFusion attains the highest mAP (0.689) among fusion baselines, it runs significantly slower (6.51 FPS). MTC-BEV achieves the best overall NDS (0.724) and competitive mAP (0.684) while maintaining real-time speed (14.91 FPS), outperforming the baseline DAL-Tiny [20] in both accuracy metrics.

Table 1. Comparison on the nuScenes validation set. C denotes using only the camera modality, L denotes using only LiDAR, and L+C denotes LiDAR–camera fusion. CB refers to the pre-trained backbone used for image features. All evaluations are conducted on an RTX 3090 GPU.

Method	Modality	CB	Voxel Size (m)	NDS	mAP	FPS
PETRv2 [21]	C	R50	-	0.456	0.349	18.90
BEVDepth [22]	C	R50	-	0.475	0.351	15.68
StreamPETR [23]	C	R50	-	0.546	0.449	31.70
CenterPoint [14]	L	-	(0.075, 0.075, 0.2)	0.648	0.564	31.21
FSTR-L [24]	L	-	(0.075, 0.075, 0.2)	0.703	0.655	8.92
DSVT-Pillar [25]	L	-	(0.3, 0.3, 8.0)	0.711	0.664	9.43
UVTR [26]	L+C	R101-DCN	(0.075, 0.075, 0.075)	0.704	0.663	1.77
TransFusion [27]	L+C	R50	(0.075, 0.075, 0.2)	0.717	0.689	6.51
BEVFusion (MIT) [16]	L+C	Swin-T	(0.075, 0.075, 0.2)	0.714	0.685	9.58
DAL-Tiny [20]	L+C	R18	(0.1, 0.1, 0.2)	0.713	0.674	16.55
CMT [28]	L+C	R50	(0.1, 0.1, 0.2)	0.709	0.679	10.72
Ours	L+C	R18	(0.1, 0.1, 0.2)	0.724	0.684	14.91

4.5. Ablation Study

To evaluate the contribution of each module in our proposed MTC-BEV framework, we conduct ablation studies on the nuScenes validation dataset. In each experiment, one or more modules are selectively disabled while keeping the network architecture otherwise unchanged, ensuring fair and interpretable comparisons.

As shown in Table 2, incorporating any single module improves performance over the baseline, with the Bidirectional Cross-Modal Attention Fusion (BCAP) module providing the largest individual gain in NDS. Combining multiple modules progressively enhances both NDS and mAP. The integration of all three modules—BCAP, TTFusion, and Mask Guide—achieves the highest overall performance, with an NDS of 0.724 and mAP of 0.684.

The ablation experiments assess the impact of incorporating and combining the BCAP, TTFusion, and Mask Guide modules. Without these modules, the NDS of the baseline model is 0.713, and the mAP is 0.674. Adding any single module improves performance, with BCAP providing the most significant NDS gain. Integrating all three modules yields the best overall performance, with an NDS of 0.724 and mAP of 0.684, highlighting the complementary benefits of the proposed components.

4.6. Effect of Historical Frame Number

We conducted ablation experiments to evaluate the impact of varying the number of historical frames in the TTFusion module, as shown in Table 3. The results indicate that increasing the number of historical frames from 1 to 3 yields only marginal changes in NDS and mAP, with optimal performance achieved using 2 frames (NDS of 0.717, mAP of 0.677). This is because dense BEV features, fused from LiDAR and camera modalities, benefit from a moderate amount of temporal information to enhance context modeling. However, excessive historical frames introduce feature redundancy and temporal misalignment in dynamic scenes, potentially degrading performance. Additionally, more frames increase computational overhead, reducing FPS. Given the nuScenes dataset’s annotation frequency of 2 Hz, using 2 historical frames provides the best balance of performance and efficiency for MTC-BEV.

4.7. Qualitative Results

We present qualitative visualization results on the nuScenes validation set. Figure 6 illustrates the BEV visualization, while Figure 7 shows the multi-view visualization. In Figure 6, each example (from left to right) displays ground-truth annotations, predictions from the baseline model (DAL-Tiny), and results from our proposed MTC-BEV framework, emphasizing pose accuracy and small object detection. Figure 7 focuses on comparing long-range detection performance across multi-view camera images, with the BEV visualization providing an auxiliary observation effect. Both figures demonstrate that MTC-BEV enhances detection accuracy and spatial consistency, particularly for long-range targets and vehicle pose estimation.

5. Discussion

5.1. Results Analysis

The quantitative and qualitative results show that the BCAP module provides the most significant performance improvement among the three proposed components. By facilitating deep, complementary feature interaction between camera images and LiDAR modalities, BCAP enhances spatial consistency and semantic richness in the BEV representation.

The TTFusion module contributes notable improvements in temporal stability, particularly in dynamic scenes with occlusions and fast-moving objects. Ablation studies indicate that using two historical frames achieves the optimal balance between accuracy and computational efficiency, as additional frames introduce feature redundancy and potential misalignment in dynamic environments.

The segmentation-guided BEV feature enhancement (Mask Guide) module further improves detection accuracy, especially for small and distant objects. By leveraging explicit semantic priors from multi-view segmentation masks, this module sharpens spatial focus on object regions, reducing false positives in cluttered environments.

5.2. Limitations

Despite its robust performance, the MTC-BEV framework has limitations. In adverse weather conditions, such as heavy rain or fog, the quality of both camera images and LiDAR data may degrade, reducing detection reliability. Furthermore, the segmentation-guided BEV feature enhancement (Mask Guide) module depends on accurate 2D segmentation predictions; errors in this stage can propagate to the BEV space, adversely affecting detection performance.

5.3. Potential Improvements

To address these issues, future work could explore the integration of adaptive feature fusion strategies that dynamically adjust the contribution of each modality based on environmental conditions or sensor quality. Moreover, leveraging radar data as an additional modality may enhance robustness in adverse weather and low-visibility scenarios. These approaches should be validated through comprehensive experiments across multiple datasets to ensure generalization and reliability under diverse real-world conditions. For example, in the Waymo Open Dataset, a higher point cloud density may amplify the benefits of semantic guidance and temporal fusion, while in Argoverse 2, TTFusion could improve stability for dynamic objects.

6. Conclusions

This paper introduces MTC-BEV, a novel multi-modal and temporal fusion framework for 3D object detection in autonomous driving. The framework integrates a position-aware Bidirectional Cross-Modal Attention Fusion (BCAP) mechanism, an explicit motion-compensated Temporal Fusion (TTFusion) module, and a segmentation-guided BEV feature enhancement strategy.

Extensive experiments on the nuScenes dataset confirm that MTC-BEV achieves state-of-the-art performance, striking an optimal balance between detection accuracy and real-time inference speed. Ablation studies substantiate the individual and collective contributions of each module.

Future work will focus on enhancing MTC-BEV by integrating adaptive multi-modal fusion approaches, advanced temporal modeling methods, and robust uncertainty-aware perception components to improve its generalization across varied and challenging driving conditions.

Author Contributions

Conceptualization, Q.X. and J.Z.; methodology, H.B., L.M. and Q.X.; software, Q.X.; validation, Q.X., J.Z. and Z.W.; formal analysis, Q.X.; investigation, Q.X.; resources, H.B. and L.M.; data curation, Q.X.; writing—original draft preparation, Q.X.; writing—review and editing, J.Z. and Z.W.; visualization, Q.X.; supervision, H.B. and L.M.; project administration, H.B.; funding acquisition, L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Inner Mongolia Natural Science Foundation, grant numbers 2024LHMS06007 and 2022QN06003, the First-Class Discipline Scientific Research Special Project, grant number YLXKZX-NKD-012; and the Ordos Higher Education Institutions Scientific Research Innovation Project, grant number KYON25Z016.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

MTC-BEV	Semantic-Guided Temporal and Cross-Modal BEV Feature Fusion
BCAP	Bidirectional Cross-Modal Attention Fusion
BEV	Bird’s-Eye View
TTFusion	Temporal Fusion
NDS	nuScenes Detection Score
mAP	mean Average Precision
FPS	Frames Per Second

References

Dal’Col, L.; Oliveira, M.; Santos, V. Joint perception and prediction for autonomous driving: A survey. arXiv 2024, arXiv:2412.14088. [Google Scholar] [CrossRef]
Xu, H.; Chen, J.; Meng, S.; Wang, Y.; Chau, L.-P. A survey on occupancy perception for autonomous driving: The information fusion perspective. Inf. Fusion 2025, 114, 102671. [Google Scholar] [CrossRef]
Li, Y.; Xu, L. Panoptic perception for autonomous driving: A survey. arXiv 2024, arXiv:2408.15388. [Google Scholar] [CrossRef]
Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3D object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2147–2156. [Google Scholar]
Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3D bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7074–7082. [Google Scholar]
Wang, Y.; Chao, W.-L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
Liu, Y.; Wang, T.; Zhang, X.; Sun, J. PETR: Position embedding transformation for multi-view 3D object detection. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 531–548. [Google Scholar]
Cao, C.; Ren, X.; Fu, Y. MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth. arXiv 2022, arXiv:2208.02541. [Google Scholar]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11784–11793. [Google Scholar]
Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. VoxelNext: Fully sparse VoxelNet for 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21674–21683. [Google Scholar]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv 2022, arXiv:2205.13542. [Google Scholar]
Zhao, H.; Guan, R.; Wu, T.; Man, K.L.; Yu, L.; Yue, Y. UniBEVFusion: Unified radar-vision BEV fusion for 3D object detection. arXiv 2024, arXiv:2409.14751. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO (Version 8.0.0) [Computer Software]. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 December 2023).
Huang, J.; Ye, Y.; Liang, Z.; Shan, Y.; Du, D. Detecting as labeling: Rethinking LiDAR-camera fusion in 3D object detection. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 439–455. [Google Scholar]
Liu, Y.; Yan, J.; Jia, F.; Li, S.; Gao, A.; Wang, T.; Zhang, X. PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 3262–3272. [Google Scholar]
Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. BevDepth: Acquisition of Reliable Depth for Multi-View 3D Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1477–1485. [Google Scholar]
Wang, S.; Liu, Y.; Wang, T.; Li, Y.; Zhang, X. Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 3621–3631. [Google Scholar]
Zhang, D.; Zheng, Z.; Niu, H.; Wang, X.; Liu, X. Fully Sparse Transformer 3-D Detector for LiDAR Point Cloud. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5705212. [Google Scholar] [CrossRef]
Wang, H.; Shi, C.; Shi, S.; Lei, M.; Wang, S.; He, D.; Schiele, B.; Wang, L. DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 13520–13529. [Google Scholar]
Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; Jia, J. Unifying voxel-based representation with transformer for 3D object detection. Adv. Neural Inf. Process. Syst. 2022, 35, 18442–18455. [Google Scholar]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.-L. TransFusion: Robust LiDAR-camera fusion for 3D object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
Yan, J.; Liu, Y.; Sun, J.; Jia, F.; Li, S.; Wang, T.; Zhang, X. Cross modal transformer: Towards fast and robust 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 18268–18278. [Google Scholar]

Figure 1. Performance comparison of different methods in terms of inference speed (FPS) and NDS. MTC-BEV demonstrates a superior trade-off between the two.

Figure 2. The overall architecture of the proposed MTC-BEV framework.

Figure 3. Multi-scale bidirectional cross-attention for multi-modal BEV fusion.

Figure 4. Time domain feature alignment and fusion. Historical BEV features are first aligned with the current frame through self-motion transformation (rotation and translation) and then fused with the current features through a multi-head attention mechanism.

Figure 5. Segmentation-guided BEV mapping. Multi-view instance masks are projected into BEV space with depth and camera geometry.

Figure 6. nuScenes 3D object detection validation set prediction visualization. Ground-truth bounding boxes are overlaid on predicted results to highlight differences in pose accuracy. Red circles are used to emphasize areas where there are significant differences.

Figure 7. The visualization results with multiple views are presented, where the six-view images are arranged in their correct spatial positions. Red bounding boxes are used to highlight the differing regions.

Table 2. Ablation study of different components: BCAP, TTFusion, and Mask Guide. ✓ indicates the use of this module.

BCAP	TTFusion	Mask Guide	NDS	mAP	FPS
			0.713	0.674	16.55
✓			0.718	0.677	16.05
	✓		0.717	0.677	15.91
		✓	0.716	0.678	15.64
✓	✓		0.720	0.679	15.45
✓		✓	0.721	0.681	15.19
	✓	✓	0.719	0.678	15.06
✓	✓	✓	0.724	0.684	14.91

Table 3. The impact of the number of historical frames in temporal fusion on detection performance.

Number of Historical Frames	NDS	mAP	FPS
1	0.715	0.676	16.09
2	0.717	0.677	15.91
3	0.714	0.675	15.78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xi, Q.; Ma, L.; Zhang, J.; Bai, H.; Wang, Z. MTC-BEV: Semantic-Guided Temporal and Cross-Modal BEV Feature Fusion for 3D Object Detection. World Electr. Veh. J. 2025, 16, 493. https://doi.org/10.3390/wevj16090493

AMA Style

Xi Q, Ma L, Zhang J, Bai H, Wang Z. MTC-BEV: Semantic-Guided Temporal and Cross-Modal BEV Feature Fusion for 3D Object Detection. World Electric Vehicle Journal. 2025; 16(9):493. https://doi.org/10.3390/wevj16090493

Chicago/Turabian Style

Xi, Qiankai, Li Ma, Jikai Zhang, Hongying Bai, and Zhixing Wang. 2025. "MTC-BEV: Semantic-Guided Temporal and Cross-Modal BEV Feature Fusion for 3D Object Detection" World Electric Vehicle Journal 16, no. 9: 493. https://doi.org/10.3390/wevj16090493

APA Style

Xi, Q., Ma, L., Zhang, J., Bai, H., & Wang, Z. (2025). MTC-BEV: Semantic-Guided Temporal and Cross-Modal BEV Feature Fusion for 3D Object Detection. World Electric Vehicle Journal, 16(9), 493. https://doi.org/10.3390/wevj16090493

Article Menu

MTC-BEV: Semantic-Guided Temporal and Cross-Modal BEV Feature Fusion for 3D Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Camera-Based 3D Target Detection

2.2. Three-Dimensional Target Detection Based on LiDAR

2.3. Multi-Modal Fusion for 3D Target Detection

3. Methodology

3.1. Overview

3.2. BCAP Module

3.3. TTFusion Module

3.3.1. Temporal Alignment with Motion Compensation

3.3.2. Multi-Head Temporal Attention Fusion

3.4. Segmentation-Guided BEV Feature Enhancement

3.4.1. Depth-Aware BEV Projection with Segmentation Masks

3.4.2. BEV Mask Generation and Feature Enhancement

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Details

4.4. Comparison with Other Methods

4.5. Ablation Study

4.6. Effect of Historical Frame Number

4.7. Qualitative Results

5. Discussion

5.1. Results Analysis

5.2. Limitations

5.3. Potential Improvements

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI