Asymmetric Explicit Synergy for Multi-Modal 3D Gaussian Pre-Training in Autonomous Driving

Zhang, Dingwei; Ji, Jie; Huang, Chengjun; Li, Bichun; Yu, Chennian; Qu, Chenhui; Yang, Zhengyuan; Hua, Chen; Yu, Biao

doi:10.3390/wevj17020102

Open AccessArticle

Asymmetric Explicit Synergy for Multi-Modal 3D Gaussian Pre-Training in Autonomous Driving

by

Dingwei Zhang

^1,2

,

Jie Ji

^1,2,

Chengjun Huang

¹,

Bichun Li

¹,

Chennian Yu

³,

Chenhui Qu

⁴,

Zhengyuan Yang

⁴,

Chen Hua

⁵

and

Biao Yu

^1,*

¹

Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

²

Science Island Branch, Graduate School of USTC, Hefei 230026, China

³

School of Computer Science and Artificial Intelligence, ChaoHu University, Hefei 238000, China

⁴

Southwest Institute of Technical Physics, Chengdu 610041, China

⁵

School of Electrical and Information Engineering, Changzhou Institute of Technology, Changzhou 213002, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2026, 17(2), 102; https://doi.org/10.3390/wevj17020102

Submission received: 20 January 2026 / Revised: 11 February 2026 / Accepted: 17 February 2026 / Published: 19 February 2026

(This article belongs to the Section Automated and Connected Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Generative pre-training via neural rendering has become a cornerstone for scaling 3D perception in autonomous driving. However, prevalent approaches relying on implicit Neural Radiance Fields (NeRFs) face two fundamental limitations: the shape-radiance ambiguity inherent in vision-centric optimization and the prohibitive computational overhead of volumetric ray marching. To address these challenges, we propose AES-Gaussian, a novel multi-modal pre-training framework grounded in the efficient 3D Gaussian Splatting (3DGS) representation. Diverging from symmetric fusion paradigms, our core innovation is an Asymmetric Encoder architecture that couples a deep semantic vision backbone with a lightweight, physics-aware LiDAR branch. In this framework, LiDAR data serve not merely for semantic extraction, but as sparse physical anchors. By employing a novel Explicit Feature Synergy mechanism, we directly inject raw LiDAR intensity and depth priors into the Gaussian decoding process, thereby rigidly constraining scene geometry in open-world environments. Extensive empirical validation on the nuScenes dataset demonstrates the superiority of our approach. AES-Gaussian achieves state-of-the-art transfer performance, yielding a substantial 7.0% improvement in NDS for 3D Object Detection and a 4.8% mIoU gain in 3D semantic occupancy prediction compared to baselines. Notably, our method reduces geometric reconstruction error by over 50% while significantly improving training and inference efficiency, attributed to the streamlined asymmetric design and rapid Gaussian rasterization. Ultimately, by enhancing both perception accuracy and system efficiency, this work contributes to the development of safer and more reliable autonomous driving systems.

Keywords:

autonomous driving; multi-modal pre-training; 3D Gaussian splatting; self-supervised learning

1. Introduction

Accurate perception of the 3D environment constitutes a cornerstone for the safety and reliability of autonomous driving systems. In recent years, the field has witnessed a paradigm shift from 2D bounding box detection to holistic 3D scene understanding, encompassing tasks such as 3D Object Detection [1,2] and 3D Semantic Occupancy Prediction [3,4]. While fully supervised learning has achieved remarkable success, the dependence on massive quantities of annotated data remains a significant bottleneck. Consequently, self-supervised pre-training, which facilitates the learning of robust representations from unlabeled raw data, has emerged as a promising avenue to enhance both data efficiency and model generalization.

Current pre-training paradigms are predominantly categorized into contrastive learning [5] and generative modeling [6]. Within the latter, rendering-based methods such as UniPAD [7] and VisionPAD [8] have gained substantial traction by utilizing Neural Radiance Fields to reconstruct 3D scenes from 2D imagery. Despite their advancements, these vision-centric approaches struggle with a fundamental geometric ambiguity challenge. Lacking explicit depth constraints, the optimization process is prone to the shape-radiance ambiguity trap [9], where the model infers spurious geometry merely to satisfy RGB consistency. Furthermore, the volumetric ray marching inherent to NeRFs [10] necessitates sampling hundreds of points along each ray, resulting in prohibitive computational costs and high latency during inference.

To incorporate precise geometric cues, multi-modal fusion [11,12] offers a natural solution. However, existing fusion frameworks typically suffer from computational redundancy. These methods often employ symmetric dual-branch architectures, where the LiDAR branch relies on computationally intensive 3D sparse convolutions like VoxelNet [13]. This heavy design hinders the deployment of such models on resource-constrained onboard chips, creating a difficult trade-off between geometric accuracy and system efficiency.

In this paper, we present AES-Gaussian, a novel pre-training framework designed to resolve these conflicts by integrating an asymmetric architecture, Explicit Feature Synergy, and discrete 3D scene representation. Departing from implicit NeRFs, we adopt 3D Gaussian Splatting (3DGS) [14] as our core representation, leveraging its discrete nature for explicit geometric modeling and real-time rasterization capabilities.

The fundamental premise of our framework is that LiDAR data should serve primarily as a rigid geometric constraint rather than a semantic cue during the pre-training phase. Guided by this assumption, we propose an Asymmetric Encoder Design where a deep vision branch is responsible for extracting rich semantic context, whereas a lightweight LiDAR branch focuses on providing precise spatial guidance to calibrate the geometric structure. To effectively synthesize these modalities, we introduce an Explicit Feature Synergy mechanism. This module utilizes LiDAR intensity and relative depth to explicitly initialize and constrain the 3D Gaussians, effectively mitigating the geometric ambiguity prevalent in vision-only approaches.

By leveraging these physical constraints, our method produces high-fidelity geometry with minimal artifacts. When fine-tuned on downstream tasks, AES-Gaussian demonstrates superior transferability on the nuScenes dataset, validating its effectiveness for both sparse object detection and dense occupancy prediction.

The main contributions of this work are summarized as follows:

We propose AES-Gaussian, the first multi-modal pre-training framework that synergizes 3D Gaussian Splatting with an Asymmetric Encoder architecture, successfully balancing high geometric fidelity with computational efficiency.
We propose an Explicit Feature Synergy mechanism that utilizes LiDAR intensity as a physical anchor to resolve the shape-radiance ambiguity, reducing the depth reconstruction error by over 50% compared to vision-only baselines.
We design a Decoupled Gaussian Parameter Decoding strategy coupled with a physics-aware occupancy loss. By utilizing LiDAR intensity as a confidence metric to dynamically weight geometric supervision, this optimization objective strictly penalizes structural errors in significant regions, ensuring that the learned representation adheres rigidly to physical reality.

2. Related Work

2.1. Multi-Modal Perception Methods

Accurate perception of the 3D environment constitutes a cornerstone for the safety and reliability of autonomous driving systems. Existing approaches can be broadly categorized based on their input modalities. LiDAR-based methods, such as VoxelNet [13] and PointPillars, leverage precise depth measurements to capture 3D geometry but lack semantic richness. Conversely, camera-based approaches like FCOS3D [1] and BEVFormer [2] excel in semantic understanding but suffer from depth ambiguity. To leverage complementary information, state-of-the-art multi-modal fusion frameworks, such as BEVFusion [11] and TransFusion [12], project camera and LiDAR features into a shared representation space.

However, these mainstream frameworks typically rely on fully supervised learning paradigms trained from scratch. This dependency on massive manual annotations leads to significant data inefficiency. Furthermore, most existing methods employ deterministic regression for object detection [1,13], often neglecting the quantification of epistemic uncertainty, which is critical for safety-critical decision-making. Additionally, treating object detection and occupancy prediction [3,4] as isolated tasks ignores the intrinsic geometric correlation between sparse objects and dense scene structures.

2.2. 3D Pre-Training for Autonomous Driving

Self-supervised pre-training has emerged as a promising avenue to learn robust representations from unlabeled data. As summarized in Table 1, current paradigms primarily differ in their pretext tasks and scene representations.

Early works, such as PointContrast [5] and DepthContrast [15], maximize the similarity between augmented views of the same scene. While effective for instance-level discrimination, these methods often struggle to capture dense, scene-level geometric details required for occupancy prediction.

Inspired by success in NLP, Masked Autoencoders (MAEs) [6] have been extended to 3D domains. Methods like Point-MAE [16] and BEV-MAE [17] reconstruct masked regions to learn contextual dependencies. However, they typically require high masking ratios and may hallucinate plausible but geometrically inaccurate structures in large open scenarios.

Table 1. Comparison of representative self-supervised pre-training paradigms in autonomous driving.

Method	Paradigm	Representation	Modality
PointContrast [5]	Contrastive	Point Cloud	LiDAR
Point-MAE [16]	Masked Modeling	Point Cloud	LiDAR
UniPAD [7]	Neural Rendering	NeRF (Implicit)	Camera+LiDAR
VisionPAD [8]	Neural Rendering	NeRF (Implicit)	Camera
AES-Gaussian (Ours)	Gaussian Rendering	3DGS (Explicit)	Camera+LiDAR

More relevant to our work is the emerging paradigm of rendering-based pre-training. UniPAD [7] and VisionPAD [8] utilize Neural Radiance Fields (NeRFs) to reconstruct 3D scenes from 2D images. Despite their advancements, NeRF-based methods rely on implicit volumetric ray marching, which is computationally prohibitive and prone to geometric over-smoothing (limitations detailed in Table 1). Different from these approaches, our AES-Gaussian framework employs 3D Gaussian Splatting as the rendering engine. By integrating explicit LiDAR priors, our method achieves faster training speeds and superior geometric sharpness, effectively bridging the gap between efficient rendering and precise physical reconstruction.

2.3. 3D Gaussian Splatting in Perception

3D Gaussian Splatting (3DGS) [14] has revolutionized scene reconstruction with its real-time rendering speed and explicit point-based representation. While recent works such as StreetGaussians [18] and DrivingGaussian [19] have adapted 3DGS for autonomous driving, they primarily focus on high-fidelity rendering and dynamic simulation for novel view synthesis. These methods generally prioritize photometric consistency over the learning of transferable semantic representations suitable for downstream perception tasks.

Consequently, specific efforts have shifted toward utilizing 3DGS for perception pre-training. GaussianPretrain [20] represents a pioneering step by lifting 2D image features to drive 3D Gaussians for detection. However, these predominantly vision-centric approaches struggle with inherent scale ambiguity and floating artifacts in open scenes due to the lack of reliable depth measurements [9]. Distinguishing our approach from both rendering-oriented and vision-only methods, AES-Gaussian enforces explicit physical interpretability. By leveraging LiDAR intensity as a rigid geometric anchor rather than a mere auxiliary feature, our Explicit Feature Synergy mechanism fundamentally resolves the shape-radiance ambiguity, ensuring structural stability in complex open-world environments.

The remainder of this paper is organized as follows. Section 3 details the methodology of our proposed AES-Gaussian framework, elaborating on the Asymmetric Encoder design, the Explicit Feature Synergy mechanism, and the Decoupled Gaussian Parameter Decoding strategy. Section 4 presents the experimental setup and provides a comprehensive evaluation of our method on 3D Object Detection and semantic occupancy prediction tasks, including mechanism validation and ablation studies. Section 5 discusses the theoretical implications of geometric correction, analyzes computational efficiency, and addresses the limitations of the current framework. Finally, Section 6 summarizes our main contributions and concludes the paper.

3. Materials and Methods

The overall architecture of our proposed LiDAR-Enhanced Gaussian Pretrain framework is illustrated in Figure 1. This framework is designed as a streamlined and efficient multi-modal pre-training paradigm grounded in the 3D Gaussian Splatting (3D-GS) representation.

Taking multi-view images and LiDAR point clouds as inputs, our primary objective is to reconstruct scene-level RGB, Depth, and Occupancy signals by decoding the Gaussian anchor parameters

{(μ_{j}, α_{j}, Σ_{j}, c_{j})}_{j = 1}^{K}

for each scene. To effectively exploit multi-modal information while preserving computational efficiency, we propose an Asymmetric Encoder architecture. Specifically, the vision branch leverages deep neural networks to extract implicit semantic features, whereas the LiDAR branch captures explicit physical attributes via a lightweight point encoder. These heterogeneous features are then fused within the Explicit Feature Synergy module and propagated to Decoupled Decoders. Structurally, these decoders are separated to independently predict geometric attributes and texture attributes.

3.1. Preliminaries: Visual-Only Gaussian Pre-Training

Our primary objective in this preliminary phase is to establish a unified 3D scene representation that accurately captures both the geometric structure and visual appearance of the environment. We aim to reconstruct a continuous radiance field from discrete sensor inputs, enabling the synthesis of novel views and depth maps. To achieve this efficiently, we adopt 3D Gaussian Splatting as our underlying representation due to its explicit nature and rendering speed. Ideally, this representation should seamlessly integrate semantic information from images with structural constraints from LiDAR.

3.1.1. 3D Gaussian Scene Representation

Formally, we represent the scene as a collection of discrete volumetric primitives. These 3D Gaussians are defined by their spatial attributes, including position and shape, and their appearance attributes, such as opacity and color. Mathematically, the geometry of each primitive is determined by a center

μ \in R^{3}

and a covariance matrix

Σ

. To ensure numerical stability during optimization, the covariance is decomposed into a scaling vector

s \in R^{3}

and a rotation quaternion

q \in R^{4}

. The visual appearance is controlled by an opacity scalar

α \in [0, 1]

and spherical harmonics coefficients

c

.

To render an image from a specific viewpoint, we project these 3D primitives onto the 2D camera plane. The pixel color

C (p)

at pixel

p

is computed via differentiable splatting, which aggregates N ordered Gaussians overlapping the viewing ray:

C (p) = \sum_{i = 1}^{N} c_{i} α_{i} \prod_{j = 1}^{i - 1} (1 - α_{j}),

(1)

where

c_{i}

denotes the view-dependent color of the i-th Gaussian, and the product term represents the accumulated transmittance along the ray. This explicit formulation facilitates the direct manipulation of geometric attributes, providing a theoretical foundation for the injection of physical attributes proposed in our method.

3.1.2. Baseline Limitations: Implicit Decoding

To initialize these Gaussians without dense depth labels, the baseline framework employs a ray-based strategy. It casts rays through LiDAR projection points to determine initial anchor positions

μ

. However, a critical limitation arises during the parameter decoding phase. While the positions are anchored, the remaining attributes are predicted solely from implicit visual features.

For a given anchor at

x

, the baseline extracts a feature vector

f (x)

via trilinear interpolation from a 3D visual voxel volume

V_{i m g}

:

f (x) = TriInterp (V_{i m g}, x) .

(2)

A shared decoder

Φ

then maps this implicit feature to the Gaussian parameters

{α, s, q, c}

. This design assumes that interpolated visual features contain sufficient geometric cues. However, in texture-less or low-light scenarios, visual features often exhibit significant ambiguity, resulting in ill-posed geometric inference [9]. This deficiency leads to slow convergence and geometric artifacts, directly motivating our proposal of Asymmetric Explicit Feature Synergy.

3.2. Asymmetric Explicit Feature Synergy

The cornerstone of our framework is the Asymmetric Explicit Feature Synergy module. This design is predicated on the fundamental assumption that in the pre-training phase, the primary role of LiDAR is to function as a rigid geometric constraint to calibrate spatial structure, whereas the camera branch provides rich semantic context. Unlike conventional multi-modal fusion approaches that depend on computationally intensive 3D backbones [13] for voxel-level alignment, we introduce a lightweight asymmetric architecture designed to efficiently integrate these implicit visual semantics with explicit physical attributes.

Consistent with the baseline architecture, the Vision Branch is tasked with extracting high-level semantic context. Multi-view images are initially processed by a shared image backbone such as ResNet [21] or ConvNeXt [22] to yield 2D features. These features are subsequently lifted into 3D space via a View Transformer like LSS [23], forming a unified visual voxel volume

V_{i m g}

. For an arbitrary 3D anchor position

x

, the implicit semantic feature

F_{i m g} (x)

is derived via trilinear interpolation from

V_{i m g}

. We characterize this feature as implicit because it is inferred from 2D projections rather than acquired through direct 3D measurements.

To augment the visual representations, we incorporate a lightweight LiDAR Branch that directly encodes raw physical priors without relying on intermediate voxelization. During the ray sampling phase, for each sampled point, we extract not only its geometric coordinates but also a set of raw physical attributes

P_{r a w} = {I, d, Δ x}

. Here, I denotes the reflection intensity, d represents the relative depth along the ray, and

Δ x

indicates the relative coordinate within the local voxel.

Instead of feeding these points into a heavy 3D backbone, we employ a streamlined Point MLP encoder [24] denoted as

E_{l i d a r}

and illustrated in Figure 2 to map these raw attributes into a high-dimensional feature space:

F_{l i d a r} = E_{l i d a r} (I, d, Δ x),

(3)

where

F_{l i d a r}

represents the explicit physical feature vector. This design significantly reduces computational overhead while preserving precise physical cues.

To effectively synthesize these heterogeneous features, we employ a concatenation-based fusion strategy. The unified feature

F_{s y n e r g y}

is formulated as:

F_{s y n e r g y} = Concat (F_{i m g}, F_{l i d a r}) .

(4)

The preference for concatenation over element-wise summation or attention mechanisms is driven by the necessity to maintain the integrity of sparse geometric signals. Summation runs the risk of allowing dominant visual features, which may contain noise in texture-less regions, to obscure the sparse yet precise LiDAR signals. In contrast, concatenation preserves the high-frequency physical information provided by the LiDAR branch. Furthermore, this approach offers superior computational efficiency compared to complex attention modules, while empowering the subsequent decoding layers to adaptively balance semantic context and geometric constraints based on the specific requirements of the scene.

3.3. Decoupled Gaussian Parameter Decoding

To further refine the representation learning process, we introduce a Decoupled Gaussian Parameter Decoding strategy as show in Figure 3. Fundamentally, the attributes of 3D Gaussian primitives can be categorized into two functionally distinct groups: Geometry Attributes which define the physical structure, and Appearance Attributes which govern surface radiance. Given that these attribute groups exhibit varying dependencies on input modalities, employing a monolithic shared decoder—as is standard in prior works—is suboptimal. Consequently, we engineer two specialized decoding heads to handle these distinct tasks.

Geometry Head. The Geometry Head is dedicated to reconstructing the precise spatial topology of the scene. It accepts the synergistic feature vector

F_{s y n e r g y}

(the concatenation of visual and physical features) as input to regress the opacity

α

, scaling

s

, rotation

q

, and position offset

Δ μ

:

{α, s, q, Δ μ} = Φ_{g e o} (F_{s y n e r g y}),

(5)

where

Φ_{g e o} (\cdot)

denotes the learnable geometry decoding module (implemented via MLPs) that maps the high-dimensional feature representations to the intrinsic physical attributes of 3D Gaussians.

The rationale behind this design is grounded in physical principles. Specifically, LiDAR intensity acts as a robust proxy for surface occupancy; high intensity typically correlates with solid surfaces, thereby providing a decisive supervisory signal for predicting high opacity

α

. Furthermore, the explicit depth information encoded in

F_{l i d a r}

imposes rigid constraints on position and scale, effectively mitigating the “floating artifacts” frequently observed in vision-only approaches.

Texture Head. In contrast, the Texture Head focuses on high-fidelity photometric modeling. Acknowledging that LiDAR point clouds are sparse and devoid of chromatic information, forcing geometric features into the color prediction branch risks introducing structural noise into the continuous color manifold. Therefore, we deliberately isolate the Texture Head to condition exclusively on the implicit visual feature

F_{i m g}

:

c = Φ_{t e x} (F_{i m g}),

(6)

where

Φ_{t e x} (\cdot)

represents the texture decoding head that projects the semantic-rich image features

F_{i m g}

into the spherical harmonics coefficients

c

. By decoupling the decoding architecture, we ensure that the scene geometry is rigorously constrained by physical measurements, while the appearance attributes preserve the rich semantic consistency derived from the visual context.

3.4. Joint Optimization and Loss Functions

To establish robust representation learning without reliance on manual semantic annotations, we formulate the pre-training phase as a multi-task self-supervised reconstruction problem. The overall optimization objective is composed of photometric consistency constraints and geometric structure regularization. The total loss function is formally defined as:

L_{t o t a l} = L_{r g b} + λ_{d e p t h} L_{d e p t h} + λ_{o c c} L_{o c c},

(7)

where

L_{r g b}

,

L_{d e p t h}

, and

L_{o c c}

denote the Photometric Consistency Loss, Depth Geometric Regularization, and physics-aware occupancy loss, respectively. The scalars

λ_{d e p t h}

and

λ_{o c c}

serve as hyperparameters to balance the contribution of these geometric regularization terms.

3.4.1. Photometric Consistency Loss

The learning of texture features is governed primarily by the fidelity of the rendered images. For a set of valid pixels

P

, we quantify the photometric discrepancy using the

L_{1}

distance between the rendered color

\hat{C} (p)

and the ground truth pixel value

C_{g t} (p)

. The RGB loss is formulated as:

L_{r g b} = \frac{1}{| P |} \sum_{p \in P} {∥ \hat{C} (p) - C_{g t} (p) ∥}_{1} .

(8)

This objective compels the Texture Head to extract accurate appearance information from the image features, ensuring high-fidelity visual reconstruction.

3.4.2. Depth Geometric Loss

To mitigate the inherent depth ambiguity associated with volumetric rendering, we leverage LiDAR projection points to provide sparse yet precise depth supervision. Analogous to chromatic rendering, the predicted depth

\hat{D} (p)

is derived via the volumetric accumulation of Gaussian depths along the ray. We then minimize the deviation between this prediction and the ground truth depth

D_{g t} (p)

captured by LiDAR:

L_{d e p t h} = \frac{1}{| P |} \sum_{p \in P} {∥ \hat{D} (p) - D_{g t} (p) ∥}_{1} .

(9)

This term serves as a geometric anchor, constraining the spatial distribution of Gaussian primitives to ensure they adhere closely to the physical object surfaces.

3.4.3. Physics-Aware Occupancy Loss

To inject explicit physical priors into the geometric learning process, we introduce a physics-aware occupancy loss (

L_{o c c}

). Departing from conventional occupancy supervision that treats all voxels uniformly, we utilize LiDAR reflectance intensity as a confidence metric to strictly penalize geometric errors in structurally significant regions while mitigating sensitivity to sensor noise.

We formulate the loss as an intensity-weighted binary cross-entropy. To derive the binary ground truth labels

y_{v}

, we employ a ray casting strategy aligned with the physical measurement process. Specifically, voxels traversed by the LiDAR beam between the sensor origin and the reflection point are defined as free space (

y_{v} = 0

), whereas the voxel containing the reflection point is marked as occupied (

y_{v} = 1

). Crucially, to address the ambiguity in occluded regions, grid cells located behind the reflection point along the ray are treated as unknown and are explicitly masked out from the loss calculation. The optimization objective is defined as:

L_{o c c} = - \frac{1}{| V |} \sum_{v \in V} [(1 + γ \cdot {\tilde{I}}_{v}) \cdot y_{v} \cdot log (p_{v}) + α \cdot (1 - y_{v}) \cdot log (1 - p_{v})],

(10)

where

p_{v}

denotes the predicted occupancy probability. Functionally, this objective serves to anchor the Gaussian attributes: the spatial constraints from ray casting implicitly regulate the position

μ

and scale s, while the intensity weighting primarily refines the opacity

α

based on signal confidence. The weighting term

(1 + γ \cdot {\tilde{I}}_{v})

acts as a dynamic gain based on the normalized intensity

{\tilde{I}}_{v}

. This mechanism effectively handles the noise inherent in sparse LiDAR returns by assigning lower weights to points with weak reflection signals, which are statistically more prone to measurement instability. Consequently, the network prioritizes high-reflectivity objects which provide reliable geometric cues. The selection of hyperparameters

γ

and

α

is pivotal for ensuring convergence stability, balancing the gradient contribution from high-confidence surfaces against the severe class imbalance typical of outdoor scenes.

4. Results

4.1. Datasets and Metrics

Dataset. Experimental validation is performed on the nuScenes dataset [25], a large-scale multimodal autonomous driving benchmark comprising 1000 driving sequences, each of 20 s duration. The dataset adheres to the official split: 700 scenes for training, 150 for validation, and 150 for testing. It provides synchronized sensory data from six surrounding cameras and a 32-beam LiDAR sensor. For the 3D Semantic Occupancy Prediction task, we utilize the Occ3D-nuScenes benchmark [4], which generates dense voxel-wise semantic labels derived from nuScenes keyframes. The volume of interest spans

[- 40 m, 40 m]

along the X and Y axes, and

[- 1 m, 5.4 m]

along the Z axis, with a voxel resolution of

0.4 m

.

Metrics for Pre-training Reconstruction. Given that the primary objective of our pre-training is to resolve geometric ambiguity, we assess the fidelity of the learned representations from two complementary perspectives: photometric consistency and geometric accuracy.

Photometric Quality: We quantify the visual similarity between rendered images and ground truth views using Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS).
Geometric Quality: To explicitly verify the efficacy of our feature synergy in rectifying depth errors, we project accumulated LiDAR points onto the image plane to generate sparse ground truth depth maps. We evaluate depth consistency using the Root Mean Square Error (RMSE):

$RMSE = \sqrt{\frac{1}{| D |} \sum_{i \in D} | | d_{i} - {\hat{d}}_{i} {| |}^{2}},$

(11)

where $d_{i}$ denotes the ground truth depth from projected LiDAR points, ${\hat{d}}_{i}$ represents the rendered depth from our 3D Gaussians, and $D$ denotes the set of valid pixels containing LiDAR returns.

Metrics for 3D Object Detection. For the downstream 3D object detection task, we report the official nuScenes detection metrics, including the Mean Average Precision (mAP) and the NuScenes Detection Score (NDS). The NDS is computed as a weighted combination of mAP and five True Positive (TP) error metrics: Mean Average Translation Error (mATE), Mean Average Scale Error (mASE), Mean Average Orientation Error (mAOE), Mean Average Velocity Error (mAVE), and Mean Average Attribute Error (mAAE). The NDS is formulated as:

NDS = \frac{1}{10} [5 \times mAP + \sum_{TP \in TP} (1 - min (1, TP))] .

(12)

In our analysis, we place particular emphasis on mATE (measured in meters), as it serves as a direct indicator of geometric localization accuracy, thereby validating the benefits of our physics-aware pre-training paradigm.

Metrics for 3D Semantic Occupancy. For the dense prediction task, performance is evaluated using the mean Intersection over Union (mIoU) across 17 semantic classes. Let

T P_{c}

,

F P_{c}

, and

F N_{c}

denote the true positives, false positives, and false negatives for class c, respectively. The mIoU is calculated as:

mIoU = \frac{1}{N_{c l s}} \sum_{c = 1}^{N_{c l s}} \frac{T P_{c}}{T P_{c} + F P_{c} + F N_{c}} .

(13)

Furthermore, to assess the quality of scene geometry reconstruction regardless of semantic categorization, we report the Geometric IoU (

{IoU}_{g e o}

). This metric evaluates the binary classification performance, distinguishing occupied voxels from free space.

4.2. Implementation Details

Our framework is implemented within the MMDetection3D [26] codebase. For the vision branch, we utilize ConvNeXt-Small [22], pre-trained on ImageNet, as the backbone network to extract 2D image features. These features are subsequently lifted into 3D voxel space via a depth-based View Transformer. Conversely, for the LiDAR branch, we engineer a lightweight Point-MLP comprising three linear layers with a hidden dimension of 16 to encode explicit physical attributes. The voxel resolution is configured as

[0.6 m, 0.6 m, 1.6 m]

, and the spatial detection range is defined as

[- 54 m, 54 m]

along the X and Y axes, and

[- 5 m, 3 m]

along the Z axis.

Training Schedule. The training protocol is divided into two distinct phases: pre-training and fine-tuning.

Stage 1: Multi-modal Pre-training. We train the entire Asymmetric Explicit Feature Synergy framework for 12 epochs using the AdamW optimizer [27] with a weight decay of 0.01. The initial learning rate is set to $2 \times 10^{- 4}$ and decays following a cosine annealing schedule, incorporating a linear warmup for the first 500 iterations. The optimization objective integrates RGB reconstruction, depth supervision, and the proposed physics-aware occupancy loss.
Stage 2: Downstream Fine-tuning. Following pre-training, the learned weights are transferred to downstream tasks. For 3D Object Detection, the model is fine-tuned for 30 epochs on the nuScenes training set, employing a transformer-based detection head structurally similar to UVTR [28]. For 3D Semantic Occupancy, we attach a dedicated occupancy head and fine-tune the model for 24 epochs. During this phase, the learning rate policy remains consistent with the cosine annealing strategy used in the pre-training stage.

Hardware and Efficiency. All experimental evaluations are executed on a server equipped with 6 NVIDIA A800 GPUs. The batch size is set to 1 per GPU, yielding a total batch size of 6. To ensure a fair comparison with baseline methods under a standard setting, we deliberately exclude complex training tricks such as Class-Balanced Grouping and Sampling (CBGS) or Test-Time Augmentation (TTA).

4.3. Pre-Training Mechanism Validation

To empirically validate the efficacy of our Asymmetric Explicit Feature Synergy, we conduct a comparative analysis against two representative paradigms including GaussianPretrain representing vision-only 3DGS and UniPAD representing multi-modal NeRF. We evaluate the reconstruction fidelity on the nuScenes validation set focusing on photometric consistency, perceptual sharpness, and geometric accuracy. Specifically, we employ the Learned Perceptual Image Patch Similarity metric LPIPS to explicitly assess image sharpness and texture fidelity. Furthermore, to rigorously quantify the geometric fidelity of the reconstructed scene, we introduce the Depth L1 Error metric alongside standard RMSE and AbsRel metrics.

Quantitative analysis confirms that vision-exclusive approaches such as GaussianPretrain exhibit severe scale ambiguity resulting in a high RMSE of 4.51 m. By integrating LiDAR constraints, our method significantly rectifies these depth errors. Most notably, compared to the multi-modal NeRF baseline UniPAD, our approach achieves a substantial 24 percent reduction in Depth RMSE. This trend is consistently reflected in the Depth L1 Error where our method achieves the lowest error rate of 1.250 m, demonstrating superior per-point geometric fidelity. We attribute this performance gain to the fundamental difference in scene representation where implicit NeRF-based rendering is prone to over-smoothing high-frequency details whereas our explicit Gaussian synergy utilizes LiDAR intensity to strictly anchor the geometry and preserve sharp discontinuities. Beyond geometric accuracy, computational efficiency constitutes a vital metric for scalable pre-training. We report the inference latency in Table 2 to highlight the distinct costs associated with each scene representation. The NeRF-based UniPAD incurs a high latency of 145 ms due to the computationally intensive volumetric ray marching required for implicit rendering. Conversely, our method benefits from the rapid rasterization capability of 3D Gaussian Splatting, achieving a low latency of 23 ms. This performance is comparable to the vision-only baseline of 19 ms, demonstrating that our asymmetric synergy incorporates physical priors with minimal computational overhead.

Qualitative visualization further corroborates these quantitative results as presented in Figure 4 which depicts reconstruction results across three distinct scenes. The vision-only baseline displays significant depth noise characterized by floating artifacts in open space manifesting as speckles in the depth map. The NeRF-based UniPAD produces globally consistent but locally blurred depth maps that often lose definition at object boundaries. In contrast, our method generates high-fidelity depth maps with crisp structural boundaries between vehicles and the background effectively eliminating both the floating artifacts observed in vision-only methods and the over-smoothing inherent to NeRF.

4.4. Performance on Downstream Tasks

To assess the transferability of the representations learned by our pre-training paradigm, we fine-tune the model on two critical downstream tasks: 3D Object Detection and 3D Semantic Occupancy Prediction. In all experiments, the UVTR-M [28] architecture serves as the base detector.

3D Object Detection. Table 3 presents the quantitative results on the nuScenes detection benchmark. Our AES-Gaussian achieves an NDS of 54.7% (+14.7%) and an mAP of 48.2% (+15.6%), surpassing the scratch baseline by a substantial margin. Notably, compared to the state-of-the-art NeRF-based method UniPAD [7], our approach demonstrates superior geometric precision, yielding a significantly lower Mean Average Translation Error (mATE: 0.258 m vs. 0.275 m). This 18% reduction in translation error relative to the baseline corroborates that our explicit physical synergy effectively calibrates geometric features, thereby facilitating more precise object localization.

To further evaluate the robustness of our learned representations against environmental variations, we analyze the detection performance under different weather and lighting conditions. As shown in Table 4, vision-centric methods suffer significant performance degradation in Rainy and Night scenarios due to visual ambiguity. In contrast, AES-Gaussian maintains high stability with only a marginal drop in NDS. Specifically, in Night scenes, our method outperforms the vision-only baseline by over 8%, validating that the injected LiDAR physical priors effectively compensate for the loss of visual information.

We visualize the detection results in Figure 5. As illustrated in the second column, the No-Pretrain baseline is characterized by frequent false negatives (missed detections) and imprecise bounding box regression. While UniPAD-M (third column) improves recall, it remains prone to orientation misalignment and positional jitter. In contrast, our method (fourth column) demonstrates robust detection capabilities even in complex scenarios, achieving tighter bounding box alignment with the ground truth due to the enhanced geometric priors.

3D Semantic Occupancy Prediction. We further evaluate the generalization capability of our model on dense prediction tasks using the Occ3D-nuScenes benchmark [4]. As detailed in Table 5, our method achieves a 41.3% mIoU, representing a substantial 13.2% relative improvement (4.8 mIoU gain) over the baseline. Crucially, the Geometric IoU (

{IoU}_{g e o}

) increases from 74.2% to 78.9%, providing strong evidence that our learned features possess inherent spatial awareness.

Qualitative comparisons presented in Figure 6 illustrate that our method recovers finer structural details. The No-Pretrain baseline (Column 2) produces fragmented predictions with significant sparsity. UniPAD-M (Column 3) tends to suffer from over-smoothing, resulting in volumetric dilation. Conversely, Ours (Column 4) generates dense, coherent, and sharp occupancy structures, successfully recovering fine-grained elements such as poles and vegetation that are often compromised in competing methods.

4.5. Efficiency Analysis

To evaluate the potential for real-world deployment, we benchmark the inference speed (Frames Per Second, FPS) and detection performance (NDS) of various methods on an NVIDIA A800 GPU platform. The trade-off between performance and latency is visualized in Figure 7.

AES-Gaussian is positioned in the optimal high-efficiency quadrant, achieving 54.7% NDS at 8.5 FPS. Notably, compared to the counterpart pre-training method UniPAD, our approach delivers an approximate

4 \times

speedup while maintaining superior accuracy. This significant efficiency boost stems from our architectural design, which discards the computationally intensive volumetric ray marching inherent to NeRF baselines in favor of a lightweight Point-MLP encoder and explicit Gaussian rasterization.

Furthermore, relative to heavy multi-modal detectors such as BEVFusion, AES-Gaussian achieves a two-fold increase in inference speed with a negligible performance margin (0.5% NDS difference). This validates that high-quality pre-training combined with a streamlined architecture can effectively replace computationally heavy 3D backbones. Although PointPillars exhibits the highest inference speed due to its simplistic pillar encoding, its perception accuracy falls short of the requirements for complex scene understanding. Consequently, AES-Gaussian represents the optimal equilibrium between precision and latency, satisfying the quasi-real-time requirements of modern autonomous driving systems.

4.6. Ablation Studies

To rigorously verify the efficacy of individual components within our AES-Gaussian framework, we conduct comprehensive ablation studies on the nuScenes validation set. Following established protocols in self-supervised learning [5,7], we fine-tune the models on the downstream detection task using a semi-supervised setting with only 10% of the labeled training data. This low-data regime is deliberately chosen to evaluate the data efficiency of the pre-trained representations. A robust pre-training framework should enable the model to learn generalized features that require minimal annotated data for adaptation, thereby validating the quality of the learned structural constraints.

Effectiveness of Key Components. We dissect the contribution of our core designs by incrementally adding components to the baseline: the Asymmetric LiDAR Branch, the Explicit Feature Synergy, and the Decoupled Decoder. The quantitative evolution is summarized in Table 6.

Baseline (Model A): We start with the vision-only baseline (UVTR-C). As expected, it is plagued by geometric ambiguity, achieving a modest 42.5% NDS and a high Depth RMSE of 5.82 m.
w/ Asymmetric LiDAR Branch (Model B): We first equip the baseline with the lightweight LiDAR branch. Simply concatenating these explicit physical features yields a notable improvement (4.2% NDS). However, the mATE (0.295 m) remains suboptimal, indicating that naive fusion without alignment strategies fails to fully exploit the inherent geometric precision of LiDAR.
w/ Explicit Feature Synergy (Model C): Building on Model B, we incorporate the Explicit Feature Synergy module to inject LiDAR intensity and relative depth into visual features via Voxel Scatter. This addition brings a substantial gain of 3.5% NDS and drastically reduces Depth RMSE to 2.25 m. This step confirms that establishing explicit physical anchors is critical for compelling 3D Gaussians to align strictly with physical surfaces.
w/ Decoupled Decoder (Model D): Finally, we apply the Decoupled Decoder strategies to separate geometric reconstruction from appearance learning. This full model achieves the best performance (51.0% NDS on 10% data) and the lowest translation error (0.262 m), validating that decoupling effectively mitigates optimization conflicts between the two modalities.

Impact of Loss Components and Hyperparameters. To explicitly quantify the contribution of each supervision signal defined in Equation (7), we conduct a component-wise ablation study by incrementally incorporating depth and occupancy constraints into the photometric baseline. As summarized in Table 7, relying solely on RGB supervision results in a modest NDS of 42.5% and a high translation error, primarily due to the inherent scale ambiguity of monocular rendering. The integration of LiDAR-guided depth supervision significantly improves geometric localization, increasing the NDS to 47.3%. Furthermore, incorporating the proposed physics-aware occupancy loss yields the most substantial performance gain, boosting the NDS to 51.0%. This confirms that explicit occupancy supervision is critical for suppressing floating artifacts and enforcing structural consistency in open-world environments.

We further investigate the sensitivity of the framework to the occupancy loss weight

λ_{o c c}

. We evaluate the detection performance by varying

λ_{o c c}

while keeping other hyperparameters constant. The results presented in Table 8 demonstrate that the model exhibits robustness to hyperparameter variations within a reasonable range. The performance peaks at

λ_{o c c} = 1.0

, where the network achieves an optimal balance between geometric constraints and semantic learning. Excessively low weights fail to provide sufficient structural guidance, while overly high weights overly penalize the learning process, leading to a slight degradation in performance.

Sensitivity to Masking Ratio. The masking ratio serves as a pivotal hyperparameter in generative pre-training. We investigate its impact by sweeping ratios from 0% to 75%, with results summarized in Table 9. Empirical evidence indicates that performance maximizes at a moderate ratio of 40%. This finding diverges from standard vision-only Masked Autoencoders, which typically necessitate aggressive masking (e.g., 75%) to counteract the high spatial redundancy of dense imagery. In our multi-modal context, however, LiDAR point clouds are inherently sparse. Excessive masking at 75% severely compromises the geometric integrity of the scene, diminishing the density of physical anchors required by the Explicit Feature Synergy module. Conversely, a negligible masking ratio fails to enforce robust representation learning, resulting in overfitting. Consequently, 40% emerges as the optimal equilibrium, preserving sufficient geometric structure to guide the synergy mechanism while inducing adequate difficulty for feature abstraction.

5. Discussion

The empirical results substantiate that incorporating explicit physical priors is pivotal for resolving geometric ambiguity in vision-centric perception.

Reasons for Superiority. Our performance gains stem primarily from resolving the shape-radiance ambiguity. By leveraging LiDAR intensity as a physical anchor, the Asymmetric Explicit Feature Synergy compels 3D Gaussians to align strictly with physical surfaces. This explicit constraint effectively eliminates floating artifacts inherent in vision-only optimization and significantly reduces translation errors, directly translating to superior detection accuracy.

Comparison with NeRF Baselines. The performance advantage of AES-Gaussian over NeRF-based methods extends beyond the choice of scene representation. Standard NeRF baselines typically optimize for photometric consistency without incorporating the explicit physical or occupancy constraints utilized in our framework. Therefore, the observed gains are attributable to the dual benefits of the discrete 3D Gaussian structure and the proposed physics-aware regularization, which collectively resolve geometric ambiguities that implicit representations struggle to address.

Deterministic and Probabilistic Analysis. While quantifying uncertainty is critical for safety, our primary objective is to mitigate the root cause of epistemic uncertainty—geometric ambiguity—at the representation level. By rigorously constraining 3D Gaussians with physical priors, we inherently reduce geometric variance. Consequently, we employ a deterministic architecture to isolate the efficacy of our synergy mechanism, though extending this explicit representation to probabilistic modeling remains a vital direction for future research.

Implications for Theory and Practice. Theoretically, our study indicates that discrete, physics-aware modeling offers a more robust solution to ill-posed 3D reconstruction than implicit representations relying solely on photometric consistency. Practically, the proposed framework establishes a cost-effective paradigm: it empowers camera-based systems to approach multi-modal performance by distilling LiDAR geometry during pre-training, without incurring additional hardware costs during inference.

Limitations. We acknowledge two limitations. First, pre-training requires synchronized LiDAR data, which restricts the utilization of large-scale vision-only datasets despite the model being vision-centric at inference. Second, the current framework operates on static sequences, limiting the exploitation of long-term temporal consistency required for modeling complex 4D dynamics.

6. Conclusions

In this work, we propose AES-Gaussian to resolve the geometric ambiguity and computational redundancy in autonomous driving pre-training. By synergizing an Asymmetric Encoder with explicit physical feature injection, our framework effectively utilizes sparse LiDAR intensity as a physical anchor for 3D Gaussian Splatting. Empirical results on the nuScenes dataset demonstrate state-of-the-art performance, achieving a 7.0% NDS improvement in 3D object detection and 41.3% mIoU in semantic occupancy prediction, while significantly reducing geometric reconstruction errors. Future work will extend this paradigm to model 4D temporal dynamics and explore its integration into end-to-end planning pipelines.

Author Contributions

Conceptualization, D.Z. and B.Y.; methodology, D.Z. and C.Q.; software, J.J. and Z.Y.; validation, C.H. (Chengjun Huang), B.L. and C.Y.; formal analysis, D.Z. and B.L.; investigation, D.Z. and C.H. (Chengjun Huang); resources, B.Y. and C.H. (Chen Hua); data curation, D.Z. and J.J.; writing—original draft preparation, D.Z.; writing—review and editing, B.Y. and C.H. (Chen Hua); visualization, D.Z. and C.Y.; supervision, B.Y. and C.H. (Chen Hua); project administration, B.Y.; funding acquisition, B.Y. and C.H. (Chen Hua). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Youth Innovation Promotion Association of the CAS, grant number Y2021115; the Dreams Foundation of Jianghuai Advance Technology Center, grant number 2023-ZM01G002; the National Natural Science Foundation of China, grant number 62503066; and the Changzhou Science and Technology Program, grant number CJ20250080.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, T.; Zhu, X.; Pang, J.; Lin, D. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 913–922. [Google Scholar]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation from LiDAR-Camera via Spatiotemporal Transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Zheng, W.; Zhang, Y.; Zhou, J.; Lu, J. Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 9223–9232. [Google Scholar]
Chen, D.; Zheng, H.; Zhou, Y.; Li, X.; Liao, W.; He, T.; Peng, P.; Shen, J. Semantic Causality-Aware Vision-Based 3D Occupancy Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 24878–24888. [Google Scholar]
Xie, S.; Gu, J.; Guo, D.; Qi, C.R.; Guibas, L.; Litany, O. PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 574–591. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 16000–16009. [Google Scholar]
Yang, H.; Zhang, S.; Huang, D.; Liu, W.; Wang, T.; Ouyang, W.; Bai, L. UniPAD: A Universal Pre-Training Paradigm for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 15238–15250. [Google Scholar]
Zhang, H.; Zhou, W.; Zhu, Y.; Yan, X.; Gao, J.; Bai, D.; Cai, Y.; Liu, B.; Cui, S.; Li, Z. VisionPAD: A Vision-Centric Pre-Training Paradigm for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 17165–17175. [Google Scholar]
Chen, G.; He, Y.; He, L.; Zhang, H. PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 205–222. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Liang, T.; Xie, H.; Yu, K.; Xia, Z.; Lin, Z.; Wang, Y.; Tang, T.; Wang, B.; Tang, Z. BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; 2022; Volume 35, pp. 10421–10434. [Google Scholar]
Zhou, C.; Yu, L.; Babu, A.; Tirumala, K.; Yasunaga, M.; Shamis, L.; Kahn, J.; Ma, X.; Zettlemoyer, L.; Levy, O. TransFusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. arXiv 2024, arXiv:2408.11039. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 4490–4499. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 139. [Google Scholar] [CrossRef]
Chhipa, P.C.; Upadhyay, R.; Saini, R.; Lindqvist, L.; Nordenskjold, R.; Uchida, S.; Liwicki, M. Depth Contrast: Self-Supervised Pretraining on 3DPM Images for Mining Material Classification. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 212–227. [Google Scholar]
Pang, Y.; Tay, E.H.F.; Yuan, L.; Liu, W.; Tian, Y. Masked Autoencoders for 3D Point Cloud Self-Supervised Learning. World Sci. Annu. Rev. Artif. Intell. 2023, 1, 2440001. [Google Scholar]
Zheng, W.; Chen, W.; Huang, Y.; Zhang, B.; Duan, Y.; Lu, J. OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 55–72. [Google Scholar]
Yan, Y.; Lin, H.; Zhou, C.; Wang, W.; Sun, H.; Wang, K.; Zhou, H.; Guo, L.; Peng, S.; Zhang, G.; et al. Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 156–173. [Google Scholar]
Zhou, X.; Lin, Z.; Shan, X.; Wang, Y.; Sun, D.; Yang, M.H. DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 21634–21643. [Google Scholar]
Xu, S.; Li, F.; Jiang, S.; Yuan, H.; Wang, Y.; Zhang, L. GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving. arXiv 2024, arXiv:2411.12452. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 11976–11986. [Google Scholar]
Philion, J.; Fidler, S. Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 194–210. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 652–660. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 11621–11631. [Google Scholar]
MMDetection3D Contributors. MMDetection3D: OpenMMLab Next-Generation Platform for General 3D Object Detection. 2020. Available online: https://github.com/open-mmlab/mmdetection3d (accessed on 27 December 2025).
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; Jia, J. Unifying Voxel-Based Representation with Transformer for 3D Object Detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; 2022; Volume 35, pp. 18442–18455. [Google Scholar]

Figure 1. The architecture of the proposed LiDAR-Enhanced Gaussian Pretrain.

Figure 2. Detailed architecture of the LiDAR Branch.

Figure 3. Illustration of the Decoupled Gaussian Parameter Decoding strategy.

Figure 4. Qualitative comparison of reconstruction quality.

Figure 5. Qualitative visualization of 3D Object Detection across three scenarios.

Figure 6. Qualitative comparison of 3D Semantic Occupancy Prediction.

Figure 7. Efficiency Analysis of Different Models.

Table 2. Quantitative comparison of reconstruction quality and efficiency. Note that while GaussianPretrain is plagued by scale ambiguity and UniPAD tends to over-smooth geometric details, our method leverages explicit physical anchors to achieve superior performance. Notably, our lowest LPIPS score of 0.261 indicates that AES-Gaussian produces the sharpest reconstruction with minimal blurring artifacts compared to the NeRF baseline.

Method	Representation	Modality	Photometric and Sharpness			Geometric			Efficiency
Method	Representation	Modality	PSNR ↑	SSIM ↑	LPIPS ↓	L1 (m) ↓	RMSE (m) ↓	AbsRel ↓	Latency (ms) ↓
GaussianPretrain [20]	3D-GS	C	23.89	0.695	0.278	3.150	4.510	0.142	19
UniPAD [7]	NeRF	C + L	24.10	0.702	0.285	1.850	2.850	0.115	145
Ours (AES-Gaussian)	3D-GS	C + L	24.65	0.718	0.261	1.250	2.153	0.082	23

Table 3. Comparison of 3D object detection performance on the nuScenes validation set. Our method achieves the optimal trade-off between detection accuracy (NDS) and geometric error (mATE). Percentage improvements over the scratch baseline are indicated in parentheses.

Pre-Training Strategy	Modality	NDS ↑	mAP ↑	mATE ↓	mASE ↓	mAOE ↓	mAVE ↓	mAAE ↓
UVTR-M (Scratch)	C + L	47.7	41.7	0.315	0.285	0.460	0.385	0.205
GaussianPretrain [20]	C	51.5	45.3	0.291	0.265	0.410	0.345	0.192
UniPAD [7] (NeRF)	C + L	53.8	47.5	0.275	0.258	0.380	0.310	0.188
Ours (AES-Gaussian)	C + L	54.7 (+14.7%)	48.2 (+15.6%)	0.258	0.252	0.365	0.295	0.180

Table 4. Robustness analysis under varying environmental conditions. The NDS (%) scores demonstrate that our method is significantly more resilient to rain and low-light conditions compared to baselines.

Method	Weather: NDS (%)		Lighting: NDS (%)
Method	Sunny	Rainy	Day	Night
GaussianPretrain [20]	52.4	46.1	52.1	44.8
UniPAD [7]	54.5	51.8	54.3	51.2
Ours (AES-Gaussian)	55.6	53.2	55.4	53.5

Table 5. Comparison of 3D Semantic Occupancy Prediction on Occ3D-nuScenes. Our method achieves superior performance in both semantic (mIoU) and geometric (

{IoU}_{g e o}

) metrics. Relative percentage improvements over the baseline are indicated in parentheses.

Table 5. Comparison of 3D Semantic Occupancy Prediction on Occ3D-nuScenes. Our method achieves superior performance in both semantic (mIoU) and geometric (

{IoU}_{g e o}

) metrics. Relative percentage improvements over the baseline are indicated in parentheses.

Method	mIoU (%) ↑	${IoU}_{g e o}$ (%)↑	Gain (mIoU)
UVTR-M (Scratch)	36.5	74.2	-
+PointContrast [5]	37.8	75.1	+1.3 (+3.6%)
+UniPAD [7]	39.0	76.3	+2.5 (+6.8%)
+Ours	41.3	78.9	+4.8 (+13.2%)

Table 6. Component-wise ablation study. We perform an incremental analysis starting from the Baseline. L-Branch: Asymmetric LiDAR Branch; Ex-Synergy: Explicit Physical Synergy; Decoupled: Decoupled Gaussian Decoder. Metrics are reported on 10% training data. Model C exhibits the most significant geometric improvement (lower mATE/RMSE), validating the effectiveness of our core design.

Model	Components			Detection (10% Data)		Reconstruction
Model	L-Branch	Ex-Synergy	Decoupled	NDS ↑	mATE ↓	PSNR ↑	RMSE ↓
A (Baseline)	-	-	-	42.5	0.352	22.51	5.824
B	✓	-	-	46.7	0.295	23.10	4.105
C	✓	✓	-	50.2	0.268	24.15	2.250
D (Ours)	✓	✓	✓	51.0	0.262	24.65	2.153

Table 7. Ablation study on loss components. The physics-aware occupancy loss provides crucial geometric regularization, yielding the best detection performance.

$L_{rgb}$	$L_{depth}$	$L_{occ}$	NDS (%) ↑	mATE (m) ↓
✓			42.5	0.352
✓	✓		47.3	0.288
✓	✓	✓	51.0	0.262

Table 8. Sensitivity analysis of the occupancy loss weight

λ_{o c c}

. The model demonstrates stable performance across a wide range of values, achieving optimal results at

λ_{o c c} = 1.0

.

Table 8. Sensitivity analysis of the occupancy loss weight

λ_{o c c}

. The model demonstrates stable performance across a wide range of values, achieving optimal results at

λ_{o c c} = 1.0

.

$λ_{o c c}$	0.1	0.5	1.0	2.0	5.0
NDS (%)	49.5	50.4	51.0	50.1	48.8

Table 9. Impact of masking ratio on downstream detection performance (10% labeled data). The model achieves optimal performance at a moderate ratio of 40%.

Masking Ratio	0%	20%	40%	60%	75%
NDS (%)	48.5	50.1	51.0	50.4	49.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Zhang, D.; Ji, J.; Huang, C.; Li, B.; Yu, C.; Qu, C.; Yang, Z.; Hua, C.; Yu, B. Asymmetric Explicit Synergy for Multi-Modal 3D Gaussian Pre-Training in Autonomous Driving. World Electr. Veh. J. 2026, 17, 102. https://doi.org/10.3390/wevj17020102

AMA Style

Zhang D, Ji J, Huang C, Li B, Yu C, Qu C, Yang Z, Hua C, Yu B. Asymmetric Explicit Synergy for Multi-Modal 3D Gaussian Pre-Training in Autonomous Driving. World Electric Vehicle Journal. 2026; 17(2):102. https://doi.org/10.3390/wevj17020102

Chicago/Turabian Style

Zhang, Dingwei, Jie Ji, Chengjun Huang, Bichun Li, Chennian Yu, Chenhui Qu, Zhengyuan Yang, Chen Hua, and Biao Yu. 2026. "Asymmetric Explicit Synergy for Multi-Modal 3D Gaussian Pre-Training in Autonomous Driving" World Electric Vehicle Journal 17, no. 2: 102. https://doi.org/10.3390/wevj17020102

APA Style

Zhang, D., Ji, J., Huang, C., Li, B., Yu, C., Qu, C., Yang, Z., Hua, C., & Yu, B. (2026). Asymmetric Explicit Synergy for Multi-Modal 3D Gaussian Pre-Training in Autonomous Driving. World Electric Vehicle Journal, 17(2), 102. https://doi.org/10.3390/wevj17020102

Article Menu

Asymmetric Explicit Synergy for Multi-Modal 3D Gaussian Pre-Training in Autonomous Driving

Abstract

1. Introduction

2. Related Work

2.1. Multi-Modal Perception Methods

2.2. 3D Pre-Training for Autonomous Driving

2.3. 3D Gaussian Splatting in Perception

3. Materials and Methods

3.1. Preliminaries: Visual-Only Gaussian Pre-Training

3.1.1. 3D Gaussian Scene Representation

3.1.2. Baseline Limitations: Implicit Decoding

3.2. Asymmetric Explicit Feature Synergy

3.3. Decoupled Gaussian Parameter Decoding

3.4. Joint Optimization and Loss Functions

3.4.1. Photometric Consistency Loss

3.4.2. Depth Geometric Loss

3.4.3. Physics-Aware Occupancy Loss

4. Results

4.1. Datasets and Metrics

4.2. Implementation Details

4.3. Pre-Training Mechanism Validation

4.4. Performance on Downstream Tasks

4.5. Efficiency Analysis

4.6. Ablation Studies

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI