Semantic-Aligned Multimodal Vision–Language Framework for Autonomous Driving Decision-Making

Peng, Feng; She, Shangju; Deng, Zejian

doi:10.3390/machines14010125

Open AccessArticle

Semantic-Aligned Multimodal Vision–Language Framework for Autonomous Driving Decision-Making

by

Feng Peng

¹,

Shangju She

² and

Zejian Deng

^3,*

¹

The Intelligent Transportation Systems Research Center, Wuhan University of Technology, Wuhan 430063, China

²

School of Automotive Intelligent Manufacturing, Hubei University of Automotive Technology, Shiyan 442002, China

³

Department of Data and Systems Engineering, The University of Hong Kong, Hong Kong SAR 999077, China

^*

Author to whom correspondence should be addressed.

Machines 2026, 14(1), 125; https://doi.org/10.3390/machines14010125

Submission received: 24 December 2025 / Revised: 16 January 2026 / Accepted: 18 January 2026 / Published: 21 January 2026

(This article belongs to the Special Issue Control and Path Planning for Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Recent advances in Large Vision–Language Models (LVLMs) have demonstrated strong cross-modal reasoning capabilities, offering new opportunities for decision-making in autonomous driving. However, existing end-to-end approaches still suffer from limited semantic consistency, weak task controllability, and insufficient interpretability. To address these challenges, we propose SemAlign-E2E (Semantic-Aligned End-to-End), a semantic-aligned multimodal LVLM framework that unifies visual, LiDAR, and task-oriented textual inputs through cross-modal attention. This design enables end-to-end reasoning from scene understanding to high-level driving command generation. Beyond producing structured control instructions, the framework also provides natural-language explanations to enhance interpretability. We conduct extensive evaluations on the nuScenes dataset and CARLA simulation platform. Experimental results show that SemAlign-E2E achieves substantial improvements in driving stability, safety, multi-task generalization, and semantic comprehension, consistently outperforming state-of-the-art baselines. Notably, the framework exhibits superior behavioral consistency and risk-aware decision-making in complex traffic scenarios. These findings highlight the potential of LVLM-driven semantic reasoning for autonomous driving and provide a scalable pathway toward future semantic-enhanced end-to-end driving systems.

Keywords:

autonomous driving; end-to-end learning; interpretability; multimodal fusion; semantic reasoning; vision-language models

1. Introduction

Autonomous driving represents a frontier in artificial intelligence research, where the core challenge lies in achieving precise perception, prediction, and planning for vehicles and their surrounding environments within complex and dynamic traffic scenarios. Currently, autonomous driving methodologies are primarily categorized into traditional modular pipelines and end-to-end approaches. While traditional modular pipelines—which decouple perception, prediction, and planning—offer a degree of interpretability, they face significant bottlenecks regarding long-tail scenario handling, generalization capabilities, and system integration. Conversely, end-to-end learning approaches attempt to directly map raw sensor inputs to control outputs via deep neural networks, thereby simplifying the system architecture. However, this “black-box” paradigm suffers from insufficient semantic representation, leading to severe challenges in interpretability and safety assurance. Bridging the “semantic gap” between low-level sensor data and high-level driving decisions has thus emerged as a critical bottleneck constraining the advancement of end-to-end autonomous driving.

In recent years, the rapid evolution of Large Vision–Language Models (LVLMs) has presented new opportunities to address these challenges. LVLMs demonstrate robust capabilities in cross-modal understanding, semantic reasoning, and prior knowledge retrieval, endowing them with the potential to comprehend complex driving scenarios and execute human-like decision-making. To address the semantic deficiencies and limited interpretability of conventional end-to-end methods, this paper proposes a unified decision-making framework based on LVLMs. This framework aims to establish a novel “Perception–Semantics–Decision” paradigm. It first efficiently fuses visual perception, spatial geometry, and textual navigation instructions via a multimodal encoder. Subsequently, leveraging the powerful semantic reasoning capabilities of the LVLM, it synthesizes structured driving decisions based on scene abstraction and causal reasoning.

A core innovation of this work is the design of a dual-channel output mechanism. It generates machine-readable standardized JSON instructions (including action types, target speeds, etc.) to ensure the real-time performance and precision of vehicle control; on the other hand, it simultaneously produces human-readable natural language descriptions to explain the rationale behind current decisions, significantly enhancing system transparency, interpretability, and trustworthiness.

We conducted comprehensive validations of the proposed framework on the nuScenes dataset and the CARLA simulation platform. Experimental results demonstrate that the proposed method significantly outperforms existing mainstream baselines in terms of driving success rate, trajectory accuracy, and robustness in complex scenarios. The main contributions of this paper are summarized as follows:

(1): We propose a unified “Perception–Semantics–Decision” autonomous driving framework based on LVLMs, effectively resolving the semantic deficiency of traditional end-to-end models by introducing semantic reasoning.
(2): We design a dual-channel decision output mechanism comprising machine-readable (JSON) and human-readable (natural language) components, which significantly enhances model interpretability and safety while ensuring control precision.
(3): Extensive closed-loop and open-loop experiments on nuScenes and CARLA validate the superiority of our framework over State-of-the-Art (SOTA) methods in terms of performance, robustness, and generalization capability.

2. Related Work

In recent years, autonomous driving systems have evolved from traditional modular pipelines to end-to-end learning paradigms. Conventional approaches rely on independent perception, prediction, and planning modules, where the planning component typically adopts classical algorithms such as A* search [1], RRT* [2], or MPC [3,4]. These methods perform reliably in structured environments but depend heavily on hand-crafted rules. This dependency makes them susceptible to error accumulation and limits their generalization ability in complex or long-tail scenarios. These methods perform reliably in structured environments but depend heavily on hand-crafted rules, making them susceptible to error accumulation and limiting their generalization ability in complex or long-tail scenarios. To reduce manual design and improve system consistency, end-to-end (E2E) approaches have gained increasing attention. Early work such as ALVINN [5] directly mapped sensor observations to control commands, while DeepDriving [6] introduced an affordance-based “direct perception” framework. Subsequent studies combined imitation learning (IL) and deep reinforcement learning (DRL). For instance, CIL [7] utilized conditional commands to enhance navigational consistency, while GAIL [8] learned expert policies through adversarial training. Notably, Kendall et al. [9] were the first to demonstrate DDPG in real-world driving. To further improve policy learning in interactive traffic, advanced hybrid strategies have been introduced, such as CIRL [10], hierarchical reinforcement learning (HRL) [11,12], and multi-agent reinforcement learning (MARL) [13,14,15], However, despite reducing system complexity, E2E approaches still suffer from limited interpretability, sensitivity to distribution shift, and a lack of high-level semantic reasoning.

To address these limitations, researchers have explored structured intermediate representations to enhance semantic consistency and decision transparency. Bird’s-Eye View (BEV) representations have become a key direction. BEVFormer [16] constructs unified BEV features via spatiotemporal self-attention, while VAD [17] adopts vectorized scene descriptions to improve planning efficiency and collision avoidance. In multimodal fusion, TransFuser [18] integrates image and LiDAR BEV features at multiple scales, and ST-P3 [19] jointly models perception, prediction, and planning through spatiotemporal learning. The interpretable neural motion planner by Zeng et al. [20] further enables controllable planning via cost maps. Although these methods enhance spatial structure modeling, they still lack high-level semantic understanding and explicit intent reasoning.

With the rapid development of LVLMs, cross-modal semantic reasoning has emerged as a promising direction for overcoming these challenges. DriveGPT4 [21] leverages vision instruction tuning to jointly generate control actions and natural-language explanations, while Driving with LLMs [22] incorporates object-centric vector representations into LLMs for interpretable action prediction. Recent hybrid systems further explore the integration of language reasoning with traditional planning modules, such as the hierarchical Chain-of-Thought planner in DriveVLM [23], the high-frequency 3D perception + LVLM fusion in DriveVLM-Dual [24], the decoupled high-low-level control in Senna [24], and VisioPath [25], which uses language-based reasoning to guide differentiable trajectory optimization. Despite these advances, LVLM-based approaches still face challenges in numerical prediction accuracy, real-time performance, and evaluability of language outputs.

Therefore, there is a pressing need for a more engineering-oriented and structured end-to-end framework. Such a framework should effectively integrate high-level semantic reasoning with controllable trajectory decision-making, thereby improving the accuracy, safety, and efficiency of autonomous driving systems.

3. The SemAlign-E2E Framework

The framework proposed in this paper is centered on vision-semantic fusion-driven end-to-end decision-making. As illustrated in Figure 1, the overall processing pipeline consists of four major stages: data acquisition and preprocessing, feature modeling and multimodal fusion, semantic reasoning and task execution, and standardized output for vehicle-level control.

The system first performs real-time perception of the traffic environment through onboard multimodal sensors. Cameras capture semantic visual information, LiDAR provides 3D structural data, and millimeter-wave radar combined with GNSS/HD maps offer localization and task-level constraints. In the preprocessing stage, we apply image rectification, point cloud filtering, temporal synchronization, and extrinsic calibration to obtain temporally and spatially consistent multimodal inputs. The core reasoning stage is powered by Llama-3.2-11B-Vision-Instruct, a state-of-the-art large vision–language model that performs scene understanding and high-level policy generation. Through large-scale image-text pretraining, the model acquires rich traffic semantics, spatial relationships, and behavioral logic, enabling human-like semantic reasoning in complex driving environments. Using a unified vision–language architecture, the model encodes multimodal features via visual and textual encoders and aligns them through cross-modal attention.

To enable effective multimodal understanding within this framework, we adapt Llama-3.2-11B-Vision-Instruct using a lightweight parameter-efficient tuning strategy and a unified tokenization interface for all sensor modalities. RGB images are encoded into a sequence of 256 visual tokens, each projected to a 4096-dimensional embedding. LiDAR point clouds are voxelized using a 0.2 m grid and processed through a PointPillars-style encoder, producing 128 BEV tokens of the same dimensionality. Navigation text is tokenized using the native Llama tokenizer, and all modality tokens are concatenated into a unified sequence supplied to the LVLM.

To align these heterogeneous representations, a text-guided cross-attention module is inserted after the 12th transformer block of Llama-3.2. Text tokens serve as Query vectors, enabling intent-conditioned selection of relevant visual and LiDAR evidence, while the pretrained attention layers retain their capability for semantic reasoning. Instead of full fine-tuning, we apply LoRA adapters (rank = 16, α = 32, dropout = 0.05) to the attention and feed-forward projections within the fusion layers, keeping all remaining LVLM parameters frozen. This design preserves the LVLM’s pretrained semantic priors while enabling efficient adaptation to autonomous-driving scenarios.

During training, the entire model is optimized end-to-end using AdamW with a learning rate of 1 × 10⁻⁴ and cosine decay. The learning objective combines maneuver classification, target-speed regression, and short-horizon trajectory prediction, enabling both symbolic-level decisions and continuous control outputs to emerge from the shared multimodal embedding space. This adaptation makes the LVLM capable of performing fine-grained scene reasoning and high-level policy generation in complex driving environments.

This allows joint interpretation of image cues, LiDAR BEV structures, and textual instructions. The framework enables understanding of driving intent and generates both structured driving commands and natural-language explanations. For instance, it can interpret that “a pedestrian approaching the lane requires deceleration,” significantly enhancing decision interpretability.

The reasoning outputs are then parsed into executable control parameters, followed by confidence and semantic consistency checks to filter unreliable decisions. In high-risk or uncertain situations, the system automatically switches to a more conservative strategy. The final control commands are executed by the vehicle’s low-level controller, forming a complete end-to-end loop from multimodal perception to planning and control.

4. Unified Multimodal Input Representation

To achieve robust and interpretable driving decisions, the proposed framework incorporates three types of inputs at the modality level: visual perception, spatial context, and textual navigation information. Through a unified Transformer-based fusion encoder, as illustrated in Figure 2, these heterogeneous modalities are projected into a shared semantic embedding space, providing a coherent representation foundation for subsequent causal reasoning and task-oriented planning.

In this study, we adopt Llama-3.2-11B-Vision-Instruct as the multimodal backbone model to jointly process visual, LiDAR, and textual inputs. The model exhibits strong image–language understanding capabilities, where its visual branch contains a vision adapter that projects visual features into the same embedding space as linguistic features. To ensure effective multimodal fusion, RGB images, LiDAR point clouds, and textual navigation instructions are encoded separately. We provide below the definitions of all variables used in the subsequent formulations.

For an RGB image input:

I \in R^{H \times W \times 3}

(1)

where

H

and

W

denote image height and width and the three channels correspond to RGB values, features are extracted using a visual encoder

{E n c}_{v i s i o n} (\cdot)

, consisting of convolutional layers and Vision Transformer (ViT) blocks. These components progressively capture both local and global semantic patterns. The encoded visual feature is defined as:

f_{i m g} = {E n c}_{v i s i o n} (I)

(2)

The resulting feature vector:

f_{i m g} \in R^{d_{v}}

(3)

resides in a visual embedding space where the dimensionality is set to:

d_{v} = 4096

(4)

ensuring compatibility with the language embedding dimension of Llama-3.2.

In multi-camera systems, extracted visual features may be further transformed into a unified world coordinate frame via inverse-perspective mapping or BEV projection, ensuring spatial consistency and facilitating alignment with 3D point cloud representations.

For a LiDAR point cloud:

P = {p_{i}}_{i = 1}^{N}, p_{i} = (x_{i}, y_{i}, z_{i})

(5)

each point represents a 3D spatial coordinate. To capture accurate geometric structure, a point cloud encoder

{E n c}_{l i d a r} (\cdot)

based on the PointPillars architecture is used. It includes voxelization, pillar aggregation, and attention-based feature processing. The raw global LiDAR feature is defined as:

f_{l i d a r}^{r a w} = {E n c}_{l i d a r} (P)

(6)

with

f_{l i d a r}^{r a w} \in R^{d_{r a w}}, d_{r a w} = 512

(7)

When using a BEV representation, the point cloud is projected onto a 2D bird’s-eye-view plane aligned with the camera coordinate system. To match the embedding dimensionality required for multimodal integration, the raw LiDAR feature is further transformed by a learnable linear projection:

f_{l i d a r} = W_{p r o j} \cdot f_{l i d a r}^{r a w}, f_{l i d a r} \in R^{4096}

(8)

The textual input:

T = {w_{1}, w_{2}, \dots, w_{L}}

(9)

represents navigation instructions or semantic prompts. The text is encoded by a language encoder

{E n c}_{t e x t} (\cdot)

built upon Llama-3.2-11B, using a multi-layer Transformer architecture. The resulting text feature is:

f_{t e x t} = {E n c}_{t e x t} (T)

(10)

The encoded text embedding lies in a semantic feature space:

f_{t e x t} \in R^{d_{t}}, d_{t} = 4096

(11)

In summary, each modality is projected into a unified 4096-dimensional embedding space:

f_{i m g} \in R^{4096}, f_{l i d a r} \in R^{4096}, f_{t e x t} \in R^{4096}

(12)

Aligning the feature dimensions ensures consistency within the multimodal Transformer fusion module, enabling effective cross-modal interaction and joint embedding. This unified representation integrates visual perception, spatial understanding, and language-based reasoning, forming the foundation for downstream multimodal scene interpretation and decision-making tasks.

To obtain the BEV semantic segmentation and future occupancy forecasting results reported in Table 1, the unified multimodal embedding is further processed using a BEVerse-style BEV prediction head. This head consists of a spatial decoder followed by category-wise segmentation and occupancy forecasting branches, which operate on the 2D BEV feature map derived from the fused multimodal representation. The BEV prediction head is trained jointly with the multimodal LVLM backbone, enabling consistent optimization across perception, prediction, and semantic reasoning tasks.

5. Semantic-Aligned Perception and Decision Making

The LVLM-driven reasoning process designed in this work comprises three stages: perception, semantic reasoning, and decision-making. In the perception stage, the system receives multimodal inputs from various onboard sensors, including front-view RGB images, LiDAR point clouds, and textual navigation instructions. These inputs are then encoded into a unified semantic representation through multimodal feature extraction and spatial alignment.

In the semantic reasoning stage, the SemAlign-E2E leverages its language understanding and cross-modal inference capabilities to analyze the fused representation and perform causal modeling. Through attention mechanisms, the model captures the logical dependencies between task objectives and scene elements, establishes explicit semantic relations and causal chains, and thereby enables the transition from “seeing” to “understanding.” This equips the system with the ability to perform language-driven reasoning in complex traffic scenarios.

Finally, in the decision-making stage, the model generates high-level driving commands based on the inferred semantic representation. The system outputs both machine-readable structured control instructions in JSON format and natural-language explanations, achieving dual-channel decision outputs that are simultaneously executable and interpretable.

5.1. Cross-Modal Feature Encoding and Fusion

In the perception stage, each modality is first encoded independently. The image encoder extracts spatial and semantic features from the RGB input

I

, yielding a high-dimensional representation

f_{img}

. A point cloud encoder processes the LiDAR input

P

to obtain geometric descriptors, object shapes, and spatial relations, producing the feature vector

f_{lidar}

. The navigation instruction

T

is encoded into a semantic embedding

f_{text}

, which provides high-level driving intent.

To integrate modality-specific features into a unified semantic space, we employ a Transformer-based fusion module. The fusion process is centered around a text-guided Cross-Attention mechanism, where the navigation embedding acts as the Query, enabling intent-aware selection of relevant visual and geometric cues.

Formally, the Query, Key, and Value vectors are computed as:

Q = f_{text} W_{Q}, K = [f_{img}, f_{lidar}] W_{K}, V = [f_{img}, f_{lidar}] W_{V}

(13)

The multimodal interaction is obtained through:

F = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(14)

where

d_{k}

is the Key dimension. Using text as the Query allows the model to emphasize scene regions that align with the navigation goal (e.g., lane-keeping, upcoming turns, obstacle avoidance), thereby improving cross-modal alignment and semantic consistency.

The fused representation

F

serves as a shared semantic feature that supports downstream perception and reasoning tasks. Based on

F

, the model predicts a structured set of scene elements:

S = {o_{i} ∣ o_{i} = ({c l s}_{i}, b_{i}, a_{i}), i = 1, \dots, N}

(15)

where each element includes a class label

{c l s}_{i}

, a spatial descriptor

b_{i}

, and a dynamic attribute vector

a_{i}

. This symbolic-like representation provides a compact abstraction of the driving scene, forming the foundation for subsequent semantic reasoning and decision-making.

5.2. Language-Driven Chain-of-Thought Reasoning

After obtaining the structured semantic representation

S

from the multimodal fusion stage, the SemAlign-E2E framework performs goal-conditioned causal reasoning to anticipate future interactions and infer context-consistent driving decisions.

By incorporating the navigation goal

G

, the system is able not only to interpret the current scene but also to project how semantic entities may influence upcoming behaviors, enabling more anticipatory and safety-aware planning.

To support reasoning, the semantic abstraction

S

encodes object categories, spatial states, and dynamic attributes, jointly summarizing critical contextual cues—such as pedestrian intention, lead-vehicle deceleration, or intersection priority rules—required for downstream decision-making (see Figure 3 for an overview of the reasoning pipeline).

Given the scene semantics

S

and navigation goal

G

, the LVLM predicts a distribution over candidate driving actions:

π (a ∣ S, G) = LVLM (S, G)

(16)

The core reasoning module is a goal-guided Cross-Attention mechanism, which aligns semantic elements with the driving intent. The navigation goal is encoded as a Query vector, while semantic objects serve as Key–Value pairs:

Q = f_{goal} W_{Q}^{(r)}, K = S W_{K}^{(r)}, V = S W_{V}^{(r)}

(17)

This attention operation highlights the semantic entities most relevant to the current driving objective—such as pedestrians crossing the ego lane or approaching vehicles at intersections—and aggregates these relationships into a reasoning embedding

h_{t}

, which is subsequently mapped to the driving policy.

Through stacked attention layers, the LVLM performs multi-step relational inference akin to Chain-of-Thought reasoning. The model progressively integrates cues such as:

Interaction priorities (e.g., yield to pedestrian);
Predicted motion tendencies (e.g., lead vehicle slowing);
Traffic-rule implications (e.g., red-light → mandatory stop);
Spatial feasibility (e.g., safe gap for lane merging).

This layered reasoning process enables the system to produce interpretable and context-aligned decisions such as “the vehicle should decelerate to yield” or “prepare for a right turn after clearing the intersection.”

5.3. Dual-Channel Interpretable Policy Synthesis

After the semantic reasoning stage, the LVLM generates high-level driving decisions based on the fused semantic representation and the navigation goal. The core objective is to map the scene semantics S and task objective G into an executable driving behavior while adhering to traffic rules and system-level control constraints. The decision process can be formally expressed as:

d = {maneuver, v, τ}

(18)

where

m a n e u v e r

denotes the action type (e.g., going straight, lane change, deceleration, or stopping),

v

represents the recommended target speed, and

τ

is the planning horizon for executing the action.

During decision generation, the system first constructs a candidate action set

A

based on the inferred scene semantics and the navigation goal. The selection of feasible actions considers semantic factors such as pedestrian intent, vehicle dynamics, traffic light states, and roadway structure. For instance, if the semantic reasoning stage indicates the presence of a pedestrian ahead or a red traffic signal, the system prioritizes conservative actions such as “decelerate” or “stop.” Once the action type is determined, the model computes an appropriate target speed

v

, which is adaptively adjusted according to traffic conditions, regulatory constraints, and safety margins. The planning horizon

τ

is then assigned based on task requirements and road characteristics to ensure coherent and stable execution.

After generating the high-level command, the system outputs both a structured JSON-format instruction for machine execution and a natural-language explanation, enabling downstream control modules to interpret the policy while providing human-understandable justification for the decision, thus enhancing transparency and interpretability.

To ensure safety and robustness under complex traffic conditions, the decision stage also incorporates rule checking and fallback mechanisms. When a generated action violates traffic rules, exhibits low confidence, or presents potential risk, the system triggers a minimum-risk policy that selects a more conservative alternative. This mechanism ensures that the system maintains stable and safe behavior even in uncertain or extreme scenarios.

5.4. Safety Mechanisms and Robustness Enhancement

In complex traffic environments, ensuring the stability and safety of decision-making is essential. To enhance reliability under long-tail or high-risk scenarios, our framework introduces a hierarchical safety mechanism consisting of three components: Constraint-Based Rule Checking, Uncertainty-Aware Confidence Gating, and a Risk-Minimizing Fallback Strategy. These mechanisms jointly enable robust decision filtering and safety-aware action selection.

(1): Constraint-Based Rule Checking

The system first validates the action inferred by the LVLM against predefined traffic regulations and kinematic constraints (e.g., red-light compliance, stop-line rules, speed limits, collision-avoidance boundaries).

We define the set of all semantic-compliant actions as

A_{total}

. For a given scene state

S_{t}

, the subset of admissible safe actions is defined as:

A_{safe} (S_{t}) = {a ∣ C_{rule} (a, S_{t}) \leq 0}

(19)

where

C_{rule} (\cdot)

denotes the constraint function encoding traffic rules.

To determine whether a predicted action

a

is valid, we define an indicator function:

I_{safe} (a) = \{\begin{matrix} 1, & a \in A_{safe} (S_{t}), \\ 0, & otherwise . \end{matrix}

(20)

If

I_{safe} (a_{pred}) = 0

, the action is rejected and the system immediately triggers the fallback policy.

(2): Uncertainty-Aware Confidence Gating

LVLM reasoning may be uncertain, especially under ambiguous or out-of-distribution (OOD) scenarios. Instead of using a heuristic sigmoid score, we compute a sequence-level confidence score based on the normalized log-probability of the generated decision tokens.

Let

y = {y_{1}, y_{2}, \dots, y_{L}}

be the sequence of LVLM-generated decision tokens. The confidence score is defined as:

C_{seq} = e x p (\frac{1}{L} \sum_{i = 1}^{L} l o g P (y_{i} ∣ y_{<i}, F_{multi}))

(21)

where

F_{multi}

denotes the fused multimodal features (image + LiDAR + text).

The final action is determined by:

a_{final} = \{\begin{matrix} a_{LVLM}, & C_{seq} \geq γ, \\ a_{fallback}, & otherwise, \end{matrix}

(22)

where

γ

is a dynamic confidence threshold.

This ensures the system only executes aggressive or complex driving maneuvers when the LVLM is sufficiently confident in its semantic understanding.

(3): Risk-Minimizing Fallback Strategy

When the primary LVLM decision is rejected by either the Rule Checker or the Confidence Gate, the system switches to a deterministic fallback strategy. This is formulated as an optimization problem that selects the action minimizing a predefined risk cost function.

The fallback action is given by:

a_{fallback} = a r g \underset{a \in A_{basic}}{m i n} (λ_{1} J_{collision} (a, S_{t}) + λ_{2} J_{comfort} (a))

(23)

where:

$A_{basic}$ : a set of primitive safe maneuvers (e.g., maintain-lane, slow-down, emergency-stop);
$J_{collision}$ : collision-risk cost based on predicted object occupancy or TTC (time-to-collision);
$J_{comfort}$ : comfort penalty, discouraging abrupt acceleration or jerk.

This guarantees that the system always converges to a minimum-risk state, even when semantic reasoning partially fails.

6. Experimental Results and Analysis

All experiments are conducted following standard dataset splits to ensure reproducibility. For nuScenes, we adopt the official training, validation, and test partitions, consisting of 700, 150, and 150 driving episodes, respectively. Only the training split is used for model training, while the validation split is used for hyperparameter tuning and ablation studies. The test split is reserved exclusively for final performance reporting.

For the CARLA closed-loop evaluation, we use CARLA 0.9.14 with the standard benchmark towns and routes. Training data are collected in Town01, Town03, and Town04, while generalization and evaluation are performed in Town02, Town05, Town06, and Town07. Weather conditions follow the Dynamic Weather Suite, including ClearNoon, CloudySunset, WetCloudyNoon, HardRainNoon, and SoftRainSunset. Driving routes are selected from the LeaderBoard 2.0 evaluation set, covering intersections, roundabouts, multi-lane urban roads, and highway ramps, thereby ensuring consistency with widely used end-to-end driving benchmarks.

To avoid information leakage, no evaluation towns or routes appear in the training data. All experiments are repeated using three random seeds, and the averaged results are reported.

To comprehensively evaluate the effectiveness of the proposed SemAlign-E2E framework, we conduct experiments on both the nuScenes dataset and the CARLA simulation environment. The evaluation covers five aspects—namely perception and prediction performance, driving performance comparison, qualitative analysis, ablation studies, and inference efficiency—thereby providing a multifaceted assessment of the model’s validity and robustness.

6.1. Perception and Motion Forecasting Performance

We first conduct a systematic and comprehensive evaluation of the model’s semantic perception and future prediction capabilities under the Bird’s-Eye-View (BEV) representation. To ensure fair comparison, we select several representative BEV-based perception and prediction frameworks as baselines, including Lift-Splat-Shoot (LSS), FIERY, ST-P3, and the recently proposed BEVerse. These methods span different methodological directions, ranging from implicit spatial voxelization to spatio-temporal instance prediction and unified panoptic modeling, thus providing a multi-dimensional benchmark for assessing the overall performance of our approach. The comparison results are reported in Table 1.

Table 1 reports several key metrics for BEV semantic segmentation and future prediction. The Drivable, Lane, Vehicle, Pedestrian, and Prediction IoU metrics measure the segmentation accuracy of different semantic categories, whereas PQ, SQ, and RQ evaluate panoptic segmentation quality, instance consistency, and recognition accuracy. These metrics collectively characterize a model’s scene understanding capability in autonomous driving.

From the results, our method achieves IoU scores of 38.51% and 39.08% on the Lane and Vehicle categories, which are highly correlated with driving safety. This demonstrates clear improvements over segmentation-based approaches such as LSS and prediction-based models such as FIERY, while remaining competitive with more recent BEV fusion methods including ST-P3 and BEVerse. In the Pedestrian category, our method attains an IoU of 17.49%, outperforming both ST-P3 (14.06%) and BEVerse (16.31%), highlighting its robustness in detecting small and sparsely distributed objects.

For the prediction task, SEMALIGN-E2E achieves the best overall performance across both segmentation and panoptic metrics, with Pred IoU, PQ, SQ, and RQ reaching 38.54%, 29.83%, 69.56%, and 42.88%, respectively. Notably, RQ improves by 2.8% over ST-P3, indicating stronger recognition stability in multi-agent interaction scenarios under complex driving environments.

6.2. Closed-Loop Driving Evaluation

To comprehensively evaluate the performance of SemAlign-E2E, we conduct experiments under both the CARLA closed-loop driving environment and the nuScenes open-loop planning benchmark. The former assesses the model’s stability and safety within an actual control loop, while the latter evaluates its semantic understanding and trajectory prediction capabilities. The comparison includes a wide range of representative end-to-end frameworks, such as AD-MLP, UniAD, ST-P3, VAD/VADv2, BEVFormer, DriveGPT4, and DriveTransformer, which cover mainstream methodologies from lightweight perception-to-control pipelines to multimodal BEV modeling and large-model semantic decision-making.

Six key metrics are employed in the evaluation: Driving Score (DS) and Success Rate (SR) measure overall driving quality and task completion; Infraction Penalty (IP) reflects adherence to traffic rules; L2 Error evaluates trajectory prediction accuracy; Collision Rate quantifies safety; and Multi-Ability assesses multi-skill driving performance, including car-following, lane-changing, obstacle avoidance, and turning behaviors.

As shown in Table 2, SEMALIGN-E2E achieves the best or near-best performance across multiple key metrics. It attains a Driving Score of 76.6 and a Success Rate of 87.6, outperforming both DriveGPT4 and DriveTransformer, demonstrating superior overall driving stability and robustness in complex scenarios. In terms of safety, the collision rate is reduced to 0.40%, the lowest among all methods, indicating stronger safety margins in high-risk situations such as vehicle interactions, pedestrian crossings, and obstacle avoidance. The L2 error decreases to 1.02 m, outperforming DriveGPT4 and DriveTransformer and reflecting more accurate prediction of future motion trajectories. The Infraction Penalty (IP) is reduced from 0.31 to 0.25, suggesting improved rule compliance and semantic consistency. Meanwhile, the Multi-Ability score reaches 28.33, which is the highest among all baselines, highlighting the model’s superior multi-task capability across diverse driving scenarios.

Qualitative results further support these findings. In CARLA, the model demonstrates stable behavior in intersections, car-following, and obstacle avoidance scenarios, and maintains reliable lane perception and trajectory planning under nighttime or adverse weather conditions. In nuScenes, SEMALIGN-E2E produces more continuous and coherent trajectory predictions in multi-agent traffic and complex road structures, reflecting stronger semantic consistency and scene generalization.

In summary, SEMALIGN-E2E exhibits clear advantages in both closed-loop driving and open-loop prediction, particularly in driving stability, safety performance, temporal consistency, and multi-task capability. These results validate the effectiveness of the proposed cross-modal language-enhanced semantic reasoning mechanism for understanding and decision-making in complex traffic environments.

6.3. Qualitative Analysis and Interpretability

To further demonstrate the model’s behavior in real-world complex scenarios, we evaluate its open-loop planning performance on the nuScenes dataset. nuScenes is one of the most widely used multimodal real-world datasets in autonomous driving research, consisting of 1000 driving episodes and approximately 1.4 million frames of multi-sensor data. The data collection vehicle is equipped with six surround-view RGB cameras, five radar units, and a 32-beam LiDAR, providing 360° visual and depth coverage.

The dataset includes scenes from Singapore and Boston, covering a wide range of traffic environments such as urban roads, intersections, highway entrances, ports, and residential areas. Each keyframe is annotated with high-precision 3D labels, semantic maps, road topology, and multi-agent future trajectories. As such, nuScenes serves as a standard benchmark for evaluating the semantic understanding and trajectory prediction capabilities of end-to-end autonomous driving models.

However, nuScenes does not natively provide navigation instructions or textual goal descriptions. To incorporate the navigation-text modality required by our framework, we construct an auxiliary set of human-annotated navigation prompts. For each keyframe, annotators examine the corresponding ego-trajectory and HD-map topology—such as upcoming intersections, lane merges, and branching points—and generate concise, goal-aligned navigation descriptions (e.g., “keep straight for 50 m,” “prepare to turn left at the next intersection,” “merge right after the divider”).

These annotations are added only at decision-relevant frames to ensure semantic effectiveness while preventing redundancy. During training and inference, the generated navigation text is concatenated with the visual and LiDAR inputs and fed into the LVLM as part of the multimodal token sequence. This controlled annotation procedure ensures reproducibility and maintains consistency with the original driving behaviors in nuScenes without introducing external knowledge.

As illustrated in Figure 4, SEMALIGN-E2E demonstrates strong adaptability, semantic understanding, and decision stability across a variety of complex traffic environments. The evaluation covers intersections, urban roads, unstructured parking lots, and nighttime driving scenarios, enabling assessment of the model’s generalization capability under varying illumination conditions, traffic densities, and semantic complexities. The results show that, supported by its multimodal fusion mechanism, the model accurately identifies key elements such as traffic lights, pedestrians, road structures, and obstacles, and consistently produces stable high-level driving decisions.

In intersection scenarios, the model responds appropriately to traffic light states and pedestrian movements. In dense traffic or visually constrained urban environments, it maintains smooth trajectory planning. In unstructured parking lots and other open spaces, SEMALIGN-E2E effectively distinguishes drivable regions from obstacles, avoiding entry into restricted or hazardous areas and demonstrating strong spatial understanding.

Furthermore, in nighttime or low-light conditions, the complementary strengths of visual, LiDAR, and textual navigation inputs enable the model to sustain reliable perception and prediction performance. Overall, SEMALIGN-E2E exhibits robust semantic understanding and consistent decision-making across diverse real-world driving scenarios. The multimodal alignment mechanism significantly enhances generalization under varying illumination, traffic complexity, and semantic uncertainty, validating the feasibility and reliability of the proposed approach for end-to-end autonomous driving in real traffic environments.

6.4. Ablation Studies

To analyze the contribution of each module, we conduct a series of ablation experiments. The results are summarized in Table 3.

Table 3 summarizes the ablation study results, which collectively demonstrate the contribution of each component in SEMALIGN-E2E to end-to-end driving decision-making. Removing navigation text reduces DS and SR to 74.1% and 81.3%, respectively, while increasing the Collision Rate to 0.56%. This indicates that navigation text provides more than coarse route-level guidance: it constrains high-level intent and helps the model resolve ambiguities in road topology.

Disabling the semantic reasoning module further degrades performance, with the Collision Rate rising from 0.40% to 0.47%. This highlights the role of semantic reasoning in capturing logical and temporal relationships among traffic elements—such as right-of-way and motion trends—ensuring decisions remain consistent with latent scene structures. Cross-modal attention proves to be a critical component; when removed, SR drops significantly to 73.6%, confirming that effective alignment is essential for integrating heterogeneous modalities into a coherent representation.

To evaluate the system’s robustness against sensor degradation, we conducted experiments with varying LiDAR point cloud densities. As shown in Table 3, reducing the density to 50% and 25% leads to a gradual decline in SR to 85.3% and 82.5%, respectively. Notably, even with only 25% density, SemAlign-E2E outperforms the configuration without cross-modal attention, suggesting that our framework leverages visual and textual cues to compensate for sparse geometric data.

Furthermore, a dual-component ablation study was performed by simultaneously removing navigation text and semantic reasoning. This resulted in the lowest DS (68.6%) and SR (72.8%) across all tests. This sharp decline underlines the synergistic necessity of linguistic guidance and causal reasoning: without this combined semantic layer, the model’s ability to navigate complex urban intersections is severely compromised. Finally, the LR-Visual baseline, which uses downsampled visual features, yielded the highest Collision Rate (0.61%), underscoring the importance of high-quality visual inputs for accurate spatial reasoning.

6.5. Inference Latency Analysis

As shown in Figure 5, the proposed end-to-end framework yields an average latency of about 180 ms from data input to control output. The per-frame latency mainly fluctuates within 170–190 ms, with only a few spikes approaching 200 ms, which indicates reasonably stable temporal behavior without catastrophic stalls in the processing pipeline.

However, this latency is not negligible when compared with typical control frequencies in real autonomous driving systems. In practice, low-level vehicle control often runs at 50–100 Hz (10–20 ms), while high-level decision-making or trajectory planning modules are usually updated at 10–20 Hz, corresponding to a time budget of roughly 50–100 ms per cycle. Under this reference, the current end-to-end latency of ≈180 ms exceeds the budget of high-frequency control loops and is therefore better suited to low-speed or low-frequency operating regimes, where a high-level semantic planner can run at around 5–7 Hz and its outputs are tracked by a faster downstream controller. In other words, the proposed LVLM-based model is more appropriate to serve as a high-level decision module that can be integrated into practical autonomous driving stacks, while further model and system-level optimization would be required for deployment in aggressive, high-speed scenarios.

Further analysis shows that the total latency can be decomposed into three components: data preprocessing, model inference, and post-processing. Data preprocessing—including image decoding, point cloud alignment, and modality synchronization—accounts for roughly 20% of the overall latency. Model inference is the primary bottleneck, contributing approximately 60% of the total execution time. This dominance is fundamentally rooted in the self-attention mechanism of the Transformer-based LVLM, which exhibits a computational complexity of O(n²) relative to the input sequence length n. Specifically, n represents the total sequence length of the multimodal input, calculated as the sum of visual patches

n_{v}

, LiDAR-derived geometric tokens

n_{g}

, and textual navigation embeddings

n_{t}

(i.e.,

n = n_{v} + n_{g} + n_{t}

). In the self-attention mechanism, the model computes a compatibility matrix of size

n \times n

to capture the dependencies between every pair of tokens. This results in a quadratic growth of both the number of dot-product operations and the memory footprint required to store the attention scores. For instance, even a marginal increase in LiDAR point cloud density or visual resolution leads to a significant, non-linear expansion in

n

, which in turn quadruples the computational overhead for every doubling of the sequence length. As the quadratic scaling of the attention matrix necessitates intensive floating-point operations and memory bandwidth, the inference stage becomes the dominant contributor to end-to-end latency. The remaining 20% arises from post-processing, where multimodal outputs are parsed into vehicle control commands. This breakdown confirms that optimizing the quadratic complexity of the LVLM remains the primary target for future deployment-oriented acceleration, especially when high-fidelity spatial reasoning is required in complex urban environments.

Nevertheless, through operator fusion and GPU-parallel acceleration strategies, the proposed method achieves lower latency than other models of comparable scale. The occasional latency peaks observed in the curve are largely due to jitter in data buffering or communication synchronization, rather than anomalies in model computation. Further improvements may be obtained through optimized data pipelines and tensor-level pipelining.

As shown in Figure 6, during end-to-end training and validation, the utilization of the four A10 GPUs consistently remains within the 65–70% range, with an average around 67% and only minor fluctuations. This indicates that the computational workload is well balanced across devices throughout the training process. GPU memory consumption also stays relatively stable, demonstrating reliable memory allocation and synchronization behavior under multi-GPU distributed training.

The overall consistency in GPU utilization suggests that the pipeline operates efficiently during data preprocessing, forward and backward computation, and parameter synchronization. In multimodal systems, imbalanced workloads or excessive communication overhead typically appear as reduced utilization or divergent usage patterns across GPUs. However, the monitoring results show no signs of such effects: all GPUs follow similar utilization curves without stalls, idling, or desynchronization. These observations provide evidence for the robustness of the distributed training setup and the effectiveness of the scheduling strategy employed.

7. Conclusions and Future Work

This work presents an end-to-end autonomous driving framework that integrates large vision–language models with multimodal perception. The proposed approach unifies visual understanding, spatial reasoning, and language-guided navigation within a single semantic decision pipeline. Compared with traditional end-to-end architectures, our method provides more structured and interpretable high-level decision-making behavior in complex traffic environments.

Extensive evaluations on the nuScenes dataset and the CARLA simulation platform show that the proposed method achieves competitive or superior performance across driving success rates, trajectory prediction, and multi-agent interaction scenarios. Furthermore, although the current inference latency is better suited for low-frequency decision modules rather than high-speed control loops, the results indicate that the framework can be integrated as a high-level planner within practical autonomous driving stacks. With model compression and system-level optimization, such as operator fusion or hardware-accelerated inference, additional reductions in latency and throughput improvements are expected.

The proposed semantic alignment mechanism also highlights a promising research direction, offering a language-enhanced decision paradigm that tightly couples perception, reasoning, and action generation. Nevertheless, several limitations remain. First, real-world generalization under long-tail conditions requires broader testing and domain adaptation strategies. Second, the computational requirements of LVLM-based models pose challenges for onboard deployment, motivating future work on lightweight architectures and efficient inference accelerators. Finally, improved robustness in complex scenarios may be achieved through human feedback, reinforcement learning, and uncertainty-aware training.

Author Contributions

Conceptualization, F.P. and Z.D.; methodology, F.P. and Z.D.; software, F.P. and S.S.; validation, F.P. and S.S.; formal analysis, F.P. and S.S.; investigation, F.P. and S.S.; resources, Z.D.; data curation, S.S.; writing—original draft preparation, F.P. and S.S.; writing—review and editing, Z.D.; visualization, F.P. and S.S.; supervision, Z.D.; project administration, Z.D.; funding acquisition, Z.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China under Grant 52172393, and in part by the Science and Technology Innovation Special Project of National Industrial Base of Wuhan Science and Technology Bureau under Grant 2023010402010778.

Institutional Review Board Statement

No applicable.

Informed Consent Statement

No applicable.

Data Availability Statement

The data presented in this study is available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Chen, L.; Shan, Y.; Tian, W.; Li, B.; Cao, D. A fast and efficient double-tree RRT*-like sampling-based planner applying on mobile robotic systems. IEEE/ASME Trans. Mechatron. 2018, 23, 2568–2578. [Google Scholar] [CrossRef]
Kuwata, Y.; Teo, J.; Fiore, G.; Karaman, S.; Frazzoli, E.; How, J.P. Real-time motion planning with applications to autonomous urban driving. IEEE Trans. Control Syst. Technol. 2009, 17, 1105–1118. [Google Scholar] [CrossRef]
Liniger, A.; Domahidi, A.; Morari, M. Optimization-based autonomous racing of 1:43 scale RC cars. Opt. Control Appl. Methods 2015, 36, 628–647. [Google Scholar] [CrossRef]
Pomerleau, D.A. ALVINN: An autonomous land vehicle in a neural network. In Proceedings of the Annual Conference on Neural Information Processing Systems, Denver, CO, USA, 27–30 November 1989; pp. 305–313. [Google Scholar]
Chen, C.; Seff, A.; Kornhauser, A.; Xiao, J. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2722–2730. [Google Scholar]
Codevilla, F.; Müller, M.; Dosovitskiy, A.; López, A.M.; Koltun, V. End-to-end driving via conditional imitation learning. In Proceedings of the IEEE International Conference on Robotics and Automation, Brisbane, Australia, 21–25 May 2018; pp. 4693–4700. [Google Scholar]
Finn, C.; Levine, S.; Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 49–58. [Google Scholar]
Kendall, A.; Hawke, J.; Janz, D.; Mazur, P.; Reda, D.; Allen, J.M.; Lam, V.D.; Bewley, A.; Shah, A. Learning to drive in a day. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 8248–8254. [Google Scholar]
Liang, X.; Wang, T.; Yang, L.; Xing, E. CIRL: Controllable imitative reinforcement learning for vision-based self-driving. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 604–620. [Google Scholar]
Chen, Y.; Dong, C.; Palanisamy, P.; Mudalige, P.; Muelling, K.; Dolan, J.M. Attention-based hierarchical deep reinforcement learning for lane change behaviors in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 3697–3703. [Google Scholar]
Duan, J.; Eben Li, S.; Guan, Y.; Sun, Q.; Cheng, B. Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data. IET Intell. Transp. Syst. 2020, 14, 297–305. [Google Scholar] [CrossRef]
Bacchiani, G.; Molinari, D.; Patander, M. Microscopic traffic simulation by cooperative multi-agent deep reinforcement learning. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), Montreal, QC, Canada, 13–17 May 2019; pp. 1547–1555. [Google Scholar]
Chen, B.; Xu, M.; Liu, Z.; Li, L.; Zhao, D. Delay-aware multi-agent reinforcement learning for cooperative and competitive environments. arXiv 2020, arXiv:2005.05441. [Google Scholar]
Hu, Y.; Nakhaei, A.; Tomizuka, M.; Fujimura, K. Interaction-aware decision making with adaptive strategies under merging scenarios. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3 November 2019; pp. 151–158. [Google Scholar]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Yu, Q.; Dai, J. Bevformer: Learning bird’s-eye-view representation from LiDAR-camera via spatiotemporal transformers. IEEE Trans. Pattern Anal. Machine Intell. 2025, 47, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
Jiang, B.; Chen, S.; Xu, Q.; Liao, B.; Chen, J.; Zhou, H.; Zhang, Q.; Liu, W.; Huang, C.; Wang, X. Vad: Vectorized scene representation for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 8306–8316. [Google Scholar]
Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. TransFuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12878–12895. [Google Scholar] [CrossRef] [PubMed]
Hu, S.; Chen, L.; Wu, P.; Li, H.; Yan, J.; Tao, D. ST-P3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23 October 2022; pp. 533–549. [Google Scholar]
Zeng, W.; Luo, W.; Suo, S.; Sadat, A.; Yang, B.; Casas, S.; Urtasun, R. End-to-end interpretable neural motion planner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8652–8661. [Google Scholar]
Xu, Z.; Zhang, Y.; Xie, E.; Zhao, Z.; Guo, Y.; Wong, K.Y.K.; Li, Z.; Zhao, H. DriveGPT4: Interpretable end-to-end autonomous driving via large language model. IEEE Robot. Autom. Lett. 2024, 9, 8186–8193. [Google Scholar] [CrossRef]
Chen, L.; Sinavski, O.; Hünermann, J.; Karnsund, A.; Willmott, A.J.; Birch, D.; Maund, D.; Shotton, J. Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving. In Proceedings of the lEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 14093–14100. [Google Scholar]
Tian, X.; Gu, J.; Li, B.; Liu, Y.; Hu, C.; Wang, Y.; Zhan, K.; Jia, P.; Lang, X.; Zhao, H. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models. arXiv 2024, arXiv:2402.12289. [Google Scholar]
Jiang, B.; Chen, S.; Liao, B.; Zhang, X.; Yin, W.; Zhang, Q.; Huang, C.; Liu, W.; Wang, X. Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving. arXiv 2024, arXiv:2410.22313. [Google Scholar]
Wang, S.; Typaldos, P.; Li, C.; Malikopoulos, A.A. VisioPath: Vision-Language Enhanced Model Predictive Control for Safe Autonomous Navigation in Mixed Traffic. IEEE Open J. Control Syst. 2025, 4, 562–580. [Google Scholar] [CrossRef]
Zhai, J.-T.; Feng, Z.; Du, J.; Mao, Y.; Liu, J.-J.; Tan, Z.; Zhang, Y.; Ye, X.; Wang, J. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv 2023, arXiv:2305.10430. [Google Scholar]
Hu, Y.; Yang, J.; Chen, L.; Li, K.; Sima, C.; Zhu, X.; Chai, S.; Du, S.; Lin, T.; Wang, W.; et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17853–17862. [Google Scholar]
Chen, S.; Jiang, B.; Gao, H.; Liao, B.; Xu, Q.; Zhang, Q.; Huang, C.; Liu, W.; Wang, X. VADv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv 2024, arXiv:2402.13243. [Google Scholar]
Jia, X.; You, J.; Zhang, Z.; Yan, J. DriveTransformer: Unified transformer for scalable end-to-end autonomous driving. arXiv 2025, arXiv:2503.07656. [Google Scholar]

Figure 1. Overall pipeline of the proposed end-to-end autonomous driving framework based on LVLM. The pipeline encompasses on-vehicle sensor data acquisition and preprocessing, cloud-side multimodal fusion and semantic reasoning, standardized interface outputs, and control signal execution.

Figure 2. Illustration of the multimodal input representation and fusion framework. RGB images, LiDAR point clouds, and navigation text are first encoded by their respective encoders. The resulting features are then aligned into a unified embedding space through a Transformer-based fusion module, enabling joint cross-modal modeling.

Figure 3. Multimodal semantic-driven reasoning pipeline. This figure illustrates how structured semantics Sand navigation goal Gare processed through a goal-guided Cross-Attention mechanism inside the LVLM to generate the final driving policy.

Figure 4. (a) Raw front-view image with overlaid predicted trajectory, (b) corresponding depth estimation, and (c) semantic segmentation results. The selected samples cover typical driving environments including intersections, urban roads, unstructured parking lots, and nighttime scenarios.

Figure 5. Inference latency over selected time intervals (Note: The latency corresponds to the per-frame end-to-end processing time—including perception encoding, semantic reasoning, and control output—and is measured under the same GPU environment).

Figure 6. GPU utilization.

Table 1. Comparison of perception and prediction performance under BEV representation.

Method	Drivable IoU	Lane IoU	Vehicle IoU	Pedestrian IoU	Pred IoU	PQ	SQ	RQ
FIERY	71.97	33.58	38.00	17.15	36.20	27.80	-	-
ST-P3	74.38	38.47	38.79	14.06	36.89	29.10	69.77	41.71
Lift Splat-Shoot	72.0	34.0	36.0	15.0	-	26.44	67.81	38.21
BEVerse	75.21	37.98	38.51	16.31	37.42	29.65	69.33	41.95
Ours	74.69	38.51	39.08	17.49	38.54	29.83	69.56	42.88

Table 2. Comparison of end-to-end driving performance across CARLA and nuScenes.

Methods	DS	SR (%)	IP	L2 Error (m)	Collision (%)	Multi-Ability
AD-MLP [26]	62.5	68.4	0.48	2.00	1.92	15.94
UniAD [27]	65.2	70.1	0.44	1.85	1.59	16.75
ST-P3 [19]	70.2	75.6	0.37	1.33	0.62	20.22
VAD [17]	73.9	79.8	0.35	1.20	0.52	22.62
VADv2 [28]	74.8	80.5	0.34	1.15	0.50	24.12
BEVFormer [16]	72.6	78.9	0.36	1.25	0.55	21.88
DriveGPT4 [21]	75.1	81.2	0.33	1.10	0.48	25.30
DriveTransformer [29]	76.4	83.0	0.31	1.05	0.42	27.80
Ours (SEMALIGN-E2E)	76.6	87.6	0.25	1.02	0.40	28.33

Table 3. Ablation study results on CARLA.

Configuration	DS (%)	SR (%)	Collision (%)	IP (%)
Full SEMALIGN-E2E	76.6	87.6	0.40	0.08
-w/o navigation text	74.1	81.3	0.56	0.15
-w/o semantic reasoning	71.8	84.0	0.47	0.12
-w/o Cross-Attention	69.4	73.6	0.55	0.29
LiDAR (50% density)	74.3	85.3	0.45	0.10
LiDAR (25% density)	71.2	82.5	0.51	0.13
w/o (Nav. Text + Reasoning)	68.6	72.8	0.48	0.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peng, F.; She, S.; Deng, Z. Semantic-Aligned Multimodal Vision–Language Framework for Autonomous Driving Decision-Making. Machines 2026, 14, 125. https://doi.org/10.3390/machines14010125

AMA Style

Peng F, She S, Deng Z. Semantic-Aligned Multimodal Vision–Language Framework for Autonomous Driving Decision-Making. Machines. 2026; 14(1):125. https://doi.org/10.3390/machines14010125

Chicago/Turabian Style

Peng, Feng, Shangju She, and Zejian Deng. 2026. "Semantic-Aligned Multimodal Vision–Language Framework for Autonomous Driving Decision-Making" Machines 14, no. 1: 125. https://doi.org/10.3390/machines14010125

APA Style

Peng, F., She, S., & Deng, Z. (2026). Semantic-Aligned Multimodal Vision–Language Framework for Autonomous Driving Decision-Making. Machines, 14(1), 125. https://doi.org/10.3390/machines14010125

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic-Aligned Multimodal Vision–Language Framework for Autonomous Driving Decision-Making

Abstract

1. Introduction

2. Related Work

3. The SemAlign-E2E Framework

4. Unified Multimodal Input Representation

5. Semantic-Aligned Perception and Decision Making

5.1. Cross-Modal Feature Encoding and Fusion

5.2. Language-Driven Chain-of-Thought Reasoning

5.3. Dual-Channel Interpretable Policy Synthesis

5.4. Safety Mechanisms and Robustness Enhancement

6. Experimental Results and Analysis

6.1. Perception and Motion Forecasting Performance

6.2. Closed-Loop Driving Evaluation

6.3. Qualitative Analysis and Interpretability

6.4. Ablation Studies

6.5. Inference Latency Analysis

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI