Research on World Models for Connected Automated Driving: Advances, Challenges, and Outlook

Chen, Nuo; Liu, Xiang

doi:10.3390/app15168986

Open AccessReview

Research on World Models for Connected Automated Driving: Advances, Challenges, and Outlook

by

Nuo Chen

and

Xiang Liu

^*

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 8986; https://doi.org/10.3390/app15168986

Submission received: 1 July 2025 / Revised: 7 August 2025 / Accepted: 13 August 2025 / Published: 14 August 2025

Download

Browse Figures

Versions Notes

Abstract

Connected Autonomous Vehicles (CAVs) technology holds immense potential for enhancing traffic safety and efficiency; however, its inherent complexity presents significant challenges for conventional autonomous driving. World Models (WMs), an advanced deep learning paradigm, offer an innovative approach to address these CAV challenges by learning environmental dynamics and precisely predicting future states. This survey systematically reviews the advancements of WMs in connected automated driving, delving into the key methodologies and technological breakthroughs across six core application domains: cooperative perception, prediction, decision-making, control, human–machine collaboration, and scene generation. Furthermore, this paper critically analyzes the current limitations of WMs in CAV scenarios, particularly concerning multi-source heterogeneous data fusion, physical law mapping, long-term temporal memory, and cross-scenario generalization capabilities. Building upon this analysis, we prospectively outline future research directions aimed at fostering the development of more robust, efficient, and interpretable WMs. Ultimately, this work aims to provide a crucial reference for constructing safe, efficient, and sustainable connected automated driving systems.

Keywords:

connected automated driving; world models; foundation models; review

1. Introduction

1.1. Background and Motivation

Intelligent transportation systems (ITS) are profoundly transforming human mobility patterns at an unprecedented pace. As a strategic core component of the ITS framework, Connected and Automated Vehicles (CAVs) technology [1] harbors a revolutionary potential for enhancing traffic safety, operational efficiency, and travel comfort [2], leveraging its real-time interconnection capabilities among vehicles, road infrastructure, and other traffic participants. However, the inherent high dynamism, complex openness, and tightly coupled interactions among traffic participants in CAV environments pose unprecedented challenges for conventional autonomous driving technologies, which traditionally rely on passive sensing and pre-defined rules [3]. The progression towards full autonomy is often categorized into distinct Levels of Driving Automation (from Level 0 to Level 5) by standards such as those from the Society of Automotive Engineers (SAE), where higher levels signify greater vehicle autonomy [4]. In highly unstructured and dynamically evolving traffic scenarios, the inherent limitations of existing methods are becoming increasingly apparent, struggling to meet the evolving demands of future intelligent transportation systems.

To effectively address the limitations inherent in traditional autonomous driving approaches within the dynamic CAVs environment, researchers have recognized the significant potential of World Models (WMs) [5]. WMs, an innovative concept originating from the broader field of deep learning [6], represent an advanced modeling paradigm capable of deeply learning complex environmental dynamic regularities and precisely predicting future traffic scenarios’ evolutionary states. This capability positions them as a transformative technological force in autonomous driving [7]. By constructing cognitive models of traffic environments that are highly consistent with the real world, WMs empower autonomous driving vehicles to more profoundly understand and accurately predict the intentions, driving decisions, and motion trajectories of other traffic participants [8], ultimately enabling safer, more rational, and highly efficient driving decisions [9] and cooperative control [10] in complex traffic scenarios. As illustrated in Figure 1, research related to World Models has shown a significant growth trend in recent years, particularly evidenced by the surge in publications on major academic platforms, like IEEE Xplore, fully affirming the field’s status as a frontier research hotspot.

As illustrated in Figure 1, research related to World Models has shown a significant growth trend in recent years, particularly evidenced by the surge in publications on major academic platforms, like IEEE X-plore, fully affirming the field’s status as a frontier research hotspot. Notably, the period from 2015 to 2017 reflects a nascent or ‘dormant’ phase for the term ‘World Model’ within the deep learning community. During this time, the research landscape was dominated by breakthroughs in foundational architectures, exemplified by the revolutionary ResNet [11] from He et al., which was proposed in 2015 and published at CVPR in 2016, as well as the success of model-free reinforcement learning paradigms, highlighted by the groundbreaking achievements of Deep Q-Networks (DQN) [12] and AlphaGo [13]. The concept of the ‘World Model’ had not yet been unified under a concrete and compelling deep learning framework. The turning point occurred around 2018, catalyzed by several seminal papers such as Ha & Schmidhuber’s “World Models”, Hafner et al.’s PlaNet, and the subsequent Dreamer series, which collectively provided practical and powerful implementations. These works, coupled with the growing recognition of the sample inefficiency limitations inherent in model-free methods [14], ignited substantial research interest, leading to the subsequent accelerated growth in publications.

While World Models present tremendous application prospects, their effective implementation in connected automated driving environments still faces numerous scientific and technical challenges that urgently need to be addressed. For instance, how can we construct World Models capable of accurately modeling complex physical laws [15], characterizing the dynamic evolution of traffic scenarios [16], and effectively integrating multi-source heterogeneous data [17]? How can we address the complexities of multi-agent interaction [18], communication delays, and data loss uncertainty [19]? Furthermore, while pursuing technological advancements, critical issues such as ethical safety, responsibility attribution, and societal impacts [20] also require thorough investigation.

This survey aims to systematically review the latest research progress of World Models in connected automated driving environments. The paper first provides a concise introduction to the fundamental concepts and core architecture of World Models, laying the groundwork for subsequent discussions. Subsequently, it delves into the key methodologies, cutting-edge advancements, and typical applications of World Models across six core application domains in connected automated driving: cooperative perception, cooperative prediction, cooperative decision-making, cooperative control, human–machine collaboration, and real-world scene generation. Building on this, the paper systematically analyzes the primary technical challenges and ethical–safety considerations currently being faced, effectively highlighting the limitations and shortcomings of existing approaches. Finally, it offers a forward-looking perspective on future research trends and directions, proposing actionable follow-up prospects. We hope this survey will provide comprehensive and in-depth insights for researchers in related fields, jointly fostering the innovative development and widespread application of connected automated driving technology.

1.2. Review Methodology and Guiding Questions

To ensure this survey provides a scholarly and structured review, we adopted a systematic methodology guided by defined research questions.

1.2.1. Literature Search and Selection Strategy

We employed a multi-stage strategy combining systematic and exploratory approaches to build the comprehensive literature base. (1) Primary Systematic Search: A systematic search was conducted in the IEEE Xplore database, which is central to our field. The exact query used was (“Document Title”:“World Model”) OR (“Document Title”:“World Models”). This initial search formed the basis for our trend analysis in Figure 1 and identified a core set of foundational papers. (2) Secondary Broad Search: We conducted broader searches in other prominent academic databases, including the ACM Digital Library, Google Scholar, and the arXiv pre-print server, using combined keywords such as ((“World Model” OR “World Models”) AND (“Connected Autonomous Vehicle” OR “CAV” OR “autonomous driving”)). (3) Citation Chaining (Snowballing): We manually screened the reference lists of the identified core articles and relevant existing surveys to find additional foundational works. We also tracked recent citations of these key papers to include the latest relevant studies. This process was instrumental in incorporating high-impact works often highlighted at top-tier conferences and within the research community (e.g., through academic forums and leading research publications). (4) Targeted Supplementary Searches: To ensure depth in each of our application domains and challenge areas, we performed targeted searches on Google Scholar for highly specific topics (e.g., “World Model” AND “long-tail scene generation” AND “CAV”).

1.2.2. Inclusion and Exclusion Criteria

During the screening process, which involved reviewing titles, abstracts, and, subsequently, full texts, we applied the following explicit criteria to all identified articles:

▪: Inclusion Criteria: (1) The paper must be written in English. (2) It must be a peer-reviewed journal article, a full paper from a top-tier conference (e.g., CVPR, NeurIPS, and ICRA), or a highly influential pre-print on arXiv. (3) The work must either explicitly identify as a “World Model” or contribute a foundational methodology highly relevant to the core components of the World Model framework (i.e., environmental perception, dynamic prediction, and agent decision-making). (4) The application context is primarily focused on autonomous driving or CAVs, or the proposed methodology is demonstrably transferable and highly influential to the CAV domain (e.g., seminal works in reinforcement learning or generative modeling from domains like game playing).
▪: Exclusion Criteria: (1) Editorials, commentaries, patents, books (unless a specific chapter is seminal), and non-peer-reviewed workshop papers. (2) Papers where the methods are only tangentially related and do not align with the core conceptual framework of World Models.
▪: Additional Criteria for Targeted Searches: For supplementary searches, more specific criteria were applied based on the research question. For instance, studies focusing on scene generation were required to address complex, non-simplistic scenarios and demonstrate clear relevance to vehicle–road collaboration.

1.2.3. Quality Assessment and Data Synthesis

A systematic process for assessing the quality and risk of bias of the selected studies is critical for a rigorous review. Inspired by the systematic spirit of structured appraisal methodologies, such as those from the Joanna Briggs Institute (JBI) [21], we developed and consistently applied a customized seven-item critical appraisal checklist suitable for the diverse research paradigms in our domain. This multi-faceted qualitative assessment strategy, which combines rigor with the flexibility required for a rapidly evolving field, is a robust and common practice in engineering and computer science reviews. Each candidate paper was qualitatively appraised against the following key questions:

Clarity of Objectives: Are the research questions, objectives, or hypotheses clearly stated?
Significance of Contribution: Is the contribution novel, or does it represent a significant advancement over existing work?
Appropriateness of Methodology: Is the chosen methodology (e.g., theoretical, experimental, and system design) appropriate for the stated objectives?
Transparency of Methodology: Is the methodology described in sufficient detail to allow for conceptual understanding and, in principle, reproducibility?
Validity of Claims: Are the claims and conclusions adequately supported by the results, proof, or arguments presented?
Community Impact: Has the work demonstrated a notable impact on the field (e.g., through citations or its role as a foundational reference)?
Relevance to the Review: Does the paper’s contribution directly and substantially address one or more of our review’s guiding research questions?

A paper was included in our synthesis only if it received a positive appraisal across these key criteria. This structured, qualitative appraisal process ensures that our synthesized findings, which are detailed throughout Section 3, are based on a body of work that is not only recent but also of high academic quality and low selection bias.

1.2.4. Guiding Questions and Review Structure

We employed a thematic synthesis approach, where the findings from the selected literature were analyzed and structured around four guiding research questions (RQs). This systematic approach ensures that our review is comprehensive, structured, and provides a state-of-the-art critical synthesis. The RQs are as follows:

RQ1: What are the foundational architectures and core principles of modern World Models relevant to the autonomous driving domain?
RQ2: How are these World Models specifically applied across the six core domains of CAVs, and what are the key methodologies in each domain?
RQ3: What are the primary technical, ethical, and safety challenges that currently limit the practical deployment of World Models in CAV scenarios?
RQ4: Based on the current advancements and limitations, what are the most promising future research directions?

2. World Model Research Progress

World Models (WMs) are not a new concept; however, they have experienced a resurgence of vitality in the context of the burgeoning advancements in Artificial Intelligence, particularly deep learning technology [22]. The core idea behind modern WMs lies in their ability to autonomously learn the intrinsic regularities of an environment from vast amounts of data and predict future state evolution [23] without explicit human-defined complex rules. The generic architecture of such a WM-based agent is illustrated in Figure 2. To ensure prediction accuracy and reliability, WMs typically employ self-supervised learning by comparing and optimizing prediction results against true observations (e.g., using Mean Squared Error (MSE) and other loss functions), thereby continuously enhancing the model’s understanding of environmental dynamics [24].

Currently, Recurrent State Space Models (RSSM) [25] and Joint-Embedding Predictive Architectures (JEPA) [26] have emerged as two foundational pillars for constructing modern World Models.

The Recurrent State Space Model (RSSM) offers an elegant and powerful framework for probabilistic state representation and the uncertainty modeling of complex dynamic systems, which are particularly crucial in inherently stochastic environments like traffic. The core architecture of the RSSM is schematically depicted in Figure 3, adapted from the work presented in [27]. Its core idea is to abstract the true state of a system into unobservable latent variables, assuming that system dynamics and observation generation are driven by these latent variables. The RSSM typically comprises three key components: a state transition model that probabilistically describes the evolution of latent variable states over time, capturing dynamic system uncertainties; an observation model that defines how observable outputs are generated from latent variable states; and an initial state distribution that describes the latent variable states at the initial time. By incorporating recurrent connection mechanisms, the RSSM effectively processes temporal input data, leveraging historical information for a more accurate inference and prediction of current states, thereby adapting to complex and continuous dynamic environments. The application of deep neural networks [28] further parameterizes the RSSM, enabling it to learn effective latent variable representations from high-dimensional sensory data and make precise future state predictions. Google DeepMind’s Dreamer series models (e.g., DreamerV2 [29], DreamerV3 [30]) notably utilize the RSSM as their core technical foundation, achieving an exceptional performance across various complex tasks and fully demonstrating the RSSM’s central role in constructing World Models with predictive and uncertainty-handling capabilities.

The Joint-Embedding Predictive Architecture (JEPA) brings revolutionary insights to the paradigm of World Model construction. A schematic diagram of its predictive architecture is presented in Figure 4. Unlike traditional generative model approaches, the JEPA does not focus on pixel-level precise reconstruction. Instead, it concentrates on learning efficient joint-embedding representations between input and prediction targets, maximizing the consistency between predicted and true target representations through a contrastive learning [31] mechanism. This non-generative and contrastive learning strategy effectively circumvents the computational overhead and detail redundancy associated with pixel-level reconstruction. Instead, it focuses on learning semantically rich feature representations that are crucial for prediction tasks. JEPA constructs positive and negative sample pairs to effectively pull similar samples closer and push dissimilar samples further apart in the latent space, thereby learning highly discriminative feature representations. The construction of its joint-embedding space enables WMs to perform efficient predictions and inferences in low-dimensional representation spaces [32], significantly improving computational efficiency and generalization capabilities. The emergence of the JEPA architecture provides a new technical paradigm for building more efficient and generalizable World Models.

In summary, driven by advancements in deep learning, World Models have evolved from early concepts into a cutting-edge paradigm possessing powerful environmental modeling and future state prediction capabilities. The evolution of architectures like RSSM and JEPA has enabled WMs to demonstrate exceptional performance in handling complex dynamic environments, laying a solid foundation for their subsequent exploration in complex application scenarios such as connected automated driving.

3. World Models in Connected Automated Driving Environments

3.1. Uniqueness and Research Needs of Connected Automated Driving Environments

The Connected and Automated Vehicles (CAVs) environment, serving as the core architecture of future intelligent transportation systems, demonstrates a revolutionary transformative potential by deeply integrating infrastructure and vehicle data, thereby establishing an unprecedented open, interconnected, and intelligent transportation ecosystem. However, the inherent high dynamism, complex openness, and tightly coupled interactions among traffic participants in CAV environments [33] pose unprecedented challenges for conventional autonomous driving technologies, which traditionally rely on passive sensing and pre-defined rules [34]. A profound understanding of the uniqueness of CAV environments and a clear definition of the derived research needs are crucial prerequisites for exploring the application of World Models in CAVs. To this end, Figure 5 illustrates an integrated framework that visually encapsulates how WMs serve as a core component to address these needs by enabling key applications.

The uniqueness of the connected automated driving environment primarily manifests in the following aspects, which impose higher demands on World Models:

(A): Multi-source Heterogeneous Perception and Fusion: Unlike traditional single-vehicle intelligent systems that primarily rely on onboard sensors, CAV systems can integrate multi-modal data from roadside infrastructure (e.g., high-definition cameras, LiDAR, and millimeter-wave radars) and the vehicles’ own sensors. This provides global, multi-view, and multi-spatiotemporal environmental perception information [8]. While this diverse and heterogeneous data source offers the potential to construct more comprehensive and precise environmental models, it also presents significant challenges to World Models in terms of efficient fusion and unified representation across data formats, coordinate systems, timestamp alignment, and leveraging complementary advantages of different modalities [35].
(B): Cooperative Decision Making and Global Optimization: The core objective of CAVs is to shift from local optimality for individual vehicles towards the global optimization of traffic flow, for instance, by improving road throughput, reducing congestion, and mitigating accident risks [36]. This implies that World Models must not only consider the driving behavior of the vehicle itself but also profoundly understand and simulate the complex interactions and behavioral intentions among multiple agents in the traffic system [37], thereby providing a reliable prediction basis and decision support for cooperative decision making and control [38].
(C): Deep Human–Machine Collaboration and Trust Building: As the level of autonomous driving advances, future CAV systems will become complex systems deeply integrating human, vehicle, and road elements [39], with human–machine collaborative driving becoming the norm [40]. This necessitates that World Models not only accurately perceive the environment but also comprehend human drivers’ intentions, driving styles, and behavioral patterns. This capability is essential for achieving natural, efficient, and safe interactions, and for gradually establishing user trust in the system [41].
(D): High-Quality, Real-World Scene Generation and Validation: The development and testing of CAVs technology urgently require high-quality, diverse, and controllable driving scenario simulation environments, particularly those encompassing rare (corner case) and hazardous scenarios [42]. World Models, leveraging their powerful generative capabilities, are expected to provide realistic and controllable scenario data for CAV systems, thereby accelerating algorithm iterations, testing validation, and standardization [37]. However, ensuring the physical realism and diversity of generated scenarios remains a challenge.

3.2. Research and Applications of World Models in Connected Automated Driving

This section delves into the key application domains where World Models are making significant impacts. To provide a structured and comprehensive overview of the representative works in each domain, we have compiled a detailed summary in Table 1. This table outlines key information for each model, including its publication year, input modalities, core structure, and, most importantly, its key contributions to V2X scenarios. It serves as a roadmap for the detailed discussions in the following subsections.

3.2.1. Cooperative Perception Based on World Models

In Connected Autonomous Vehicle (CAV) environments, cooperative perception aims to fuse multi-source information from vehicle-borne sensors and roadside infrastructure. This fusion overcomes the limitations of traditional single-vehicle perception, leading to a more comprehensive, precise, and robust understanding of the environment. World Models (WMs), with their powerful environmental modeling and data fusion capabilities, are key technologies for enhancing cooperative perception performance. Currently, the application of WMs in the field of cooperative perception primarily manifests in the following core aspects:

Firstly, the efficient fusion of multi-modal and 4D spatiotemporal data serves as a cornerstone for World Models to play their role in cooperative perception. Given the inherent multi-source heterogeneous characteristics of CAV environments, WMs can efficiently integrate sensor data such as images, point clouds, and millimeter-wave radars, and construct high-dimensional spatiotemporal occupancy representations. This provides a more global and accurate environmental awareness across multiple views and time frames. For example, Uniworld [76] innovatively utilizes multi-frame point cloud registration and fusion techniques to generate high-quality 4D occupancy labels, subsequently pre-training a World Model. This enables the model to effectively capture spatiotemporal correlations and dynamic environmental characteristics within image sequences. This methodology offers significant insights into the multi-sensor fusion perception in CAV environments, particularly in complex urban traffic, markedly improving the reliability and robustness of multi-source perception information provided by roadside infrastructure. In contrast to Uniworld’s focus on 4D occupancy labels, Drive-WM [33] stands as the first end-to-end multi-view autonomous driving World Model. It achieves significant improvements in multi-view prediction consistency, thereby enhancing the spatial consistency and completeness of environmental perception by jointly generating multi-view frames and predicting intermediate views between adjacent perspectives. Its multi-view modeling approach is crucial for fully leveraging multi-view cameras deployed in CAV scenarios. Furthermore, MUVO [35] emphasizes enhancing the WM’s perception and prediction capabilities in multi-modal fusion by utilizing geometric information (LiDAR point clouds), deeply integrating point clouds and visual images to predict 3D occupancy grids. This is of critical importance for CAVs to accurately perceive the geometric structure and spatial layout of the surrounding environment.

Beyond the advancements in multi-modal data fusion, World Models also demonstrate immense potential in efficient and generalizable representation learning, addressing the challenges of large-scale, multi-modal perceptual data. WMs achieve this by learning compact latent representations of the environment, which significantly boost data processing efficiency. IRIS [84], for instance, proposes a World Model based on discrete autoencoders (DAE) and autoregressive Transformers. By learning highly compressed discrete latent representations, IRIS significantly improves the model’s sampling efficiency and generalization capabilities. Although IRIS is not directly tailored for cooperative perception tasks, its efficient modeling philosophy offers significant insights for constructing lightweight, real-time cooperative perception models for CAVs. Simultaneously, V-JEPA [44], as a lightweight video prediction model, centers its core on non-generative and contrastive learning strategies, focusing on learning semantically condensed feature representations rather than pixel-level precise reconstructions. This characteristic makes it highly suitable for the deployment of resource-constrained, vehicle-borne, and roadside equipment, where low latency is critical, providing efficient and versatile environmental perceptions for cooperative driving.

The aforementioned representative works have significantly advanced research in the direction of cooperative perception through World Models. Concurrently, other related studies have enriched the field of World Model-driven cooperative perception from various angles. For example, EnerVerse [49], while primarily based on robotics manipulation, offers valuable insights into CAVs’ cooperative perception in terms of its capabilities in multi-view fusion and scene reconstruction. MILE [86], on the other hand, focuses on constructing multi-view bird’s-eye view (BEV) representations to enhance the completeness and accuracy of scene perception. Collectively, these works have propelled progress in World Models for fusing multi-source heterogeneous data, deepening environmental understanding, and improving generalization capabilities, thereby laying a solid foundation for cooperative perception in CAVs.

Despite these advancements, a significant gap remains in achieving real-time fusion of high-dimensional, asynchronous data streams from diverse sensors, and ensuring a robust performance in adverse weather conditions. These persistent shortcomings underscore the challenges that will be further elaborated in Section 4.

3.2.2. Cooperative Prediction Based on World Models

In connected automated driving environments, the accurate and reliable prediction of traffic flow and traffic participant behavior is crucial for achieving intelligent traffic control and optimization. Cooperative prediction methods based on World Models, by leveraging their powerful dynamic environmental modeling and future scenario inference capabilities, integrate global situational awareness information from vehicle perception and roadside infrastructure. This enables the precise prediction of traffic participant trajectories, traffic event evolution trends, and future traffic states. Compared to traditional prediction methods, World Models can more effectively capture the complex dynamics and uncertainties within traffic systems, significantly enhancing prediction accuracy and robustness, thereby providing a reliable basis for cooperative decision making and control.

World Models demonstrate exceptional capabilities in long-term temporal and multi-modal scenario prediction, owing to their proficiency in environmental dynamic modeling. This means they can process and generate predictions for complex traffic scenarios spanning extended durations, while efficiently fusing multi-source heterogeneous information. For instance, VideoRAG [45], a World Model designed for generating and understanding extremely long-term videos, utilizes a dual-channel architecture and knowledge-enhanced generation method. This allows it to process and predict video content lasting several minutes, maintaining content consistency and coherence. VideoRAG’s long-term temporal modeling capability holds significant value for predicting traffic flow over extended periods in CAV scenarios, as well as vehicle trajectory evolution. In parallel, MotionBridge [47] focuses on generating dynamic video interpolations, showcasing its potential in cooperative prediction through its cross-modal video generation capabilities. MotionBridge can integrate various modalities such as “trajectory strokes,” “key frames,” and “text” for video generation, and incorporates flexible control mechanisms. In cooperative prediction, this assists in fusing multi-source perception data to predict the evolution of future traffic scenarios and adjust predictive behavior based on specific driving scenarios and prediction targets.

Beyond macroscopic scenario evolution prediction, the fine-grained modeling and prediction of individual traffic participants’ behavior patterns and their interactions with the environment are equally critical. World Models can deeply understand and predict individual and collective dynamics. TrafficBots [38], for example, specialize in constructing World Models for autonomous driving simulations and motion prediction. Its core innovation lies in modeling and predicting the behavior of individual agents within a scenario and attempting to learn each agent’s unique “personality.” TrafficBots use conditional variational autoencoders to learn individual agent behavior patterns, enabling action prediction from a bird’s-eye view perspective. By taking vehicle type and driving intention information acquired from roadside sensors as conditional inputs, TrafficBots can enhance the accuracy and personalization of traffic participant behavior prediction in CAV scenarios. On the other hand, MILE [86] adopts a model-driven imitation learning approach, utilizing a World Model to learn a dynamic driving environment model and achieve autonomous driving through future scenario inference and planning. MILE’s proposed “generalized inference algorithm” can be viewed as an implicit decision evaluation mechanism that selects optimal strategies by imagining different future scenarios and assessing the effectiveness of driving behavior. Although primarily applied to single-vehicle intelligence, its future scenario inference capabilities offer significant insights for CAVs’ cooperative prediction.

The advancements in cooperative prediction based on World Models extend far beyond the aforementioned typical contributions, with other related research also providing valuable explorations in this domain. For instance, DriveDreamer [37] shows potential in generating future driving actions and corresponding scenarios; OccWorld [68] and Think2Drive [34] directly leverage 3D occupancy information to predict environmental evolution and plan autonomous driving actions; STORM [75] utilizes stochastic Transformers to enhance prediction capabilities in complex environments; and TinyLLaVA-Video [46] proposes a small-scale multi-modal video understanding framework, demonstrating the potential for lightweight prediction models on resource-constrained CAV edge computing platforms. While these diverse explorations each have their specific focuses, they collectively contribute invaluable experience in spatiotemporal modeling, multi-modal fusion, and uncertainty handling, offering beneficial insights for CAVs’ cooperative prediction research based on World Models.

While promising, the current works often struggle with long-term prediction accuracy and interpretability, particularly in complex, multi-agent interaction scenarios. The shortcomings in handling uncertainty and rare events highlight the need for more advanced modeling, as will be discussed in Section 4.

3.2.3. Cooperative Decision-Making Based on World Models

Cooperative decision making is a critical component for ensuring the safe and efficient operation of connected automated driving systems. In complex traffic environments, vehicles must collaborate with roadside infrastructure and other traffic participants to jointly formulate optimal driving strategies. World Models, leveraging their powerful prediction and planning capabilities, offer a novel approach for intelligent cooperative decision making. This section will primarily elaborate on the research progress of World Models in cooperative decision making for connected automated driving.

A core application of World Models in cooperative decision making is enabling latent space planning and behavior evaluation. By learning compact latent representations of the environment, World Models can perform efficient planning and decision making in low-dimensional spaces and support the evaluation of future behaviors. For example, Think2Drive [34] is a representative work that directly applies World Models to cooperative decision making. Its core idea lies in utilizing World Models for efficient reinforcement learning in the latent space to achieve quasi-realistic autonomous driving. Think2Drive employs an RSSM architecture to learn the environmental dynamics model and performs planning in latent space, thereby avoiding the computational complexity associated with high-dimensional pixel-space planning. This concept of latent space planning can be effectively extended to CAV scenarios: a typical and effective extension involves using a World Model to predict the future states of other vehicles at complex intersections, subsequently enabling cooperative path planning and speed control, which significantly boosts decision-making efficiency.

The enhancement of cooperative decision-making capabilities through World Models extends beyond their latent space planning abilities. Researchers have further discovered that World Models can better understand and utilize environmental contextual information, thereby empowering more intelligent and safer cooperative decision making. For instance, UNICORN [56] proposes an information-theoretic, context-aware offline meta-reinforcement learning framework for cooperative decision making. UNICORN quantifies and models environmental contextual information using an information-theoretic framework and then leverages this contextual information within offline meta-reinforcement learning to improve decision performance. This context-aware capability, particularly in highly dynamic and complex connected automated driving environments, aids models in learning and comprehending traffic participant behaviors, traffic rules, and road conditions, thereby improving decision-making intelligence and adaptability. Concurrently, in the safety-critical domain of autonomous driving decision making, SafeDreamerV3 [36] specifically focuses on enhancing the safety of reinforcement learning using World Models. The core innovation of SafeDreamerV3 lies in integrating Lagrangian methods into the Dreamer framework, explicitly considering safety constraints to prevent the agent from executing unsafe behaviors. In CAV scenarios, safety constraints (such as maintaining safe distances and adhering to traffic rules) can be incorporated into cooperative decision-making models to train safe cooperative driving strategies, thereby improving the safety and reliability of vehicles in complex scenarios like highway merging ramps and intersection negotiations.

Researchers’ exploration of cooperative roles based on World Models extends far beyond the aforementioned contributions. For example, EfficientZero [88] improves upon MuZero by achieving efficient sample utilization through lightweight optimization, thereby enhancing decision-making planning capabilities and exploration efficiency, which holds a significant reference value for resource-constrained CAV systems, as its high sample efficiency directly contributes to faster model convergence and potentially lower computational overhead during planning. Models like CityDreamer4D [48] and StarGen [50] have demonstrated an outstanding performance in generating simulation scenarios that are diverse, realistic, and controllable, providing rich scenario resources for the testing and validation of cooperative decision-making algorithms. These multi-dimensional research efforts collectively drive the development of World Models in the CAVs’ cooperative decision-making domain, offering a robust assurance for achieving higher-level intelligent decision making.

However, a primary shortcoming of existing methods is the challenge of ensuring safe and explainable decisions under uncertainty. The trade-off between optimality and safety, along with the lack of transparent reasoning, remains a critical area for future work, which is a central theme in our discussion in Section 4.

3.2.4. Cooperative Control Based on World Models

Cooperative control represents a significant development direction for connected automated driving systems, aiming for higher operational efficiency and safety. While cooperative decision making focuses on strategy formulation, cooperative control emphasizes the execution layer, aiming to achieve the precise coordination of vehicle movements through vehicle–road information exchange. World Models, by virtue of their accurate environmental dynamic prediction and vehicle movement planning capabilities, provide new technical means for realizing fine-grained cooperative control.

A core contribution of World Models in cooperative control is enabling high-precision motion and safety control. They can precisely control vehicle movements and incorporate safety constraints to ensure system reliability. For example, AC3D [52] focuses on improving the performance of video diffusion models in 3D camera controls, thereby enabling more precise motion control. Although AC3D is primarily applied to video generation, its refined modeling and control capabilities for 3D cameras offer more reliable actuators for CAV cooperative control systems, facilitating precise adjustments of vehicle speed, steering angle, and other parameters to achieve the desired driving behaviors. Crucially, SafeDreamerV3 [36] (whose core innovation was mentioned in Section 3.2.3) integrates Lagrangian methods into the Dreamer framework, explicitly considering safety constraints to prevent agents from performing unsafe actions. This safety-enhanced reinforcement learning approach is vital for CAVs’ cooperative control, especially in high-density, high-dynamic traffic environments. By introducing vehicle spacing safety constraints or collision avoidance constraints, safer cooperative control strategies can be trained, significantly improving system safety and reliability in scenarios such as platoon control or cooperative negotiation at intersections.

In addition to high-precision motion and safety control, World Models also demonstrate immense potential in multi-agent interaction and efficient planning, which are crucial for cooperative control in complex traffic scenarios. WMs support the construction of more sophisticated cooperative control systems by understanding and predicting multi-agent behaviors. For example, ROCKET-1 [61] proposes an open-world interaction model based on visual-temporal context prompting, showcasing significant potential in multi-agent interaction control within the cooperative control domain. ROCKET-1, by comprehending and predicting multi-agent behaviors, supports the development of cooperative control systems capable of managing complex traffic scenarios involving multiple autonomous vehicles, human-driven vehicles, and pedestrians, thereby enabling safe and efficient traffic flow. Furthermore, EfficientZero [88] improves upon MuZero by achieving efficient sample utilization through lightweight optimization, enhancing decision-making planning and exploration efficiency. Its efficient modeling and learning methods are well-suited for resource-constrained CAV control systems, helping to reduce computational and communication overhead and improve system scalability and real-time performance, which is particularly important when controlling large numbers of vehicles and roadside equipment in large-scale CAV systems.

Beyond these, other related research has also contributed valuable technical advancements to World Model-based cooperative control. For instance, TWM [73] utilizes Transformer-XL to capture long-term environmental dependencies, thereby improving prediction accuracy. PlaNet [91] proposes a reinforcement learning method based on latent variable dynamic models, whose ability to model environmental uncertainty contributes to enhancing the robustness of cooperative control systems in uncertain traffic environments. These diverse research efforts collectively drive the development of World Models in the CAV cooperative control domain, offering robust assurance for achieving higher-level intelligent control.

Despite progress in precise control, significant shortcomings persist in guaranteeing stability and safety in large-scale, heterogeneous CAV systems with communication delays. The development of scalable and resilient control strategies is a key prospect that will be explored in Section 4.

3.2.5. Human-Machine Collaboration Based on World Models

Human–machine collaborative driving represents a significant development trend for future intelligent transportation systems. In co-driving scenarios, autonomous driving systems need to understand human drivers’ intentions and behaviors, interact with them naturally and efficiently, and jointly accomplish driving tasks. World Models, leveraging their capabilities in predicting human behavior and understanding complex environments, offer a new technical pathway for achieving more intelligent and human-centric human–machine collaborative driving.

A core application of World Models in human–machine collaboration is multi-modal intention understanding and natural language interaction. By processing and fusing information from various modalities, World Models can significantly enhance their understanding of human intentions and behaviors and support intuitive natural language instruction interactions. For example, V2PE [51] focuses on improving the performance of vision-language models in multi-modal, long-context tasks, demonstrating an immense potential, particularly in human–machine interaction scenarios. V2PE introduces “variable visual position encoding” to enhance the model’s understanding of long-sequence data and complex scenes. In human–machine collaboration, V2PE’s multi-modal long-context processing capabilities can be utilized to integrate drivers’ voice commands, gestures, gaze, and vehicle sensor data, thereby enabling a more accurate understanding of human drivers’ intentions and driving states. This lays a solid foundation for natural and fluent human–machine interactions. Building upon this, Adriver-I [40], as a general driving World Model, further emphasizes leveraging multi-modal large language models (MLLMs) to enhance the World Model’s expressiveness and generalization capabilities. Adriver-I combines MLLMs with Video Latent Diffusion Models (VDMs) to improve the World Model’s understanding and prediction capabilities for driving scenarios, granting it the potential to understand and generate natural language instructions. In the co-driving mode, drivers can interact with the autonomous driving system via natural language commands (e.g., “navigate to the nearest gas station,” “maintain safe following distance”), and Adriver-I can understand and integrate these commands into the World Model’s prediction and planning processes, thus achieving more natural and convenient human–machine collaborative driving.

The ability to generate interactive virtual environments, providing a crucial platform for testing human–machine collaboration strategies and building user trust, marks a significant advancement in human–machine collaboration for connected automated driving environments through World Models. Genie [41] specializes in generating interactive virtual environments, with its core innovation lying in the “interactiveness” of the generated environment. Users can interact with the virtual environment via natural language instructions and observe real-time feedback from the environment. Genie’s model learns spatiotemporal dynamic information and interaction patterns from videos, enabling it to learn “intuitive physics” and “causal relationships” from unlabeled videos. This facilitates the generation of virtual environments that are physically realistic and responsive to interactions. In human–machine collaboration, the Genie model can be utilized to construct CAV human–machine interaction simulation platforms, allowing human drivers to interact with autonomous driving systems in a virtual setting. This enables the testing and evaluation of various human–machine collaborative driving strategies. Furthermore, its interactiveness can also be leveraged to develop more intuitive and natural human–machine interfaces for driving assistance systems, by visualizing the virtual scenarios generated by the World Model, thereby enhancing drivers’ understanding of the system status and intention, fostering trust and cooperation between humans and machines.

Beyond these, more World Model-based human–machine collaboration schemes and strategies are quietly unfolding. For example, S4WM [66] proposes a long-sequence World Model based on the S4 model, which enhances long-term memory capabilities and supports understanding human drivers’ long-term driving habits and intentions. JEPA series models, such as V-JEPA [44] and MC-JEPA [79], have made progress in efficient video representation learning, offering new options for constructing efficient and robust human–machine collaborative driving models. Concurrently, TinyLLaVA-Video [46] provides lightweight multi-modal video understanding capabilities, making it suitable for resource-constrained CAVs’ onboard platforms and supporting voice commands and simple gesture recognition. While these diverse explorations each have their specific focuses, they collectively promote the application of World Models in human–machine collaborative driving and provide a significant impetus for the development of CAV human–machine collaboration technologies.

Nevertheless, current models often lack a deep, common sense understanding of human intent, leading to brittle and unnatural interactions. The shortcoming lies in bridging the gap between pattern recognition and genuine cognitive understanding, a critical prospect for future research detailed in Section 4.

3.2.6. Real-World Scene Generation Based on World Models

High-quality driving scenario simulation environments are critical for the research, development, and testing of autonomous driving technology. They are essential for reducing testing costs, shortening development cycles, and enabling the evaluation of extreme and hazardous scenarios. World Models, leveraging their powerful generative capabilities and ability to learn real-world data distributions, offer a new technical pathway for constructing more realistic, diverse, and controllable driving scenario simulation environments.

World Models can better achieve large-scale and controllable scenario generation. They are capable of generating extensive, unbounded driving scenarios and support fine-grained control over scene elements, which are crucial for simulating diverse and complex traffic situations. For example, CityDreamer4D [48] focuses on generating unbounded, composable 4D urban scenes, aiming to break the limitations of traditional scenario generation models. Its core innovation lies in its compositional generation capabilities and unbounded scene generation, achieved through modules like the “Unbounded Layout Generator,” enabling independent control and combinations of city layouts, instances, buildings, and vehicles. CityDreamer4D’s ability to generate unbounded urban scenes provides unprecedented scenario resources for large-scale urban traffic simulation and testing, particularly suitable for evaluating CAV systems under various complex city layouts, traffic flows, and architectural styles. Similarly, StarGen [50] proposes a spatiotemporal autoregressive framework based on video diffusion models, capable of generating large-scale, controllable, and coherent driving scenarios, with support for fine-grained control over scene elements and driving behaviors. StarGen’s controllable scenario generation capability can be used to produce scenarios with varying traffic densities and driving styles, providing rich scenario resources for CAVs algorithm testing.

In addition to large-scale scenario generation capabilities, World Models have also made significant strides in achieving high fidelity and dynamic reconstruction, enabling them to learn from real-world data and reconstruct highly accurate dynamic driving scenarios. ReconDreamer [58] focuses on driving scene reconstruction and dynamic scene modeling, aiming to generate high-precision, structured, real-world driving scenes. Its core lies in utilizing online repair and large-scale motion modeling techniques to reconstruct high-precision 3D driving scenes from real driving videos and representing these scenes using Gaussian Splatting for efficient rendering and real-time interaction. The high-precision, structured driving scenes generated by ReconDreamer can be used to construct high-fidelity CAV simulation platforms, supporting the testing and validation of various autonomous driving algorithms and cooperative control strategies. Concurrently, while MotionBridge [47] primarily targets dynamic video interpolation, its cross-modal video generation capability offers a unique value in the domain of real-world scene generation. MotionBridge can fuse various modalities, such as vision, trajectories, and semantics, generating richer and more realistic driving scenarios and supporting flexible control, which is of great significance for training CAVs’ motion prediction and trajectory planning models.

Furthermore, DiffDreamer [70] proposes a single-view scene extrapolation method based on conditional diffusion models, enhancing long-term consistency. JEPA series models, such as V-JEPA [44] and I-JEPA [81], have advanced in efficient video representation learning, offering new options for building efficient and lightweight driving scene generation models. DriveDreamer [37], based on a conditional variational autoencoder architecture, can generate diverse and controllable driving scenarios. TinyLLaVA-Video [46] provides lightweight scene understanding and generation capabilities. While these diverse research efforts each have their specific focuses, they collectively promote the development of World Models in real-world driving scene generation technology and provide strong support for the research, development, testing, and validation of CAV systems.

While the generated scenes are visually impressive, their physical realism, dynamic consistency, and coverage of safety-critical corner cases remain significant shortcomings. The prospect of creating truly reliable and comprehensive simulation platforms is a major challenge that we will address in Section 4.

4. Challenges and Future Directions

The application of World Models in connected automated driving environments holds immense potential, heralding a leap in the intelligence level of transportation systems. However, like any emerging technology, the deeper application of World Models in the CAVs domain still confronts a series of severe challenges yet simultaneously presents broad prospects for development. Building upon the advancements reviewed in the previous section, this section provides a critical analysis of these limitations inherent in the current body of work and elaborates on the follow-up prospects. Addressing these challenges head-on and actively exploring future directions will help us more effectively advance the application of World Models in connected automated driving, ultimately building safe, efficient, and sustainable future intelligent transportation systems.

4.1. Technical and Computational Challenges

(A): Efficient Fusion and Unified Representation of Multi-source Heterogeneous Data: The multi-source heterogeneous nature of data in connected automated driving environments is a salient characteristic. Data collected from vehicle-borne [92] and roadside sensors [93] exhibit significant differences in modality types (e.g., images, point clouds, and millimeter-wave radars), data formats, sampling frequencies, and noise characteristics. Although existing research (e.g., Uniworld [76], Drive-WM [33]) has made preliminary explorations into multi-modal fusion, constructing efficient and versatile fusion frameworks remains a key bottleneck. Future research needs to explore more advanced cross-modal attention mechanisms (drawing inspiration from the JEPA concept in V-JEPA [44]) and multi-sensor association methods based on Graph Neural Networks (GNNs) to achieve deeper levels of data correlation and information propagation. Concurrently, to meet real-time requirements, it is necessary to investigate techniques such as lightweight feature extraction networks, knowledge distillation, model pruning, and quantization [94] to build efficient multi-modal unified representation models.
(B): Understanding Physical Laws and Real-World Mapping: Current World Models primarily lean towards data-driven statistical modeling, and their deep understanding and explicit modeling of fundamental laws governing the physical world (e.g., Newton’s laws of motion, conservation of energy) remain insufficient [5]. However, the ultimate goal of CAV systems is to achieve effective control and optimization of real traffic flow, which demands that World Model predictions and decision outcomes precisely map back to the physical world and produce expected, safe, and reliable physical effects. Future research needs to focus on how to explicitly integrate physical knowledge into World Model design [95], for instance, by embedding physical laws as constraints into models via Physics-Informed Neural Networks (PINNs) or by using symbolic regression and similar methods to automatically discover underlying physical laws [96]. Concurrently, attention should be paid to the model’s generalization capabilities across different physical scales and complexities, ensuring its effectiveness and reliability in various real-world environments.
(C): Real-time Performance and Computational Efficiency on Edge Devices: The concern regarding latency and processing capabilities is highly pertinent, as CAVs demand decisions within milliseconds. A significant shortcoming of many current WM approaches is that they are developed in research settings with abundant computational resources, often overlooking the stringent real-time constraints of in-vehicle systems. The high computational complexity of large Transformer-based architectures [73,75] and iterative planning algorithms can lead to inference latencies that are prohibitive for practical driving conditions. Future research must therefore prioritize the development of lightweight and efficient WMs tailored for edge deployment [46]. This involves exploring architectural optimization techniques such as model quantization, pruning, and knowledge distillation to compress large models without significant performance degradation; architectures like JEPA [44,79], which avoid expensive pixel-level reconstruction, represent a promising direction in this regard. Furthermore, hardware–software co-design, which optimizes WM algorithms for specific in-vehicle hardware accelerators (e.g., GPUs, TPUs, and FPGAs) [28], is crucial for maximizing throughput and minimizing latency. Investigating hybrid computing paradigms that delineate tasks between in-vehicle, edge, and cloud computing [1] is another key prospect. For instance, latency-critical tasks could be handled by a lightweight WM on the vehicle, while more computationally intensive tasks are offloaded to the edge or cloud. Clarifying which of the reviewed models are suitable for each paradigm remains a crucial follow-up endeavor.
(D): Memory Enhancement for Long-Term Tasks: In complex and dynamic real-world environments, constructing World Models that can effectively integrate and utilize long-term memory to maintain coherent cognitive reasoning and stable decision-making presents a formidable challenge in the current CAVs domain [97]. Although Transformer architectures have made progress in processing long-sequence data, limitations still exist in achieving truly scalable and human-like long-term memory integration, often leading to “context forgetting” and “recurrent repetition” issues. Future research needs to explore more efficient and scalable memory mechanisms, such as hierarchical memory structures, external memory modules, and sparse attention mechanisms. Concurrently, the continuous optimization of training methods is required, including contrastive learning, memory-augmented learning, continual learning, and memory replay strategies, to enhance the stability of the model’s long-term memory. Furthermore, effective context management strategies (e.g., sliding windows, memory compression, and information retrieval) are crucial for the fine-grained management and filtering of long-term information under limited computational resources.

4.2. Ethical and Safety Challenges

(A): Responsibility Attribution and Explainability of Cooperative Decision Making and Control: Connected Autonomous Vehicle (CAV) systems are complex systems characterized by deep multi-agent cooperation (involving vehicles, roadside infrastructure, cloud platforms, communication networks, and other traffic participants). The traditional framework for single-vehicle liability attribution is therefore difficult to apply [98]. In the event of an accident, responsibility attribution could involve multiple factors, becoming exceptionally complex and ambiguous. Furthermore, the decision-making processes driven by World Models often lack transparency and interpretability, which exacerbates the difficulty of tracing accountability. Future research should integrate Explainable Artificial Intelligence (XAI) techniques into the World Model design, leveraging mechanisms such as attention mechanisms, visualization techniques, and causal inference models to enhance the transparency and interpretability of the model’s decision-making process. Concurrently, it is necessary to establish liability attribution mechanisms and legal regulations tailored to CAVs’ characteristics, clearly defining the responsibilities and obligations of all participating entities.
(B): Data Privacy Protection and Information Security in Connected Automated Driving: The effective operation of CAVs relies heavily on large-scale data collection and real-time information exchange. This inevitably involves vast amounts of sensitive user information (e.g., vehicle location, driving trajectories, driving habits, and identity identifiers). Data breaches or misuse would pose severe threats, leading to significant ethical and social issues [99]. Future research must prioritize data privacy protection, exploring and applying advanced techniques such as federated learning, differential privacy, and homomorphic encryption to enable data sharing and value extraction within a secure and trustworthy environment. Simultaneously, it is imperative to establish data security management norms, clearly defining permissions, procedures, and responsible entities for data collection, storage, transmission, and usage, while strengthening cybersecurity defenses to effectively counter increasingly severe cyber threats and ensure the stable and reliable operation of the system.

4.3. Future Perspectives

Given the immense potential demonstrated by World Models in the connected automated driving domain, and in light of the challenges currently faced, future research should focus on the following key directions to propel CAVs technology towards higher levels of intelligence. To better contextualize these future directions, it is valuable to consider them through the lens of the SAE Levels of Driving Automation. The role and challenges for World Models evolve as we progress towards higher autonomy:

▪: For Level 3 (Conditional Automation), WMs are crucial for robustly predicting scenarios that may require human takeover, quantifying uncertainty, and ensuring safe transitions of control. The primary challenge lies in the model’s reliability and its ability to accurately assess its own confidence.
▪: For Level 4 (High Automation), where the vehicle operates autonomously within a defined Operational Design Domain (ODD), WMs must achieve a profound and robust understanding of all possible scenarios within that domain. The challenges shift towards mastering long-tail edge cases and ensuring long-term operational safety without human oversight.
▪: For Level 5 (Full Automation), WMs face the ultimate challenge of generalization. They must possess common sense reasoning and a deep understanding of physical laws to handle entirely novel situations, making the challenges of physical law mapping and long-term memory discussed previously paramount.

The following research directions are fundamental to addressing these evolving challenges and advancing through the levels of autonomy:

(A): Deep Integration of World Models with Novel Roadside Sensing Devices: With the deployment of new roadside sensing devices such as the high-precision millimeter-wave radar, high-resolution LiDAR, and hyperspectral imagers, World Models should profoundly explore how to deeply integrate with these devices. This integration should span the data, feature, and decision layers, constructing more powerful and comprehensive cooperative perception systems [1]. WMs can leverage their multi-modal information fusion capabilities to achieve precise spatiotemporal alignment, efficient fusion, and deep semantic understanding of multi-source heterogeneous perceptual data, significantly enhancing the accuracy, range, and robustness of environmental perception. Furthermore, WMs are expected to enable intelligent control and optimized configuration of roadside sensing devices, leading to more efficient and intelligent cooperative perception. Crucially, this integration must also consider the computational load on roadside units, necessitating the deployment of latency-aware WMs optimized for edge processing.
(B): World Model-Driven Global Optimized Control for Connected Automated Driving: Future CAV systems will shift from single-vehicle intelligence towards a global optimized control, with the ultimate goal of significantly improving the operational efficiency and safety level of the entire transportation system [100]. The powerful prediction and decision-making capabilities of WMs make them central to achieving global optimization. Research should explore how to utilize WMs to predict future evolution trends of traffic flow (e.g., traffic volume, congestion status, and bottleneck locations). Based on these high-precision predictions, global optimization strategies such as adaptive traffic signal timing, dynamic lane assignment, and fine-grained control of highway ramp metering can be realized. Additionally, further research can investigate World Model-based Distributed Model Predictive Control (DMPC) to enable vehicles to actively consider the overall efficiency of the transportation system while satisfying their own demands.
(C): Construction of World Model-Based CAVs Simulation Platforms and Standards: High-quality, standardized simulation platforms are the cornerstone for innovation in CAVs technology research, development, and application. The powerful capabilities of WMs in scenario generation, data simulation, and multi-agent interaction modeling are expected to drive the construction of a new generation of more realistic, diverse, and scalable CAV simulation platforms [101]. Concurrently, active participation in the formulation and refinement of relevant international and national standards is crucial to promote the standardization of simulation platforms and test data, thereby accelerating technology maturity and deployment.

5. Conclusions

This survey provides an in-depth analysis of the research trajectory, application scope, critical bottlenecks, and future directions of World Models in connected automated driving environments. It comprehensively demonstrates the immense potential of this emerging technical paradigm in building future intelligent transportation systems. Leveraging their exceptional capabilities in multi-modal data fusion, precise spatiotemporal dynamic prediction, and flexible scene generation, World Models have injected revolutionary impetus into core application domains within connected automated driving, including cooperative perception, cooperative prediction, cooperative decision making, cooperative control, human–machine collaboration, and real-world scene generation, significantly surpassing traditional methods.

Nevertheless, we also clearly recognize that the application of World Models in the CAVs domain is still in its nascent stage, facing numerous challenges that urgently need to be addressed. These include the deep fusion and unified representation of multi-source heterogeneous data, the profound understanding and explicit modeling of physical laws, the effective management and robust maintenance of long-term temporal context memory, and the secure definition of ethical responsibilities and effective mitigation of societal impacts.

Despite the challenging path ahead, we maintain a steadfast confidence in the future developmental prospects of World Models in the connected automated driving domain. As related research continues to deepen and key technologies undergo continuous iterative breakthroughs, we firmly believe that World Models will inevitably become the core driving engine for constructing safe, efficient, and sustainable connected automated driving systems, contributing immense power to achieving a smarter, more human-centric, greener, and safer future transportation.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CAVs	Connected Autonomous Vehicles
V2X	Vehicle-to-Everything
WM	World Model
ITS	Intelligent Transportation System
JEPA	Joint-Embedding Predictive Architecture
LSTM	Long Short-Term Memory
MLLM	Multimodal Large Language Model
CNN	Convolutional Neural Network
GNN	Graph Neural Network
RNN	Recurrent Neural Network
RSSM	Recurrent State Space Model
RL	Reinforcement Learning
VAE	Variational Autoencoder

References

Yi, X.; Rui, Y.; Ran, B.; Luo, K.; Sun, H. Vehicle–Infrastructure Cooperative Sensing: Progress and Prospect. Strateg. Study Chin. Acad. Eng. 2024, 26, 178–189. [Google Scholar] [CrossRef]
Li, X. Comparative analysis of LTE-V2X and 5G-V2X (NR). Inf. Commun. Technol. Policy 2020, 46, 93. [Google Scholar]
Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 2020, 8, 58443–58469. [Google Scholar] [CrossRef]
Wiseman, Y. Autonomous vehicles. In Encyclopedia of Information Science and Technology, 5th ed.; Khosrow-Pour, M., Ed.; IGI Global: Hershey, PA, USA, 2020; pp. 1–11. [Google Scholar]
Wang, F.; Wang, Y. Digital scientists and parallel sciences: The origin and goal of AI for science and science for AI. Bull. Chin. Acad. Sci. 2024, 39, 27–33. [Google Scholar]
Li, Y.; Li, M. Anomaly detection of wind turbines based on deep small-world neural network. Power Gener. Technol. 2021, 42, 313. [Google Scholar] [CrossRef]
Feng, T.; Wang, W.; Yang, Y. A survey of world models for autonomous driving. arXiv 2025, arXiv:2501.11260. [Google Scholar]
Tu, S.; Zhou, X.; Liang, D.; Jiang, X.; Zhang, Y.; Li, X.; Bai, X. The role of world models in shaping autonomous driving: A comprehensive survey. arXiv 2025, arXiv:2502.10498. [Google Scholar]
Gao, S.; Yang, J.; Chen, L.; Chitta, K.; Qiu, Y.; Geiger, A.; Zhang, J.; Li, H. Vista: A generalizable driving world model with high fidelity and versatile controllability. arXiv 2024, arXiv:2405.17398. [Google Scholar] [CrossRef]
Luo, J.; Zhang, T.; Hao, R.; Li, D.; Chen, C.; Na, Z.; Zhang, Q. Real-time cooperative vehicle coordination at unsignalized road intersections. IEEE Trans. Intell. Transp. Syst. 2023, 24, 5390–5405. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. A brief survey of deep reinforcement learning. arXiv 2017, arXiv:1708.05866. [Google Scholar] [CrossRef]
Cimini, G.; Squartini, T.; Saracco, F.; Garlaschelli, D.; Gabrielli, A.; Caldarelli, G. The statistical physics of real-world networks. Nat. Rev. Phys. 2019, 1, 58–71. [Google Scholar] [CrossRef]
Bieker, L.; Krajzewicz, D.; Morra, A.; Michelacci, C.; Cartolano, F. Traffic simulation for all: A real world traffic scenario from the city of Bologna. In Modeling Mobility with Open Data: 2nd SUMO Conference 2014 Berlin, Germany, May 15–16, 2014; Springer International Publishing: Cham, Switzerland, 2015; pp. 47–60. [Google Scholar]
Zhang, L.; Xie, Y.; Xidao, L.; Zhang, X. Multi-source heterogeneous data fusion. In Proceedings of the 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 26–28 May 2018; IEEE: New York, NY, USA, 2018; pp. 47–51. [Google Scholar]
Zhang, H.; Wang, Z.; Lyu, Q.; Zhang, Z.; Chen, S.; Shu, T.; Dariush, B.; Lee, K.; Du, Y.; Gan, C. COMBO: Compositional world models for embodied multi-agent cooperation. arXiv 2024, arXiv:2404.10775. [Google Scholar] [CrossRef]
Bergies, S.; Aljohani, T.M.; Su, S.F.; Elsisi, M. An IoT-based deep-learning architecture to secure automated electric vehicles against cyberattacks and data loss. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 5717–5732. [Google Scholar] [CrossRef]
Samak, T.; Samak, C.; Kandhasamy, S.; Krovi, V.; Xie, M. AutoDRIVE: A comprehensive, flexible and integrated digital twin ecosystem for autonomous driving research & education. Robotics 2023, 12, 77. [Google Scholar] [CrossRef]
Munn, Z.; Barker, T.H.; Moola, S.; Tufanaru, C.; Stern, C.; McArthur, A.; Stephenson, M.; Aromataris, E. Methodological quality of case series studies: An introduction to the JBI critical appraisal tool. JBI Evid. Synth. 2020, 18, 2127–2133. [Google Scholar] [CrossRef] [PubMed]
Wang, F.; Miao, Q. Novel paradigm for AI-driven scientific research: From AI4S to intelligent science. Bull. Chin. Acad. Sci. 2023, 38, 536–540. [Google Scholar]
Ding, J.; Zhang, Y.; Shang, Y.; Zhang, Y.H.; Zong, Z.; Feng, J.; Yuan, Y.; Su, H.; Li, N.; Sukiennik, N.; et al. Understanding World or Predicting Future? A Comprehensive Survey of World Models. arXiv 2024, arXiv:2411.14499. [Google Scholar] [CrossRef]
Zhu, Z.; Wang, X.; Zhao, W.; Min, C.; Deng, N.; Dou, M.; Wang, Y.; Shi, B.; Wang, K.; Zhang, C.; et al. Is sora a world simulator? A comprehensive survey on general world models and beyond. arXiv 2024, arXiv:2405.03520. [Google Scholar] [CrossRef]
Wan, Y.; Zhong, Y.; Ma, A.; Wang, J.; Feng, R. RSSM-Net: Remote sensing image scene classification based on multi-objective neural architecture search. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; IEEE: New York, NY, USA, 2021; pp. 1369–1372. [Google Scholar]
Littwin, E.; Saremi, O.; Advani, M.; Thilak, V.; Nakkiran, P.; Huang, C.; Susskind, J. How jepa avoids noisy features: The implicit bias of deep linear self distillation networks. arXiv 2024, arXiv:2407.03475. [Google Scholar] [CrossRef]
Imran, A.; Gopalakrishnan, K. Reinforcement Learning and Control. In AI for Robotics: Toward Embodied and General Intelligence in the Physical World; Apress: Berkeley, CA, USA, 2025; pp. 311–352. [Google Scholar]
Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
Hafner, D.; Lillicrap, T.; Norouzi, M.; Ba, J. Mastering atari with discrete world models. arXiv 2020, arXiv:2010.02193. [Google Scholar]
Hafner, D.; Pasukonis, J.; Ba, J.; Lillicrap, T. Mastering diverse domains through world models. arXiv 2023, arXiv:2301.04104. [Google Scholar]
Li, K.; Wu, L.; Qi, Q.; Liu, W.; Gao, X.; Zhou, L.; Song, D. Beyond single reference for training: Underwater image enhancement via comparative learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 2561–2576. [Google Scholar] [CrossRef]
An, Y.; Yu, F.R.; He, Y.; Li, J.; Chen, J.; Leung, V.C. A Deep Learning System for Detecting IoT Web Attacks With a Joint Embedded Prediction Architecture (JEPA). IEEE Trans. Netw. Serv. Manag. 2024, 21, 6885–6898. [Google Scholar] [CrossRef]
Wang, Y.; He, J.; Fan, L.; Li, H.; Chen, Y.; Zhang, Z. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 14749–14759. [Google Scholar]
Li, Q.; Jia, X.; Wang, S.; Yan, J. Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-V2). In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 142–158. [Google Scholar]
Bogdoll, D.; Yang, Y.; Zöllner, J.M. Muvo: A multimodal generative world model for autonomous driving with geometric representations. arXiv 2023, arXiv:2311.11762. [Google Scholar] [CrossRef]
Huang, W.; Ji, J.; Xia, C.; Zhang, B.; Yang, Y. Safedreamer: Safe reinforcement learning with world models. arXiv 2023, arXiv:2307.07176. [Google Scholar] [CrossRef]
Wang, X.; Zhu, Z.; Huang, G.; Chen, X.; Zhu, J.; Lu, J. DriveDreamer: Towards Real-World-Drive World Models for Autonomous Driving. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 55–72. [Google Scholar]
Zhang, Z.; Liniger, A.; Dai, D.; Yu, F.; Van Gool, L. Trafficbots: Towards world models for autonomous driving simulation and motion prediction. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 1522–1529. [Google Scholar]
Hu, A.; Russell, L.; Yeo, H.; Murez, Z.; Fedoseev, G.; Kendall, A.; Shotton, J.; Corrado, G. Gaia-1: A generative world model for autonomous driving. arXiv 2023, arXiv:2309.17080. [Google Scholar] [CrossRef]
Jia, F.; Mao, W.; Liu, Y.; Zhao, Y.; Wen, Y.; Zhang, C.; Zhang, X.; Wang, T. Adriver-i: A general world model for autonomous driving. arXiv 2023, arXiv:2311.13549. [Google Scholar] [CrossRef]
Bruce, J.; Dennis, M.D.; Edwards, A.; Parker-Holder, J.; Shi, Y.; Hughes, E.; Lai, M.; Mavalankar, A.; Steigerwald, R.; Apps, C.; et al. Genie: Generative interactive environments. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 24–27 July 2024. [Google Scholar]
Wang, X.; Zhu, Z.; Huang, G.; Wang, B.; Chen, X.; Lu, J. Worlddreamer: Towards general world models for video generation via predicting masked tokens. arXiv 2024, arXiv:2401.09985. [Google Scholar] [CrossRef]
Song, K.; Chen, B.; Simchowitz, M.; Du, Y.; Tedrake, R.; Sitzmann, V. History-Guided Video Diffusion. arXiv 2025, arXiv:2502.06764. [Google Scholar] [CrossRef]
Garrido, Q.; Ballas, N.; Assran, M.; Bardes, A.; Najman, L.; Rabbat, M.; Dupoux, E.; LeCun, Y. Intuitive physics understanding emerges from self-supervised pretraining on natural videos. arXiv 2025, arXiv:2502.11831. [Google Scholar]
Ren, X.; Xu, L.; Xia, L.; Wang, S.; Yin, D.; Huang, C. VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos. arXiv 2025, arXiv:2502.01549. [Google Scholar]
Zhang, X.; Weng, X.; Yue, Y.; Fan, Z.; Wu, W.; Huang, L. TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding. arXiv 2025, arXiv:2501.15513. [Google Scholar]
Tanveer, M.; Zhou, Y.; Niklaus, S.; Amiri, A.M.; Zhang, H.; Singh, K.K.; Zhao, N. MotionBridge: Dynamic Video Inbetweening with Flexible Controls. arXiv 2024, arXiv:2412.13190. [Google Scholar] [CrossRef]
Xie, H.; Chen, Z.; Hong, F.; Liu, Z. Citydreamer: Compositional generative model of unbounded 3d cities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 9666–9675. [Google Scholar]
Huang, S.; Chen, L.; Zhou, P.; Chen, S.; Jiang, Z.; Hu, Y.; Liao, Y.; Gao, P.; Li, H.; Yao, M.; et al. Enerverse: Envisioning embodied future space for robotics manipulation. arXiv 2025, arXiv:2501.01895. [Google Scholar] [CrossRef]
Zhai, S.; Ye, Z.; Liu, J.; Xie, W.; Hu, J.; Peng, Z.; Xue, H.; Chen, D.P.; Wang, X.M.; Yang, L.; et al. StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation. arXiv 2025, arXiv:2501.05763. [Google Scholar] [CrossRef]
Ge, J.; Chen, Z.; Lin, J.; Zhu, J.; Liu, X.; Dai, J.; Zhu, X. V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding. arXiv 2024, arXiv:2412.09616. [Google Scholar]
Bahmani, S.; Skorokhodov, I.; Qian, G.; Siarohin, A.; Menapace, W.; Tagliasacchi, A.; Lindell, D.B.; Tulyakov, S. AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers. arXiv 2024, arXiv:2411.18673. [Google Scholar] [CrossRef]
Liang, H.; Cao, J.; Goel, V.; Qian, G.; Korolev, S.; Terzopoulos, D.; Plataniotis, K.N.; Tulyakov, S.; Ren, J. Wonderland: Navigating 3D Scenes from a Single Image. arXiv 2024, arXiv:2412.12091. [Google Scholar] [CrossRef]
Xing, Y.; Fei, Y.; He, Y.; Chen, J.; Xie, J.; Chi, X.; Chen, Q. Large Motion Video Autoencoding with Cross-modal Video VAE. arXiv 2024, arXiv:2412.17805. [Google Scholar] [CrossRef]
Chen, X.; Zhang, Z.; Zhang, H.; Zhou, Y.; Kim, S.Y.; Liu, Q.; Li, Y.J.; Zhang, J.M.; Zhao, N.X.; Wang, Y.L.; et al. UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics. arXiv 2024, arXiv:2412.07774. [Google Scholar]
Li, L.; Zhang, H.; Zhang, X.; Zhu, S.; Yu, Y.; Zhao, J.; Heng, P.A. Towards an information theoretic framework of context-based offline meta-reinforcement learning. arXiv 2024, arXiv:2402.02429. [Google Scholar]
Huang, Z.; Guo, Y.C.; Wang, H.; Yi, R.; Ma, L.; Cao, Y.P.; Sheng, L. Mv-adapter: Multi-view consistent image generation made easy. arXiv 2024, arXiv:2412.03632. [Google Scholar]
Ni, C.; Zhao, G.; Wang, X.; Zhu, Z.; Qin, W.; Huang, G.; Liu, C.; Chen, Y.; Wang, W.; Zhang, X.; et al. ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration. arXiv 2024, arXiv:2411.19548. [Google Scholar] [CrossRef]
Lin, Z.; Liu, W.; Chen, C.; Lu, J.; Hu, W.; Fu, T.J.; Allardice, J.; Lai, Z.F.; Song, L.C.; Zhang, B.W.; et al. STIV: Scalable Text and Image Conditioned Video Generation. arXiv 2024, arXiv:2412.07730. [Google Scholar] [CrossRef]
Wang, Q.; Fan, L.; Wang, Y.; Chen, Y.; Zhang, Z. Freevs: Generative view synthesis on free driving trajectory. arXiv 2024, arXiv:2410.18079. [Google Scholar] [CrossRef]
Cai, S.; Wang, Z.; Lian, K.; Mu, Z.; Ma, X.; Liu, A.; Liang, Y. Rocket-1: Mastering open-world interaction with visual-temporal context prompting. arXiv 2024, arXiv:2410.17856. [Google Scholar]
Fan, Y.; Ma, X.; Wu, R.; Du, Y.; Li, J.; Gao, Z.; Li, Q. Videoagent: A memory-augmented multimodal agent for video understanding. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 75–92. [Google Scholar]
Feng, J.; Ma, A.; Wang, J.; Cheng, B.; Liang, X.; Leng, D.; Yin, Y. Fancyvideo: Towards dynamic and consistent video generation via cross-frame textual guidance. arXiv 2024, arXiv:2408.08189. [Google Scholar]
Yuan, Z.; Liu, Y.; Cao, Y.; Sun, W.; Jia, H.; Chen, R.; Li, Z.; Lin, B.; Yuan, L.; He, L.; et al. Mora: Enabling generalist video generation via a multi-agent framework. arXiv 2024, arXiv:2403.13248. [Google Scholar] [CrossRef]
Song, Z.; Wang, C.; Sheng, J.; Zhang, C.; Yu, G.; Fan, J.; Chen, T. Moviellm: Enhancing long video understanding with ai-generated movies. arXiv 2024, arXiv:2403.01422. [Google Scholar]
Deng, F.; Park, J.; Ahn, S. Facing off world model backbones: RNNs, Transformers, and S4. Adv. Neural Inf. Process. Syst. 2023, 36, 72904–72930. [Google Scholar]
Ma, Y.; Fan, Y.; Ji, J.; Wang, H.; Sun, X.; Jiang, G.; Shu, A.; Ji, R. X-dreamer: Creating high-quality 3d content by bridging the domain gap between text-to-2d and text-to-3d generation. arXiv 2023, arXiv:2312.00085. [Google Scholar]
Zheng, W.; Chen, W.; Huang, Y.; Zhang, B.; Duan, Y.; Lu, J. Occworld: Learning a 3d occupancy world model for autonomous driving. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 55–72. [Google Scholar]
Bardes, A.; Garrido, Q.; Ponce, J.; Chen, X.; Rabbat, M.; LeCun, Y.; Assran, M.; Ballas, N. Revisiting feature prediction for learning visual representations from video. arXiv 2024, arXiv:2404.08471. [Google Scholar] [CrossRef]
Cai, S.; Chan, E.R.; Peng, S.; Shahbazi, M.; Obukhov, A.; Van Gool, L.; Wetzstein, G. Diffdreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 2139–2150. [Google Scholar]
Lu, H.; Yang, G.; Fei, N.; Huo, Y.; Lu, Z.; Luo, P.; Ding, M. Vdt: General-purpose video diffusion transformers via mask modeling. arXiv 2023, arXiv:2305.13311. [Google Scholar] [CrossRef]
Mendonca, R.; Bahl, S.; Pathak, D. Structured world models from human videos. arXiv 2023, arXiv:2308.10901. [Google Scholar] [CrossRef]
Robine, J.; Höftmann, M.; Uelwer, T.; Harmeling, S. Transformer-based world models are happy with 100k interactions. arXiv 2023, arXiv:2303.07109. [Google Scholar] [CrossRef]
Ma, H.; Wu, J.; Feng, N.; Xiao, C.; Li, D.; Hao, J.; Wang, J.; Long, M. Harmonydream: Task harmonization inside world models. arXiv 2023, arXiv:2310.00344. [Google Scholar]
Zhang, W.; Wang, G.; Sun, J.; Yuan, Y.; Huang, G. Storm: Efficient stochastic transformer based world models for reinforcement learning. Adv. Neural Inf. Process. Syst. 2023, 36, 27147–27166. [Google Scholar]
Min, C.; Zhao, D.; Xiao, L.; Yazgan, M.; Zollner, J.M. Uniworld: Autonomous driving pre-training via world models. arXiv 2023, arXiv:2308.07234. [Google Scholar]
Wu, P.; Escontrela, A.; Hafner, D.; Abbeel, P.; Goldberg, K. Daydreamer: World models for physical robot learning. In Proceedings of the Conference on robot learning, Atlanta, GA, USA, 6–9 November 2023; PMLR: Cambridge, UK, 2023; pp. 2226–2240. [Google Scholar]
Gu, J.; Wang, S.; Zhao, H.; Lu, T.; Zhang, X.; Wu, Z.; Xu, S.; Zhang, W.; Jiang, W.G.; Xu, H. Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv 2023, arXiv:2309.03549. [Google Scholar]
Bardes, A.; Ponce, J.; LeCun, Y. Mc-jepa: A joint-embedding predictive architecture for self-supervised learning of motion and content features. arXiv 2023, arXiv:2307.12698. [Google Scholar]
Fei, Z.; Fan, M.; Huang, J. A-JEPA: Joint-Embedding Predictive Architecture Can Listen 2024. arXiv 2024, arXiv:2311.15830. [Google Scholar]
Assran, M.; Duval, Q.; Misra, I.; Bojanowski, P.; Vincent, P.; Rabbat, M.; LeCun, Y.; Ballas, N. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15619–15629. [Google Scholar]
Okada, M.; Taniguchi, T. DreamingV2: Reinforcement learning with discrete world models without reconstruction. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: New York, NY, USA, 2022; pp. 985–991. [Google Scholar]
Deng, F.; Jang, I.; Ahn, S. Dreamerpro: Reconstruction-free model-based reinforcement learning with prototypical representations. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR: Cambridge, UK, 2022; pp. 4956–4975. [Google Scholar]
Micheli, V.; Alonso, E.; Fleuret, F. Transformers are sample-efficient world models. arXiv 2022, arXiv:2209.00588. [Google Scholar]
Chen, C.; Wu, Y.F.; Yoon, J.; Ahn, S. Transdreamer: Reinforcement learning with transformer world models. arXiv 2022, arXiv:2202.09481. [Google Scholar] [CrossRef]
Hu, A.; Corrado, G.; Griffiths, N.; Murez, Z.; Gurau, C.; Yeo, H.; Kendall, A.; Cipolla, R.; Shotton, J. Model-based imitation learning for urban driving. Adv. Neural Inf. Process. Syst. 2022, 35, 20703–20716. [Google Scholar]
Gao, Z.; Mu, Y.; Shen, R.; Duan, J.; Luo, P.; Lu, Y.; Li, S.E. Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model. arXiv 2022, arXiv:2210.04017. [Google Scholar] [CrossRef]
Ye, W.; Liu, S.; Kurutach, T.; Abbeel, P.; Gao, Y. Mastering atari games with limited data. Adv. Neural Inf. Process. Syst. 2021, 34, 25476–25488. [Google Scholar]
Koh, J.Y.; Lee, H.; Yang, Y.; Baldridge, J.; Anderson, P. Pathdreamer: A world model for indoor naviga tion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 14738–14748. [Google Scholar]
Kim, K.; Sano, M.; De Freitas, J.; Haber, N.; Yamins, D. Active world model learning with progress curiosity. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 13–18 July 2020; PMLR: Cambridge, UK, 2020; pp. 5306–5315. [Google Scholar]
Hafner, D.; Lillicrap, T.; Fischer, I.; Villegas, R.; Ha, D.; Lee, H.; Davidson, J. Learning latent dynamics for planning from pixels. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2025; PMLR: Cambridge, UK, 2019; pp. 2555–2565. [Google Scholar]
Wang, X.; Yang, X.; Jia, X.; Wang, S. Modeling and analysis of hybrid traffic flow considering actual behavior of platoon. J. Syst. Simul. 2024, 36, 929–940. [Google Scholar]
Zhang, Y.; Zhang, L.; Liu, B.L.; Liang, Z.Z.; Zhang, X.F. Multi-spatial scale traffic prediction model based on spatio-temporal Transformer. Comput. Eng. Sci. 2024, 46, 1852. [Google Scholar]
Huang, Y.; Zhang, S.; Lin, Y.; Zhen, F.; Zhao, S.S.; Li, L. Ideas and practices of city-level territorial spatial planning monitoring: A case study of Ningbo. J. Nat. Resour. 2024, 39, 823–841. [Google Scholar] [CrossRef]
Zhao, J.; Ni, Q.; Li, Y. Exploration on Construction of Intelligent Water Conservancy Teaching Platform in Colleges and Universities Based on Digital Twin. J. Yellow River Conserv. Tech. Inst. 2024, 36, 76–80. [Google Scholar]
Li, G. AI4R: The fifth scientific research paradigm. Bull. Chin. Acad. Sci. 2024, 39, 1–9. [Google Scholar]
Ye, Y.; Xu, Y.; Zhang, Z.; Hu, L.; Xia, S. Recent Advances on Motion Control Policy Learning for Humanoid Characters. J. Comput.-Aided Des. Comput. Graph. 2025, 37, 185–206. [Google Scholar]
Wang, X.; Tan, G. Research on Decision-making of Autonomous Driving in Highway Environment Based on Knowledge and Large Language Model. J. Syst. Simul. 2025, 37, 1246. [Google Scholar]
Yang, L.; Yuan, T.; Duan, R.; Li, Z. Research on the optimization strategy of the content agenda setting of artificial intelligence in the WeChat official account of scientific journals. Chin. J. Sci. Tech. Period. 2025, 36, 611. [Google Scholar]
Yuan, Z.; Gu, J.; Wang, G. Exploration and practice of urban intelligent pan-info surveying and mapping in Shanghai. Bull. Surv. Mapp. 2024, 4, 168. [Google Scholar]
Ren, A.; Li, S. DESIGN OF DRL TIMING SIMULATION SYSTEM FOR SIGNAL INTERSECTION INTEGRATING VISSIM-PYTHON AND QT. Comput. Appl. Softw. 2024, 41, 53–59. [Google Scholar]

Figure 1. Publications related to World Models in IEEE Xplore (2015–2025 [May]).

Figure 2. Generic World Model architecture.

Figure 3. Schematic diagram of Recurrent State Space Model (RSSM) core architecture.

Figure 4. Schematic diagram of Joint-Embedding Predictive Architecture.

Figure 5. World Model integration framework in CAV systems.

Table 1. Application and development of World Models in vehicle–road collaboration environments in recent years.

Year	Model Name	Input Modalities	Core Structure	Key Contributions for V2X Scenarios
2025	DFoT [43]	Video	Diffusion Transformer	Generates high-quality, temporally consistent, and robust CAV simulation scenarios, including rare events.
	V-JEPA [44]	Natural Video, Pixel Data	JEPA	Provides a lightweight, data-driven paradigm for WM construction, enhancing representation learning and prediction efficiency.
	VideoRAG [45]	Video, Audio, and Textual	RAG	Introduces retrieval-augmented generation for ultra-long videos, crucial for long-term CAVs scene prediction.
	TinyLLaVA-Video [46]	Video, Text	Transformer	Enables deployment of high-performance video understanding models on resource-constrained in-vehicle edge platforms.
	MotionBridge [47]	Trajectory Stroke, Pixel, and Text	Diffusion Transformer	Offers a novel approach for high-fidelity traffic flow video simulation in CAVs.
	CityDreamer4D [48]	Image, Text	VQVAE, Transformer, GAN	Generates large-scale, high-fidelity 4D city scenes for complex urban CAV simulations.
	EnerVerse [49]	Image, Text, and Action	FAVs, Diffusion, and 4DGS	Proposes new insights for modeling complex, dynamic scenes and collaborative perception in multi-agent systems.
	StarGen [50]	Image, Text, and Pose Trajectory	Diffusion	Provides a spatiotemporal autoregressive generation framework with high-precision pose control for large-scale, high-fidelity traffic flow simulations.
2024	V2PE [51]	Image, Text	V2PE-enhanced	Enhances understanding of complex CAV scenarios and human–machine interaction through efficient multi-modal data processing.
	AC3D [52]	Text, Video	Diffusion Transformer	Significant for dynamic urban scene generation and sensor view manipulation in CAVs.
	Wonderland [53]	Image	Diffusion Transformer, Latent Transformer	Enables rapid construction of high-fidelity, large-scale 3D traffic simulation environments from single images.
	Cross-modal Video VAE [54]	Video, Text	VAE	Presents a novel cross-modal video VAE, supporting bandwidth-limited V2X communication and efficient collaborative perception.
	UniReal [55]	Images, Text	Diffusion Transformer	Learns and constructs real-world dynamics from video data, providing more realistic and diverse approaches for traffic scene simulations.
	UNICORN Framework [56]	Trajectory Segment	Information Theoretic Meta-RL Framework	Enhances V2X system adaptability and reliability in complex, uncertain environments through task representation learning.
	MV-Adapter [57]	Text, Image, and Camera Parameter	Multi-view Adapter Model	Generates multi-view consistent, high-quality images, supporting HMC and multi-sensor fusion perception.
	ReconDreamer [58]	Video	Diffusion	Reduces artifacts and maintains spatiotemporal consistency, improving accuracy and robustness of V2X environmental perception.
	STIV [59]	Text, Image	Diffusion Transformer, VAE	Enhancing environmental understanding and prediction in complex traffic.
	FreeVS [60]	Pseudo-Image	Diffusion, U-Net	Developing more realistic, generalizable CAV simulation platforms, enhancing in-vehicle perception robustness in complex scenarios.
	ROCKET-1 [61]	Image	TransformerXL	Demonstrates decision-making potential in complex, dynamic environments, offering technical reference for future CAV systems.
	VideoAgent [62]	Video	VQVAE, Transformer	Offers insights into building modular and scalable CAV systems through its versatile toolkit approach.
	FancyVideo [63]	Text	CTGM, Transformer, and Latent Diffusion Model	Significantly improves temporal consistency and motion coherence in text-to-video generation.
	Genie [41]	Videos, Text, and Image	VQ-VAE, Transformer, and MaskGIT	Significantly advances real-world scene creation and human–machine co-driving R&D.
	MORA [64]	Text, Image	Diffusion	Provides valuable technical examples for CAV applications requiring diverse scenario generation or operating with limited real-world data.
	Think2Drive [34]	Image, Text	RSSM	Offers reference for efficient reinforcement learning decisions in CAV environments.
	MovieLLM [65]	Video	Diffusion	Provides new technical references and data augmentation strategies for CAV traffic scene simulations and human–machine co-driving.
	S4WM [66]	Image	S4/PSSM, VAE	Offers significant advantages in long-sequence traffic flow prediction and complex dynamic scene modeling for CAVs.
	WorldDreamer [42]	Image, Video, and Text	VQGAN, STPT Transformer	Provides technical paradigms for high-fidelity scene generation and multi-modal data fusion via its STPT architecture and masked token prediction.
	X-Dreamer [67]	Text	LoRA, Diffusion	Stimulates research into more robust and accurate virtual simulations for V2X systems.
2023	OccWorld [68]	Semantic Data	VQVAE, Transformer	Provides predictive data support for safer, more efficient collaborative decision-making and control in V2X systems.
	V-JEPA [69]	Video	JEPA	Offers theoretical support for V2X system deployment and application in data-constrained scenarios.
	MUVO [35]	Image, Text	Transformer, GRU	Greatly enhances perception, prediction, and decision-making capabilities in complex traffic scenarios.
	DiffDreamer [70]	Image	Diffusion	Delivers richer and more realistic virtual environments for the development, testing, and validation of V2X algorithms.
	VDT [71]	Image	Diffusion Transformer	Reduces model complexity and computational cost, facilitating deployment on resource-constrained in-vehicle and roadside units.
	SWIM [72]	Video	RSSM, VAE, and CNN	Provides insights into developing flexible and safe control strategies for V2X systems.
	SafeDreamer [36]	Image	RSSM	Offers technical support for dynamically adjusting V2X control strategies based on varying safety requirements.
	TWM [73]	Image	Transformer	Provides theoretical references for risk assessment and safety decision-making in V2X systems.
	HarmonyDream [74]	Image	RSSM	Enhances V2X system adaptability and collaborative decision making in complex traffic scenarios.
	STORM [75]	Image	Transformer	Reduces reliance on manual annotation through self-supervised learning, improving V2X system stability and reliability in data-scarce environments.
	UniWorld [76]	Image	BEV, Transformer	Offers technical insights into constructing multi-functional, integrated perception models for V2X systems.
	DayDreamer [77]	Image, Text	RSSM	Provides a reference architecture for building scalable and customizable World Models for V2X systems.
	TrafficBots [38]	Text, Image	Transformer, GRU	Delivers robust tools for the development, validation, and deployment of V2X algorithms.
	VidRD [78]	Text, Video	Diffusion	Accelerates R&D and validation of V2X collaborative perception, prediction, decision-making, and control algorithms.
	MC-JEPA [79]	Image	JEPA	Improves performance of V2X target detection, scene understanding, and behavior prediction tasks.
	DreamerV3 [30]	Image	RSSM	Enhances V2X system stability and reliability in complex environments.
	Drive-WM [33]	Imags, Text, and Action	U-Net, Transformer	Boosts the generalization capability of V2X systems in complex environments.
	A-JEPA [80]	Audio Spectrogram	JEPA	Provides technical support for various human–machine collaboration solutions in CAVs, including voice interaction and driver state recognition.
	ADriver-I [40]	Action, Image	MLLM, Diffusion	Potentially reduces CAVs system reliance on roadside infrastructure, enhancing in-vehicle unit intelligence.
	I-JEPA [81]	Image	JEPA	Enables lightweight, high-performance edge computing vision perception models for CAV systems.
	GAIA-1 [39]	Video, Text, and Action	Diffusion Transformer	Offers technical support for more flexible and intelligent scene interactions and human–machine co-driving in CAV systems.
	DriveDreamer [37]	Image, Text, and Actions	Diffusion Transformer	Provides powerful tools for CAV simulations, data augmentation, and scene understanding, accelerating algorithm R&D and validation.
2022	DreamingV2 [82]	Image, Actions	RSSM	Holds significant potential for CAVs perception, prediction, and decision-making tasks.
	DreamerPro [83]	Image, Scalar Reward	RSSM	Enhances environmental perception robustness and decision reliability of CAV systems under adverse conditions
	IRIS [84]	Image, Action	VAE, Transformer	Offers solutions for low-latency, high-reliability collaborative perception and prediction.
	TransDreamer [85]	Visual Observation	Transformer	Greatly enhances perception, prediction, and decision-making capabilities in complex traffic scenarios.
	MILE [86]	Images, Actions	RSSM	Provides robust solutions for multi-modal fusion perception and long-tail scenario handling in CAVs.
	SEM2 [87]	Images, LiDAR	RSSM	Paves new theoretical pathways for developing more practical CAV decision-making and control algorithms.
2021	EfficientZero [88]	Video, Image	MuZero	Provides theoretical basis for realistic scene simulation and scene-understanding-based human–machine co-driving.
2021	Pathdreamer [89]	Image, Semantic Segmentation, and Camera Poses	GAN	Holds potential for CAV systems to efficiently learn and predict intentions and trajectories of key agents in complex traffic flows.
2020	AWML [90]	Object-Oriented Features	LSTM	Offers technical references for complex decision making and control in resource-constrained in-vehicle platforms.
2020	DreamerV2 [29]	Pixels	RSSM	Offers valuable insights for robust perception, multi-agent interaction prediction, and collaborative planning.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, N.; Liu, X. Research on World Models for Connected Automated Driving: Advances, Challenges, and Outlook. Appl. Sci. 2025, 15, 8986. https://doi.org/10.3390/app15168986

AMA Style

Chen N, Liu X. Research on World Models for Connected Automated Driving: Advances, Challenges, and Outlook. Applied Sciences. 2025; 15(16):8986. https://doi.org/10.3390/app15168986

Chicago/Turabian Style

Chen, Nuo, and Xiang Liu. 2025. "Research on World Models for Connected Automated Driving: Advances, Challenges, and Outlook" Applied Sciences 15, no. 16: 8986. https://doi.org/10.3390/app15168986

APA Style

Chen, N., & Liu, X. (2025). Research on World Models for Connected Automated Driving: Advances, Challenges, and Outlook. Applied Sciences, 15(16), 8986. https://doi.org/10.3390/app15168986

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on World Models for Connected Automated Driving: Advances, Challenges, and Outlook

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Review Methodology and Guiding Questions

1.2.1. Literature Search and Selection Strategy

1.2.2. Inclusion and Exclusion Criteria

1.2.3. Quality Assessment and Data Synthesis

1.2.4. Guiding Questions and Review Structure

2. World Model Research Progress

3. World Models in Connected Automated Driving Environments

3.1. Uniqueness and Research Needs of Connected Automated Driving Environments

3.2. Research and Applications of World Models in Connected Automated Driving

3.2.1. Cooperative Perception Based on World Models

3.2.2. Cooperative Prediction Based on World Models

3.2.3. Cooperative Decision-Making Based on World Models

3.2.4. Cooperative Control Based on World Models

3.2.5. Human-Machine Collaboration Based on World Models

3.2.6. Real-World Scene Generation Based on World Models

4. Challenges and Future Directions

4.1. Technical and Computational Challenges

4.2. Ethical and Safety Challenges

4.3. Future Perspectives

5. Conclusions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI