A Dual Digital Twin Framework for Reinforcement Learning: Bridging Webots and MuJoCo with Generative AI and Alignment Strategies

Laukaitis, Algirdas; Šareiko, Andrej; Mažeika, Dalius

doi:10.3390/electronics14244806

Open AccessArticle

A Dual Digital Twin Framework for Reinforcement Learning: Bridging Webots and MuJoCo with Generative AI and Alignment Strategies

by

Algirdas Laukaitis

^*

,

Andrej Šareiko

and

Dalius Mažeika

The Faculty of Fundamental Sciences, Vilnius Gediminas Technical University, Saulėtekio al. 11, LT-10223 Vilnius, Lithuania

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4806; https://doi.org/10.3390/electronics14244806

Submission received: 26 October 2025 / Revised: 4 December 2025 / Accepted: 5 December 2025 / Published: 6 December 2025

(This article belongs to the Special Issue Generative AI and Its Transformative Potential, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Deep reinforcement learning (DRL) has shown potential for robotic training in virtual environments; however, challenges remain in bridging simulation and real-world deployment. This paper introduces an extended reinforcement learning framework that advances beyond traditional single-environment approaches by proposing a dual digital twin concept. Specifically, we suggest creating a digital twin of the robot in Webots and a corresponding twin in MuJoCo, enabling policy training in MuJoCo’s optimized physics engine and subsequent transfer back to Webots for validation. To ensure consistency across environments, we introduce a digital twin alignment methodology, synchronizing sensors, actuators, and physical model characteristics between the two simulators. Furthermore, we propose a novel testing framework that conducts controlled experiments in both virtual environments to quantify and manage divergence, thereby improving robustness and transferability. To address the cost and complexity of maintaining two high-fidelity models, we leverage generative AI agents to automate the creation of the secondary digital twin, significantly reducing engineering overhead. The proposed framework enhances scalability, accelerates training, and improves the reliability of sim-to-real transfer, paving the way for more efficient and adaptive robotic systems.

Keywords:

reinforcement learning; digital twin; Webots; MuJoCo; sim-to-sim transfer

Graphical Abstract

1. Introduction

Reinforcement learning (RL) has matured into a powerful paradigm for training autonomous robotic systems by allowing agents to discover task policies through interaction and reward-guided optimization. Virtual environments play a central role in this process by providing safe, repeatable, and cost-effective platforms for large-scale experimentation and policy refinement, thereby reducing the hazards and expense of physical trial-and-error [1]. Modern simulators and RL toolkits have enabled important results across domains—from industrial manipulation to mobile navigation—yet substantial gaps remain when moving policies between heterogeneous simulation platforms and, ultimately, to physical robots [2,3,4,5]. These gaps are driven by discrepancies in dynamics, sensing, and control abstractions between simulators and between simulation and reality, and they motivate the need for frameworks that manage cross-environment consistency and transferability.

This paper extends prior work [6] on Webots-based RL framework by introducing a dual-digital-twin concept and a set of practical mechanisms to align, test, and automate twin creation. The core idea is illustrated in Figure 1: showing (Left) a UR5e model instantiated in MuJoCo, (Center) the UR5e environment modeled in Fusion 360, and (Right) the corresponding simulated environment used for testing in Webots, with an “LLM-based RL synchronization component.” The LLM-based component encapsulates an AI agent whose role is to maintain and reconcile multiple digital twins of the same physical robot so that their physics parameters, sensor and actuator models, and runtime semantics remain mutually consistent. By placing a lightweight, learning-capable coordinator between simulators, the framework treats synchronization not as an ad hoc data translation task but as an adaptive, learnable service that responds to observed divergences during training and validation.

A distinguishing element of the proposed framework is the notion of creating a digital twin of a digital twin: specifically, using MuJoCo as a secondary training substrate for policies that are validated primarily in Webots. MuJoCo’s efficient, well-validated physics engine [7] can accelerate policy optimization and exploration, while Webots [8] is retained as the primary validation and deployment environment because of its integration with ROS and its fidelity to the target production scene. Under this workflow, policies are trained or fine-tuned in MuJoCo and then transferred back to Webots for cross-verification; the LLM-based synchronizer monitors and actively reduces discrepancies so that the transferred policies behave robustly across both simulators. This sim-to-sim pipeline leverages the complementary strengths of each platform—MuJoCo’s computational efficiency and Webots’ deployment realism—while explicitly addressing the mismatch that would otherwise impair transfer.

To make transfer predictable and repeatable, the framework formalizes an alignment methodology between digital twins. Alignment covers three primary axes: (i) sensor semantics (e.g., sampling rates, noise models, field-of-view and intrinsic calibration for cameras; resolution and latency for proprioceptive sensors); (ii) actuator and control interfaces (joint torque/velocity limits, command discretization, control loop frequency, low-level controller dynamics); and (iii) physical model characteristics (mass, inertia, friction coefficients, contact geometry and restitution). By parameterizing each axis and exposing a compact alignment state, the LLM-based agent can propose and evaluate adjustments (via optimization or learned correction maps) and thus reduce the effective distance between the two twins. Where exact matching is impossible or undesirable, the agent produces controlled stochastic perturbations—calibrated domain randomization—that make policies robust to residual mismatch. The alignment strategy therefore combines deterministic matching, probabilistic augmentation, and continuous monitoring to close the sim-to-sim (and ultimately sim-to-real) gap.

Complementing alignment, the framework introduces a dual-environment testing methodology. Instead of validating policies exclusively in a single simulator, we run paired experimental campaigns: equivalent experiments are executed in both MuJoCo and Webots (with matched initial states and aligned configuration parameters), and a divergence analysis quantifies discrepancies in trajectories, control signals, sensor streams, and performance metrics. The pipeline reports statistical measures of divergence (e.g., trajectory distance, cumulative reward gaps, sensor distribution shifts) and feeds these diagnostics to the synchronization agent, which recommends corrective actions—such as parameter rescaling, noise model adjustments, or selective retraining. This closed-loop testing and correction cycle converts what is typically an offline, manual transfer problem into an online, measurable, and iteratively solvable process.

Finally, to reduce the engineering burden of maintaining two high-fidelity twins, the framework leverages generative AI agents to automate the creation and upkeep of the secondary digital twin. Given a canonical Webots model and a specification of target physics fidelity, a generative agent can synthesize MuJoCo model descriptions, initial parameter estimates (masses, inertias, collision primitives), and plausible sensor/actuator wrappers. The agent uses a combination of learned priors, CAD-to-physics heuristics, and iterative refinement driven by divergence measurements to produce an initial twin that is rapidly converged to alignment via the LLM-synchronizer. This automation substantially lowers the cost of dual-twin workflows and scales the approach to larger fleets of robots and environments.

The remainder of this paper is organized as follows. Section 2 reviews related work in digital twins, reinforcement learning, and generative AI. Section 3 presents the process for digital twin creation, contrasting the traditional workflow with the proposed AI-augmented approach. Section 4 details the design patterns for the dual-twin reinforcement learning framework, integrating Webots and MuJoCo. Section 5 describes the physics-based alignment and validation framework designed to synchronize the two simulation environments. Section 6 presents the experimental results from the LLM-based model generation and the physics-based alignment validation. Finally, Section 7 summarizes the findings and discusses potential future research.

2. Related Work

The concept of a digital twin (DT)—a virtual replica of a physical robot or system—has rapidly matured in robotics over the last decade. Early DT efforts focused on high-fidelity modeling and simulation of individual robots for design and testing. More recent work embeds artificial intelligence to make DTs “smarter” and more adaptive. For example, ref. [9] develop a standardized Webots-based RL framework that explicitly incorporates a DT concept: their architecture uses distinct design patterns to structure agent–environment interactions and even includes a pattern aimed at improving sim-to-real transfer. Likewise, an “AI-enhanced” industrial DT has been described wherein a reinforcement-learning agent operates on real-time production data, closing the loop between the physical and virtual systems for tasks like maintenance prioritization [10]. A recent review [11] identifies DT-aided AI implementations as one of the most consistent and emerging trends across robotics domains, with applications spanning intelligent robot grasping, policy transfer from simulation to physical systems, and autonomous data processing for enhanced adaptability. At the same time, researchers are exploring generative DTs: for instance, RoboTwin [12] leverages 3D generative models and large language models (LLMs) to automatically construct varied object models and motion programs for dual-arm manipulation tasks.

Similarly, a Unity-based digital twin framework [13] demonstrated a robot arm trained via reinforcement learning in a virtual environment, successfully transferring the learned policy to a 3D-printed physical counterpart and outlining key strategies for mapping virtual learning to real-world control. A related study [14] presents a digital twin–enabled human–robot collaboration system that integrates 3D simulation and real-time data to design, validate, and control a collaborative assembly process, demonstrating how digital twins can manage the complexity and adaptability of cobot-based production environments. In each case, the DT is moving beyond a static simulator toward an intelligent, continuously updated virtual system. Our work builds on this trend by constructing two concurrent DTs (in Webots and MuJoCo) for the same robot and task, effectively coupling two simulation domains. Unlike prior DT systems that use a single simulator or focus on domain-specific tasks, we explicitly target the dual-simulator case: using generative methods to align models across Webots and MuJoCo and designing an AI agent to synchronize them in real time.

It is well known that no simulator perfectly matches reality, and different engines have different inductive biases in physics modeling. The robotics community has thus employed various strategies to bridge these gaps. Domain randomization perturbs simulation parameters to improve robustness [15], and system identification uses real data to calibrate simulators to observed physics [16]. For example, PolySim [15] addresses sim-to-sim divergence by training controllers simultaneously in multiple simulators: policies are trained across heterogeneous engines (e.g., IsaacGym, IsaacSim, Genesis) so that the learned dynamics better approximate the real world and are less overfit to any single simulator’s quirks. In PolySim’s experiments, mixing simulators dramatically reduces discrepancies—e.g., combining IsaacSim, IsaacGym and Genesis improved success rates by 52.8% on a MuJoCo task compared to using IsaacSim alone—and even enabled zero-shot transfer to a real Unitree robot. Complementarily, ASID system proposes an active-exploration-based system-identification method that interleaves limited real-world trials with simulator tuning. ASID first collects informative real trajectories to estimate unknown parameters (mass, friction, etc.), then updates the simulator and trains a policy on the refined model—yielding policies that can transfer to reality in a zero-shot fashion [16]. Similarly, Humanoid-Gym achieves zero-shot sim-to-real transfer for complex humanoid robots by combining domain randomization with a sim-to-sim validation pipeline, where policies trained in Isaac Gym are tested in a calibrated MuJoCo environment to ensure robustness before deployment [17].

These works underscore that sim-to-sim mismatch and the sim-to-real gap remain major challenges in RL. Our framework differs by using a dual-twin approach: instead of training on multiple engines as independent environments, we explicitly link Webots and MuJoCo via model alignment and cross-validation. Rather than mixing simulators or relying on domain randomization, our method continuously compares two digital twins and corrects their divergence, enabling stable mutual verification between a more compliant-contact engine (Webots/ODE) and a higher-precision engine (MuJoCo). In contrast to prior methods that report performance gains only after training on multiple engines or updating physics from real data, our results show that a single-step alignment loop can already reduce simulator divergence and substantially improve cross-simulator policy transfer. This demonstrates that the proposed dual-twin architecture provides a lightweight yet effective alternative to multi-simulator training pipelines, offering measurable benefits even before sim-to-real deployment.

Automating the construction and maintenance of simulation-ready robot models is an increasingly active research direction, driven by the labor intensity of hand-authoring model files (URDF, SDF, MuJoCo XML, Webots PROTO) and by growing demand to scale simulations across many robot variants and environments. Recent literature falls into several complementary classes.

First, CAD-to-simulator and geometry-processing pipelines provide deterministic methods to extract kinematic chains, collision primitives, and approximate inertial properties from CAD or mesh data. These toolchains (commonly validated in robotics and manufacturing venues) reduce manual geometry rework by producing URDF/SDF descriptions and coarse physical parameters that serve as a starting point for dynamic tuning. Work in this class is well-established and often forms the first step in automated twin generation; recent surveys emphasize integrating automated parameter estimation with downstream system-identification for higher fidelity. See [18] for a review of how generative methods accelerate DT adoption and [19] for an applied example in predictive maintenance pipelines where model generation reduces engineering overhead.

Second, LLM approaches extend this foundation by using large language or vision-language models to automatically generate simulator scaffolding—such as sensor wrappers, control interfaces, and generator code—from high-level descriptions or example snippets. These systems typically emit editable modules that can be compiled into URDF or MuJoCo artifacts and verified through lightweight tests, significantly reducing both development time and syntax errors. The study on kinematic structure mapping demonstrates that such generative pipelines can reliably convert mechanical assemblies into simulation-ready URDF and MuJoCo-XML models, confirming that LLM-assisted translation can maintain kinematic integrity while minimizing manual intervention [20]. Together, these results indicate that generative coding methods now form a viable middle layer between CAD-based modeling and physics calibration.

Third, iterative LLM-plus-physics refinement closes the loop by coupling generative output with automated calibration. In this process, an LLM proposes an initial structure and nominal dynamics; the resulting model is executed in simulation, divergence metrics are computed, and targeted corrections are applied either by the same agent or by an optimization routine. Frameworks such as G-Sim (2025) exemplify this pattern by combining LLM-based simulator construction with empirical calibration to align generated models with observed physics. This two-stage cycle—semantic scaffolding followed by quantitative refinement—produces simulators that are both syntactically correct and dynamically consistent across benchmarks ranging from simple pendula to complex multi-body systems [21].

Finally, the idea of using an intelligent agent to keep a digital twin in sync with reality (or with another simulator) is gaining traction. The DDD-GenDT framework proposes an LLM-augmented DT framework wherein an LLM continuously ingests time-series observations and predicts future system behavior. The LLM is treated as a live, adaptive model: by feeding it recent sensor and state data, it generates zero-shot forecasts of the system state without retraining [22].

To address the challenge of synchronization, recent studies have explored the use of advanced AI techniques. Multi-agent deep reinforcement learning, for example, has been proposed for managing the synchronization and migration of digital twins in dynamic environments like multi-access edge computing (MEC) networks [23]. Similarly to our work, the integration of LLMs with digital twins has been investigated in [24]. LLMs can act as an intelligent layer for managing and aligning the parameters of multiple digital twins, interpreting unstructured data, and even facilitating a natural language-based interface for controlling and monitoring the robotic system [25,26]. This synergy allows for continuous, feedback-driven optimization, strengthening the alignment between the digital and physical systems. For example, LLMs can interpret natural language commands to adjust optimization constraints within the digital twin framework, enabling flexible human-in-the-loop adaptation without manual recoding [27].

What distinguishes our work is the use of an AI or LLM agent to monitor and correct divergence between two simulators. In our dual-twin framework, an agent watches both Webots and MuJoCo simulations; when it detects growing discrepancies in state or physics, it uses learned models or LLM reasoning to adjust parameters (e.g., friction coefficients, joint damping, or even scene geometry) in one or both simulators.

3. Process of Digital Twin Creation

In our earlier work [6], we described a structured but engineering-heavy workflow for creating digital twins. Figure 2 summarizes this traditional process, which we now present as a baseline before outlining how this process can be augmented with generative AI assistance.

At the core of the workflow are three main stakeholders: the Production Company, the Simulation Engineer, and the Reinforcement Learning Engineer, all interacting through a simulation environment (e.g., Webots, MuJoCo).

Requirement Definition. The Production Company specifies the robot’s tasks, constraints, and performance targets. These requirements include robot dynamics, environment layouts, operational data, task objectives, and communication interfaces (As an example see: https://github.com/aalgirdas/roboGen-LLM/tree/main/pdf_sources (accessed on 26 October 2025)).
Model Construction. The Simulation Engineer uses this information to build both the robot model and its environment inside the simulator. The result is a high-fidelity digital twin that reflects the robot’s geometry, sensors, actuators, and workspace.
Simulation and Analysis. The Production Company can then run simulations to evaluate layouts, identify bottlenecks, and collect performance data without disrupting physical production.
Training and Optimization. The RL Engineer defines reward functions and algorithms, and trains policies directly in the digital twin. Performance metrics are fed back to the company to assess readiness.
Deployment. Once trained, the robot model can be transferred to the physical robot, with optional real-time synchronization to refine behavior after deployment.

This sequence (illustrated in Figure 2) highlights the clear but rigid separation of responsibilities: requirements flow from company to engineer, models are built and tested in the simulator, and policies are trained before transfer to reality. While effective, this process is slow and costly to maintain across multiple simulators. In the following, we propose replacing parts of this workflow with generative AI–assisted dual digital twin creation and alignment, reducing manual effort and enabling faster, more robust sim-to-sim and sim-to-real transfer.

In contrast to this conventional workflow, our extended process introduces two generative AI–based components, as illustrated in Figure 3. First, a Generative AI Prompt Engineer acts as an intermediary between the Simulation Engineer, RL Engineer, and the simulation environment. It leverages previously created digital twins, selects the most relevant cases, and iteratively adapts them by querying both the Simulation Engineer and the simulator itself. Second, a Generative AI Assistant, built on large-scale language and reasoning models (e.g., GPT-5 or Gemini 2.5 Pro), transforms these prompts into executable models and training configurations.

In this revised workflow, the Simulation Engineer collaborates with the Prompt Engineer rather than directly modeling every detail, while the RL Engineer defines task parameters through the same AI layer. The Assistant then generates and refines the robot and environment models inside the simulation platform, which now serves mainly as an observation and validation environment rather than the primary design tool. By shifting model creation and configuration toward generative AI agents, the process reduces manual effort, accelerates iteration, and enables dynamic reuse of prior design knowledge (More info in roboGen-LLM project page: https://github.com/aalgirdas/roboGen-LLM (accessed on 26 October 2025)).

4. Design Patterns for Dual-Twin Reinforcement Learning

The architecture of reinforcement learning systems within virtual environments relies on interaction between simulation tools, RL frameworks, and machine learning algorithms. In our previous work [6], we introduced a set of design patterns for applying RL in the Webots simulator. This paper extends and substantially refines those patterns to support a more robust and scalable dual digital twin methodology, integrating both Webots and MuJoCo and leveraging Generative AI to streamline development. The following three patterns represent a complete workflow, from accelerated training in parallel simulations to the deployment of a Supervisor-free model ready for real-world transfer.

4.1. Pattern 1: The Dual Digital Twin for Accelerated Training and Synchronization

The foundational pattern, illustrated on the left side of Figure 4, is designed for rapid RL experimentation and policy development. It moves beyond a single simulator to embrace a “digital twin of a digital twin” concept. This pattern is composed of a primary RLModel class, which interfaces with a generalized RLRobot class.

To leverage the unique strengths of different simulators, RLRobot now serves as a parent class to specific implementations:

WebotsRobot: A high-fidelity twin representing the robot in a visually and physically realistic environment like Webots, ideal for validation and fine-tuning.
MuJoCoRobot: A computationally efficient twin optimized for rapid simulation, allowing for massively parallel training on GPU-accelerated hardware.

In this updated context, the Webots Supervisor is no longer viewed as a limitation for real-world transfer but as a tool for training. It provides a ground truth for the environment’s state, enabling synchronization of the WebotsRobot and MuJoCoRobot. This ensures that behaviors learned in the fast MuJoCo environment are consistent with the high-fidelity Webots twin, facilitating effective policy convergence and robust sim-to-sim transfer.

4.2. Pattern 2: Supervised Data Collection for Learned Perception

The second design pattern, shown on the right side of Figure 4, re-purposes the supervisor-slave architecture as a specialized data collection framework. Its primary goal is to generate a dataset for training a supervised learning model that can later emulate the Supervisor’s omniscience using only on-board sensors.

In this pattern, a SupervisorRobot (which still inherits from Supervisor to access environmental data) is equipped with a virtual camera. During simulation runs, it performs two critical functions simultaneously:

It provides the RLModel with ground-truth information about the environment (e.g., object positions, velocities) that is inaccessible to the SlaveRobot’s sensors.
It captures video footage from its camera, time-stamping each frame and pairing it with the corresponding ground-truth data from the Supervisor.

This process creates a rich, labeled dataset where sensor data (video) is directly correlated with precise environmental state information. This pattern is intentionally designed for visually realistic simulators like Webots, as the goal is to train a model on imagery that closely mirrors the real world. This data collection pattern is not intended for simulators like MuJoCo, which prioritize computational efficiency and physics accuracy over high-fidelity visual rendering.

4.3. Pattern 3: Supervisor-Free Deployment with an AI-Enhanced Robot Server

The third design pattern (Figure 5) represents the final, deployable architecture, designed to operate without any reliance on a simulator’s Supervisor. It introduces a RobotServer, an external entity that manages the RL policy. Critically, this pattern integrates the model created in the previous stage.

The RobotServer now uses a Supervised Learning Model, which was trained on the data collected in Pattern 2. This model takes real-time sensor inputs (e.g., a camera feed) from the Robot instance and predicts the necessary environmental states that were previously provided by the Supervisor. This enables the robot to perceive and react to its environment in a realistic manner, using learned perception to close the loop instead of relying on unrealistic, privileged information. This pattern bridges the final gap, creating a system that can be transferred from the simulation to a physical robot.

4.4. Automating the Workflow with Generative AI

A key contribution of our new framework is the use of a Generative AI (GAI) Agent to automate and accelerate the creation of these patterns. As illustrated in Figure 5, the development workflow is significantly streamlined. An engineer first implements the relatively straightforward First Design Pattern. This implementation then serves as a prompt for the GAI Agent, which automatically generates the more complex code required for:

The Second Design Pattern: Including the camera setup, data logging, and synchronization logic.
The Third Design Pattern: Involving the creation of the RobotServer, API endpoints, and the integration of the Supervised Learning Model.

This GAI-assisted approach reduces the engineering cost and complexity associated with developing a full-fledged, sim-to-real RL framework. It allows developers to focus on the core RL problem while leveraging AI to handle the implementation details of data collection and deployment architectures.

5. Physics-Based Alignment and Validation Framework

A core challenge in leveraging dual digital twin architecture is ensuring behavioral consistency across different physics simulators. Discrepancies in how engines like MuJoCo and Webots model forces, contacts, and friction can create a sim-to-sim gap, where a policy optimized in one environment fails to perform adequately in the other. To bridge this gap, we introduce a systematic, semi-automated framework for physics-based alignment and validation.

This process of synchronizing two digital environments is analogous to the well-established challenge of synchronizing a single digital twin with its physical counterpart. Research in sim-to-real transfer often focuses on updating a digital twin’s parameters by observing the physical system to account for factors like mechanical wear, sensor noise, and unmodeled dynamics. Just as those frameworks use real-world operational data to ground the digital model in reality, our framework uses cross-simulator data to ground the two digital twins in a shared, consistent physical behavior. This alignment is critical for ensuring that an RL policy is transferable and robust.

5.1. Framework Architecture and Workflow

The framework is built around three key components: a centralized test specification, an orchestrated data acquisition system, and an AI-assisted divergence analysis module. The entire process is designed to be iterative, progressively minimizing the behavioral delta between the Webots and MuJoCo twins.

1. Unified Test Scenario Definition. To ensure that both digital twins are subjected to identical experiments, we define a series of physics-based tests in a single JSON file (see test_scenarios.json on https://github.com/aalgirdas/roboGen-LLM/tree/main/test_orchestrator/test_scenarios (accessed on 26 October 2025)). This file acts as the single source of truth for the entire validation suite. It specifies:

Global Parameters: Simulation settings like timestep and failure conditions that are common across all tests.
Individual Scenarios: A list of discrete tests, each with a specific name, duration, and a set of initial conditions (e.g., initial_angle for the pole) and prescribed actions (e.g., wheel_velocity for the cart). This structured approach allows engineers—or a generative AI—to design targeted tests that probe specific aspects of the robot’s dynamics, such as step responses, impulse disturbances, or stability under constant actuation.

2. Orchestrated Data Acquisition. Execution and data logging are handled by two distinct Python components running concurrently within the simulator environment, a model that can be adapted for all simulators like MuJoCo or Webots:

TestRobot script is the robot’s main controller. It reads the test_scenarios.json file, applies the prescribed actuator commands (e.g., setting wheel velocity), and logs data from its on-board sensors, such as the pole’s position sensor.
TestOrchestrator script runs as a Webots Supervisor, a “privileged” entity that can observe and manipulate the entire simulation. Its role is to provide “ground truth” data that the robot cannot sense itself, such as its absolute position in the world. It also reads the scenario file to set the initial state of the environment (e.g., setting the pole’s starting angle) and logs its observations in a separate data file.

This dual-controller approach ensures a comprehensive dataset is collected, capturing both the robot’s internal state and its external, world-frame state. The same test scenarios are run independently in all simulation environments, generating a parallel dataset for comparison.

5.2. Component Design and Process Flow

The static structure of the physics-based alignment framework is illustrated in the class diagram in Figure 6. The design is centered on a RobotController base class, which is responsible for running the test suite (run_test_suite). This controller uses two key components:

ScenarioConfiguration: This class loads the unified test definitions from the JSON file. It holds the global_settings and a list of individual Scenario objects, each detailing conditions and actions for a specific test.
DataLogger: A utility class used by the controllers to record time-stamped data to a log file during simulation.

Two specialized classes inherit from RobotController to manage the experiment:

TestRobot: This class acts as the robot’s main controller. It reads its position_sensor and applies prescribed actions by controlling its wheels (e.g., via set_robot_speed).
TestOrchestrator: This class represents a privileged “supervisor” entity. It has access to the entire simulation (robot_node, pole_node) and is responsible for setting the environment’s set_initial_state to match the scenario’s conditions.

After the test suite is executed in both simulators, the GenerativeAIAgent is designed to read the log files produced by the DataLogger. It then performs divergence analysis (analyze_divergence) and provides suggested parameter updates to the engineer.

The activity diagram in Figure 7 illustrates the dynamic, two-stage workflow of the physics-based alignment process. The first stage, shown on the left side of Figure 7, details the data acquisition process for a single test run. The second stage, on the right, depicts the iterative, AI-assisted alignment loop that uses this data.

The data acquisition process begins with defining the unified test scenarios, which are loaded from a JSON file. For each test, the TestOrchestrator first sets the environment’s initial state (e.g., setting the pole’s starting angle) as specified in the scenario. The simulation then enters its main loop, running for a predetermined duration. In each step, the TestRobot executes the prescribed actions (e.g., setting wheel velocity). Simultaneously, both the TestRobot (logging on-board sensor data) and the TestOrchestrator (logging ground-truth data) record their observations via the DataLogger. Once the test duration is complete, the loop terminates, and the data log file is generated. This entire process is executed independently in both the Webots and MuJoCo environments to produce parallel datasets for comparison.

The second stage is the AI-assisted divergence analysis and alignment loop, shown on the right side of Figure 7. After data collection, the log files from both simulators are compared by the GenerativeAIAgent or a human engineer. This analysis involves:

Quantifying Divergence: The agent computes statistical measures of divergence, such as the Mean Squared Error (MSE) or Dynamic Time Warping (DTW) distance, between the time-series data from the two simulators.
Identifying Root Causes: By correlating divergence with specific test scenarios, the agent can hypothesize the cause of the mismatch (e.g., “divergence is highest in high-velocity tests, suggesting a discrepancy in friction coefficients”).
Suggesting Corrections: The agent proposes specific changes to the robot’s model files (e.g., URDF or Webots .proto files), such as adjusting mass, inertia, joint damping, or calculating proportional coefficients to scale actuator signals.

If the behaviors are not aligned, an engineer applies these suggested corrections to the twin models. The entire test suite is then executed again, starting a new cycle of the loop. This iterative process of testing, analysis, and correction systematically reduces the sim-to-sim gap. Once the behaviors are successfully aligned, the process concludes, and the digital twins are considered synchronized and ready for robust RL policy transfer.

6. Results and Discussion

Preliminary Results on LLM-Based Model Generation

The preliminary evaluation of the proposed framework focused on the automated generation of MuJoCo robot models using large language models (LLMs), based on existing Webots descriptions and textual technical specifications. The objective was to assess the capacity of generative AI to produce simulation-ready digital twins and to quantify the degree of human guidance required for convergence. Figure 8 illustrates two representative case studies—one successful Webots-to-MuJoCo conversion and one more challenging end-to-end synthesis task.

The upper row of Figure 8 presents the CartPole system, a classical benchmark in reinforcement learning. The left image shows the manually engineered Webots model used in the experiments, while the right image depicts its automatically generated MuJoCo counterpart, obtained through an iterative ChatGPT-based model conversion procedure.

This process required 14 iterative interactions, during which the LLM progressively corrected syntax errors, refined dynamic parameters, and achieved structural consistency with the reference model. At each step, the model was provided with short, valid MuJoCo examples to preserve syntactic and semantic grounding. The resulting XML description successfully compiled and reproduced the expected CartPole dynamics.

This experiment demonstrates that, when supplied with contextual examples and guided refinement, an LLM can perform cross-simulator model translation, significantly reducing manual engineering effort while maintaining physical plausibility.

The lower row of Figure 8 illustrates a more complex case involving the Pioneer 3-AT mobile robot. The left image shows the reference model derived from its technical specification, whereas the right image shows the corresponding MuJoCo model generated by ChatGPT using only primitive geometric elements.

Unlike the previous example, this case lacked a Webots reference model; the LLM was guided solely by textual descriptions of dimensions, mass distribution, and wheel geometry. After five refinement iterations, the model compiled successfully. However, it required further modification to achieve realistic kinematics and control fidelity.

These results indicate that while LLMs can infer approximate structural relationships from natural language, complex multi-body systems still demand subsequent alignment and parameter optimization, as proposed in the physics-based workflow described in Section 5.

The experimental results reveal both the promise and current limitations of LLM-assisted digital twin generation. Three main observations can be drawn:

Contextual grounding is crucial. The availability of valid MuJoCo syntax examples was decisive for the success of the CartPole case, confirming that carefully selected context substantially enhances reliability.
Iterative refinement is essential. Multiple feedback cycles were necessary to achieve convergence, supporting the principle that LLM-guided model synthesis benefits from structured, iterative alignment rather than one-shot generation.
Complexity scaling remains a challenge. For robots with multiple degrees of freedom or compound geometries, the LLM’s reasoning accuracy decreases when limited to textual cues. Nevertheless, the generation of syntactically valid models provides a valuable initialization for downstream physics-based optimization.

Overall, these findings validate the feasibility of integrating generative AI into the dual digital twin workflow. The combination of language-driven model synthesis and quantitative alignment offers a scalable pathway to automated cross-simulator consistency, reducing engineering overhead and accelerating reinforcement learning experiments across heterogeneous simulation environments.

The second set of results evaluates the physics-based alignment and validation framework described in Section 5, using the CartPole system as a benchmark. This framework executes a suite of controlled experiments defined in a unified configuration file (test_scenarios.json), which specifies parameters such as initial conditions, actuator commands, simulation durations, and failure criteria. While the full suite encompasses numerous scenarios probing various dynamics (e.g., step responses, impulse disturbances, and stability tests), accessible for verification via the project’s GitHub repository, here we highlight two experiments visualized in Figure 9. These focus on dynamics during a pole fall, revealing inherent discrepancies between the Webots and MuJoCo physics engines.

Figure 9 compares trajectories from identical initial conditions in both simulators: a near-vertical pole starting at 0.001 radians with no active control applied, allowing it to fall under gravity until hitting a 1 radian barrier. The left plot shows pole angle versus time. In Webots (blue), the pole accelerates smoothly, reaching the barrier in approximately 1.5 s. In contrast, MuJoCo (orange) exhibits a delayed onset of motion, with the fall trajectory lagging by more than 0.5 s throughout. This delay suggests differences in how the engines model low-friction joints and gravitational forces at near-equilibrium states, where MuJoCo’s contact and constraint solvers may introduce subtle numerical damping.

The right plot depicts cart position versus time during the same fall. Both simulators show the cart drifting due to the unbalanced torque from the falling pole, but with notable divergences. In Webots, the cart displaces by about 1 mm before stabilizing, reflecting a tightly constrained response. MuJoCo, however, results in a larger excursion, ending approximately 1 cm further from the origin. This amplified displacement highlights variances in friction modeling, inertia propagation, and contact resolution between the engines.

These experiments underscore the value of the framework in quantifying sim-to-sim gaps. Without such validation, policies trained in MuJoCo might overcompensate for exaggerated dynamics, leading to instability when transferred to Webots or reality. By feeding these divergences into the generative AI agent for parameter suggestions (e.g., scaling friction coefficients or adjusting integrator tolerances), the framework enables iterative alignment, enhancing robustness for dual-twin RL workflows.

Additionally, we performed a cross-simulator transfer test to validate the dual digital-twin workflow. A policy trained in MuJoCo (continuous controller) was exported and executed in the Webots CartPole environment without retraining. Table 1 reports results over 10 evaluation episodes. The MuJoCo-trained policy obtained a mean reward of 991.9 ± 14.5 in MuJoCo and, when transferred directly, achieved 561.4 ± 72.7 in Webots (mean ± s.d.). Using the physics-based alignment procedure described in Section 5, we applied targeted parameter adjustments (joint damping +75%, friction scaling factor ×0.8) suggested by the divergence analysis. After alignment, the transferred policy’s performance in Webots increased to 869.7 ± 31.1, corresponding to a 55% improvement in mean episode reward compared to the unaligned transfer. The remaining gap to MuJoCo is primarily attributable to residual differences in contact handling and integrator behavior (wheel jitter and low-velocity numerical damping). These results demonstrate that (i) policies trained in MuJoCo can be executed in Webots with reasonable performance after alignment, and (ii) the proposed alignment loop materially reduces sim-to-sim divergence and improves transfer outcomes.

To evaluate the generalizability of the proposed alignment methodology beyond simple benchmark systems, we performed an additional test on the Pioneer 3-AT mobile robot. Instead of training a full RL controller, we used a minimal dynamics-only validation scenario in which both simulators executed an identical constant wheel-velocity command for 3 s. In this baseline test, the MuJoCo model travelled 0.92 m while the Webots model travelled 0.84 m, yielding an 8.7% difference in forward displacement. A small but measurable sideways deviation also appeared in both simulators (1.8 cm in MuJoCo and 0.5 cm in Webots). Such lateral drift is expected in simulation because the wheels are represented as mesh geometries, making the resulting contact patch, micro-slip behavior, and friction distribution sensitive to each simulator’s physics solver, friction model, and numerical integration scheme. Using divergence cues from our alignment workflow, we modified the wheel-base parameters and reduced friction within the MuJoCo model. After these adjustments, the forward-displacement difference decreased to 3.2%, and sideways deviation was reduced to 0.9 cm in MuJoCo and 0.6 cm in Webots. Although this experiment does not involve policy learning, it demonstrates that the alignment procedure extends to a more complex multi-body system and effectively reduces cross-simulator divergence.

To substantiate the claim that generative AI reduces engineering overhead, we conducted a small-scale evaluation with a group of seven students who constructed robot models both manually and using our LLM-assisted workflow. The students replaced part of their coursework with this activity and were motivated to complete both modeling approaches under comparable conditions. Table 2 reports the averaged results. For the CartPole robot, manual modeling required 3.4 h on average, whereas the LLM-assisted workflow required 1.2 h, corresponding to a 64.7% reduction. For the more complex Pioneer 3-AT robot, manual construction took 7.4 h on average, while the LLM-assisted process required 2.9 h, yielding a 60.8% reduction. These results provide quantitative evidence that the proposed generative-AI approach significantly decreases modeling effort, especially for multi-body systems with nontrivial geometries.

To strictly quantify the alignment, we measured the trajectory divergence before and after applying the framework. In the baseline comparison, the unaligned MuJoCo model exhibited a significant lag, reaching the 1 radian threshold at 2.06 s compared to 1.52 s in Webots. This discrepancy produced a Dynamic Time Warping (DTW) distance of 0.184 and a Mean Squared Error (MSE) of 0.029. Following the optimization of joint damping and friction coefficients based on our divergence analysis, the sim-to-sim gap was substantially reduced. The post-alignment fall time in MuJoCo converged to 1.61 s, reducing the DTW distance to 0.061 and the MSE to 0.009. These metrics correspond to a quantitative reduction in simulation divergence of approximately 67% to 69%, statistically substantiating the effectiveness of the physics-based alignment loop.

The final set of results examines the reinforcement learning performance of the CartPole system across different configurations, as depicted in Figure 10. To ensure reproducibility of the RL experiments, we provide additional details on the training configuration used across all CartPole scenarios.

To ensure full reproducibility of the reported results, the reinforcement learning experiments were conducted using a standardized PPO configuration across all trials. The specific hyperparameters were fixed as follows: a learning rate of 3 × 10⁻⁴, a discount factor γ = 0.99, and a batch size of 4096 environment steps. The policy and value networks shared a uniform architecture of two fully connected layers (size 64) with ReLU activations. Training was executed for 1–2 million steps using consistent random seeds to ensure valid cross-simulator comparisons. Detailed configuration files, including the specific random seeds and simulator integration scripts, have been documented in the accompanying code repositories to facilitate independent verification.

The top left plot illustrates an experiment in Webots with a discrete action space, where the controller sets the cart’s speed to either −1.5 m/s or 1.5 m/s. The reward, defined as the number of steps until the pole tilts beyond 0.3 radians or reaches a 2000-step threshold, is plotted against episode number. The learning curve shows a rapid increase in average score, stabilizing around 1500 steps, indicating effective policy convergence in this environment.

The top right plot replicates the same experiment in MuJoCo with the same discrete actions. Here, the learning rate is noticeably slower, with the rolling mean score plateauing around 400 steps. This discrepancy arises from a subtle “jumping effect” observed on the cart’s wheels in MuJoCo, introducing additional instability that complicates stabilization and delays policy optimization. This highlights the impact of simulator-specific dynamics on RL performance, underscoring the need for alignment strategies.

The bottom central plot presents a modified experiment in MuJoCo, where the action space is continuous, allowing the controller to set speeds within the interval (−1.5, 1.5) m/s. This adjustment results in a significantly improved learning rate, with the rolling mean score rising sharply toward 500 steps. The smoother action space mitigates the jumping effect, enabling more precise control and faster convergence, demonstrating the sensitivity of RL outcomes to action granularity and simulator characteristics.

The differences in training outcomes between Webots and MuJoCo arise from fundamental discrepancies in their physical modeling and numerical solvers. Webots relies on ODE-based contact dynamics with a friction–pyramid approximation, producing relatively stiff wheel–ground interactions and minimal micro-slip. MuJoCo, in contrast, employs a compliant-contact formulation and implicit constraint solver that allows small amounts of lateral slip and wheel jitter, particularly at low velocities. These differences affect the controllability of the cart: the discrete-action experiments in MuJoCo experience intermittent destabilization due to small, solver-induced perturbations, whereas Webots exhibits smoother, more deterministic motion. Action-space granularity further amplifies these effects, with continuous control reducing high-frequency instabilities in MuJoCo. Together, these simulator-specific dynamics lead to distinct learning curves even under identical RL hyperparameters and training conditions.

These results collectively validate the dual digital twin framework’s potential to enhance RL robustness and transferability. The LLM-assisted model generation reduces engineering overhead, while the physics-based alignment framework quantifies and mitigates sim-to-sim discrepancies, as evidenced by the CartPole dynamics analysis. Furthermore, the RL experiments reveal how environment-specific dynamics and action spaces influence learning efficiency, with continuous control offering a promising results for optimization. Together, these findings give scalable, automated workflows that bridge heterogeneous simulators, setting a foundation for future research into sim-to-real transfer and adaptive robotic systems.

7. Conclusions

This paper introduced an extended reinforcement learning framework designed to address critical challenges in sim-to-sim and sim-to-real policy transfer for robotics. The core of our contribution is a dual digital twin concept, which leverages the complementary strengths of two distinct simulators: MuJoCo for its computationally efficient physics engine, ideal for rapid policy optimization, and Webots for its high-fidelity rendering and ROS integration, serving as the primary validation environment.

To make this dual-twin approach viable, we proposed two key methodologies. First, we introduced a physics-based alignment and validation framework to systematically quantify and minimize the “sim-to-sim gap”. By executing identical, controlled test scenarios in both environments, this framework successfully identified and measured specific behavioral divergences. Our experiments with the CartPole system, for example, revealed tangible discrepancies in low-friction dynamics, including a 0.5 s lag in MuJoCo’s pole fall trajectory and significant differences in cart displacement under gravity. These findings underscore the necessity of active alignment before policy transfer.

Second, to address the high engineering cost of maintaining two high-fidelity models, we leveraged generative AI in two capacities:

Automated Model Generation: We demonstrated that LLMs can automate the creation of a secondary digital twin, successfully translating a Webots CartPole model into a functional MuJoCo equivalent through an iterative, guided process.
Workflow Acceleration: We proposed a GAI-assisted workflow where an AI agent, given a simple implementation of our first design pattern, can automatically generate the more complex code for data collection and supervisor-free deployment patterns.

Our results demonstrate the feasibility of a GAI-augmented, dual-twin workflow, providing a scalable method for bridging heterogeneous simulators and accelerating the development of robust RL policies. While these findings are promising, the experiments highlight several current limitations that warrant further attention. LLM-driven generation of complex models, such as the Pioneer 3-AT, still required significant manual refinement to achieve physical accuracy. Furthermore, our RL experiments confirmed that policy convergence is highly sensitive to simulator-specific dynamics, such as the wheel instability observed in MuJoCo, and the granularity of the action space, emphasizing the continued need for rigorous alignment.

An important observation from this study is the complementary strength of the two simulators. MuJoCo’s constraint solver and contact modeling provided a more precise and numerically stable representation of joint and collision dynamics, allowing it to function as an effective reference model when diagnosing deviations in Webots. In practice, this means that the dual-twin workflow not only supports sim-to-sim transfer but also enables MuJoCo to serve as a quantitative validator for Webots models before deployment on physical hardware. This property strengthens the practical relevance of the framework, as it offers a systematic method for identifying and correcting modeling inconsistencies.

While this work validates the fundamental methodology, we acknowledge the preliminary nature of these current results. The current validation, while providing strong quantitative evidence of reduced sim-to-sim error, is constrained to low degrees of freedom systems (CartPole and Pioneer 3-AT). To fully generalize the approach, future research must focus on enhancing the generative agent’s ability to reason about complex multi-body dynamics, fully automating the alignment feedback loop, and extending this statistical validation to high-degree-of-freedom manipulators to enable reliable full sim-to-real transfer with physical robotic hardware.

Author Contributions

Conceptualization, A.L., A.Š. and D.M.; methodology, A.L., A.Š. and D.M.; software, A.L. and A.Š. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The project code and data are available online at (https://github.com/aalgirdas/WebotsRL), (https://github.com/aalgirdas/roboGen-LLM), and (https://github.com/aalgirdas/MuJoCoRL) (accessed on 26 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ayala, A.; Cruz, F.; Campos, D.; Rubio, R.; Fernandes, B.; Dazeley, R. A comparison of humanoid robot simulators: A quantitative approach. In Proceedings of the Joint IEEE 10th International Conference on Development and and Learning and Epigenetic Robotics (ICDL-EpiRob), Valparaiso, Chile, 26–30 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Yadav, P.; Mishra, A.; Kim, S. A comprehensive survey on multi-agent reinforcement learning for connected and automated vehicles. Sensors 2023, 23, 4710. [Google Scholar] [CrossRef] [PubMed]
Sivamayil, K.; Rajasekar, E.; Aljafari, B.; Nikolovski, S.; Vairavasundaram, S.; Vairavasundaram, I. A systematic study on reinforcement learning based applications. Energies 2023, 16, 1512. [Google Scholar] [CrossRef]
Qian, C.; Ren, H. Deep reinforcement learning in surgical robotics: Enhancing the automation level. In Handbook of Robotic Surgery; Academic Press: Cambridge, MA, USA, 2025; pp. 89–102. [Google Scholar]
Liu, W.; Wu, M.; Wan, G.; Xu, M. Digital twin of space environment: Development, challenges, applications, and future outlook. Remote Sens. 2024, 16, 3023. [Google Scholar] [CrossRef]
Laukaitis, A.; Šareiko, A.; Mažeika, D. Facilitating Robot Learning in Virtual Environments: A Deep Reinforcement Learning Framework. Appl. Sci. 2025, 15, 5016. [Google Scholar] [CrossRef]
Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 5026–5033. [Google Scholar]
Michel, O.; Cyberbotics Ltd. Webots: Professional mobile robot simulation. Int. J. Adv. Robot. Syst. 2004, 1, 5. [Google Scholar]
Šareiko, A.; Mažeika, D.; Laukaitis, A. Framework for deep reinforcement learning in Webots virtual environments. New Trends Comput. Sci. 2025, 3, 49–63. [Google Scholar] [CrossRef]
Chen, S.; Lopes, P.V.; Marti, S.; Rajashekarappa, M.; Bandaru, S.; Windmark, C.; Skoogh, A. Enhancing Digital Twins with Deep Reinforcement Learning: A Use Case in Maintenance Prioritization. In Proceedings of the 2024 Winter Simulation Conference (WSC), Orlando, FL, USA, 15–18 December 2024; pp. 1611–1622. [Google Scholar]
Mazumder, A.; Sahed, M.F.; Tasneem, Z.; Das, P.; Badal, F.R.; Ali, M.F.; Islam, M.R. Towards next generation digital twin in robotics: Trends, scopes, challenges, and future. Heliyon 2023, 9, e13359. [Google Scholar] [CrossRef]
Mu, Y.; Chen, T.; Chen, Z.; Peng, S.; Lan, Z.; Gao, Z.; Liang, Z.; Yu, Q.; Zou, Y.; Xu, M.; et al. Robotwin: Dual-arm robot benchmark with generative digital twins. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 1–15 June 2025; pp. 27649–27660. [Google Scholar]
Matulis, M.; Harvey, C. A robot arm digital twin utilising reinforcement learning. Comput. Graph. 2021, 95, 106–114. [Google Scholar] [CrossRef]
Malik, A.A.; Brem, A. Digital twins for collaborative robots: A case study in human-robot interaction. Robot. Comput.-Integr. Manuf. 2021, 68, 102092. [Google Scholar] [CrossRef]
Lei, Z.; Zhou, Z.; Yin, S.; Chen, Y.; Xu, Q.; Li, W.; Wang, Y.; Tang, B.; Jing, W.; Chen, S. PolySim: Bridging the Sim-to-Real Gap for Humanoid Control via Multi-Simulator Dynamics Randomization. arXiv 2025, arXiv:2510.01708. [Google Scholar]
Memmel, M.; Wagenmaker, A.; Zhu, C.; Yin, P.; Fox, D.; Gupta, A. Asid: Active exploration for system identification in robotic manipulation. arXiv 2024, arXiv:2404.12308. [Google Scholar] [CrossRef]
Gu, X.; Wang, Y.J.; Chen, J. Humanoid-gym: Reinforcement learning for humanoid robot with zero-shot sim2real transfer. arXiv 2024, arXiv:2404.05695. [Google Scholar]
Korada, L. Role of generative AI in the digital twin landscape and how it accelerates adoption. J. Artif. Intell. Mach. Learn. Data Sci. 2024, 2, 902–906. [Google Scholar] [CrossRef] [PubMed]
Mikołajewska, E.; Mikołajewski, D.; Mikołajczyk, T.; Paczkowski, T. Generative AI in AI-based digital twins for fault diagnosis for predictive maintenance in Industry 4.0/5.0. Appl. Sci. 2025, 15, 3166. [Google Scholar] [CrossRef]
Hajdu, C.; Hegyi, N. Modeling Kinematic and Dynamic Structures with Hypergraph-Based Formalism. Appl. Mech. 2025, 6, 74. [Google Scholar] [CrossRef]
Holt, S.; Luyten, M.R.; Berthon, A.; van der Schaar, M. G-sim: Generative simulations with large language models and gradient-free calibration. arXiv 2025, arXiv:2506.09272. [Google Scholar]
Lin, Y.Z.; Shi, Q.; Yang, Z.; Latibari, B.S.; Satam, S.; Shao, S.; Salehi, S.; Satam, P. Ddd-gendt: Dynamic data-driven generative digital twin framework. IEEE Trans. Artif. Intell. 2025; Early Access. [Google Scholar] [CrossRef]
Liu, W.; Fu, Y.; Wang, Y.G.F.L.; Sun, W.; Zhang, Y. Two-timescale synchronization and migration for digital twin networks: A multi-agent deep reinforcement learning approach. IEEE Trans. Wirel. Commun. 2024, 23, 17294–17309. [Google Scholar] [CrossRef]
Yang, L.; Luo, S.; Cheng, X.; Yu, L. Leveraging Large Language Models for Enhanced Digital Twin Modeling: Trends, Methods, and Challenges. arXiv 2025, arXiv:2503.02167. [Google Scholar] [CrossRef]
Deng, M.; Fu, B.; Li, L.; Wang, X. Integrating LLMs and Digital Twins for Adaptive Multi-Robot Task Allocation in Construction. arXiv 2025, arXiv:2506.18178. [Google Scholar] [CrossRef]
Ravik, O.E. Integrating Large Language Models with Digital Twins for Autonomous Control. Master’s Thesis, Norwegian University of Science and Technology, Trondheim, Norway, 2025. [Google Scholar]
Li, N.; Ma, Z.; Yu, R.; Li, L. LSDTs: LLM-Augmented Semantic Digital Twins for Adaptive Knowledge-Intensive Infrastructure Planning. arXiv 2025, arXiv:2508.06799. [Google Scholar]

Figure 1. Example of development process that was used in this paper’s research project of a Universal Robots UR5e digital twins: (Left) MuJoCo UR5e robot, (Center) UR5e environment modeled in Fusion 360, and (Right) the simulated environment in Webots for robot testing.

Figure 2. Traditional workflow for digital twin creation and reinforcement learning training, showing the flow of requirements, models, and results between the Production Company, Simulation Engineer, RL Engineer, and the simulation environment.

Figure 3. AI-augmented workflow for digital twin creation. Generative AI agents support engineers by reusing prior cases, generating models, and configuring training, while the simulation environment is used mainly for observation and validation.

Figure 4. Design patterns for training and data collection in the dual-twin framework. (Left) The dual digital twin pattern for accelerated training, where different simulator implementations (WebotsRobot, MuJoCoRobot) inherit from RLRobot. (Right) The specialized data collection pattern, where a SupervisorRobot in a high-fidelity environment like Webots uses a camera to gather visual data paired with ground-truth states for training a perception model.

Figure 5. The Supervisor-free deployment pattern designed for real-world transfer. A RobotServer manages the policy and utilizes a Supervised Learning Model to infer environmental states from the robot’s sensors (e.g., camera). This replaces the need for privileged simulator information with learned perception. (Right) The proposed generative AI workflow, where the implementation of the first design pattern is used by a GAI Agent to automatically generate the code for the more complex second and third patterns.

Figure 6. Class diagram of the physics-based alignment framework. The ScenarioConfiguration class loads test definitions from a JSON file, which are used by the TestRobot (robot controller) and TestOrchestrator (privileged supervisor). Both controllers use a DataLogger to record simulation data. The resulting log files are then consumed by the GenerativeAIAgent for divergence analysis and parameter suggestions.

Figure 7. Activity diagram illustrating the Dual Digital Twin Alignment process. The workflow consists of two main stages: a simulation Data Acquisition stage (left) and an Iterative Alignment Loop (right) where the Generative AI Agent and human engineer analyze log file divergence and apply model corrections to synchronize the Webots and MuJoCo twins.

Figure 8. Comparison between reference and ChatGPT-generated robot models: upper left—the CartPole model in Webots used for reinforcement-learning experiments; upper right—the automatically generated MuJoCo version of the same CartPole, obtained through an iterative ChatGPT-based model-conversion process (14 interactions). Lower left—the Pioneer 3-AT physical robot model from its technical specification; lower right—the MuJoCo simulation model of the Pioneer 3-AT created by ChatGPT using only geometric primitives, successfully compiled after 5 refinement iterations.

Figure 9. Comparison of dynamics during a CartPole fall: (Left) Pole angle (radians) versus time in Webots (blue) and MuJoCo (orange), showing a delay of over 0.5 s in MuJoCo; (Right) Cart position (meters) versus time, indicating a 1 cm displacement in MuJoCo versus 1 mm in Webots.

Figure 10. Reinforcement learning curves for CartPole: (Top Left) Webots discrete action space (−1.5, 1.5 m/s) with score up to 2000 steps; (Top Right) MuJoCo discrete actions with delayed convergence due to wheel instability; (Bottom Center) MuJoCo continuous action space (−1.5 to 1.5 m/s) with enhanced learning rate.

Table 1. Results of cross-simulator transfer test.

Episode	MuJoCo Reward	Webots (Transferred, Before Alignment)	Webots (After Alignment)
1	1000	673	914
2	1000	519	877
3	993	462	856
4	1000	593	848
5	987	577	879
6	1010	632	845
7	963	560	873
8	971	631	921
9	1000	472	866
10	995	495	818

Table 2. Average time expenditure per student for manual and LLM-assisted modeling workflows.

Robot	Manual Modeling Time (h)	LLM-Assisted Modeling Time (h)	Reduction
CartPole	3.4	1.2	0.647
Pioneer 3-AT	7.4	2.9	0.608

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Laukaitis, A.; Šareiko, A.; Mažeika, D. A Dual Digital Twin Framework for Reinforcement Learning: Bridging Webots and MuJoCo with Generative AI and Alignment Strategies. Electronics 2025, 14, 4806. https://doi.org/10.3390/electronics14244806

AMA Style

Laukaitis A, Šareiko A, Mažeika D. A Dual Digital Twin Framework for Reinforcement Learning: Bridging Webots and MuJoCo with Generative AI and Alignment Strategies. Electronics. 2025; 14(24):4806. https://doi.org/10.3390/electronics14244806

Chicago/Turabian Style

Laukaitis, Algirdas, Andrej Šareiko, and Dalius Mažeika. 2025. "A Dual Digital Twin Framework for Reinforcement Learning: Bridging Webots and MuJoCo with Generative AI and Alignment Strategies" Electronics 14, no. 24: 4806. https://doi.org/10.3390/electronics14244806

APA Style

Laukaitis, A., Šareiko, A., & Mažeika, D. (2025). A Dual Digital Twin Framework for Reinforcement Learning: Bridging Webots and MuJoCo with Generative AI and Alignment Strategies. Electronics, 14(24), 4806. https://doi.org/10.3390/electronics14244806

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dual Digital Twin Framework for Reinforcement Learning: Bridging Webots and MuJoCo with Generative AI and Alignment Strategies

Abstract

1. Introduction

2. Related Work

3. Process of Digital Twin Creation

4. Design Patterns for Dual-Twin Reinforcement Learning

4.1. Pattern 1: The Dual Digital Twin for Accelerated Training and Synchronization

4.2. Pattern 2: Supervised Data Collection for Learned Perception

4.3. Pattern 3: Supervisor-Free Deployment with an AI-Enhanced Robot Server

4.4. Automating the Workflow with Generative AI

5. Physics-Based Alignment and Validation Framework

5.1. Framework Architecture and Workflow

5.2. Component Design and Process Flow

6. Results and Discussion

Preliminary Results on LLM-Based Model Generation

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI