Sim2Real Transfer of Imitation Learning of Motion Control for Car-like Mobile Robots Using Digital Twin Testbed

Mohaghegh, Narges; Wang, Hai; Yazdani, Amirmehdi

doi:10.3390/robotics14120180

Open AccessArticle

Sim2Real Transfer of Imitation Learning of Motion Control for Car-like Mobile Robots Using Digital Twin Testbed

by

Narges Mohaghegh

,

Hai Wang

^*

and

Amirmehdi Yazdani

School of Engineering and Energy, College of STEM, Murdoch University, Perth, WA 6150, Australia

^*

Author to whom correspondence should be addressed.

Robotics 2025, 14(12), 180; https://doi.org/10.3390/robotics14120180 (registering DOI)

Submission received: 13 October 2025 / Revised: 23 November 2025 / Accepted: 26 November 2025 / Published: 30 November 2025

(This article belongs to the Section AI in Robotics)

Download

Browse Figures

Versions Notes

Abstract

Reliable transfer of control policies from simulation to real-world robotic systems remains a central challenge in robotics, particularly for car-like mobile robots. Digital Twin (DT) technology provides a robust framework for high-fidelity replication of physical platforms and bi-directional synchronization between virtual and real environments. In this study, a DT-based testbed is developed to train and evaluate an imitation learning (IL) control framework in which a neural network policy learns to replicate the behavior of a hybrid Model Predictive Control (MPC)–Backstepping expert controller. The DT framework ensures consistent benchmarking between simulated and physical execution, supporting a structured and safe process for policy validation and deployment. Experimental analysis demonstrates that the learned policy effectively reproduces expert behavior, achieving bounded trajectory-tracking errors and stable performance across simulation and real-world tests. The results confirm that DT-enabled IL provides a viable pathway for Sim2Real transfer, accelerating controller development and deployment in autonomous mobile robotics.

Keywords:

imitation learning; digital twin; motion tracking; MPC-Backstepping controller; neural network controller

1. Introduction

Car-like Mobile Robots (CLMRs), a technology that has significantly enhanced efficiency in industry sectors such as home services, warehousing logistics, and intelligent transportation, among many others, are increasingly supported by digital models throughout their entire lifecycle—from initial design through to operational deployment [1,2,3].

Owing to their high-fidelity modeling and bidirectional, real-time connectivity, Digital Twins (DTs) offer an accurate and dynamic representation of system behavior, enabling comprehensive insight into the operational state of physical entities and can enhance the adaptability and autonomy of robotic platforms [4,5,6,7,8].

In the domain of four-wheeled CLMRs, DT technology refers to an integrated framework that enables real-time, bi-directional synchronization between a physical robotic platform and its high-fidelity virtual counterpart [3,4,5,9]. According to recent work in this field, the DT architecture typically consists of four key elements: the physical entity, the digital entity, the interactive middleware layer, and the twin data and application services. The physical entity comprises components such as wheels, encoders, IMU, UWB modules, and embedded processors (e.g., STM32, Jetson NX) operating in real-world environments. Its digital twin is constructed using simulation platforms like Webots [4,5] or Unity3D [3], replicating the robot’s geometry, kinematics, and dynamics, and interfaced through the Robot Operating System (ROS). The interaction layer—enabled by ROS-TCP protocols [3] and Python-based nodes—facilitates continuous communication and command execution [4,5,9], maintaining synchronized motion via a virtual–physical mapping strategy. Meanwhile, the data layer logs and processes physical, virtual, and control datasets in structured databases to support functions such as remote control, simulation prediction, state monitoring, and environment reconstruction. Collectively, DT technology in this domain enables not only remote management and kinematic and dynamical simulation but also accelerates development and deployment cycles and supports data-driven autonomy.

Recent studies have highlighted simulation prediction as a critical function supported and enhanced by DT technology, particularly in scenarios where the physical instance of CLMR is temporarily offline. Within this context, the virtual representation of the robot facilitates the testing and optimization of advanced control algorithms—such as Extended State Observers (ESO) and Sliding Mode Control (SMC)—under realistic operating conditions. Owing to the high-fidelity replication of the physical system’s structural and dynamic characteristics, the simulation environment ensures that validated control strategies can be reliably transferred to the physical platform with minimal re-tuning [4,5].

This predictive capability not only reduces the frequency of physical trials but also accelerates controller development cycles, enhances system safety, and supports the deployment of robust, data-driven control logic.

Learning-based control methods, including Reinforcement Learning (RL) [1,10] and Imitation Learning (IL) [1,11,12], have become central to the development of autonomous robotic systems capable of performing complex tasks such as locomotion, manipulation, and navigation [13] and can enhance robustness to model parameter mismatches, achieve superior generalization on unseen tracks, and deliver significant computational benefits compared with model-based controllers [14]. These approaches learn control policies either through interaction with an environment guided by a reward signal (as in RL) or by mimicking expert demonstrations (as in IL). While training in simulated environments offers safety, speed, and flexibility, transferring the learned policies to real-world systems—a process known as Sim-to-Real (Sim2Real) transfer—remains a fundamental challenge due to the discrepancies in dynamics, sensing, and environmental variability between the virtual and physical domains [15].

To address this gap, several strategies have emerged. Domain randomization is widely used to expose agents to diverse simulated conditions, enhancing policy robustness and generalizability. Representation learning techniques, such as autoencoders, help the system extract relevant features from raw sensory input, while regularization strategies have been shown to improve the stability and transferability of policies. Additionally, digital twin technology has increasingly been adopted as a foundation for Sim2Real transfer [10,15,16,17,18]. By maintaining a high-fidelity, synchronized virtual replica of the physical system, digital twins enable accurate simulation of real-world conditions and iterative policy refinement. This integration allows control strategies to be tested, validated, and optimized virtually before deployment, thereby reducing real-world risk and accelerating development cycles. The convergence of digital twins with learning-based control thus represents a promising pathway for advancing the deployment of robust and adaptive robotic systems in dynamic real-world environments.

The key contributions of this paper are described as follows.

DT Framework for Learning-Based Control in CLMRs

This study presents a modular DT framework designed to support the development and deployment of learning-based control policies for CLMRs. The DT system integrates three core components: a physical instance (real robot with sensor–actuator suite), a high-fidelity virtual instance implemented in Webots, and an interactive communication layer built on ROS. Critically, the framework incorporates a data processing module that records pose and control signals from the expert controller, enabling structured data collection for training machine learning models. This architecture supports safe, scalable, and synchronized policy learning by providing a reliable interface for capturing expert trajectories, training neural networks offline, and evaluating model performance across both simulated and real-world domains.

Hybrid Expert Controller and Imitation Learning Pipeline

We develop a hybrid Model Predictive Control (MPC)–Backstepping expert controller for trajectory tracking, used to generate optimal control demonstrations. A neural policy is then trained offline via supervised IL to imitate the expert’s actions based on tracking errors. The learner replaces the classical controller with a lightweight, feedforward neural network that operates in real time without online optimization, enabling robust control policy distillation from structured expert behavior.

Sim2Real Transfer and Visualization via ROS-Webots Integration

The trained neural policy is deployed on the physical robot through a ROS Noetic-based Catkin workspace. Control commands are generated from real-time pose tracking errors and actuated on the robot, while the resulting odometry—captured via encoders and IMU—is streamed back to the Webots simulator. This feedback loop enables live visualization of real-world motion within the simulation environment, supporting consistent benchmarking, performance logging, and closed-loop Sim2Real evaluation through synchronized ROS topics and structured data management.

2. Digital Twin Framework

The digital twin testbed herein present in Figure 1 has been tailored for a four-wheeled CLMR, specifically the Wheeltec “https://www.wheeltec.net/ (accessed on 25 August 2025)” platform. The DT system comprises three primary components: the physical instance, the virtual instance, and an interactive communication layer. The physical robot is equipped with diverse sensors—including Inertia Measurement Unit (IMU), LiDAR, and Ultra-Wideband (UWB)—alongside an STM32 controller and an NVIDIA Jetson Xavier NX, which facilitates real-time data acquisition and onboard processing for autonomous control. This physical instance is mirrored within Webots, where a high-fidelity simulation replicates the robot’s kinematics, sensor behavior, and control mechanisms. The virtual instance serves not only as a testbed for controller development but also as a runtime environment for predictive modeling, diagnostics, and trajectory planning.

The interactive module, underpinned by a ROS-based communication interface, synchronizes sensor feedback and control commands between the simulated and physical domains in real time. ROS topics facilitate bi-directional data exchange, ensuring that both instances maintain consistent operational states during training and deployment phases. The simulation domain operates on Ubuntu 20.04 with ROS Noetic and Webots, while the physical domain leverages ROS Melodic on Jetson NX hardware. Platform-specific interoperability is achieved using communication layers and ROS topics. This modular, distributed system architecture supports real-time verification, remote monitoring, and robust Sim2Real deployment, making it suitable for iterative learning-based development in advanced robotics applications.

3. Methodology

In this research we use the DT testbed for training and verifying our IL control policy. The verified policy is then subsequently transferred to the real world and measured against the performance criteria defined for the IL policy and benchmarked against the simulation results. The expert controller guiding the learning process is a hybrid MPC–Backstepping architecture, designed to adhere to a reference speed profile while generating control inputs that enable the CLMR to accurately track a predefined trajectory.

Pivotal to this research is the implementation of a learning-based control framework, in which an IL policy is trained and verified within the digital twin environment before deployment to the physical robot. The expert controller used for generating training data comprises a hybrid MPC and Backstepping scheme. This expert adheres to a predefined speed profile and computes control inputs to enable accurate trajectory tracking in both simulation and reality. The IL agent, once trained on expert data within the virtual domain, is transferred to the physical platform and evaluated against performance metrics derived from the simulation. This enables a safe, efficient Sim2Real transfer by allowing controller validation under identical environmental and operational conditions.

This work addresses the challenge of deploying complex control logic in real-time robotic systems by replacing a hybrid Backstepping–MPC controller with a lightweight neural network policy as depicted in Figure 2. We pose trajectory tracking as a supervised IL problem where the robot learns to imitate expert control commands based on pose tracking errors. An efficient control policy for a wheeled mobile robot (WMR) to follow a time-varying reference trajectory in real-time, replacing a model-based controller (Backstepping + MPC) with a learned neural policy that mimics expert behavior. The learner is trained on a fixed dataset of expert demonstrations collected prior to training. The state distribution during training exactly matches that of the expert, but any deviation during execution cannot be corrected since no additional data is collected.

To close the Sim2Real loop, the trained neural controller is deployed to the physical robot using a ROS Melodic-based control stack within a Catkin workspace. The controller in this case uses the Webots tested and trained expert and leaner scenarios. Simultaneously, the robot’s real-time odometry—obtained from wheel encoders and IMU—is published over ROS topics and streamed back to the Webots virtual environment. This integration allows the physical instance’s pose to be visualized within the simulation for live comparison against the reference trajectory. Logged data from both virtual and real domains is synchronized and archived for post-analysis using structured ROS bag and CSV formats, enabling quantitative evaluation of Sim2Real performance.

4. MPC-Backstepping Controller

In this work, the controller mechanism used as the expert controller combines MPC for longitudinal velocity planning and Backstepping Control for lateral and angular motion. The approach is implemented in Webots for a WMR virtual instance to track a reference trajectory.

The control architecture illustrated in Figure 3 begins with a trajectory planner that generates the desired reference pose

(x_{r}, y_{r}, θ_{r})

with respect to the defined reference trajectory. This reference is compared against the actual robot pose

(x, y, θ)

, producing the tracking errors i.e.,

e_{x}, e_{y}

and

e_{θ}

. These error signals together with predicted velocity output from the MPC

(v_{x}, v_{y}, ω)

form the input to the backstepping controller, which designs stabilizing virtual control laws to regulate the error dynamics. Specifically, the controller computes control signals

(u_{x}, u_{y}, u_{θ})

that ensure that the robot remains aligned with the desired trajectory. The control signals are then transformed into individual feasible wheel speeds

ω_{i}

via an inverse kinematics module that accounts for the robot’s geometry [4,5]. This hierarchical design ensures that the low-level wheel actuation remains consistent with high-level trajectory tracking objectives. By continuously feeding back the measured pose

(x, y, θ)

, the structure closes the loop between system feedback and control, enabling the system to minimize the cumulative cost associated with deviations in tracking error

(e_{x}, e_{y}, e_{θ})

. In this way, the diagram demonstrates how reference generation, error regulation, and actuator-level implementation are integrated into a coherent control framework for mobile robots. This hybrid design integrates the predictive foresight of MPC with the stabilizing structure of backstepping control, yielding a resilient closed-loop system capable of real-time trajectory tracking under varying speed profile mandates.

4.1. MPC Scheme

The MPC framework deployed in this study formulates the trajectory tracking problem as a finite-horizon optimizations task, balancing tracking precision with control effort smoothness. At each control step, the MPC minimizes a quadratic cost function defined over a sequence of future states and control inputs

The utilized cost function penalizes deviations from the reference trajectory—both in position and orientation—while simultaneously discouraging abrupt or energetically inefficient control actions. The cumulative cost is computed as the sum of these weighted penalties over a finite prediction horizon, which enables the controller to consider both immediate tracking accuracy and long-term path feasibility. Several key parameters govern the behavior of these optimizations, including the length of the prediction horizon, the discretization time step, and a set of tunable weighting coefficients. The prediction horizon determines how far ahead the controller looks to optimize future actions, while the time step controls the temporal granularity of the model. The weighting factors strike a trade-off between strict adherence to the trajectory and the practical feasibility of velocity commands. The optimizations problem is constrained by the robot’s kinematic limits and is solved at each control cycle using a nonlinear programming solver. The first control input in the optimal sequence is applied to the robot, after which the process repeats in a receding-horizon manner. To ensure stability and robust convergence, a backstepping controller is layered on top of the MPC output, shaping the system’s error and control dynamics.

The cost function uses the robot’s future states over a horizon

H

using the discrete-time unicycle kinematic model. The optimal forward velocity profile

{\{v_{x_{k}}\}}_{k = 0}^{H}

is obtained by minimizing the quadratic cost:

The MPC cost function is:

J = \sum_{k = 0}^{H} Q \times {∥ p_{k} - p_{r} ∥}^{2} + R \times v_{x_{k}}^{2} - G . {∥ v_{x_{k}} - v_{x_{k - 1}} ∥}^{2}

(1)

where

Q

is the weight on the position tracking error,

R

is the weight on the control effort,

{∥ p_{k} - p_{r} ∥}^{2}

is the squared Euclidean norm of position error, using predicted positions

p_{k}

and reference position

p_{r}

at step

k

,

v_{x_{k}}

is forward velocity profile applied and step

k

,

G > 0

is the agility weight.

Remark 1.

Within the proposed hybrid architecture, MPC provides the predictive layer that forecasts the robot’s future states and calculates an optimal forward velocity profile over the prediction horizon. Its nominated parameters—horizon length, step size, and weighting structure—act as settings that balance long-term feasibility with short-term responsiveness, creating the foundation upon which backstepping dynamics are later stabilized.

4.2. Backstepping Controller

Backstepping control is employed in this dissertation to facilitate accurate trajectory tracking and enhance control mapping of virtual and physical instances [4].

The Backstepping controller computes the

{(u}_{x}, u_{y}, u_{θ})

from tracking errors

(e_{x}, e_{y}, e_{θ})

expressed in the robot’s local frame:

[\begin{matrix} e_{x} \\ e_{y} \\ e_{θ} \end{matrix}] = R {(θ)}^{T} [\begin{matrix} x_{r} - x \\ y_{r} - y \\ θ_{r} - θ \end{matrix}]

(2)

where

e_{x}

is longitudinal position tracking error in the robot frame,

e_{y}

is lateral position tracking error in the robot frame,

e_{θ}

is heading (orientation) tracking error in the robot frame,

x_{r}, y_{r}

is reference position coordinates (global frame), (

x, y)

is actual robot position coordinates (global frame),

θ_{r}

is reference robot heading (global frame),

θ

is actual robot heading (global frame),

R (θ)

is 2 × 2 rotation matrix transforming from global to robot frame i.e.,

R (θ) = [\begin{matrix} c o s θ & s i n θ \\ - s i n θ & c o s θ \end{matrix}] .

The control laws for CLMR using backstepping controller are as below as derived in [4,5]:

\{\begin{matrix} u_{x} (t) = v_{x r} (t) \cos (e_{θ} (t)) - v_{y r} (t) \sin (e_{θ} (t)) + k_{1} \arctan (k_{2} e_{x} (t)) \\ u_{y} (t) = v_{x r} (t) \sin (e_{θ} (t)) + v_{y r} (t) \cos (e_{θ} (t)) + k_{3} \arctan (k_{4} e_{y} (t)) \\ u_{θ} (t) = ω_{r} (t) + k_{5} \sin (e_{θ} (t)) \end{matrix}

(3)

where

u_{x} (t)

is control input for longitudinal motion in the robot frame,

u_{y} (t)

is control input for lateral motion in the robot frame,

u_{θ} (t)

is control input for angular motion (yaw rate),

v_{x r} (t)

is reference longitudinal velocity in the robot frame,

v_{y r} (t)

is reference lateral velocity in the robot frame,

ω_{r} (t)

is reference angular velocity (yaw rate),

e_{x} (t)

is longitudinal position tracking error in the robot frame,

e_{y} (t)

is lateral position tracking error in the robot frame,

e_{θ} (t)

is heading error in the robot frame,

k_{1}, k_{2}, k_{3}, k_{4}, k_{5}

are control gains (tuning parameters) of the backstepping controller.

Remark 2.

In our hybrid design, backstepping mechanism serves as the stabilizing layer that computes corrective angular and lateral control laws based on pose tracking errors. Its abstraction lies in recursive gain-based shaping of these error dynamics, where tuning parameters ensure that the forward velocity profile generated by MPC is realized through stable tracking of orientation and lateral position, thereby integrating prediction with robust nonlinear regulation.

5. Imitation Learning

In recent years, RL, which is a paradigm within machine learning (ML), has been recognized as an efficient approach to acquire a policy that replicates the behavior of an expert. However, the drawback of RL is that it requires extensive exploration of the state–action space, which becomes computationally prohibitive as the dimensionality of the environment increases. The complexity of mapping all possible action–state pairs grow exponentially, making naive exploration infeasible for real-world robotic applications. IL has emerged as a solution designed to mitigate this challenge. Instead of relying solely on exploration, IL provides the agent with expert demonstrations of high-reward behaviors, thereby constraining the search space and accelerating convergence [19,20].

Formally, RL is framed as a Markov Decision Process (MDP), where the objective of RL is to derive an optimal policy

π

* that maximizes the expected return over a trajectory [21]:

π^{*} = a r g m a x_{π} J (π),

(4)

J (π) = E [R_{t}]

(5)

With

R_{t}

denoting the cumulative reward at time t which is culminated to the expected reward i.e.,

E [R_{t}]

. In autonomous driving and robotic control, however, purely reward-driven RL can be unsafe and sample-inefficient, particularly when bridging the simulation-to-reality (Sim2Real) gap.

To address this challenge and complementary to RL, IL provides a cost-based formulation within the MDP setting, where the objective is to learn a policy that replicates an expert’s behavior [11]. The IL objective as:

m i n_{π} J (π), J (π) = E_{π} [Σ_{t = 0}^{T - 1} c (s_{t}, a_{t})]

(6)

where

S

is state space,

A

is action space,

O

is observation space,

s_{t} \in S

is system state at time t,

o_{t} \in O

is observation (on-board measurements such as images, wheel speeds),

a_{t} \in A

is control action (e.g., steering, throttle),

π : O \to A

is stationary reactive policy (e.g., a neural network controller),

c (s_{t}, a_{t})

is instantaneous cost (e.g., penalizing deviation, slip, or unsafe control),

J (π)

is accumulated cost over horizon

T

.

By leveraging expert demonstrations

π_{H}

, IL constrains the learner’s policy to satisfy [11]:

J (π) \leq J (π_{H}) + O (T ε)

(7)

where

ε

is the expected per-step imitation loss Mean Squared Error (MSE),

T

is the time horizon,

J (π)

and

J (π_{H})

are cumulative costs of the learned and expert policies, respectively.

The supervised imitation’s objective that trains the parameterized policy

π

to match the expert is to minimize the loss function. For each expert datapoint

(s, a_{H})

, the loss measures the squared Euclidean distance between the policy’s predicted action

π_{θ} (s)

and the expert action

a_{H}

; the outer expectation averages this error over the expert’s data distribution

D

. Minimizing

L (θ)

yields parameters

θ

that make the learned policy’s actions close to the expert’s across states the expert visits, providing a principled surrogate for achieving low imitation error over trajectories [11,19].

L (θ) = E_{(s, a_{H}) \sim D} [{|π_{θ} (s) - a_{H}|}^{2}]

(8)

where

L (θ)

is loss function,

s

is the input state (e.g., pose tracking errors),

a_{H}

is the expert control action (e.g., control signals from Backstepping + MPC),

π_{θ} (s)

is the neural policy output,

E (s, a H) \sim D

denotes an expectation (average) taken over state–action pairs

(s, a_{H})

drawn from the demonstration distribution

D

, i.e., the dataset collected from an expert policy

π_{H}

.

Based on IL for autonomous control for CLMR, the control problem is reformulated as a policy learning problem. A neural network is trained in a supervised manner to imitate an expert hybrid controller (Backstepping + MPC), using expert-generated control commands as ground truth labels and minimizing the discrepancy between its predictions and the expert’s actions. A PyTorch(v1.13.1+cpu)-based neural policy network was developed to replace the Backstepping + MPC expert controller in Webots as demonstrated in Figure 2. The network learns the mapping input is tracking error to control signals, i.e.,

(e_{x}, e_{y}, e_{θ}) \mapsto (u_{x}, u_{y}, u_{θ})

.

Remark 3.

Within a DT-enabled workflow, a high-fidelity virtual replica of the physical instance, enables validation and optimization through virtual instance data processing layer supports robust Sim2Real deployment. Our implementation uses the batch IL setting, since training data is collected from the Backstepping–MPC expert controller in Webots and the neural policy is trained offline using PyTorch. While this reduces the supervision effort, it can be more sensitive to covariate shift if the learned policy deviates from the expert trajectory distribution.

5.1. Digital Twin-Based Learning Architecture

The proposed digital twin-based learning architecture presented in Figure 2 utilizes a DT framework for structured policy learning and safe Sim2Real transfer. It consists of two principal virtual modules: the Expert Model and the Learner Model. The Expert Model is implemented in the Webots simulation environment and mirrors the physical robot by modeling high-fidelity sensing and actuation characteristics. This includes real-time LiDAR perception, encoder-based wheel odometry, and precise ground-truth pose tracking. An MPC-Backstepping controller embedded within the expert model operates as the reference policy

π_{H}^{*}

, producing control signals in response to tracking errors.

The expert generates control actions These control commands are computed from the local frame tracking error vector where each component quantifies deviation from the reference trajectory in the robot’s frame of motion and using MPC scheme and backstepping controller mechanism which utilize a set of tuned parameters. These signals, combined with observed states, are logged as training data.

The learner model is trained offline using this dataset to imitate the expert policy. A deep neural network approximates the mapping local-frame tracking error vector to the control signals, with training performed in PyTorch using normalized features and supervision. The approach follows the surrogate reinforcement learning analysis, offering theoretical guarantees that the learned policy’s cumulative cost will remain within

O (T ε)

of the expert’s performance. Once deployed in simulation mode, the learner model provides control commands based solely on error-frame inputs. This modular setup facilitates robust development and evaluation in simulation prior to real-world testing, while preserving sensor and feedback consistency with the Webots and physical environments.

5.2. Algorithmic Expert: Backstepping-MPC Policy

Imitation learning (IL) formulates policy acquisition as a supervised learning problem in which a model is trained on expert-provided state–action pairs to reproduce the expert’s behavior. To ensure robust supervision, a hybrid algorithmic expert combining Model Predictive Control (MPC) and Backstepping control was developed for the wheeled mobile robot. In this structure, the MPC module computes a reference trajectory by forward projecting along a predefined sinusoidal path, accounting for curvature and desired heading changes, while the Backstepping controller stabilizes the system based on tracking errors expressed in the robot’s body frame. This hybrid design enables the expert to generate control commands

(u_{x}, u_{y}, u_{θ})

that accurately track the desired trajectory and compensate for orientation and position deviations. The resulting commands are transformed into individual wheel speeds using an analytically derived Jacobian model of the robot’s kinematics. Leveraging full state information from the Webots Supervisor and known system dynamics, the expert achieves reliable control performance in trajectory tracking. During data collection, the expert actions and corresponding tracking errors

(e_{x}, e_{y}, e_{θ})

are logged to form the training dataset. The expert policy, implemented as the hybrid MPC–Backstepping controller, generates control actions that minimize the cumulative cost of

π_{H}

as defined in Equation (6). All parameters and control data relevant to

π_{H}

are recorded through the digital twin’s data processing module, providing the state–action pairs used to train the neural network policy within the DT-based framework.

5.3. Learning a Neural Control Policy

To approximate the expert’s control strategy with a lightweight, reactive policy suitable for real-time deployment, we trained a feedforward neural network that maps trajectory tracking errors

(e_{x}, e_{y}, e_{θ})

to control commands

(u_{x}, u_{y}, u_{θ})

. The learned policy, implemented as a deep fully connected network, takes a 3-dimensional error vector as input and outputs three continuous control commands used to drive the robot. The neural network architecture as presented in Figure 4, consists of four hidden layers with widths [128, 64, 32] and ReLU activations. Dropout regularization (rate 0.1) is applied to prevent overfitting. The final output layer produces linear activations corresponding to

v_{x}

,

v_{y}

and

ω

. The network was trained offline using the expert-generated dataset and optimized via stochastic gradient descent using the ADAM optimizer. The trained model was later deployed in the loop, replacing the expert, and its outputs were fed directly into the kinematic inverse model to drive the wheels.

This Deep Neural Network (DNN) policy enables real-time inference without reliance on costly trajectory re-planning or external state estimation. Empirically, the network learns to generalize across a wide range of error conditions [14], allowing successful deployment in Webots without online access to the MPC or backstepping modules.

Policy Evaluation and Deployment

To train a neural network policy that mimics expert control behavior, a weighted

l

1 loss function is employed. This loss compares the control commands predicted by the network with those generated by the expert controller (a hybrid MPC–Backstepping controller), based on the same state input. Specifically, the state is represented by a local pose tracking error vector

(e_{x}, e_{γ}, e_{θ})

, and the target action is a control vector

(u_{x}, u_{γ}, u_{θ})

from the expert. The neural policy output

π_{θ} (s)

aims to reproduce these expert actions with high fidelity across all three control dimensions.

Learning is posed as supervised imitation with a weighted

l

1 objective to mitigate outliers and emphasize accuracy where it matters:

L (θ) = \frac{1}{N} \sum_{i} {‖w ⊙ (π_{θ} (s_{i}) - a_{i}^{H})‖}_{1}, w = [w_{x}, w_{γ}, w_{θ}]

(9)

where

θ

denotes the trainable parameters of the neural network policy

π_{θ},

N

is the total number of training samples or time steps,

s_{i} = {[e_{x}, e_{γ}, e_{θ}]}^{T}

is the input state vector at time step

i

, consisting of the tracking errors in the robot frame,

π_{θ (s_{i})} = {[û_{x}, û_{y}, û_{θ}]}^{T}

is the control command predicted by the neural network for states sᵢ,

a_{i}^{H} = {[u_{x}, u_{γ}, u_{θ}]}^{T}

represents the control command generated by the expert controller (MPC + Backstepping) at time step i, ||×||₁ denotes the

l

1 norm, which computes the sum of absolute differences,

w = [w_{x}, w_{γ}, w_{θ}]

is a weight vector applied to emphasize certain control channels,

w

is tweaked as required to prioritize lateral and angular control, ⊙ denotes the Hadamard (element-wise) product between vectors.

The chosen loss function is defined as a weighted

l

1 norm, which penalizes the absolute difference between the predicted and expert control commands. Importantly, a weight vector

w

= [1, 3, 5] is applied to the loss terms to place higher emphasis on lateral and angular control accuracy (

u_{y}

and

u_{θ}

, respectively). These components are generally more sensitive to noise and have a stronger impact on trajectory stability, especially when dealing with curved paths or rapid heading adjustments

The optimization of this loss is carried out using the ADAM optimizer, a variant of stochastic gradient descent that adaptively scales the learning rate based on the first and second moments of the gradients. ADAM is well-suited for training in dynamic and noisy environments due to its fast convergence and robustness to poorly scaled gradients. During training, the network is updated in minibatches using a fixed learning rate until convergence, and the final set of parameters is saved for deployment. This setup enables reliable training of a compact neural controller capable of real-time deployment with minimal computational overhead [11].

Training curves (loss vs. epoch) are logged for simulation results validations, and a held-out test split reports the same weighted

l

1 metric to estimate the imitation gap before on-robot trials [22]. At runtime, the scaler must be applied to incoming errors, and the policy’s outputs are fed to the wheel-kinematics layer that converts

(u_{x}, u_{y}, u_{θ})

to wheel speeds—replacing online optimization with a single forward pass. This matches our aim to utilize a compact, reactive neural policy that imitates an algorithmic expert while remaining computationally lightweight for real-time control.

5.4. Track Generalization

To address the track generalization capability of the neural network controller, this work investigates whether a deep neural network (DNN) policy, trained via supervised IL on a single expert-guided trajectory, can maintain stable tracking performance when evaluated on structurally distinct but related paths. The goal is to assess whether the learned controller encodes transferable control behavior or merely overfits to the training scenario. The neural policy is trained using local-frame pose errors as inputs and control signals from a hybrid expert as targets, with a weighted loss function that emphasizes angular and lateral accuracy. Deployment on previously unseen trajectories of increasing frequency provides insight into the limits and capabilities of the policy’s generalization.

Across the reviewed literature, generalization performance is typically evaluated in terms of MSE or trajectory deviation norms, both of which tend to increase as the testing trajectory diverges from the training distribution. Empirical results are compared against prior literature on trajectory generalization. These studies demonstrate that DNN controllers, even when trained on selected path subsets, can retain effective performance on unseen and similar trajectories, with typical error growth bounded to 10–25% depending on trajectory complexity [23,24]. Our experiments mirror this behavior: across test paths with increasing curvature, the policy maintains low mean squared errors (MSE) in lateral and angular deviation, with degradation in accuracy aligned with expected bounds. These findings confirm that even minimal training on a representative trajectory can yield neural policies that generalize reliably within a task distribution, offering a lightweight alternative to online optimization for mobile robot control.

6. Zero-Shot Sim2Real Transfer Using Digital Twins

This section presents the zero-shot Sim2Real deployment pipeline, which enables the direct transfer of a neural network controller trained in Webots simulation to a real four-wheeled CLMR without additional real-world adaptation. The procedure begins with the construction of a high-fidelity digital twin that replicates the robot’s geometry, sensors, and actuation dynamics. The virtual environment mirrors the sensor fusion stack of the physical platform, including wheel encoders, inertial sensors, and a Webots supervisor for pose tracking.

In the first phase, the expert controller generates optimal control commands

(u_{x}, u_{y}, u_{θ})

for a set of reference trajectories under varying parameters. Each trajectory is logged as a sequence of input-output pairs consisting of pose tracking errors

(e_{x}, e_{y}, e_{θ})

and corresponding control actions. These tuples form the training dataset used to train a feedforward neural policy

π_{θ}

via supervised learning with weighted

l 1

loss. To enhance generalizability, dropout regularization and observation noise are injected during training.

Once training converges, the policy is validated in simulation on previously unseen trajectories. The digital twin ensures that both simulation and deployment share consistent timing, interface protocols (via ROS), and actuation bandwidth. Consequently, the neural policy can be deployed with minimum retraining or online fine-tuning, representing a zero-shot transfer scenario.

The deployment phase proceeds by launching the trained policy directly on the physical Wheeltec robot, where pose tracking errors are measured in real time and passed as inputs to the neural controller. The predicted control outputs are then mapped to wheel velocities using a Jacobian-based inverse kinematics model. Performance metrics, including average tracking error and policy stability, are recorded for comparison with simulation benchmarks.

Remark 4.

This pipeline aims to demonstrate that utilizing digital twin components can bridge the Sim2Real gap in imitation learning and minimize requirement for policy adaptation for real world deployment, provided that dynamics, control latency, and sensor feedback are accurately replicated.

7. Results

This section presents empirical findings from simulation and real-world experiments, validating the effectiveness of the using DT technology in training neural policy trained via an expert controller.

7.1. Supervised Learning of Neural Control Policy

Training of the leaner was conducted over 100 epochs using the ADAM optimizer with a learning rate of 0.001. An epoch refers to one complete pass through the entire training dataset. The optimizer updates the neural network parameters iteratively using mini batches sampled from the training set. As training progresses, the network becomes increasingly proficient at replicating expert control actions.

The effectiveness of the learned neural controller is supported by the progressive reduction in weighted Mean Absolute Error (MAE) loss throughout training described in Equation (9). The network successfully approximated the expert control policy using a deep feedforward architecture trained with imitation learning, achieving a final loss of 0.8224 after 100 epochs. This consistent decrease in loss, visualized in Figure 5 and summarized in Table 1, confirms that the neural policy has learned to replicate expert actions with high fidelity, justifying its deployment as a computationally efficient replacement to the hybrid MPC–Backstepping controller.

The evolution of the training loss over time is shown in Figure 5. The weighted MAE loss decreased consistently from an initial value of 1.80 to a final value of 0.82. The loss at selected intervals is summarized in Table 1 below:

In imitation learning for low-dimensional control tasks, such as mapping pose tracking errors to velocity commands, the convergence of the learned policy is commonly evaluated using per-step loss metrics such as MAE described in Equation (9). A final loss value below 1.0 is widely regarded as a practical indicator of effective policy learning in these settings, particularly when the state and action spaces are compact and continuous. This threshold is not absolute but is supported by empirical results in prior work, where successful training runs typically show the loss steadily decreasing and plateauing below this value [11,19,21]. In this study, the neural policy achieved a final weighted MAE of 0.82 after 100 epochs of training, suggesting that the model has learned to approximate the expert policy with acceptable accuracy across the training distribution.

7.2. Imitation Learning and Cost Evaluation

To assess whether the learned policy is admissible, we compare its control outputs to those of the expert using a cumulative imitation cost. A policy is considered admissible if its deviation from expert actions remains bounded over the trajectory horizon, indicating reliable imitation and stable performance for deployment.

As explained in 5.2 and 5.3 the performance of a learned policy

π

is formalized in terms of its expected deviation from the expert policy

π_{H}

.

Based on recorded control signals from simulation for both the expert and learner policies, we obtained the following values:

J (π_{H}) = 1.1701

and

J (π) = 0.8793

. These values confirm that the imitation loss remains bounded and within a close margin of expert-level performance, validating the neural controller’s effectiveness in replicating expert behavior and supporting its deployment in real-world robotic systems.

As shown in Figure 6, both the expert controller and the neural network policy closely follow the sinusoidal reference trajectory. However, the expert trajectory (blue line) tracks with higher fidelity and lower variability compared to the neural controller (orange), which slightly lags and deviates in high-curvature segments. To quantitatively compare tracking precision, we compute the MSE for lateral and heading deviations over a 30 s horizon, summarized in Table 2. The expert controller achieves MSE(

e_{y}

) = 0.0049 and MSE(

e_{θ}

) = 0.0742, confirming near-perfect path adherence. In contrast, the learner policy yields MSE(

e_{y}

) = 0.1349 and MSE(

e_{θ}

) = 0.0768, which—although higher—remain within a bounded range and support the learner’s admissibility for deployment.

7.3. Generalization to Out-of-Distribution Reference Paths

To evaluate the robustness and generalization capability of the learned neural control policy, we conducted simulation experiments on a series of reference trajectories that were not present during training. These paths varied in spatial frequency, representing different movement profiles than the originally trained path.

These trajectories were generated procedurally in Webots and compared against the learner’s actual behavior over a 30 s tracking window.

To quantify performance of the neural control as shown in Figure 7, we computed the MSE for both lateral error

e_{y}

and heading error

e_{θ}

. Table 3 summarizes the results:

These results show a bounded and gradual increase in error as the reference curvature becomes more complex, which is consistent with prior findings in imitation learning literature. In this case, the lateral error remains below 0.30 and heading error below 0.16 even on the most complex path tested, suggesting that the policy has learned a transferable control strategy.

As per the demonstrated results, our learner policy tracks dynamic out-of-distribution references robustly without retraining, reinforcing its practical viability for embedded real-world deployment.

7.4. Zero-Shot Deployment on the Real Robot

To assess Sim2Real transferability, the tested controllers in virtual instance were deployed to the physical Wheeltec CLMR platform using the ROS-bridged digital twin architecture, without any retraining or adaptation. The MPC Backstepping controller is deployed to the Physical instance using ROS Noetic Catkin workspace. The neural network controller is also deployed to the Wheeltec robot which receives pose tracking errors in real time, computes the appropriate control signals

(u_{x}, u_{y}, u_{θ})

, and issues them to the wheel actuators. The same test trajectories were replicated in the physical environment for both expert, i.e., MPC backstepping controller and learner, i.e., neural neural network controller.

To evaluate the real-world shown in Figure 8 performance of the learned neural controller, we computed the MSE in both lateral deviation

e_{y}

and heading error

e_{θ}

across five sinusoidal reference paths of increasing frequency. The results, summarized in Table 4, show that the controller maintains acceptable tracking accuracy within expected generalization bounds. For lower-frequency paths (Path 1 and 2), the learner performs well within benchmarked limits reported in recent IL-based control literature. Even under moderate distributional shift (Paths 3–4), the error remains bounded and does not exhibit significant degradation, indicating effective knowledge transfer from simulation to real-world deployment.

8. Discussion

This study demonstrates that a lightweight neural control policy trained via supervised imitation learning can successfully replicate the behavior of an expert Backstepping–MPC controller and maintain trajectory tracking performance in both simulated and real-world scenarios. The consistent tracking performance across multiple reference paths—both within and outside the training distribution—indicates that the learned policy exhibits a high degree of generalizability under varied trajectory profiles. However, the performance is naturally more stable in simulation due to idealized conditions, while real-world results reveal modest increases in tracking errors, particularly in the lateral and angular domains.

To further improve robustness and address potential compounding errors from covariate shift, future iterations could incorporate DAgger-style data aggregation [11]. Integrating DAgger-style updates into the DT-based workflow would enable iterative refinement of the neural policy. During learner rollouts in the virtual or physical twin, corrective expert actions could be recorded and appended to the training dataset, allowing progressive retraining within the same framework.

Despite promising results, some limitations remain. While the reported MSE values demonstrate consistent trajectory-tracking behavior, future experiments will include repeated trials and the incorporation of statistical measures such as standard deviation and confidence intervals to provide a more rigorous quantitative assessment of model reliability. The current imitation learning framework does not explicitly account for environmental or physical effects such as friction, wheel slip, or surface irregularities, which can introduce disturbances to control performance. Moreover, the model assumes accurate state estimation from sensors, which may not always hold under real-world constraints such as sensor drift or latency. These factors present opportunities for future work that couples learned policies with adaptive or disturbance-aware modules.

9. Conclusions

This paper has presented a Sim2Real imitation learning framework that uses expert-guided control data to train a lightweight neural policy for trajectory tracking in car-like mobile robots (CLMRs). The approach integrates a hybrid MPC–Backstepping controller with a neural network trained on pose tracking errors, implemented within a digital twin architecture using ROS and Webots. The developed system enables supervised learning from simulated trajectories and zero-shot deployment to real-world scenarios, replacing costly online optimization with single-pass neural inference.

Quantitative evaluation demonstrates that the learned neural controller maintains bounded trajectory-tracking errors within the acceptable limits defined by the expert policy, achieving MSE(

e_{y}

) = 0.1349 and MSE(

e_{θ}

) = 0.0768 in simulation compared with the expert’s MSE(

e_{y}

) = 0.0049 and MSE(

e_{θ}

) = 0.0742. Across unseen trajectories and physical deployment, the neural controller preserved consistent behavior without retraining, validating its generalization capability and practical viability for embedded real-time control.

The findings confirm that the primary goal of the imitation learning framework—to achieve bounded error within acceptable policy definition rather than outperform the expert—has been successfully met. Future work will explore adaptive data aggregation and online fine-tuning to further improve robustness under variable environmental conditions such as friction, wheel slip, and sensor drift.

Author Contributions

Conceptualization, N.M., H.W. and A.Y.; methodology, N.M.; software, N.M.; validation, N.M. and H.W.; formal analysis, N.M.; investigation, N.M.; resources, N.M.; data curation, N.M.; writing—original draft preparation, N.M.; writing—review and editing, A.Y. and H.W.; visualization, N.M.; supervision, H.W.; project administration, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, H.; Caldararu, S.; Young, A.; Ruiz, A.; Unjhawala, H.; Mahajan, I.; Ashokkumar, S.; Batagoda, N.; Zhou, Z.; Bakke, L.; et al. A Study on the Use of Simulation in Synthesizing Path-Following Control Policies for Autonomous Ground Robots. arXiv 2024, arXiv:2403.18021. [Google Scholar] [CrossRef]
Wang, Z.; Li, P.; Li, Q.; Wang, Z.; Li, Z. Motion Planning Method for Car-Like Autonomous Mobile Robots in Dynamic Obstacle Environments. IEEE Access 2023, 11, 137387–137400. [Google Scholar] [CrossRef]
Li, J.; Liu, M.; Wang, W.; Hu, C. Inspection Robot Based on Offline Digital Twin Synchronization Architecture. IEEE J. Radio Freq. Identif. 2022, 6, 943–947. [Google Scholar] [CrossRef]
Zhao, L.; Nie, Z.; Xia, Y.; Li, H. Virtual-Physical Tracking Control for a Car-Like Mobile Robot Based on Digital Twin Technology. IEEE Trans. Ind. Electron. 2024, 71, 16348–16356. [Google Scholar] [CrossRef]
Yang, H.; Cheng, F.; Li, H.; Zuo, Z. Design and Control of Digital Twin Systems based on a Unit Level Wheeled Mobile Robot. IEEE Trans. Veh. Technol. 2024, 73, 323–332. [Google Scholar] [CrossRef]
Wang, Z.; OuYang, Y.; Kochan, O. Bidirectional Linkage Robot Digital Twin System Based on ROS. In Proceedings of the 2023 17th International Conference on the Experience of Designing and Application of CAD Systems (CADSM), Jaroslaw, Poland, 22–25 February 2023; IEEE: NewYork, NY, USA. [Google Scholar]
Singh, M.; Kapukotuwa, J.; Gouveia, E.L.S.; Fuenmayor, E.; Qiao, Y.; Murry, N.; Devine, D. Unity and ROS as a Digital and Communication Layer for Digital Twin Application: Case Study of Robotic Arm in a Smart Manufacturing Cell. Sensors 2024, 24, 5680. [Google Scholar] [CrossRef] [PubMed]
Negri, E.; Fumagalli, L.; Macchi, M. A Review of the Roles of Digital Twin in CPS-Based Production Systems; Procedia Manufacturing: Amsterdam, The Netherlands, 2017; Volume 11, pp. 939–948. [Google Scholar]
Samak, T.; Samak, C.; Kandhasamy, S.; Krovi, V.; Xie, M. AutoDRIVE: A Comprehensive, Flexible and Integrated Digital Twin Ecosystem for Autonomous Driving Research & Education. Robotics 2023, 12, 77. [Google Scholar] [CrossRef]
Han, T.; Shah, P.; Rajagopal, S.; Bao, Y.; Jung, S.; Talia, S.; Guo, G.; Xu, B.; Mehta, B.; Romig, E.; et al. Demonstrating Wheeled Lab: Modern Sim2Real for Low-cost, Open-source Wheeled Robotics. arXiv 2025, arXiv:2502.07380. [Google Scholar]
Pan, Y.; Cheng, C.-A.; Saigol, K.; Lee, K.; Yan, X.; A Theodorou, E.; Boots, B. Imitation learning for agile autonomous driving. Int. J. Robot. Res. 2019, 39, 286–302. [Google Scholar] [CrossRef]
Lin, W.-S.; Yang, P.-C. Adaptive critic motion control design of autonomous wheeled mobile robot by dual heuristic programming. Automatica 2008, 44, 2716–2723. [Google Scholar] [CrossRef]
Farid, Y. Simultaneous locomotion and manipulation control of quadruped robots using reinforcement learning-based adaptive fractional-order sliding-mode control. Trans. Inst. Meas. Control 2023, 45, 2459–2476. [Google Scholar] [CrossRef]
Ghignone, E.; Baumann, N.; Magno, M. TC-Driver: A Trajectory-Conditioned Reinforcement Learning Approach to Zero-Shot Autonomous Racing. IEEE Trans. Field Robot. 2024, 1, 527–536. [Google Scholar] [CrossRef]
Vrabič, R.; Škulj, G.; Malus, A.; Kozjek, D.; Selak, L.; Bračun, D.; Podržaj, P. An architecture for sim-to-real and real-to-sim experimentation in robotic systems. Procedia CIRP 2021, 104, 336–341. [Google Scholar] [CrossRef]
Samak, C.; Samak, T.; Krovi, V. Towards Sim2Real Transfer of Autonomy Algorithms using AutoDRIVE Ecosystem. IFAC-PapersOnLine 2023, 56, 277–282. [Google Scholar] [CrossRef]
Liu, D.; Chen, Y.; Wu, Z. Digital Twin (DT)-CycleGAN: Enabling Zero-Shot Sim-to-Real Transfer of Visual Grasping Models. IEEE Robot. Autom. Lett. 2023, 8, 2421–2428. [Google Scholar] [CrossRef]
Voogd, K.L.; Allamaa, J.P.; Alonso-Mora, J.; Son, T.D. Reinforcement Learning from Simulation to Real World Autonomous Driving using Digital Twin. IFAC-PapersOnLine 2023, 56, 1510–1515. [Google Scholar] [CrossRef]
Han, D.; Mulyana, B.; Stankovic, V.; Cheng, S. A Survey on Deep Reinforcement Learning Algorithms for Robotic Manipulation. Sensors 2023, 23, 3762. [Google Scholar] [CrossRef]
Delgado, J.M.D.; Oyedele, L. Robotics in construction: A critical review of the reinforcement learning and imitation learning paradigms. Adv. Eng. Inform. 2022, 54, 101787. [Google Scholar] [CrossRef]
Sun, J.; Yang, H. Learning two-dimensional merging behaviour from vehicle trajectories with imitation learning. Transp. Res. Part C Emerg. Technol. 2024, 160, 104530. [Google Scholar] [CrossRef]
Ren, A.Z.; Veer, S.; Majumdar, A. Generalization Guarantees for Imitation Learning. arXiv 2020, arXiv:2008.01913. [Google Scholar] [CrossRef]
Li, Q.; Qian, J.; Zhu, Z.; Bao, X.; Helwa, M.K.; Schoellig, A.P. Deep Neural Networks for Improved, Impromptu Trajectory Tracking of Quadrotors. arXiv 2017, arXiv:1610.06283. [Google Scholar] [CrossRef]
Chen, S.; Wen, J.T. Wen Industrial Robot Trajectory Tracking Control Using Multi-Layer Neural Networks Trained by Iterative Learning Control. Robotics 2021, 10, 50. [Google Scholar] [CrossRef]

Figure 1. CLMR Digital Twin Framework.

Figure 2. DT architecture depicting the expert MPC–Backstepping model, data processing and neural network training, and the virtual deployment of the learned controller.

Figure 3. MPC-Backstepping Controller structure for CLMR.

Figure 4. Neural Network Architecture.

Figure 5. Training loss profile showing improvement of the neural controller performance during training.

Figure 6. Comparison of expert and learned trajectories against a reference path.

Figure 7. Trajectory Tracking via Neural Network Control Outside Training Distribution (Simulation). (a) Tracking on Path 1 with low curvature. (b) Tracking on Path 2 with moderate curvature. (c) Tracking on Path 3 with higher curvature. (d) Tracking on Path 4 with the highest curvature.

Figure 8. Comparison of trajectory tracking performance between the expert controller (MPC + Backstepping), the learned neural policy, and the reference sinusoidal trajectory across varying path frequencies.

Table 1. Weighted MAE training loss of the neural network policy across 100 epochs, demonstrating progressive convergence toward the expert controller.

Epoch	Loss
10	1.5446
20	1.2654
30	1.1533
40	1.0919
50	1.0489
60	1.0078
70	0.9642
80	0.9192
90	0.8708
100	0.8224

Table 2. MSE values of lateral (

e_{y}

) and heading (

e_{θ}

) tracking errors for the expert (MPC–Backstepping) and learned neural controllers in simulation.

Table 2. MSE values of lateral (

e_{y}

) and heading (

e_{θ}

) tracking errors for the expert (MPC–Backstepping) and learned neural controllers in simulation.

Controller	MSE( $e_{y}$ )	MSE( $e_{θ}$ )
Expert (MPC + Backstepping)	0.0049	0.0742
Learner (Neural Network)	0.1349	0.0768

Table 3. Generalization performance of the learned neural controller evaluated in simulation and on unseen reference paths, with MSE values of (

e_{y}

) and (

e_{θ}

) demonstrating stable tracking as path curvature increases.

Table 3. Generalization performance of the learned neural controller evaluated in simulation and on unseen reference paths, with MSE values of (

e_{y}

) and (

e_{θ}

) demonstrating stable tracking as path curvature increases.

Path ID	Reference Path (ω)	MSE( $e_{y}$ )	MSE( $e_{θ}$ )
Path 1	2π × 0.1	0.1616	0.0826
Path 2	3π × 0.1	0.2228	0.1110
Path 3	4π × 0.1	0.3040	0.1581
Path 4	5π × 0.1	0.2957	0.0879

Table 4. Real-world tracking performance of the neural network controller deployed on the Wheeltec CLMR, showing MSE values of lateral (

e_{y}

) and heading (

e_{θ}

) tracking errors across multiple reference trajectories.

Table 4. Real-world tracking performance of the neural network controller deployed on the Wheeltec CLMR, showing MSE values of lateral (

e_{y}

) and heading (

e_{θ}

) tracking errors across multiple reference trajectories.

Path ID	Reference Path (ω)	MSE( $e_{y}$ )	MSE( $e_{θ}$ )
Path 1	2π × 0.1	0.231	0.115
Path 2	3π × 0.1	0.312	0.148
Path 3	4π × 0.1	0.391	0.207
Path 4	5π × 0.1	0.316	0.131

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mohaghegh, N.; Wang, H.; Yazdani, A. Sim2Real Transfer of Imitation Learning of Motion Control for Car-like Mobile Robots Using Digital Twin Testbed. Robotics 2025, 14, 180. https://doi.org/10.3390/robotics14120180

AMA Style

Mohaghegh N, Wang H, Yazdani A. Sim2Real Transfer of Imitation Learning of Motion Control for Car-like Mobile Robots Using Digital Twin Testbed. Robotics. 2025; 14(12):180. https://doi.org/10.3390/robotics14120180

Chicago/Turabian Style

Mohaghegh, Narges, Hai Wang, and Amirmehdi Yazdani. 2025. "Sim2Real Transfer of Imitation Learning of Motion Control for Car-like Mobile Robots Using Digital Twin Testbed" Robotics 14, no. 12: 180. https://doi.org/10.3390/robotics14120180

APA Style

Mohaghegh, N., Wang, H., & Yazdani, A. (2025). Sim2Real Transfer of Imitation Learning of Motion Control for Car-like Mobile Robots Using Digital Twin Testbed. Robotics, 14(12), 180. https://doi.org/10.3390/robotics14120180

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sim2Real Transfer of Imitation Learning of Motion Control for Car-like Mobile Robots Using Digital Twin Testbed

Abstract

1. Introduction

2. Digital Twin Framework

3. Methodology

4. MPC-Backstepping Controller

4.1. MPC Scheme

4.2. Backstepping Controller

5. Imitation Learning

5.1. Digital Twin-Based Learning Architecture

5.2. Algorithmic Expert: Backstepping-MPC Policy

5.3. Learning a Neural Control Policy

Policy Evaluation and Deployment

5.4. Track Generalization

6. Zero-Shot Sim2Real Transfer Using Digital Twins

7. Results

7.1. Supervised Learning of Neural Control Policy

7.2. Imitation Learning and Cost Evaluation

7.3. Generalization to Out-of-Distribution Reference Paths

7.4. Zero-Shot Deployment on the Real Robot

8. Discussion

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI