1. Introduction
Autonomous vehicles aim to achieve full automation (Level 5), allowing occupants to engage in other activities while the vehicle drives itself. Beyond personal use, autonomous vehicles may impact the public transportation, logistics, healthcare, and military sectors, changing many aspects of daily life. Deep reinforcement learning (DRL) has proven effective in many areas, enabling intelligent agents to learn complex tasks through experience.
The successful implementation of autonomous vehicles relies on the integration of several core technologies, including accurate localization [
1], environmental perception, decision-making, and motion control. These components work in concert and are essential for ensuring the safety and stability of the overall system. Among these technologies, the ability of a vehicle to make stable and accurate action decisions based on real-time sensory input is one of the key challenges to achieving reliable autonomous driving. To address this, this study adopts a deep reinforcement learning (DRL) approach to train models that learn lane-keeping and navigation behaviors based on visual inputs. The objective is to enable the vehicle to maintain stable driving within lane boundaries and successfully follow a predefined navigation path. To further evaluate the practicality and transferability of the trained models, a two-stage experimental setup is designed, consisting of simulation and real-world environments. Additionally, Sim2Real techniques are incorporated to mitigate the impact of visual discrepancies between virtual and physical domains on model performance.
This study uses images captured by an onboard camera as input and applies deep reinforcement learning algorithms, including DQN (Deep Q-Network), DDQN (Double DQN), A2C (Advantage Actor–Critic), and PPO (Proximal Policy Optimization), to enable an autonomous vehicle to learn lane-keeping capabilities through interactions with a simulated environment. In the first stage, the model focuses on learning to drive stably within lane boundaries. In the second stage, the best-performing algorithm from the lane-keeping task is selected and further combined with path planning information to train the vehicle to perform navigation tasks on a specific map. Although the model performs well in simulation, it often fails to complete navigation tasks in real-world environments due to physical differences such as lighting and color. To bridge this gap between simulation and reality, this study adopts Sim2Real techniques to improve the model’s adaptability and performance in real-world applications.
Based on the above, the main contributions of this paper are as follows:
- (1)
An end-to-end autonomous driving framework is developed using visual input and multiple deep reinforcement learning algorithms (DQN, DDQN, A2C, PPO), enabling the vehicle to perform lane-keeping and navigation tasks.
- (2)
Sim2Real transfer techniques, including domain randomization and CycleGAN-based style transfer, are adopted to reduce the simulation-to-reality gap and enhance real-world generalization.
- (3)
A physical testing environment is built using Duckiebot and a custom driving setup to validate the system’s real-world performance and practical applicability.
2. Related Work
This section introduces the implementation methods of autonomous vehicle systems. First, it elaborates on the deep reinforcement learning algorithms used for lane-keeping tasks, then explains the navigation training strategies combined with path planning, and finally introduces techniques and applications for simulation to reality (Sim2Real).
2.1. Deep Reinforcement Learning
Reinforcement learning (RL) is a type of machine learning in which an agent continuously interacts with a dynamic environment. During this process, the agent receives a representation of the environment’s state, which could be an image or a set of physical parameters. Based on the observed state, the agent makes decisions and takes corresponding actions. The environment then provides a reward, which is used to update the agent’s behavior, enabling it to learn how to perform tasks effectively. The goal of reinforcement learning is to learn an optimal policy that maximizes the expected cumulative reward. This interaction process is illustrated in
Figure 1.
2.1.1. Deep Q-Network (DQN)
Zhang et al. proposed a Dueling Double Deep Q-Network (D3QN-DA) architecture enhanced with dual attention mechanisms for tactical decision-making in autonomous driving scenarios such as highway lane-changing and obstacle avoidance [
2]. By integrating spatial and channel attention modules, the model improves its ability to identify critical regions and features. Experimental results demonstrated significant improvements in decision safety and driving efficiency, highlighting the potential of deep reinforcement learning in complex driving environments. Wu et al. applied the Double DQN algorithm to a device-to-device (D2D) communication resource matching problem and introduced a distributed framework where each node independently performs action selection and resource allocation [
3]. This approach addresses the scalability limitations of centralized learning in high-dimensional spaces and enhances overall system throughput and stability, offering valuable insights for multi-agent decision-making systems. Additionally, Chen et al. designed an enhanced Dueling DDQN model incorporating a Convolutional Block Attention Module (CBAM) for optimizing traffic signal control [
4]. The model dynamically adjusts signal timings based on real-time traffic conditions, and experimental results showed reduced vehicle delay and wait times, demonstrating its effectiveness in sequential decision-making tasks. These studies illustrate the advantages of various DQN-based architectures across different application domains and provide valuable references for the design of visual-based reinforcement learning in this research.
2.1.2. Double DQN (DDQN)
Double Deep Q-Network (Double DQN) was proposed by van Hasselt et al. to address the overestimation bias observed in traditional DQN algorithms. The key idea is to decouple action selection and evaluation: the online network is used to select actions, while the target network is responsible for estimating Q-values. This mechanism reduces over-optimistic estimates and improves the stability and performance of the learned policy [
5]. Wang et al. (2023) further applied Double DQN to a hybrid flowshop scheduling problem with multiple job priorities, introducing a co-evolutionary learning framework to enhance resource allocation efficiency and task completion rate [
1,
6]. Their approach combines Q-learning with evolutionary strategies and demonstrates robust performance in high-dimensional scheduling scenarios, highlighting the potential of Double DQN in complex optimization tasks.
2.1.3. Dueling DQN
Dueling DQN is an improved deep reinforcement learning architecture proposed by Wang et al. in 2016 [
7]. It decomposes the Q-value function into a state value function and an advantage function, enhancing the model’s ability to distinguish critical states. Even when certain actions have limited impact in specific states, this architecture can still effectively assess the importance of the state, accelerating learning and improving policy stability. Zhou et al. (2023) [
8] further applied a Branching Dueling Q-Network (BDQ) to the online scheduling of a microgrid with multiple distributed energy storage systems. They proposed a control framework capable of handling both discrete and continuous action spaces. Experimental results showed that the proposed method outperformed traditional DQN and DDPG in terms of energy dispatch efficiency and operational cost, demonstrating the practical potential of the dueling architecture in high-dimensional control tasks.
2.1.4. Policy Gradient
Policy Gradient is a class of reinforcement learning methods that directly optimize parameterized policies through gradient ascent. It is particularly effective in handling continuous action spaces and generating stochastic policies for enhanced exploration. Wu et al. (2023) [
9] proposed a method combining faded experience mechanisms with Trust Region Policy Optimization (TRPO) for model-free power allocation in interference channels. Their approach introduced robust policy update constraints that mitigated training oscillations and improved stability. Sun et al. (2024) investigated the linear convergence properties of policy gradient methods in game-theoretic settings and proved that, under entropy regularization, independent policy updates can achieve stable convergence in non-cooperative environments [
10]. These studies demonstrate the solid theoretical foundation and practical potential of Policy Gradient approaches in continuous decision-making problems.
2.1.5. Advantage Actor–Critic (A2C)
A2C is a synchronous actor–critic architecture that improves training stability by using the advantage function. Mandal et al. [
11] proposed an enhanced version, Optimal L-AC-GAA, which reduces gradient variance by optimizing the number of critic updates. Their method combines random search with smoothed estimation techniques and demonstrates superior stability and performance across various reinforcement learning tasks.
2.1.6. Proximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO) is a trust region-based policy optimization algorithm that stabilizes training by limiting the magnitude of policy updates. Lin et al. [
12] proposed Anti-Martingale PPO, which applies anti-martingale theory to dynamically adjust the learning rate and update frequency, thereby improving the stability of policy convergence and sample efficiency. Gu et al. [
13] introduced PPO with Policy Feedback (PPO-PF), which incorporates a policy feedback mechanism to construct a novel actor–critic architecture capable of jointly updating the policy and value function. Their approach demonstrates faster convergence and greater stability across multiple reinforcement learning tasks. Both methods outperform the original PPO in standard benchmarks, highlighting the effectiveness of their improvements.
2.2. Sim2Real
In robotic reinforcement learning, training directly in real-world environments often faces two major challenges. The first is efficiency: after completing each task, the robot must return to its initial position, making the process time-consuming and operationally complex. The second is safety: since deep reinforcement learning relies heavily on trial and error, unstable policies during training may lead to collisions or hardware damage. In contrast, simulation environments allow for parallel training with multiple agents, significantly improving data efficiency while avoiding physical risks. However, discrepancies still exist between simulation and reality—such as differences in color, lighting, or object geometry—which may affect the model’s performance when deployed in real-world scenarios. As a result, ensuring smooth transfer from simulation to reality has become a key issue for real-world deployment, commonly addressed through simulation-to-reality (Sim2Real) techniques.
To tackle this problem, Samak et al. [
14] proposed a robust control framework for a 1:14 scale autonomous vehicle named Nigel, which features all-wheel drive and steering capabilities. Their approach employs a multi-model, multi-objective control strategy based on polytopic Linear Parameter-Varying (LPV) models to account for real-world disturbances such as friction and wind. The framework demonstrates stable control and accurate tracking in both simulation and physical experiments. On the other hand, Zhang et al. [
15] presented a comprehensive survey of Sim2Real techniques in autonomous driving, covering methods such as domain randomization, digital twins, and parallel intelligence. Their work highlights how enhancing the realism and environmental consistency of simulation can effectively bridge the gap between virtual and real environments. Collectively, these studies emphasize that, for reinforcement learning to be successfully applied in physical systems, algorithm design must be complemented by robust control strategies and high-fidelity simulation environments.
2.2.1. Domain Randomization
Domain randomization has been widely adopted in sim-to-real (Sim2Real) tasks to bridge the gap between simulation and the real world. Peng et al. [
16] proposed a domain generalization approach for semantic segmentation that incorporates Global and Local Texture Randomization (GTR and LTR). By introducing texture styles from paintings, the model learns domain-invariant features rather than relying on fixed textures. A consistency regularization mechanism (CGL) is also introduced to harmonize the two randomization strategies. Their method demonstrated superior generalization performance across multiple datasets compared to existing approaches. On the other hand, Horváth et al. [
17] proposed a domain randomization pipeline for object detection, where synthetic images are randomized in terms of color, lighting, and camera parameters to train a YOLOv4 model. Their approach achieved 86.32% mAP50 in a zero-shot setting and 97.38% in a one-shot setting, highlighting its practicality even with minimal real-world annotations. These studies show that domain randomization effectively enhances cross-domain generalization and offers a robust solution to Sim2Real challenges.
2.2.2. CycleGAN
CycleGAN (Cycle-Consistent Generative Adversarial Network) is a deep learning technique used for image style transfer and is capable of handling unpaired data. Its architecture consists of two generators and two discriminators, trained through adversarial learning and cycle consistency loss. This allows the model to establish mappings between different image styles while preserving the structural information of the original image. Since it does not require paired training data, CycleGAN is particularly suitable for applications where collecting corresponding image pairs is difficult.
In prior studies, Zhang et al. [
18] enhanced glue image contrast using CycleGAN to improve edge clarity and defect recognition. Wang et al. [
19] applied CycleGAN to GPR images with attention and deformable convolutions for noise suppression and pipeline localization. Kazemi and Musilek [
20] transformed passive microwave sensor responses into active-like signals, improving Q-factor and sensitivity for detecting trace components in water.
This study also integrates CycleGAN as an image style transfer method for Sim2Real adaptation to reduce the visual gap between the simulation and real-world environments. By training a CycleGAN model to transform simulated images into realistic styles, the approach enables the reinforcement learning model to perceive simulation inputs with real-world characteristics. This enhances the Duckiebot’s adaptability and success rate in real-world navigation tasks. The proposed method combines CycleGAN with reinforcement learning algorithms such as Dueling DQN and PPO, and comparative experiments were conducted against models trained without Sim2Real techniques and with domain randomization only. The results demonstrate that CycleGAN is a critical technique for effective simulation-to-reality transfer, significantly improving the success rate and stability of autonomous navigation in physical environments.
2.3. Path Planning
Path planning is one of the core technologies in the navigation system of self-driving cars and autonomous mobile robots. The purpose of navigation is to guide the robot from the starting position to the target position safely and effectively, mainly including three technologies: positioning, path planning, and motion control, in which path planning is responsible for selecting the most suitable movement path in the given environment, which is a key link in the navigation process.
According to the degree of mastery of environmental information, path planning can be divided into two categories: static and dynamic. Static programming assumes that the environment does not change, and common methods include Dijkstra and A star algorithms. Dynamic programming reacts in real time to changing environments, such as the D* algorithm, which updates environmental information in real time through sensors to plan the best path from the current location to the target location.
2.4. Gym Duckietown
Duckietown is a freely scalable experimental platform that simulates the environment required for autonomous driving technologies through pre-defined road configurations, traffic signals, and signs such as AprilTag [
21]. The platform supports the development and testing of various tasks, including basic functions such as lane-following, signal recognition, positioning, planning, and navigation, and can also be further expanded into multi-robot systems for advanced applications such as communication and collaboration. The core of the system is a self-propelled car named Duckiebot, which uses a Raspberry Pi as the computing body, is equipped with ROS (Robot Operating System) as the development architecture, and controls two DC motors through the motor driver board. Ambient perception relies entirely on a fisheye lens mounted on the front of the car, which is the only sensor on the Duckiebot.
To improve the efficiency of reinforcement learning training, Duckietown has also launched its simulation environment, Gym Duckietown, as a virtual alternative to real-world experiments. Written in Python 3.9 and built on top of the OpenAI Gym framework, the simulator uses OpenGL technology to generate graphics that simulate the perspective and behavior of a real-world Duckiebot. Gym Duckietown has a simulated two-wheel differential drive model that matches the motion characteristics of the simulated vehicle to reality, while providing an image output from a simulated camera that simulates the visual information seen by a fisheye lens in the real world. In addition, a reward mechanism is designed in the simulator to evaluate the quality of the agent’s decision-making. When the Duckiebot is driving close to the middle of the road, it receives a positive reward; conversely, if it deviates from the lane or collides, it will receive a negative reward. In order to determine the relative position of the vehicle to the center of the road, the system pre-calculates the road centerline, which is constructed from a third-order Béz curve. Bézi curves are a common representation of curves in graphics that can be reshaped through control points and are used in Duckietown to accurately model road geometry.
3. Materials and Methods
3.1. System Architecture
The implementation flow of the system consists of the following stages: lane-keeping task training, algorithm evaluation and selection, navigation task training, simulation-to-reality (Sim2Real) transfer strategy development, real-world deployment of the autonomous driving system, and system performance evaluation. The first four stages are conducted within a simulation environment, while the final two stages are executed in the real-world setting, as shown in
Figure 2.
In terms of system construction, the training platform utilizes a desktop computer equipped with an NVIDIA RTX 3080 GPU and an Intel Core i7 processor to perform deep reinforcement learning model training and inference. The autonomous vehicle platform is based on a custom-built Duckiebot as the mobile base, with a Jetson Nano embedded system serving as the main processing unit. The platform is further equipped with an NoIR camera, a Stepper Motor HAT V2 motor expansion board, and HC-1-048 stepper motors to achieve vehicle control and perception functionalities. A fisheye camera combined with a Jetson Nano is used for real-time localization. All devices communicate through a Wi-Fi network based on a socket communication framework, enabling data exchange and remote control. All hardware components, including the NVIDIA RTX 3080 GPU, Intel Core i7 processor, Jetson Nano, NoIR camera, Stepper Motor HAT V2, and HC-1-048 stepper motors, were procured through authorized distributors in Taiwan.
3.2. Lane Keeping Training
In this phase, four algorithms are employed to accomplish the lane-keeping task: the value-based algorithms Double DQN (DDQN) and Dueling DQN, and the policy-based algorithms Advantage Actor–Critic (A2C) and Proximal Policy Optimization (PPO).
In the experimental setup, the Duckiebot (agent) is configured to drive along the outer lane in the simulator. Each episode terminates either when the driving steps exceed a predefined maximum number or when the Duckiebot deviates from the outer lane. The experimental environment is illustrated in
Figure 3.
In the original simulator settings, a round ends only when the Duckiebot crosses the white line, allowing behaviors such as crossing the single yellow line or driving against traffic, which does not meet our task requirements.
To address this, we used images from the Duckiebot camera to detect crossings over the single yellow line. Since the yellow line consists of discontinuous segments, we processed the images by first converting them from RGB to HSV color space, which facilitates color extraction. We then isolated the yellow regions, applied a mask to segment the image, extracted the centroids of these regions, and used polynomial regression to fit a continuous curve.
The detection process is illustrated in
Figure 4, allowing for a more accurate determination of whether the Duckiebot crosses the single yellow line.
The images captured by the Duckiebot (agent) are used as inputs to the neural network, which subsequently predicts an action. In this experiment, the Duckiebot is configured to select from five discrete actions: turning left, slight left deviation, moving straight, slight right deviation, and turning right. These actions correspond to integer labels 0 through 4, respectively, as illustrated by the five blue trajectories in
Figure 5.
The environment transitions to the next frame based on the action taken by the Duckiebot (agent), and a corresponding reward is assigned. The reward is computed using the formula reward = -5 ⋅ |ΔE| + 5 ⋅ dot_dir, which consists of two components (1). The first term, |ΔE|, represents the distance between the Duckiebot and the reference line (the optimal driving path), defined as the difference between the red line (reference line) and the green dashed line (Duckiebot position) in
Figure 6. A greater distance results in a larger penalty and thus a lower reward, potentially becoming negative. The second term, dot_dir, is the dot product between the x- and y-axis components of the reference line direction (red line) and the Duckiebot’s heading direction (blue dashed line). A higher dot product value indicates better alignment between the Duckiebot and the reference path, resulting in a higher reward. The coefficients -5 and 5 are used to balance the impact of deviation and alignment in the overall reward calculation.
In the experiments, all four algorithms utilized the same neural network architecture, differing only in the final output layer. The network structures for the policy-based A2C and PPO algorithms are shown in
Figure 7. For the value-based DDQN and Dueling DQN algorithms, the output layer consists of only five neurons, where the value output in
Figure 6 is removed and the action probability is replaced with Q-values.
The network architecture consists of three convolutional layers followed by three fully connected layers, making it more lightweight compared to typical deep learning networks. This results in fewer computations and parameters, ensuring that the agent can make rapid decisions and interact with the environment in real time.
For the actor and critic networks, a shared architecture was adopted, where the three convolutional layers and one fully connected layer share the same weights. This design reduces the number of parameters and accelerates the network’s convergence. Moreover, the first two convolutional layers employ larger convolutional kernels and larger strides to extract features and reduce the data volume. This approach replaces traditional pooling layers, enabling the feature maps to be downsampled quickly without a significant loss of important features.
3.2.1. DDQN, Dueling DQN Training Process
The training procedures and network updates for DDQN and Dueling DQN are identical, differing only in network architecture, where Dueling DQN separates the output into a value and an advantage function. As defined in
Table 1, training begins by obtaining an initial state s
t from the simulator, represented as a resized (160 × 120) color image.
The state st is input into the neural network to predict action values for five possible actions. The agent selects the action with the highest value, receives a reward rt, and transitions to the next state st + 1. The tuple (st, at, rt, st + 1) is stored in replay memory. Once sufficient samples are collected, the current Q-network is trained, and its weights are periodically synchronized to the target network.
Training continues until the Duckiebot either leaves the outer lane or reaches the episode step limit. Upon episode termination, the environment is reset for the next episode, as illustrated in
Figure 8.
3.2.2. A2C, PPO Training Process
The training processes and network architectures of A2C and PPO are similar, differing mainly in their network update mechanisms. Training starts by acquiring the initial state st from the environment. The state st is input into the actor network, which outputs a probability distribution over the five discrete actions. The agent samples an action at based on the distribution and executes it, receiving a corresponding reward rt.
Each experience (s
t, a
t, r
t) is stored in a temporary buffer. If the update condition is not yet met, the interaction continues. Once the specified number of steps is reached, the collected experiences are used to update the networks, and the buffer is cleared to store new data for the next update cycle. An episode is terminated either when the agent exceeds the maximum number of steps or deviates from the outer lane, as illustrated in
Figure 9. The definitions of variables in the entire training process are summarized in
Table 2.
3.3. Navigation Function Training
The navigation function aims to guide the autonomous vehicle to safely reach its destination, while maintaining lane-keeping as a fundamental prerequisite. To achieve navigation, it is also necessary to measure the vehicle’s position through localization technologies, such as GPS. In the simulator, however, this parameter can be directly accessed.
The experimental setup involves training the Duckiebot on a map containing four T-junctions, as shown in
Figure 10. During training, the Duckiebot follows a path determined by the path planning algorithm, traveling from a randomly selected starting point to a destination within the drivable area. An episode terminates if the Duckiebot deviates significantly from the planned route or exits the drivable area. The red lines in
Figure 10 are used to visually highlight intersection points on the map and do not affect the navigation logic or training process. The red lines in
Figure 10 are used to visually highlight intersection points on the map and do not affect the navigation logic or training process.
To simplify implementation and validate the feasibility of the proposed method, the Duckiebot is restricted to only moving straight, turning left, or turning right, thereby preventing U-turns in the planned path. The map coordinates from the simulator are converted into a directed graph, as illustrated in
Figure 11, where the distance between adjacent nodes is set to one. This structure enables the use of the A star algorithm for path planning.
Once the path is planned, it is mapped back onto the simulator environment, with the overall process illustrated in
Figure 12. The planned trajectory is then represented using Bézier curves, as shown in
Figure 13, and the white dashed lines serve as the reference trajectory for navigation. The red lines in
Figure 13 are used to visually highlight intersection points on the map and do not affect the navigation logic or training process.
When training for the navigation task, if only the front-facing image captured by the Duckiebot (agent) is fed into the actor network, the agent would be unable to determine whether to move forward, turn left, or turn right upon encountering a T-junction. As a result, it would fail to follow the path planned by the A star algorithm. To address this issue, both the image input (img_input) and the path planning guidance (path_input) are simultaneously fed into the actor network.
The path_input is constructed by selecting the ten closest points along the planned path relative to the Duckiebot current position. Using the arctangent function, the angles between the Duckiebot and each of the ten points are calculated based on the standard two-point angle formula, as illustrated in
Figure 14 (where one point is used for demonstration). The calculated angles fall within a range of −45° to 45°, and are subsequently normalized before being input into the network. This design allows the Duckiebot to learn the appropriate decision whether to move forward, turn left, or turn right based on both its visual input and path guidance.
The available actions for the Duckiebot are the same as those described in the previous section on lane-keeping training, consisting of five discrete options: turning left, slight left deviation, moving straight, slight right deviation, and turning right.
The reward calculation method is the same as described in the previous section. However, the reference line for the Duckiebot navigation is replaced by the path planning result.
The training for the navigation task differs from that of the lane-keeping task in that it requires both image input and path planning guidance (path_input) as inputs to the actor network, rather than relying solely on images. Therefore, the network architecture is modified accordingly.
The image input branch remains consistent with the lane-keeping network structure, with the only change occurring at the first fully connected layer after the convolutional layers, where the number of neurons is reduced from 512 to 256. The path_input is processed through two fully connected layers with nonlinear activation to extract a 128-dimensional feature vector. This feature vector is then concatenated with the 256-dimensional feature vector from the image input. The purpose of this concatenation is to fuse the semantic information learned separately from the two inputs. After concatenation, the combined features are passed through several fully connected layers to produce the final outputs: the value prediction by the critic and the action probability prediction by the actor, as illustrated in
Figure 15.
In this phase, the Dueling DQN and PPO algorithms, which achieved better performance in the previous stage, are selected for further training. The overall training process remains similar to that of the lane-keeping task, with the primary differences being the use of a different map and a modified state input st, which now consists of both the image input (img input) and the path planning guidance (path_input).
Taking the PPO algorithm as an example, the modified sections of the training process are highlighted with red boxes, as illustrated in
Figure 16.
Since the Duckiebot must follow the path planned by the A star algorithm, the episode termination condition is changed from deviating from the outer lane (as in lane-keeping training) to deviating from the planned path.
3.4. Sim2Real Method Training
Through simulation, the Duckiebot is able to achieve navigation functionality within a designed environment. However, the ultimate goal is to implement and apply the proposed method in a real-world setting. In this study, two approaches are adopted to realize Sim2Real transfer. The first approach employs CycleGAN to reduce the domain gap between simulation and reality, while the second approach applies domain randomization by treating the real environment as one of the variations within the simulated domain. These methods collectively enable the Duckiebot to successfully perform navigation tasks in the real world.
During training, the physical parameters of the simulation environment are randomized within predefined ranges, resulting in different visual scenes for the Duckiebot in each episode. The randomized parameters include sky color, ground color, ground brightness, light source position, fisheye camera angle, camera height above the ground, and camera frame skipping. By combining variations in these physical parameters, diverse simulation environments are generated, as illustrated in
Figure 17.
Images from the simulated environment (Style A) and the real environment (Style B) were collected to train a CycleGAN network, enabling it to generate images closely resembling real-world styles, as shown in
Figure 18.
Using the same setup, Dueling DQN and PPO algorithms are applied for training. Taking PPO as an example, the training process is illustrated in
Figure 19.
First-person view images from the simulator are input into the trained CycleGAN, which outputs images in a realistic style. This allows the Duckiebot to perceive the environment as if navigating in the real world, enabling sim-to-real transfer for navigation training.
3.5. Real-World Environment
In this study, a Duckiebot is used to simulate an autonomous vehicle, as shown in
Figure 20. The real-world experimental environment is illustrated in
Figure 21,
Figure 22 and
Figure 23, where a foam puzzle mat is used to construct the driving area, measuring 220 cm in length and 270 cm in width. At the top of the environment, a fisheye camera and a Jetson Nano are installed to calculate the position of the Duckiebot within the driving area, providing the localization information required for navigation. All experimental devices are connected to the same Wi-Fi router, and data transmission is handled through a socket communication framework.
In this experiment, image-based localization was employed to simulate the GPS positioning system of a real autonomous vehicle. A blue corrugated board was mounted on the Duckiebot to enable fast and easy position detection.
The image localization process is illustrated in
Figure 24. The captured image is first rotated to a horizontal orientation. Due to the use of a fisheye lens, the field of view extends beyond the driving area, producing a convex distortion effect similar to the view of a fish underwater looking up. To correct this, a region of interest (ROI) covering the driving area was extracted, followed by an affine transformation to remove distortion and resize the map to match the simulation dimensions.
Chessboard calibration, although common for fisheye correction, was not used, as it would significantly reduce the field of view, making full coverage of the driving area impossible. After correction, the image was converted from RGB to HSV color space to isolate the blue region representing the Duckiebot. The centroid of the blue region was calculated as the Duckiebot position. Finally, the localization information was transmitted back to the Duckiebot to complete the image-based localization process.
In this study, a self-driving car system is constructed using a socket communication framework, as illustrated in
Figure 25. The system consists of a monitoring computer, a fisheye camera for capturing the driving area, a Jetson Nano for calculating the Duckiebot position based on the captured images, and the Duckiebot itself simulating an autonomous vehicle. The interface displayed on the monitoring computer is shown in
Figure 26. Its primary function is to allow the user to select the starting point and destination, which serve as the input for the path planning algorithm.
When a navigation task begins, the monitoring computer displays the interface shown in
Figure 26. The user selects a starting point and a destination, and the coordinate information is transmitted via the socket communication framework to the localization module (camera and Jetson Nano). The localization module then forwards the coordinate information to the Duckiebot, enabling it to perform path planning and generate the shortest path to the destination.
At this stage, the localization module begins image-based localization, calculating the Duckiebot current position and transmitting this information back to the Duckiebot. The Duckiebot uses the received localization information to perform navigation, where the image input (img_input) and the path planning guidance (path_input) are fed into the agent’s neural network to determine the next action.
Simultaneously, the localization information is also sent to the monitoring computer, allowing the simulated visualization window to display the Duckiebot navigation process.
If the Duckiebot has neither reached the destination nor deviated from the planned path, the process repeats. The task is considered successfully completed if the Duckiebot reaches the destination; otherwise, deviation from the planned path results in a task failure, and the navigation mission is terminated.
Visualization Experiments
In addition, we conducted feature visualization experiments to observe whether the features attended to by the model in the real-world environment are consistent with those attended to in the simulation environment.
Feature visualization is primarily conducted by multiplying the feature maps output from each convolutional layer, fusing them into a single feature map, and overlaying it onto the original input image, as shown in
Figure 27. This allows us to understand which features the convolutional layers focus on within the neural network.
The detailed procedure is illustrated in
Figure 28. First, the feature maps from the last convolutional layer are averaged. To enable element-wise multiplication with the feature maps from the preceding layer, the averaged feature map must be upsampled to match the spatial dimensions of the preceding layer’s feature maps.
Subsequently, element-wise multiplication is performed between the two sets of feature maps. This process is repeated layer by layer until reaching the first convolutional layer. The resulting matrix is then normalized, and a colormap is applied to map the values to the color scale shown in
Figure 29, where more yellow regions indicate areas where the network’s convolutional layers focus more strongly.
4. Results
This chapter presents the results of lane-keeping training and navigation training, comparing and discussing the effectiveness and feasibility of each Sim2Real method in real-world scenarios.
Unlike deep learning, where most models and algorithms have standard evaluation metrics—such as accuracy, precision, and recall for classification tasks, or Intersection over Union (IoU) and mean Average Precision (mAP) for object detection algorithms—deep reinforcement learning lacks a universal evaluation standard.
In deep reinforcement learning, a common approach is to record the rewards obtained by the agent (Duckiebot) in each episode throughout the training process and plot them as a curve. The performance of the algorithm is then judged based on whether the agent is able to complete a given task or game by analyzing the reward trends over time.
In the experimental results presented in this study, the horizontal axis of each line plot represents the number of steps, while the vertical axis represents the cumulative reward obtained in each episode. A higher cumulative reward in an episode indicates that the agent successfully completed the assigned task.
Due to the relatively slow convergence of the network during training and the varying lengths of navigation paths, the cumulative rewards per episode fluctuate significantly, making it difficult to directly assess the agent’s performance and learning trends. To facilitate observation, we compute the average cumulative reward over every 50 episodes and plot it as a dark red curve. Meanwhile, the pink curve represents the cumulative reward obtained in each individual episode.
4.1. Lane-Keeping Training Results
For the lane-keeping training experiments, both the DDQN and Dueling DQN algorithms were able to successfully maintain the Duckiebot navigation along the road within one million steps. Therefore, training was terminated upon reaching two million steps, as shown in
Figure 30 and
Figure 31.
Compared to Dueling DQN, the DDQN algorithm demonstrated a faster convergence speed. From the plots, it can be observed that DDQN reached a cumulative reward of over 1200 per episode at approximately 10,000 steps, which corresponds to the Duckiebot successfully completing two laps along the outer lane. This indicates that DDQN was able to learn the lane-keeping functionality with fewer interactions.
Regarding training stability, the results shown in
Figure 28 indicate that the DDQN algorithm maintained a cumulative reward of around 1200 during the mid-to-late training stages, suggesting that the Duckiebot could reliably complete two laps in most episodes. However, there were still several instances where only one lap or less was completed.
In contrast, the Dueling DQN algorithm exhibited greater stability during the mid-to-late training stages, with almost every episode successfully completing two laps along the outer lane.
The training results for the policy-based A2C algorithm are shown in
Figure 32. Due to the relatively slow convergence speed of A2C, the training duration was extended. However, even with prolonged training, the agent failed to effectively learn the lane-keeping task. Although the overall trend of the cumulative reward per episode showed a slight increase, the growth rate was extremely slow.
The most critical issue was that, even after achieving a cumulative reward of approximately 1200 in some episodes, the agent was unable to consistently maintain this performance. As observed from the plot, the actual rewards exhibited significant fluctuations, indicating that the agent could successfully complete two laps along the outer lane in one episode but might fail at the first turn and deviate from the lane in the subsequent episode, resulting in a cumulative reward of less than 200.
Similarly, for the policy-based PPO algorithm, the agent was able to achieve a cumulative reward exceeding 1000 after approximately 250,000 interaction steps. Furthermore, after 550,000 steps, the agent consistently maintained a cumulative reward above 1100 per episode, as shown in
Figure 33.
Although the PPO algorithm achieved stable convergence, the cumulative reward per episode remained slightly below 1200. This was primarily due to the agent tending to drift toward the inner edge of the lane when making turns, which resulted in slightly lower cumulative rewards compared to the other three algorithms.
The trained models of the four algorithms were saved and subsequently loaded into the simulation environment to perform the lane-keeping task. Among them, the agents trained with DDQN, Dueling DQN, and PPO algorithms were able to continuously navigate along the outer lane without crossing the boundary. These agents could execute the task indefinitely unless the program was manually interrupted. In contrast, the agent trained with the A2C algorithm exhibited more unstable behavior. During the mid-to-late stages of training, the agent’s performance fluctuated. If the model was saved during a period of good performance, the agent could also execute the task continuously without deviation. Otherwise, the agent would typically deviate from the outer lane within a single lap.
Based on the experimental results, the value-based Dueling DQN algorithm and the policy-based PPO algorithm were selected for the next stage of navigation training. This selection was made because Dueling DQN exhibited greater stability during the mid-to-late stages of training, while PPO also demonstrated favorable performance.
4.2. Navigation Training Results
The training hyperparameters for the Dueling DQN and PPO algorithms were the same as those used in the lane-keeping training. The training results are shown in
Figure 34 and
Figure 35.
Due to its faster convergence, the training of the Dueling DQN algorithm was terminated after approximately 1.5 million interaction steps. In contrast, based on the previous lane-keeping training results, the PPO algorithm required more interaction steps and update iterations; thus, training was stopped when the rewards became more stable after approximately 3 million steps.
Comparing the two algorithms, the Dueling DQN agent achieved an average cumulative reward of nearly 700 per episode after convergence. This is because the agent trained with Dueling DQN was able to navigate closer to the center of the road, both on straight paths and at turns.
In contrast, the PPO agent achieved an average cumulative reward of approximately 550, indicating that the agent often drifted near the edges of the road and had a lower success rate in completing navigation tasks, resulting in lower overall rewards compared to the Dueling DQN agent.
The trained models saved during the training process were subsequently loaded to perform navigation tests, aiming to verify the effectiveness of the algorithms and the experimental methodology. The testing method involved executing 1000 navigation tasks, where the starting point and destination were randomly selected for each trial within the simulator environment. The testing results are summarized in
Table 3. The Dueling DQN algorithm achieved a success rate of 96.4%, while the PPO algorithm achieved a success rate of 66%.
4.3. Sim2Real Training Results
To enable the Duckiebot to operate in real-world environments, rather than being limited to simulation, two Sim2Real methods—domain randomization and style transfer—were employed during training. Since the lane-keeping task is relatively simple, the Duckiebot could successfully navigate along the center of the road in the real environment even without Sim2Real techniques. However, navigation tasks are more complex; therefore, domain randomization and style transfer were specifically applied to the navigation training. The corresponding training results are shown in
Figure 36,
Figure 37,
Figure 38 and
Figure 39.
From the results, it can be observed that the PPO algorithm combined with domain randomization did not achieve satisfactory performance. The reward curve exhibited significant fluctuations and ultimately failed to converge, indicating that this model would likely be unable to complete the navigation task in a real-world environment. Further verification in the real-world environment is required for the other trained models to assess their feasibility.
4.4. Performance Analysis of the Duckiebot in the Real-World Environment
For the navigation task, a total of six models were trained, including models trained without Sim2Real methods and models trained with Sim2Real techniques.
The first stage of testing aimed to verify whether these six models could successfully complete navigation tasks in a real-world environment.
The testing method involved executing navigation tasks across all turnings and T-junction paths on the map, totaling 32 paths (4 left turns, 4 right turns, and 24 paths at T-junctions, with each T-junction comprising 6 possible paths).
A model was considered successful if it could complete more than 20 paths; if fewer than 10 paths were completed, the model was considered to have failed.
The testing results are summarized in
Table 4. Among the Sim2Real approaches, only models trained using CycleGAN were successful. Models trained with domain randomization or without any Sim2Real methods failed. Among the two successful models, the Dueling DQN model completed 27 paths, while the PPO model completed 24 paths. Notably, the failed paths for both models occurred primarily at T-junctions.
Subsequently, additional navigation tasks were conducted in the real-world environment by randomly selecting starting points and destinations. This allowed for a more precise calculation of the success rates for the Dueling DQN and PPO models identified as successful in the previous stage.
The testing results are summarized in
Table 5. After 50 trials, the model trained with the Dueling DQN algorithm successfully completed 31 navigation tasks, while the model trained with the PPO algorithm successfully completed 26 tasks.
In
Table 5, the models trained with domain randomization and those trained without Sim2Real methods exhibited poor performance in real-world navigation tasks. To investigate the reasons for this failure, two additional experiments were conducted.
The first experiment involved testing the lane-keeping functionality of the models in a real-world environment. The results showed that models trained with the Dueling DQN and PPO algorithms were able to navigate stably along the road without requiring Sim2Real techniques, consistently completing more than four laps in each test. This indicates that the navigation task is significantly more complex and challenging than the lane-keeping task, making it more difficult to achieve successful transfer to the real environment.
The second experiment involved the feature visualization of the model trained with the Dueling DQN algorithm, with the results shown in
Figure 40. The leftmost image depicts the feature visualization result of a model that successfully executed the navigation task in the simulation environment. Regardless of the Duckiebot position on the map, the attention regions of the model predominantly covered the yellow and white road lines. A similar pattern was observed in the real-world environment for models trained with the CycleGAN-based Sim2Real method. In contrast, models trained with domain randomization and without Sim2Real methods mainly focused on the white road lines, with very limited attention to the yellow lines.
Based on these observations, it is inferred that domain randomization and the absence of Sim2Real techniques were insufficient to bridge the gap between the simulated and real environments. As a result, the neural networks failed to extract features related to the yellow road lines, ultimately leading to the failure of the navigation tasks.
Compared to the simulation environment, the real-world success rates in
Table 5 are significantly lower than those in
Table 3, despite using the same algorithms. This performance gap can be attributed to several factors: (1) the increased complexity of the real-world driving environment, which includes more varied lighting conditions, non-uniform road textures, and localization errors, and (2) the models’ insufficient generalization ability despite domain adaptation, as confirmed by feature visualization results. These findings highlight the limitations of current Sim2Real techniques and the need for more robust domain transfer strategies.
5. Conclusions
In this paper, we developed an autonomous vehicle system based on deep reinforcement learning methods. The system enables the Duckiebot to perform lane-keeping tasks in both simulated and real environments. For navigation, the A star algorithm is used to calculate the shortest path from the start point to the destination, with both the planned path and camera images fed into a neural network. The simulation results demonstrate the effectiveness and feasibility of this combined approach.
However, due to the physical differences between the simulated and real environments, the Duckiebot trained in simulation faces challenges when deployed in the real world. Even with the application of Sim2Real techniques such as CycleGAN and domain randomization, the system could only complete about half of the navigation tasks in real-world testing. Notably, although the Dueling DQN and PPO algorithms achieved high success rates in simulation (96.4% and 66%, respectively), their performance in the real world dropped significantly to 62% and 52%. This discrepancy reveals the difficulty of real-world deployment, particularly for navigation tasks. The complexity of the physical environment—including tighter turns, visual distractions, and non-uniform road features—adds to the challenge of policy transfer. Feature visualization experiments further indicate that models without Sim2Real adaptation fail to focus on essential features such as yellow lane markings, contributing to the performance gap.
Among all tested algorithms, the Dueling DQN algorithm demonstrated the best overall performance, showing faster convergence and higher training rewards. While these results indicate promising progress, ensuring the safety, robustness, and reliability of autonomous driving systems remains a critical concern. Future work will focus on addressing the challenges posed by complex real-world environments, enhancing the system’s ability to generalize across varied conditions. This includes developing safety-aware policies, incorporating richer sensor modalities, expanding training diversity, and implementing more resilient Sim2Real transfer strategies to bridge the simulation-to-reality gap effectively.