Research on Autonomous Vehicle Lane-Keeping and Navigation System Based on Deep Reinforcement Learning: From Simulation to Real-World Application

Cheng, Chia-Hsin; Lin, Hsiang-Hao; Luo, Yu-Yong

doi:10.3390/electronics14132738

Open AccessArticle

Research on Autonomous Vehicle Lane-Keeping and Navigation System Based on Deep Reinforcement Learning: From Simulation to Real-World Application

by

Chia-Hsin Cheng

^*

,

Hsiang-Hao Lin

and

Yu-Yong Luo

Department of Electrical Engineering, National Formosa University, Yunlin 63201, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2738; https://doi.org/10.3390/electronics14132738

Submission received: 29 April 2025 / Revised: 27 June 2025 / Accepted: 3 July 2025 / Published: 7 July 2025

(This article belongs to the Special Issue Autonomous and Connected Vehicles)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In recent years, with the rapid development of science and technology and the substantial improvement of computing power, various deep learning research topics have been promoted. However, existing autonomous driving technologies still face significant challenges in achieving robust lane-keeping and navigation performance, especially when transferring learned models from simulation to real-world environments due to environmental complexity and domain gaps. Many fields such as computer vision, natural language processing, and medical imaging have also accelerated their development due to the emergence of this wave, and the field of self-driving cars is no exception. The trend of self-driving cars is unstoppable. Many technology companies and automobile manufacturers have invested a lot of resources in the research and development of self-driving technology. With the emergence of different levels of self-driving cars, most car manufacturers have already reached the L2 level of self-driving classification standards and are moving towards L3 and L4 levels. This study applies deep reinforcement learning (DRL) to train autonomous vehicles with lane-keeping and navigation capabilities. Through simulation training and Sim2Real strategies, including domain randomization and CycleGAN, the trained models are evaluated in real-world environments to validate performance. The results demonstrate the feasibility of DRL-based autonomous driving and highlight the challenges in transferring models from simulation to reality.

Keywords:

artificial intelligence; autonomous driving; deep reinforcement learning; path planning

1. Introduction

Autonomous vehicles aim to achieve full automation (Level 5), allowing occupants to engage in other activities while the vehicle drives itself. Beyond personal use, autonomous vehicles may impact the public transportation, logistics, healthcare, and military sectors, changing many aspects of daily life. Deep reinforcement learning (DRL) has proven effective in many areas, enabling intelligent agents to learn complex tasks through experience.

The successful implementation of autonomous vehicles relies on the integration of several core technologies, including accurate localization [1], environmental perception, decision-making, and motion control. These components work in concert and are essential for ensuring the safety and stability of the overall system. Among these technologies, the ability of a vehicle to make stable and accurate action decisions based on real-time sensory input is one of the key challenges to achieving reliable autonomous driving. To address this, this study adopts a deep reinforcement learning (DRL) approach to train models that learn lane-keeping and navigation behaviors based on visual inputs. The objective is to enable the vehicle to maintain stable driving within lane boundaries and successfully follow a predefined navigation path. To further evaluate the practicality and transferability of the trained models, a two-stage experimental setup is designed, consisting of simulation and real-world environments. Additionally, Sim2Real techniques are incorporated to mitigate the impact of visual discrepancies between virtual and physical domains on model performance.

This study uses images captured by an onboard camera as input and applies deep reinforcement learning algorithms, including DQN (Deep Q-Network), DDQN (Double DQN), A2C (Advantage Actor–Critic), and PPO (Proximal Policy Optimization), to enable an autonomous vehicle to learn lane-keeping capabilities through interactions with a simulated environment. In the first stage, the model focuses on learning to drive stably within lane boundaries. In the second stage, the best-performing algorithm from the lane-keeping task is selected and further combined with path planning information to train the vehicle to perform navigation tasks on a specific map. Although the model performs well in simulation, it often fails to complete navigation tasks in real-world environments due to physical differences such as lighting and color. To bridge this gap between simulation and reality, this study adopts Sim2Real techniques to improve the model’s adaptability and performance in real-world applications.

Based on the above, the main contributions of this paper are as follows:

(1): An end-to-end autonomous driving framework is developed using visual input and multiple deep reinforcement learning algorithms (DQN, DDQN, A2C, PPO), enabling the vehicle to perform lane-keeping and navigation tasks.
(2): Sim2Real transfer techniques, including domain randomization and CycleGAN-based style transfer, are adopted to reduce the simulation-to-reality gap and enhance real-world generalization.
(3): A physical testing environment is built using Duckiebot and a custom driving setup to validate the system’s real-world performance and practical applicability.

2. Related Work

This section introduces the implementation methods of autonomous vehicle systems. First, it elaborates on the deep reinforcement learning algorithms used for lane-keeping tasks, then explains the navigation training strategies combined with path planning, and finally introduces techniques and applications for simulation to reality (Sim2Real).

2.1. Deep Reinforcement Learning

Reinforcement learning (RL) is a type of machine learning in which an agent continuously interacts with a dynamic environment. During this process, the agent receives a representation of the environment’s state, which could be an image or a set of physical parameters. Based on the observed state, the agent makes decisions and takes corresponding actions. The environment then provides a reward, which is used to update the agent’s behavior, enabling it to learn how to perform tasks effectively. The goal of reinforcement learning is to learn an optimal policy that maximizes the expected cumulative reward. This interaction process is illustrated in Figure 1.

2.1.1. Deep Q-Network (DQN)

Zhang et al. proposed a Dueling Double Deep Q-Network (D3QN-DA) architecture enhanced with dual attention mechanisms for tactical decision-making in autonomous driving scenarios such as highway lane-changing and obstacle avoidance [2]. By integrating spatial and channel attention modules, the model improves its ability to identify critical regions and features. Experimental results demonstrated significant improvements in decision safety and driving efficiency, highlighting the potential of deep reinforcement learning in complex driving environments. Wu et al. applied the Double DQN algorithm to a device-to-device (D2D) communication resource matching problem and introduced a distributed framework where each node independently performs action selection and resource allocation [3]. This approach addresses the scalability limitations of centralized learning in high-dimensional spaces and enhances overall system throughput and stability, offering valuable insights for multi-agent decision-making systems. Additionally, Chen et al. designed an enhanced Dueling DDQN model incorporating a Convolutional Block Attention Module (CBAM) for optimizing traffic signal control [4]. The model dynamically adjusts signal timings based on real-time traffic conditions, and experimental results showed reduced vehicle delay and wait times, demonstrating its effectiveness in sequential decision-making tasks. These studies illustrate the advantages of various DQN-based architectures across different application domains and provide valuable references for the design of visual-based reinforcement learning in this research.

2.1.2. Double DQN (DDQN)

Double Deep Q-Network (Double DQN) was proposed by van Hasselt et al. to address the overestimation bias observed in traditional DQN algorithms. The key idea is to decouple action selection and evaluation: the online network is used to select actions, while the target network is responsible for estimating Q-values. This mechanism reduces over-optimistic estimates and improves the stability and performance of the learned policy [5]. Wang et al. (2023) further applied Double DQN to a hybrid flowshop scheduling problem with multiple job priorities, introducing a co-evolutionary learning framework to enhance resource allocation efficiency and task completion rate [1,6]. Their approach combines Q-learning with evolutionary strategies and demonstrates robust performance in high-dimensional scheduling scenarios, highlighting the potential of Double DQN in complex optimization tasks.

2.1.3. Dueling DQN

Dueling DQN is an improved deep reinforcement learning architecture proposed by Wang et al. in 2016 [7]. It decomposes the Q-value function into a state value function and an advantage function, enhancing the model’s ability to distinguish critical states. Even when certain actions have limited impact in specific states, this architecture can still effectively assess the importance of the state, accelerating learning and improving policy stability. Zhou et al. (2023) [8] further applied a Branching Dueling Q-Network (BDQ) to the online scheduling of a microgrid with multiple distributed energy storage systems. They proposed a control framework capable of handling both discrete and continuous action spaces. Experimental results showed that the proposed method outperformed traditional DQN and DDPG in terms of energy dispatch efficiency and operational cost, demonstrating the practical potential of the dueling architecture in high-dimensional control tasks.

2.1.4. Policy Gradient

Policy Gradient is a class of reinforcement learning methods that directly optimize parameterized policies through gradient ascent. It is particularly effective in handling continuous action spaces and generating stochastic policies for enhanced exploration. Wu et al. (2023) [9] proposed a method combining faded experience mechanisms with Trust Region Policy Optimization (TRPO) for model-free power allocation in interference channels. Their approach introduced robust policy update constraints that mitigated training oscillations and improved stability. Sun et al. (2024) investigated the linear convergence properties of policy gradient methods in game-theoretic settings and proved that, under entropy regularization, independent policy updates can achieve stable convergence in non-cooperative environments [10]. These studies demonstrate the solid theoretical foundation and practical potential of Policy Gradient approaches in continuous decision-making problems.

2.1.5. Advantage Actor–Critic (A2C)

A2C is a synchronous actor–critic architecture that improves training stability by using the advantage function. Mandal et al. [11] proposed an enhanced version, Optimal L-AC-GAA, which reduces gradient variance by optimizing the number of critic updates. Their method combines random search with smoothed estimation techniques and demonstrates superior stability and performance across various reinforcement learning tasks.

2.1.6. Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a trust region-based policy optimization algorithm that stabilizes training by limiting the magnitude of policy updates. Lin et al. [12] proposed Anti-Martingale PPO, which applies anti-martingale theory to dynamically adjust the learning rate and update frequency, thereby improving the stability of policy convergence and sample efficiency. Gu et al. [13] introduced PPO with Policy Feedback (PPO-PF), which incorporates a policy feedback mechanism to construct a novel actor–critic architecture capable of jointly updating the policy and value function. Their approach demonstrates faster convergence and greater stability across multiple reinforcement learning tasks. Both methods outperform the original PPO in standard benchmarks, highlighting the effectiveness of their improvements.

2.2. Sim2Real

In robotic reinforcement learning, training directly in real-world environments often faces two major challenges. The first is efficiency: after completing each task, the robot must return to its initial position, making the process time-consuming and operationally complex. The second is safety: since deep reinforcement learning relies heavily on trial and error, unstable policies during training may lead to collisions or hardware damage. In contrast, simulation environments allow for parallel training with multiple agents, significantly improving data efficiency while avoiding physical risks. However, discrepancies still exist between simulation and reality—such as differences in color, lighting, or object geometry—which may affect the model’s performance when deployed in real-world scenarios. As a result, ensuring smooth transfer from simulation to reality has become a key issue for real-world deployment, commonly addressed through simulation-to-reality (Sim2Real) techniques.

To tackle this problem, Samak et al. [14] proposed a robust control framework for a 1:14 scale autonomous vehicle named Nigel, which features all-wheel drive and steering capabilities. Their approach employs a multi-model, multi-objective control strategy based on polytopic Linear Parameter-Varying (LPV) models to account for real-world disturbances such as friction and wind. The framework demonstrates stable control and accurate tracking in both simulation and physical experiments. On the other hand, Zhang et al. [15] presented a comprehensive survey of Sim2Real techniques in autonomous driving, covering methods such as domain randomization, digital twins, and parallel intelligence. Their work highlights how enhancing the realism and environmental consistency of simulation can effectively bridge the gap between virtual and real environments. Collectively, these studies emphasize that, for reinforcement learning to be successfully applied in physical systems, algorithm design must be complemented by robust control strategies and high-fidelity simulation environments.

2.2.1. Domain Randomization

Domain randomization has been widely adopted in sim-to-real (Sim2Real) tasks to bridge the gap between simulation and the real world. Peng et al. [16] proposed a domain generalization approach for semantic segmentation that incorporates Global and Local Texture Randomization (GTR and LTR). By introducing texture styles from paintings, the model learns domain-invariant features rather than relying on fixed textures. A consistency regularization mechanism (CGL) is also introduced to harmonize the two randomization strategies. Their method demonstrated superior generalization performance across multiple datasets compared to existing approaches. On the other hand, Horváth et al. [17] proposed a domain randomization pipeline for object detection, where synthetic images are randomized in terms of color, lighting, and camera parameters to train a YOLOv4 model. Their approach achieved 86.32% mAP50 in a zero-shot setting and 97.38% in a one-shot setting, highlighting its practicality even with minimal real-world annotations. These studies show that domain randomization effectively enhances cross-domain generalization and offers a robust solution to Sim2Real challenges.

2.2.2. CycleGAN

CycleGAN (Cycle-Consistent Generative Adversarial Network) is a deep learning technique used for image style transfer and is capable of handling unpaired data. Its architecture consists of two generators and two discriminators, trained through adversarial learning and cycle consistency loss. This allows the model to establish mappings between different image styles while preserving the structural information of the original image. Since it does not require paired training data, CycleGAN is particularly suitable for applications where collecting corresponding image pairs is difficult.

In prior studies, Zhang et al. [18] enhanced glue image contrast using CycleGAN to improve edge clarity and defect recognition. Wang et al. [19] applied CycleGAN to GPR images with attention and deformable convolutions for noise suppression and pipeline localization. Kazemi and Musilek [20] transformed passive microwave sensor responses into active-like signals, improving Q-factor and sensitivity for detecting trace components in water.

This study also integrates CycleGAN as an image style transfer method for Sim2Real adaptation to reduce the visual gap between the simulation and real-world environments. By training a CycleGAN model to transform simulated images into realistic styles, the approach enables the reinforcement learning model to perceive simulation inputs with real-world characteristics. This enhances the Duckiebot’s adaptability and success rate in real-world navigation tasks. The proposed method combines CycleGAN with reinforcement learning algorithms such as Dueling DQN and PPO, and comparative experiments were conducted against models trained without Sim2Real techniques and with domain randomization only. The results demonstrate that CycleGAN is a critical technique for effective simulation-to-reality transfer, significantly improving the success rate and stability of autonomous navigation in physical environments.

2.3. Path Planning

Path planning is one of the core technologies in the navigation system of self-driving cars and autonomous mobile robots. The purpose of navigation is to guide the robot from the starting position to the target position safely and effectively, mainly including three technologies: positioning, path planning, and motion control, in which path planning is responsible for selecting the most suitable movement path in the given environment, which is a key link in the navigation process.

According to the degree of mastery of environmental information, path planning can be divided into two categories: static and dynamic. Static programming assumes that the environment does not change, and common methods include Dijkstra and A star algorithms. Dynamic programming reacts in real time to changing environments, such as the D* algorithm, which updates environmental information in real time through sensors to plan the best path from the current location to the target location.

2.4. Gym Duckietown

Duckietown is a freely scalable experimental platform that simulates the environment required for autonomous driving technologies through pre-defined road configurations, traffic signals, and signs such as AprilTag [21]. The platform supports the development and testing of various tasks, including basic functions such as lane-following, signal recognition, positioning, planning, and navigation, and can also be further expanded into multi-robot systems for advanced applications such as communication and collaboration. The core of the system is a self-propelled car named Duckiebot, which uses a Raspberry Pi as the computing body, is equipped with ROS (Robot Operating System) as the development architecture, and controls two DC motors through the motor driver board. Ambient perception relies entirely on a fisheye lens mounted on the front of the car, which is the only sensor on the Duckiebot.

To improve the efficiency of reinforcement learning training, Duckietown has also launched its simulation environment, Gym Duckietown, as a virtual alternative to real-world experiments. Written in Python 3.9 and built on top of the OpenAI Gym framework, the simulator uses OpenGL technology to generate graphics that simulate the perspective and behavior of a real-world Duckiebot. Gym Duckietown has a simulated two-wheel differential drive model that matches the motion characteristics of the simulated vehicle to reality, while providing an image output from a simulated camera that simulates the visual information seen by a fisheye lens in the real world. In addition, a reward mechanism is designed in the simulator to evaluate the quality of the agent’s decision-making. When the Duckiebot is driving close to the middle of the road, it receives a positive reward; conversely, if it deviates from the lane or collides, it will receive a negative reward. In order to determine the relative position of the vehicle to the center of the road, the system pre-calculates the road centerline, which is constructed from a third-order Béz curve. Bézi curves are a common representation of curves in graphics that can be reshaped through control points and are used in Duckietown to accurately model road geometry.

3. Materials and Methods

3.1. System Architecture

The implementation flow of the system consists of the following stages: lane-keeping task training, algorithm evaluation and selection, navigation task training, simulation-to-reality (Sim2Real) transfer strategy development, real-world deployment of the autonomous driving system, and system performance evaluation. The first four stages are conducted within a simulation environment, while the final two stages are executed in the real-world setting, as shown in Figure 2.

In terms of system construction, the training platform utilizes a desktop computer equipped with an NVIDIA RTX 3080 GPU and an Intel Core i7 processor to perform deep reinforcement learning model training and inference. The autonomous vehicle platform is based on a custom-built Duckiebot as the mobile base, with a Jetson Nano embedded system serving as the main processing unit. The platform is further equipped with an NoIR camera, a Stepper Motor HAT V2 motor expansion board, and HC-1-048 stepper motors to achieve vehicle control and perception functionalities. A fisheye camera combined with a Jetson Nano is used for real-time localization. All devices communicate through a Wi-Fi network based on a socket communication framework, enabling data exchange and remote control. All hardware components, including the NVIDIA RTX 3080 GPU, Intel Core i7 processor, Jetson Nano, NoIR camera, Stepper Motor HAT V2, and HC-1-048 stepper motors, were procured through authorized distributors in Taiwan.

3.2. Lane Keeping Training

In this phase, four algorithms are employed to accomplish the lane-keeping task: the value-based algorithms Double DQN (DDQN) and Dueling DQN, and the policy-based algorithms Advantage Actor–Critic (A2C) and Proximal Policy Optimization (PPO).

In the experimental setup, the Duckiebot (agent) is configured to drive along the outer lane in the simulator. Each episode terminates either when the driving steps exceed a predefined maximum number or when the Duckiebot deviates from the outer lane. The experimental environment is illustrated in Figure 3.

In the original simulator settings, a round ends only when the Duckiebot crosses the white line, allowing behaviors such as crossing the single yellow line or driving against traffic, which does not meet our task requirements.

To address this, we used images from the Duckiebot camera to detect crossings over the single yellow line. Since the yellow line consists of discontinuous segments, we processed the images by first converting them from RGB to HSV color space, which facilitates color extraction. We then isolated the yellow regions, applied a mask to segment the image, extracted the centroids of these regions, and used polynomial regression to fit a continuous curve.

The detection process is illustrated in Figure 4, allowing for a more accurate determination of whether the Duckiebot crosses the single yellow line.

The images captured by the Duckiebot (agent) are used as inputs to the neural network, which subsequently predicts an action. In this experiment, the Duckiebot is configured to select from five discrete actions: turning left, slight left deviation, moving straight, slight right deviation, and turning right. These actions correspond to integer labels 0 through 4, respectively, as illustrated by the five blue trajectories in Figure 5.

The environment transitions to the next frame based on the action taken by the Duckiebot (agent), and a corresponding reward is assigned. The reward is computed using the formula reward = -5 ⋅ |ΔE| + 5 ⋅ dot_dir, which consists of two components (1). The first term, |ΔE|, represents the distance between the Duckiebot and the reference line (the optimal driving path), defined as the difference between the red line (reference line) and the green dashed line (Duckiebot position) in Figure 6. A greater distance results in a larger penalty and thus a lower reward, potentially becoming negative. The second term, dot_dir, is the dot product between the x- and y-axis components of the reference line direction (red line) and the Duckiebot’s heading direction (blue dashed line). A higher dot product value indicates better alignment between the Duckiebot and the reference path, resulting in a higher reward. The coefficients -5 and 5 are used to balance the impact of deviation and alignment in the overall reward calculation.

reward = - 5 \cdot | Δ E | + 5 \cdot dot_dir

(1)

In the experiments, all four algorithms utilized the same neural network architecture, differing only in the final output layer. The network structures for the policy-based A2C and PPO algorithms are shown in Figure 7. For the value-based DDQN and Dueling DQN algorithms, the output layer consists of only five neurons, where the value output in Figure 6 is removed and the action probability is replaced with Q-values.

The network architecture consists of three convolutional layers followed by three fully connected layers, making it more lightweight compared to typical deep learning networks. This results in fewer computations and parameters, ensuring that the agent can make rapid decisions and interact with the environment in real time.

For the actor and critic networks, a shared architecture was adopted, where the three convolutional layers and one fully connected layer share the same weights. This design reduces the number of parameters and accelerates the network’s convergence. Moreover, the first two convolutional layers employ larger convolutional kernels and larger strides to extract features and reduce the data volume. This approach replaces traditional pooling layers, enabling the feature maps to be downsampled quickly without a significant loss of important features.

3.2.1. DDQN, Dueling DQN Training Process

The training procedures and network updates for DDQN and Dueling DQN are identical, differing only in network architecture, where Dueling DQN separates the output into a value and an advantage function. As defined in Table 1, training begins by obtaining an initial state s_t from the simulator, represented as a resized (160 × 120) color image.

The state s_t is input into the neural network to predict action values for five possible actions. The agent selects the action with the highest value, receives a reward r_t, and transitions to the next state s_t + 1. The tuple (s_t, a_t, r_t, s_t + 1) is stored in replay memory. Once sufficient samples are collected, the current Q-network is trained, and its weights are periodically synchronized to the target network.

Training continues until the Duckiebot either leaves the outer lane or reaches the episode step limit. Upon episode termination, the environment is reset for the next episode, as illustrated in Figure 8.

3.2.2. A2C, PPO Training Process

The training processes and network architectures of A2C and PPO are similar, differing mainly in their network update mechanisms. Training starts by acquiring the initial state s_t from the environment. The state s_t is input into the actor network, which outputs a probability distribution over the five discrete actions. The agent samples an action at based on the distribution and executes it, receiving a corresponding reward r_t.

Each experience (s_t, a_t, r_t) is stored in a temporary buffer. If the update condition is not yet met, the interaction continues. Once the specified number of steps is reached, the collected experiences are used to update the networks, and the buffer is cleared to store new data for the next update cycle. An episode is terminated either when the agent exceeds the maximum number of steps or deviates from the outer lane, as illustrated in Figure 9. The definitions of variables in the entire training process are summarized in Table 2.

3.3. Navigation Function Training

The navigation function aims to guide the autonomous vehicle to safely reach its destination, while maintaining lane-keeping as a fundamental prerequisite. To achieve navigation, it is also necessary to measure the vehicle’s position through localization technologies, such as GPS. In the simulator, however, this parameter can be directly accessed.

The experimental setup involves training the Duckiebot on a map containing four T-junctions, as shown in Figure 10. During training, the Duckiebot follows a path determined by the path planning algorithm, traveling from a randomly selected starting point to a destination within the drivable area. An episode terminates if the Duckiebot deviates significantly from the planned route or exits the drivable area. The red lines in Figure 10 are used to visually highlight intersection points on the map and do not affect the navigation logic or training process. The red lines in Figure 10 are used to visually highlight intersection points on the map and do not affect the navigation logic or training process.

To simplify implementation and validate the feasibility of the proposed method, the Duckiebot is restricted to only moving straight, turning left, or turning right, thereby preventing U-turns in the planned path. The map coordinates from the simulator are converted into a directed graph, as illustrated in Figure 11, where the distance between adjacent nodes is set to one. This structure enables the use of the A star algorithm for path planning.

Once the path is planned, it is mapped back onto the simulator environment, with the overall process illustrated in Figure 12. The planned trajectory is then represented using Bézier curves, as shown in Figure 13, and the white dashed lines serve as the reference trajectory for navigation. The red lines in Figure 13 are used to visually highlight intersection points on the map and do not affect the navigation logic or training process.

When training for the navigation task, if only the front-facing image captured by the Duckiebot (agent) is fed into the actor network, the agent would be unable to determine whether to move forward, turn left, or turn right upon encountering a T-junction. As a result, it would fail to follow the path planned by the A star algorithm. To address this issue, both the image input (img_input) and the path planning guidance (path_input) are simultaneously fed into the actor network.

The path_input is constructed by selecting the ten closest points along the planned path relative to the Duckiebot current position. Using the arctangent function, the angles between the Duckiebot and each of the ten points are calculated based on the standard two-point angle formula, as illustrated in Figure 14 (where one point is used for demonstration). The calculated angles fall within a range of −45° to 45°, and are subsequently normalized before being input into the network. This design allows the Duckiebot to learn the appropriate decision whether to move forward, turn left, or turn right based on both its visual input and path guidance.

The available actions for the Duckiebot are the same as those described in the previous section on lane-keeping training, consisting of five discrete options: turning left, slight left deviation, moving straight, slight right deviation, and turning right.

The reward calculation method is the same as described in the previous section. However, the reference line for the Duckiebot navigation is replaced by the path planning result.

The training for the navigation task differs from that of the lane-keeping task in that it requires both image input and path planning guidance (path_input) as inputs to the actor network, rather than relying solely on images. Therefore, the network architecture is modified accordingly.

The image input branch remains consistent with the lane-keeping network structure, with the only change occurring at the first fully connected layer after the convolutional layers, where the number of neurons is reduced from 512 to 256. The path_input is processed through two fully connected layers with nonlinear activation to extract a 128-dimensional feature vector. This feature vector is then concatenated with the 256-dimensional feature vector from the image input. The purpose of this concatenation is to fuse the semantic information learned separately from the two inputs. After concatenation, the combined features are passed through several fully connected layers to produce the final outputs: the value prediction by the critic and the action probability prediction by the actor, as illustrated in Figure 15.

In this phase, the Dueling DQN and PPO algorithms, which achieved better performance in the previous stage, are selected for further training. The overall training process remains similar to that of the lane-keeping task, with the primary differences being the use of a different map and a modified state input s_t, which now consists of both the image input (img input) and the path planning guidance (path_input).

Taking the PPO algorithm as an example, the modified sections of the training process are highlighted with red boxes, as illustrated in Figure 16.

Since the Duckiebot must follow the path planned by the A star algorithm, the episode termination condition is changed from deviating from the outer lane (as in lane-keeping training) to deviating from the planned path.

3.4. Sim2Real Method Training

Through simulation, the Duckiebot is able to achieve navigation functionality within a designed environment. However, the ultimate goal is to implement and apply the proposed method in a real-world setting. In this study, two approaches are adopted to realize Sim2Real transfer. The first approach employs CycleGAN to reduce the domain gap between simulation and reality, while the second approach applies domain randomization by treating the real environment as one of the variations within the simulated domain. These methods collectively enable the Duckiebot to successfully perform navigation tasks in the real world.

During training, the physical parameters of the simulation environment are randomized within predefined ranges, resulting in different visual scenes for the Duckiebot in each episode. The randomized parameters include sky color, ground color, ground brightness, light source position, fisheye camera angle, camera height above the ground, and camera frame skipping. By combining variations in these physical parameters, diverse simulation environments are generated, as illustrated in Figure 17.

Images from the simulated environment (Style A) and the real environment (Style B) were collected to train a CycleGAN network, enabling it to generate images closely resembling real-world styles, as shown in Figure 18.

Using the same setup, Dueling DQN and PPO algorithms are applied for training. Taking PPO as an example, the training process is illustrated in Figure 19.

First-person view images from the simulator are input into the trained CycleGAN, which outputs images in a realistic style. This allows the Duckiebot to perceive the environment as if navigating in the real world, enabling sim-to-real transfer for navigation training.

3.5. Real-World Environment

In this study, a Duckiebot is used to simulate an autonomous vehicle, as shown in Figure 20. The real-world experimental environment is illustrated in Figure 21, Figure 22 and Figure 23, where a foam puzzle mat is used to construct the driving area, measuring 220 cm in length and 270 cm in width. At the top of the environment, a fisheye camera and a Jetson Nano are installed to calculate the position of the Duckiebot within the driving area, providing the localization information required for navigation. All experimental devices are connected to the same Wi-Fi router, and data transmission is handled through a socket communication framework.

In this experiment, image-based localization was employed to simulate the GPS positioning system of a real autonomous vehicle. A blue corrugated board was mounted on the Duckiebot to enable fast and easy position detection.

The image localization process is illustrated in Figure 24. The captured image is first rotated to a horizontal orientation. Due to the use of a fisheye lens, the field of view extends beyond the driving area, producing a convex distortion effect similar to the view of a fish underwater looking up. To correct this, a region of interest (ROI) covering the driving area was extracted, followed by an affine transformation to remove distortion and resize the map to match the simulation dimensions.

Chessboard calibration, although common for fisheye correction, was not used, as it would significantly reduce the field of view, making full coverage of the driving area impossible. After correction, the image was converted from RGB to HSV color space to isolate the blue region representing the Duckiebot. The centroid of the blue region was calculated as the Duckiebot position. Finally, the localization information was transmitted back to the Duckiebot to complete the image-based localization process.

In this study, a self-driving car system is constructed using a socket communication framework, as illustrated in Figure 25. The system consists of a monitoring computer, a fisheye camera for capturing the driving area, a Jetson Nano for calculating the Duckiebot position based on the captured images, and the Duckiebot itself simulating an autonomous vehicle. The interface displayed on the monitoring computer is shown in Figure 26. Its primary function is to allow the user to select the starting point and destination, which serve as the input for the path planning algorithm.

When a navigation task begins, the monitoring computer displays the interface shown in Figure 26. The user selects a starting point and a destination, and the coordinate information is transmitted via the socket communication framework to the localization module (camera and Jetson Nano). The localization module then forwards the coordinate information to the Duckiebot, enabling it to perform path planning and generate the shortest path to the destination.

At this stage, the localization module begins image-based localization, calculating the Duckiebot current position and transmitting this information back to the Duckiebot. The Duckiebot uses the received localization information to perform navigation, where the image input (img_input) and the path planning guidance (path_input) are fed into the agent’s neural network to determine the next action.

Simultaneously, the localization information is also sent to the monitoring computer, allowing the simulated visualization window to display the Duckiebot navigation process.

If the Duckiebot has neither reached the destination nor deviated from the planned path, the process repeats. The task is considered successfully completed if the Duckiebot reaches the destination; otherwise, deviation from the planned path results in a task failure, and the navigation mission is terminated.

Visualization Experiments

In addition, we conducted feature visualization experiments to observe whether the features attended to by the model in the real-world environment are consistent with those attended to in the simulation environment.

Feature visualization is primarily conducted by multiplying the feature maps output from each convolutional layer, fusing them into a single feature map, and overlaying it onto the original input image, as shown in Figure 27. This allows us to understand which features the convolutional layers focus on within the neural network.

The detailed procedure is illustrated in Figure 28. First, the feature maps from the last convolutional layer are averaged. To enable element-wise multiplication with the feature maps from the preceding layer, the averaged feature map must be upsampled to match the spatial dimensions of the preceding layer’s feature maps.

Subsequently, element-wise multiplication is performed between the two sets of feature maps. This process is repeated layer by layer until reaching the first convolutional layer. The resulting matrix is then normalized, and a colormap is applied to map the values to the color scale shown in Figure 29, where more yellow regions indicate areas where the network’s convolutional layers focus more strongly.

4. Results

This chapter presents the results of lane-keeping training and navigation training, comparing and discussing the effectiveness and feasibility of each Sim2Real method in real-world scenarios.

Unlike deep learning, where most models and algorithms have standard evaluation metrics—such as accuracy, precision, and recall for classification tasks, or Intersection over Union (IoU) and mean Average Precision (mAP) for object detection algorithms—deep reinforcement learning lacks a universal evaluation standard.

In deep reinforcement learning, a common approach is to record the rewards obtained by the agent (Duckiebot) in each episode throughout the training process and plot them as a curve. The performance of the algorithm is then judged based on whether the agent is able to complete a given task or game by analyzing the reward trends over time.

In the experimental results presented in this study, the horizontal axis of each line plot represents the number of steps, while the vertical axis represents the cumulative reward obtained in each episode. A higher cumulative reward in an episode indicates that the agent successfully completed the assigned task.

Due to the relatively slow convergence of the network during training and the varying lengths of navigation paths, the cumulative rewards per episode fluctuate significantly, making it difficult to directly assess the agent’s performance and learning trends. To facilitate observation, we compute the average cumulative reward over every 50 episodes and plot it as a dark red curve. Meanwhile, the pink curve represents the cumulative reward obtained in each individual episode.

4.1. Lane-Keeping Training Results

For the lane-keeping training experiments, both the DDQN and Dueling DQN algorithms were able to successfully maintain the Duckiebot navigation along the road within one million steps. Therefore, training was terminated upon reaching two million steps, as shown in Figure 30 and Figure 31.

Compared to Dueling DQN, the DDQN algorithm demonstrated a faster convergence speed. From the plots, it can be observed that DDQN reached a cumulative reward of over 1200 per episode at approximately 10,000 steps, which corresponds to the Duckiebot successfully completing two laps along the outer lane. This indicates that DDQN was able to learn the lane-keeping functionality with fewer interactions.

Regarding training stability, the results shown in Figure 28 indicate that the DDQN algorithm maintained a cumulative reward of around 1200 during the mid-to-late training stages, suggesting that the Duckiebot could reliably complete two laps in most episodes. However, there were still several instances where only one lap or less was completed.

In contrast, the Dueling DQN algorithm exhibited greater stability during the mid-to-late training stages, with almost every episode successfully completing two laps along the outer lane.

The training results for the policy-based A2C algorithm are shown in Figure 32. Due to the relatively slow convergence speed of A2C, the training duration was extended. However, even with prolonged training, the agent failed to effectively learn the lane-keeping task. Although the overall trend of the cumulative reward per episode showed a slight increase, the growth rate was extremely slow.

The most critical issue was that, even after achieving a cumulative reward of approximately 1200 in some episodes, the agent was unable to consistently maintain this performance. As observed from the plot, the actual rewards exhibited significant fluctuations, indicating that the agent could successfully complete two laps along the outer lane in one episode but might fail at the first turn and deviate from the lane in the subsequent episode, resulting in a cumulative reward of less than 200.

Similarly, for the policy-based PPO algorithm, the agent was able to achieve a cumulative reward exceeding 1000 after approximately 250,000 interaction steps. Furthermore, after 550,000 steps, the agent consistently maintained a cumulative reward above 1100 per episode, as shown in Figure 33.

Although the PPO algorithm achieved stable convergence, the cumulative reward per episode remained slightly below 1200. This was primarily due to the agent tending to drift toward the inner edge of the lane when making turns, which resulted in slightly lower cumulative rewards compared to the other three algorithms.

The trained models of the four algorithms were saved and subsequently loaded into the simulation environment to perform the lane-keeping task. Among them, the agents trained with DDQN, Dueling DQN, and PPO algorithms were able to continuously navigate along the outer lane without crossing the boundary. These agents could execute the task indefinitely unless the program was manually interrupted. In contrast, the agent trained with the A2C algorithm exhibited more unstable behavior. During the mid-to-late stages of training, the agent’s performance fluctuated. If the model was saved during a period of good performance, the agent could also execute the task continuously without deviation. Otherwise, the agent would typically deviate from the outer lane within a single lap.

Based on the experimental results, the value-based Dueling DQN algorithm and the policy-based PPO algorithm were selected for the next stage of navigation training. This selection was made because Dueling DQN exhibited greater stability during the mid-to-late stages of training, while PPO also demonstrated favorable performance.

4.2. Navigation Training Results

The training hyperparameters for the Dueling DQN and PPO algorithms were the same as those used in the lane-keeping training. The training results are shown in Figure 34 and Figure 35.

Due to its faster convergence, the training of the Dueling DQN algorithm was terminated after approximately 1.5 million interaction steps. In contrast, based on the previous lane-keeping training results, the PPO algorithm required more interaction steps and update iterations; thus, training was stopped when the rewards became more stable after approximately 3 million steps.

Comparing the two algorithms, the Dueling DQN agent achieved an average cumulative reward of nearly 700 per episode after convergence. This is because the agent trained with Dueling DQN was able to navigate closer to the center of the road, both on straight paths and at turns.

In contrast, the PPO agent achieved an average cumulative reward of approximately 550, indicating that the agent often drifted near the edges of the road and had a lower success rate in completing navigation tasks, resulting in lower overall rewards compared to the Dueling DQN agent.

The trained models saved during the training process were subsequently loaded to perform navigation tests, aiming to verify the effectiveness of the algorithms and the experimental methodology. The testing method involved executing 1000 navigation tasks, where the starting point and destination were randomly selected for each trial within the simulator environment. The testing results are summarized in Table 3. The Dueling DQN algorithm achieved a success rate of 96.4%, while the PPO algorithm achieved a success rate of 66%.

4.3. Sim2Real Training Results

To enable the Duckiebot to operate in real-world environments, rather than being limited to simulation, two Sim2Real methods—domain randomization and style transfer—were employed during training. Since the lane-keeping task is relatively simple, the Duckiebot could successfully navigate along the center of the road in the real environment even without Sim2Real techniques. However, navigation tasks are more complex; therefore, domain randomization and style transfer were specifically applied to the navigation training. The corresponding training results are shown in Figure 36, Figure 37, Figure 38 and Figure 39.

From the results, it can be observed that the PPO algorithm combined with domain randomization did not achieve satisfactory performance. The reward curve exhibited significant fluctuations and ultimately failed to converge, indicating that this model would likely be unable to complete the navigation task in a real-world environment. Further verification in the real-world environment is required for the other trained models to assess their feasibility.

4.4. Performance Analysis of the Duckiebot in the Real-World Environment

For the navigation task, a total of six models were trained, including models trained without Sim2Real methods and models trained with Sim2Real techniques.

The first stage of testing aimed to verify whether these six models could successfully complete navigation tasks in a real-world environment.

The testing method involved executing navigation tasks across all turnings and T-junction paths on the map, totaling 32 paths (4 left turns, 4 right turns, and 24 paths at T-junctions, with each T-junction comprising 6 possible paths).

A model was considered successful if it could complete more than 20 paths; if fewer than 10 paths were completed, the model was considered to have failed.

The testing results are summarized in Table 4. Among the Sim2Real approaches, only models trained using CycleGAN were successful. Models trained with domain randomization or without any Sim2Real methods failed. Among the two successful models, the Dueling DQN model completed 27 paths, while the PPO model completed 24 paths. Notably, the failed paths for both models occurred primarily at T-junctions.

Subsequently, additional navigation tasks were conducted in the real-world environment by randomly selecting starting points and destinations. This allowed for a more precise calculation of the success rates for the Dueling DQN and PPO models identified as successful in the previous stage.

The testing results are summarized in Table 5. After 50 trials, the model trained with the Dueling DQN algorithm successfully completed 31 navigation tasks, while the model trained with the PPO algorithm successfully completed 26 tasks.

In Table 5, the models trained with domain randomization and those trained without Sim2Real methods exhibited poor performance in real-world navigation tasks. To investigate the reasons for this failure, two additional experiments were conducted.

The first experiment involved testing the lane-keeping functionality of the models in a real-world environment. The results showed that models trained with the Dueling DQN and PPO algorithms were able to navigate stably along the road without requiring Sim2Real techniques, consistently completing more than four laps in each test. This indicates that the navigation task is significantly more complex and challenging than the lane-keeping task, making it more difficult to achieve successful transfer to the real environment.

The second experiment involved the feature visualization of the model trained with the Dueling DQN algorithm, with the results shown in Figure 40. The leftmost image depicts the feature visualization result of a model that successfully executed the navigation task in the simulation environment. Regardless of the Duckiebot position on the map, the attention regions of the model predominantly covered the yellow and white road lines. A similar pattern was observed in the real-world environment for models trained with the CycleGAN-based Sim2Real method. In contrast, models trained with domain randomization and without Sim2Real methods mainly focused on the white road lines, with very limited attention to the yellow lines.

Based on these observations, it is inferred that domain randomization and the absence of Sim2Real techniques were insufficient to bridge the gap between the simulated and real environments. As a result, the neural networks failed to extract features related to the yellow road lines, ultimately leading to the failure of the navigation tasks.

Compared to the simulation environment, the real-world success rates in Table 5 are significantly lower than those in Table 3, despite using the same algorithms. This performance gap can be attributed to several factors: (1) the increased complexity of the real-world driving environment, which includes more varied lighting conditions, non-uniform road textures, and localization errors, and (2) the models’ insufficient generalization ability despite domain adaptation, as confirmed by feature visualization results. These findings highlight the limitations of current Sim2Real techniques and the need for more robust domain transfer strategies.

5. Conclusions

In this paper, we developed an autonomous vehicle system based on deep reinforcement learning methods. The system enables the Duckiebot to perform lane-keeping tasks in both simulated and real environments. For navigation, the A star algorithm is used to calculate the shortest path from the start point to the destination, with both the planned path and camera images fed into a neural network. The simulation results demonstrate the effectiveness and feasibility of this combined approach.

However, due to the physical differences between the simulated and real environments, the Duckiebot trained in simulation faces challenges when deployed in the real world. Even with the application of Sim2Real techniques such as CycleGAN and domain randomization, the system could only complete about half of the navigation tasks in real-world testing. Notably, although the Dueling DQN and PPO algorithms achieved high success rates in simulation (96.4% and 66%, respectively), their performance in the real world dropped significantly to 62% and 52%. This discrepancy reveals the difficulty of real-world deployment, particularly for navigation tasks. The complexity of the physical environment—including tighter turns, visual distractions, and non-uniform road features—adds to the challenge of policy transfer. Feature visualization experiments further indicate that models without Sim2Real adaptation fail to focus on essential features such as yellow lane markings, contributing to the performance gap.

Among all tested algorithms, the Dueling DQN algorithm demonstrated the best overall performance, showing faster convergence and higher training rewards. While these results indicate promising progress, ensuring the safety, robustness, and reliability of autonomous driving systems remains a critical concern. Future work will focus on addressing the challenges posed by complex real-world environments, enhancing the system’s ability to generalize across varied conditions. This includes developing safety-aware policies, incorporating richer sensor modalities, expanding training diversity, and implementing more resilient Sim2Real transfer strategies to bridge the simulation-to-reality gap effectively.

Author Contributions

Methodology, H.-H.L.; Validation, C.-H.C.; Formal analysis, Y.-Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, T.; Liu, J.; Wang, J.; Zhang, H.; Zhang, B.; Ma, Y.; Sun, M.; Lv, Z.; Xu, G. Pseudolites to Support Location Services in Smart Cities: Review and Prospects. Smart Cities 2023, 6, 2081–2105. [Google Scholar] [CrossRef]
Wang, P.; Ni, W. An Enhanced Dueling Double Deep Q-Network With Convolutional Block Attention Module for Traffic Signal Optimization in Deep Reinforcement Learning. IEEE Access 2024, 12, 44224–44236. [Google Scholar] [CrossRef]
Yuan, Y.; Li, Z.; Liu, Z.; Yang, Y.; Guan, X. Double Deep Q-Network Based Distributed Resource Matching Algorithm for D2D Communication. IEEE Trans. Veh. Technol. 2022, 71, 984–996. [Google Scholar] [CrossRef]
Zhang, S.; Wu, Y.; Ogai, H.; Inujima, H.; Tateno, S. Tactical Decision-Making for Autonomous Driving Using Dueling Double Deep Q Network With Double Attention. IEEE Access 2021, 9, 151983–151997. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Zeng, L.; Yao, W.; Shuai, H.; Zhou, Y.; Ai, X.; Wen, J. Resilience Assessment for Power Systems Under Sequential Attacks Using Double DQN with Improved Prioritized Experience Replay. IEEE Syst. J. 2023, 17, 1865–1874. [Google Scholar] [CrossRef]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; PMLR: Brookline, MA, USA, 2016; pp. 1995–2003. [Google Scholar]
Shuai, H.; Li, F.; Pulgar-Painemal, H.; Xue, Y. Branching Dueling Q-Network-Based Online Scheduling of a Microgrid with Distributed Energy Storage Systems. IEEE Trans. Smart Grid 2021, 12, 5479–5485. [Google Scholar] [CrossRef]
Khoshkholgh, M.G.; Yanikomeroglu, H. Faded-Experience Trust Region Policy Optimization for Model-Free Power Allocation in Interference Channel. IEEE Wirel. Commun. Lett. 2021, 10, 659–663. [Google Scholar] [CrossRef]
Sun, Y.; Liu, T.; Kumar, P.R.; Shahrampour, S. Linear Convergence of Independent Natural Policy Gradient in Games with Entropy Regularization. IEEE Control Syst. Lett. 2024, 8, 1217–1223. [Google Scholar] [CrossRef]
Mandal, L.; Diddigi, R.B.; Bhatnagar, S. Variance-Reduced Deep Actor–Critic with an Optimally Subsampled Actor Recursion. IEEE Trans. Artif. Intell. 2024, 5, 3607–3623. [Google Scholar] [CrossRef]
Gu, Y.; Cheng, Y.; Yu, K.; Wang, X. Anti-Martingale Proximal Policy Optimization. IEEE Trans. Cybern. 2023, 53, 6421–6435. [Google Scholar] [CrossRef] [PubMed]
Gu, Y.; Cheng, Y.; Chen, C.L.P.; Wang, X. Proximal Policy Optimization with Policy Feedback. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 4599–4611. [Google Scholar] [CrossRef]
Samak, C.V.; Samak, T.V.; Velni, J.M.; Krovi, V.N. Nigel—Mechatronic Design and Robust Sim2Real Control of an Overactuated Autonomous Vehicle. IEEE/ASME Trans. Mechatron. 2024, 29, 2785–2796. [Google Scholar] [CrossRef]
Hu, X.; Li, S.; Huang, T.; Tang, B.; Huai, R.; Chen, L. How Simulation Helps Autonomous Driving: A Survey of Sim2real, Digital Twins, and Parallel Intelligence. IEEE Trans. Intell. Veh. 2024, 9, 593–615. [Google Scholar] [CrossRef]
Peng, D.; Lei, Y.; Liu, L.; Zhang, P.; Liu, J. Global and Local Texture Randomization for Synthetic-to-Real Semantic Segmentation. IEEE Trans. Image Process. 2021, 30, 6594–6605. [Google Scholar] [CrossRef] [PubMed]
Horváth, D.; Erdős, G.; Istenes, Z.; Horváth, T.; Földi, S. Object Detection Using Sim2Real Domain Randomization for Robotic Applications. IEEE Trans. Robot. 2023, 39, 1225–1240. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, K.; Xie, L.; Zhao, Y.; Lu, X. An Enhancement and Detection Method for a Glue Dispensing Image Based on the CycleGAN Model. IEEE Access 2022, 10, 92036–92046. [Google Scholar] [CrossRef]
Wang, J.; Lin, Y.; Ma, D.; Wang, Y.; Ye, S. CycleGAN-Based Clutter Suppression and Pipeline Positioning Method for GPR Image. IEEE Geosci. Remote Sens. Lett. 2025, 22, 3500105. [Google Scholar] [CrossRef]
Kazemi, N.; Musilek, P. Enhancing Microwave Sensor Performance with Ultrahigh Q Features Using CycleGAN. IEEE Trans. Microw. Theory Tech. 2022, 70, 5369–5377. [Google Scholar] [CrossRef]
Skruch, P.; Długosz, M.; Szelest, M.; Morys-Magiera, A. Duckietown project pros and cons. In Proceedings of the 2023 21st International Conference on Emerging eLearning Technologies and Applications (ICETA), Stary Smokovec, Slovakia, 9–10 November 2023; pp. 446–450. [Google Scholar] [CrossRef]

Figure 1. Agent–environment interaction.

Figure 2. Implementation process.

Figure 3. Training environment for lane-keeping.

Figure 4. Flowchart of the road detection process.

Figure 5. Actions available to agent.

Figure 6. Reward calculation.

Figure 7. Neural network for lane-keeping.

Figure 8. Training flowchart of DDQN and Dueling DQN.

Figure 9. Training flowchart of A2C and PPO.

Figure 10. Training environment for vavigation.

Figure 11. Directed graph for navigation task.

Figure 12. Flowchart of navigation route setup.

Figure 13. Path planning results.

Figure 14. Path planning guidance.

Figure 15. Neural network for navigation training.

Figure 16. Flowchart of PPO training for navigation.

Figure 17. Domain randomization environment.

Figure 18. CycleGAN-generated images.

Figure 19. Flowchart of PPO training for Sim2Real transfer.

Figure 20. Duckiebot.

Figure 21. Actual experimental environment.

Figure 22. The lens used in the actual experiment.

Figure 23. Top-down view of the experimental environment.

Figure 24. Flowchart of image-based localization.

Figure 25. System architecture.

Figure 26. Monitoring window.

Figure 27. Feature visualization.

Figure 28. Feature visualization flowchart.

Figure 29. Colormap.

Figure 30. DDQN training results.

Figure 31. Dueling DQN training results.

Figure 32. A2C training results.

Figure 33. PPO training results.

Figure 34. Dueling DQN navigation training results.

Figure 35. PPO navigation training results.

Figure 36. Dueling DQN domain randomization training results.

Figure 37. Dueling DQN CycleGAN training results.

Figure 38. PPO domain randomization training results.

Figure 39. PPO CycleGAN training results.

Figure 40. Feature visualization results of different Sim2Real methods.

Table 1. Definitions of variables and terminologies in DDQN and Dueling DQN algorithms.

Episode	An Episode Ends When the Number of Interaction Steps (t) Reaches the Maximum Limit or the Vehicle Leaves the Lane.
State (s_t)	The environmental state at time t, represented by an image.
Action (a_t)	The action taken by the agent at time t based on the current environmental state.
Reward (r_t)	The reward obtained by taking the action at time t.
Replay Memory	Storage of data collected during interactions; when capacity is reached, the oldest data are discarded first.
Q-Value	The output value of the action-value function (neural network).
Update Condition	When the replay memory exceeds a predefined size, the current Q-network is updated every five steps.
Synchronization Step	After reaching a predefined number of steps, the weights of the current Q-network are synchronized to the target Q-network.
Step Limit	The maximum number of interaction steps allowed between the Duckiebot and the environment within an episode.

Table 2. Definitions for variables and terminologies in A2C and PPO algorithms.

Episode	An Episode Ends When the Number of Interaction Steps (t) Reaches the Step Limit or When the Vehicle Leaves the Lane.
State (s_t)	The environmental state at time t, represented by an image.
Action (a_t)	The action taken by the agent under the environmental state at time t.
Reward (r_t)	The reward obtained by taking the action at time t.
Buffer	Temporary storage of data collected during interactions, used for updates.
Step Limit	The maximum number of interaction steps allowed between the Duckiebot and the environment within an episode.

Table 3. Navigation function test results.

Algorithm	Successes/Trials	Success Rate
Dueling DQN	964/1000	96.4%
PPO	660/1000	66%

Table 4. Navigation model test results in real-world environment.

	CycleGAN	Domain Randomization	None
Dueling DQN	Success	Failure	Failure
PPO	Success	Failure	Failure

Table 5. Navigation function success rate.

Algorithm	Successes/Trials	Success Rate
Dueling DQN/CycleGAN	31/50	62%
PPO/CycleGAN	26/50	52%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, C.-H.; Lin, H.-H.; Luo, Y.-Y. Research on Autonomous Vehicle Lane-Keeping and Navigation System Based on Deep Reinforcement Learning: From Simulation to Real-World Application. Electronics 2025, 14, 2738. https://doi.org/10.3390/electronics14132738

AMA Style

Cheng C-H, Lin H-H, Luo Y-Y. Research on Autonomous Vehicle Lane-Keeping and Navigation System Based on Deep Reinforcement Learning: From Simulation to Real-World Application. Electronics. 2025; 14(13):2738. https://doi.org/10.3390/electronics14132738

Chicago/Turabian Style

Cheng, Chia-Hsin, Hsiang-Hao Lin, and Yu-Yong Luo. 2025. "Research on Autonomous Vehicle Lane-Keeping and Navigation System Based on Deep Reinforcement Learning: From Simulation to Real-World Application" Electronics 14, no. 13: 2738. https://doi.org/10.3390/electronics14132738

APA Style

Cheng, C.-H., Lin, H.-H., & Luo, Y.-Y. (2025). Research on Autonomous Vehicle Lane-Keeping and Navigation System Based on Deep Reinforcement Learning: From Simulation to Real-World Application. Electronics, 14(13), 2738. https://doi.org/10.3390/electronics14132738

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Autonomous Vehicle Lane-Keeping and Navigation System Based on Deep Reinforcement Learning: From Simulation to Real-World Application

Abstract

1. Introduction

2. Related Work

2.1. Deep Reinforcement Learning

2.1.1. Deep Q-Network (DQN)

2.1.2. Double DQN (DDQN)

2.1.3. Dueling DQN

2.1.4. Policy Gradient

2.1.5. Advantage Actor–Critic (A2C)

2.1.6. Proximal Policy Optimization (PPO)

2.2. Sim2Real

2.2.1. Domain Randomization

2.2.2. CycleGAN

2.3. Path Planning

2.4. Gym Duckietown

3. Materials and Methods

3.1. System Architecture

3.2. Lane Keeping Training

3.2.1. DDQN, Dueling DQN Training Process

3.2.2. A2C, PPO Training Process

3.3. Navigation Function Training

3.4. Sim2Real Method Training

3.5. Real-World Environment

Visualization Experiments

4. Results

4.1. Lane-Keeping Training Results

4.2. Navigation Training Results

4.3. Sim2Real Training Results

4.4. Performance Analysis of the Duckiebot in the Real-World Environment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI