Adaptive Robot Navigation Using Randomized Goal Selection with Twin Delayed Deep Deterministic Policy Gradient
Abstract
:1. Introduction
2. Background
2.1. TD3 Architecture and Principles
- Policy Network (Actor Network): The actor network, denoted as , is responsible for selecting actions given the current state. It approximates the policy function and is parameterized by . The actor network outputs the action that the agent should take in a given state to maximize the expected return.
- Critic Networks ( and ): TD3 employs two critic networks, and , to estimate the Q-values for state–action pairs . Each network outputs a scalar value that represents the expected return for a specific state–action pair. The use of two critics helps reduce overestimation bias, which can occur when function approximation is used in reinforcement learning. By employing double Q-learning, TD3 ensures more accurate value estimates, enhancing the stability of policy updates.
- Critic Target Networks: The critic target networks, and , are delayed versions of the primary critic networks and are updated at a slower rate. These target networks provide stable Q-value targets for the critic updates, which helps to reduce variance and prevents divergence during training. By maintaining more conservative Q-value estimates, the critic target networks further mitigate the overestimation bias found in reinforcement learning models.
- Actor Target Network: The actor target network, , is a delayed copy of the actor network and is essential for generating stable action targets for the critic updates. The actor target network changes gradually over time, ensuring that the target actions used in critic updates are consistent and stable. This stability is critical for reducing the variance in the Q-value targets provided by the critic target networks, leading to more reliable and stable training of the critics. The actor target network’s slow updates help in maintaining a steady learning process and improve the overall performance of the TD3 algorithm by preventing rapid and unstable policy changes.
2.2. TD3 Networks and Updates
Symbol Definitions
- : Actor network responsible for selecting actions.
- : Critic networks estimating Q-values for state–action pairs.
- , , : Target networks providing stable targets for updates.
- : Expected return, optimized by the actor network.
- r: Immediate reward received after taking an action.
- : Discount factor for future rewards.
- : Noise-regularized action used in target Q-value calculation.
- : Soft update rate for target network parameters.
- : Regularization noise added to target actions.
- : Parameters of the primary critic networks.
- : Parameters of the actor network.
3. Methodology and Experimental Setup
3.1. Jackal Robot Dynamics and Sensors
- Laser Sensor:The LiDAR sensor provides essential data for navigation by scanning the environment and detecting obstacles. It generates a 720-dimensional laser scan, which is directly used by the DRL algorithm to inform the robot’s decision-making process. These LiDAR data allow the robot to avoid collisions and select efficient paths to reach its goal. In the Gazebo simulation, the LiDAR sensor replicates real-world sensor dynamics, ensuring that the robot can adapt to complex environments during training.
- Velocity Sensor: The velocity sensor monitors the robot’s movement by tracking the linear and angular velocities issued by the DRL node. These commands control the robot’s speed, ensuring that the actual movement matches the intended commands. This feedback helps maintain stable movement and highlights any deviations, such as low velocity, that may affect navigation performance.
3.2. Training and Evaluation Scenarios
3.2.1. Static Box Environments
3.2.2. Custom Static Environments with Randomized Start and Goal Points
3.3. Graphical Monitoring of Training Process
4. Experimental Comparison of TD3 Models
4.1. Training and Evaluation Metrics
- Success Rate: The percentage of episodes where the robot successfully reaches the goal without collisions.
- Collision Rate: The percentage of episodes where the robot collides with obstacles.
- Episode Length: The average duration (in terms of steps) taken by the robot to complete a task.
- Average Return: The cumulative reward earned by the robot over an episode, averaged over all episodes, to evaluate the learning efficiency of the algorithm.
- Time-Averaged Number of Steps: The average number of steps the robot takes to complete each episode, providing insight into the efficiency of the navigation strategy.
- Total Distance Traveled: This metric, calculated during the test phase, measures the total distance (in meters) the robot travels to complete a path. It provides an additional layer of analysis by assessing the efficiency of the robot’s movements in terms of path length.
4.2. Baseline Comparison in Static Environments
5. Training Results and Comparison
5.1. Reward Calculation and Success Rate Comparison
- Slack Penalty: per step.
- Progress reward: per meter closer to the goal.
- Collision penalty: per collision.
- Success reward: for reaching the goal.
- Failure penalty: if the episode ends unsuccessfully.
5.2. Collision Rate Analysis
5.3. Training Efficiency and Convergence
6. Transfer to Test Environments Using TD3 and Global Path Planning
6.1. Evaluation Results
6.2. Analysis of Results
Path | Distance (m) | Collisions | Goal Reached | Time (s) |
---|---|---|---|---|
Path 1 | 18.72 | 0 | Yes | 9.77 |
Path 2 | 47.16 | 1 | Yes | 31.66 |
Path 3 | 13.81 | 0 | Yes | 8.28 |
Path 4 | 26.41 | 4 | No | 21.79 |
Path 5 | 15.51 | 0 | Yes | 9.26 |
Path 6 | 8.12 | 0 | Yes | 4.19 |
Path 7 | 15.04 | 3 | Yes | 9.87 |
Path 8 | 37.74 | 0 | No | 35.49 |
Path 9 | 15.66 | 1 | Yes | 9.98 |
Path 10 | 30.54 | 0 | Yes | 23.33 |
Path | Distance (m) | Collisions | Goal Reached | Time (s) |
---|---|---|---|---|
Path 1 | 33.53 | 4 | No | 33.79 |
Path 2 | 41.97 | 0 | No | 80.00 |
Path 3 | 10.65 | 4 | No | 14.88 |
Path 4 | 39.40 | 3 | No | 67.59 |
Path 5 | 32.09 | 0 | Yes | 42.20 |
Path 6 | 8.12 | 0 | Yes | 4.24 |
Path 7 | 15.22 | 3 | Yes | 9.55 |
Path 8 | 37.75 | 0 | No | 39.67 |
Path 9 | 44.95 | 0 | No | 56.42 |
Path 10 | 58.16 | 1 | No | 68.19 |
6.3. Custom Navigation Performance Assessment Score Comparison
7. Conclusions and Recommendations
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
Abbreviation | Definition |
TD3 | Twin Delayed Deep Deterministic Policy Gradient |
RL | Reinforcement Learning |
DRL | Deep Reinforcement Learning |
PPO | Proximal Policy Optimization |
SAC | Soft Actor-Critic |
DDPG | Deep Deterministic Policy Gradient |
DQN | Deep Q-Network |
ROS | Robot Operating System |
XAI | Explainable Artificial Intelligence |
CP | Collision Probability |
NeurIPS | Conference on Neural Information Processing Systems |
GDAE | Goal-Driven Autonomous Exploration |
Tentabot FC | Fully Connected Neural Network Model in Tentabot Framework |
Tentabot 1DCNN FC | 1D Convolutional Neural Network Model with Fully Connected Layers in Tentabot Framework |
OPAC | Opportunistic Actor-Critic |
References
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018; ISBN 978-0-262-03924-6. Available online: https://mitpress.mit.edu/9780262039246/reinforcement-learning/ (accessed on 24 March 2025).
- Dogru, S.; Marques, L. An improved kinematic model for skid-steered wheeled platforms. Auton. Robot. 2021, 45, 229–243. [Google Scholar] [CrossRef]
- Chen, Y.; Rastogi, C.; Norris, W.R. A CNN-Based Vision-Proprioception Fusion Method for Robust UGV Terrain Classification. IEEE Robot. Autom. Lett. 2021, 6, 7965–7972. [Google Scholar] [CrossRef]
- Lapan, M. Deep Reinforcement Learning Hands-On: Apply Modern RL Methods, with Deep Q-Networks, Value Iteration, Policy Gradients, TRPO, AlphaGo Zero and More; Packt Publishing Ltd.: Birmingham, UK, 2018; ISBN 978-1-78883-930-2. [Google Scholar]
- Hafner, D.; Lillicrap, T.; Norouzi, M.; Ba, J. Mastering Atari with Discrete World Models. arXiv 2020, arXiv:2010.02193. [Google Scholar] [CrossRef]
- Roy, S.; Bakshi, S.; Maharaj, T. OPAC: Opportunistic Actor-Critic. arXiv 2020, arXiv:2012.06555. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar] [CrossRef]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar] [CrossRef]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Roth, A.M. JackalCrowdEnv. GitHub Repository. 2019. Available online: https://github.com/AMR-/JackalCrowdEnv (accessed on 24 March 2025).
- Roth, A.M.; Liang, J.; Manocha, D. XAI-N: Sensor-Based Robot Navigation Using Expert Policies and Decision Trees. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September 2021–1 October 2021; pp. 2053–2060. [Google Scholar] [CrossRef]
- Akmandor, N.Ü.; Li, H.; Lvov, G.; Dusel, E.; Padir, T. Deep Reinforcement Learning Based Robot Navigation in Dynamic Environments Using Occupancy Values of Motion Primitives. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 11687–11694. [Google Scholar] [CrossRef]
- Akmandor, N.U.; Dusel, E. Tentabot: Deep Reinforcement Learning-Based Navigation. GitHub Repository. 2022. Available online: https://github.com/RIVeR-Lab/tentabot/tree/master (accessed on 24 March 2025).
- Ali, R. Robot exploration and navigation in unseen environments using deep reinforcement learning. World Acad. Sci. Eng. Technol. Int. J. Comput. Syst. Eng. 2024, 18, 619–625. [Google Scholar]
- Xu, Z.; Liu, B.; Xiao, X.; Nair, A.; Stone, P. Benchmarking Reinforcement Learning Techniques for Autonomous Navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 9224–9230. [Google Scholar] [CrossRef]
- Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Stockholmsmässan: Stockholm, Sweden, 2018; Volume 80, pp. 1582–1591. [Google Scholar] [CrossRef]
- Cimurs, R.; Suh, I.H.; Lee, J.H. Goal-Driven Autonomous Exploration Through Deep Reinforcement Learning. IEEE Robot. Autom. Lett. 2022, 7, 730–737. [Google Scholar] [CrossRef]
- Cimurs, R. DRL-Robot-Navigation. GitHub Repository. 2024. Available online: https://github.com/reiniscimurs/DRL-robot-navigation (accessed on 24 March 2025).
- Anas, H.; Hong, O.W.; Malik, O.A. Deep Reinforcement Learning-Based Mapless Crowd Navigation with Perceived Risk of the Moving Crowd for Mobile Robots. arXiv 2023, arXiv:2304.03593. [Google Scholar] [CrossRef]
- Zerosansan. TD3, DDPG, SAC, DQN, Q-Learning, SARSA Mobile Robot Navigation. GitHub Repository. 2024. Available online: https://github.com/zerosansan/td3_ddpg_sac_dqn_qlearning_sarsa_mobile_robot_navigation (accessed on 24 March 2025).
- Wong, C.-C.; Chien, S.-Y.; Feng, H.-M.; Aoyama, H. Motion Planning for Dual-Arm Robot Based on Soft Actor-Critic. IEEE Access 2021, 9, 26871–26885. [Google Scholar] [CrossRef]
- Sylabs. Installing SingularityCE. 2024. Available online: https://docs.sylabs.io/guides/latest/admin-guide/installation.html#installation-on-linux (accessed on 24 March 2025).
- Daffan, F. ros_jackal. GitHub Repository. 2021. Available online: https://github.com/Daffan/ros_jackal (accessed on 24 March 2025).
- Ali, R. Extended-ROS-Jackal-Environment. GitHub Repository. 2024. Available online: https://github.com/Romisaa-Ali/Extended-ROS-Jackal-Environment (accessed on 24 March 2025).
- Xu, Z.; Liu, B.; Xiao, X.; Nair, A.; Stone, P. Benchmarking Reinforcement Learning Techniques for Autonomous Navigation. Available online: https://cs.gmu.edu/~xiao/Research/RLNavBenchmark/ (accessed on 24 March 2025).
- Daffan, F. ROS Jackal: Competition Package. GitHub Repository. 2023. Available online: https://github.com/Daffan/ros_jackal/tree/competition (accessed on 24 March 2025).
- Clearpath Robotics. Simulating Jackal in Gazebo. 2024. Available online: https://docs.clearpathrobotics.com/docs/ros1noetic/robots/outdoor_robots/jackal/tutorials_jackal/#simulating-jackal (accessed on 24 March 2025).
- Open Source Robotics Foundation. move_base. ROS Wiki. n.d. Available online: https://wiki.ros.org/move_base (accessed on 24 March 2025).
- Zenodo. Robot Navigation Using TD3 with MoveBase Integration in ENV2. February 2025. Available online: https://zenodo.org/records/14881795 (accessed on 24 March 2025).
Author | Method Used | Obstacle Type and Difficulty | Success Rate (%) | Time (s) | Distance (m) |
---|---|---|---|---|---|
Cimurs et al. [18] | TD3 (With Global Strategy) | A static environment with smooth walls and multiple local optima | 100 | 88.03 | 41.42 |
Akmandor et al. [13] | PPO (Without Global Strategy) | Static environments with distinct structured obstacles | 55 | - | - |
Proposed Method | TD3 (With Global Strategy) | Maze-like static environment with long walls | 80.0 | 16.36 | 22.87 |
Path | Metric (%) | Description |
---|---|---|
Path 1 | 100.00 | No collisions, within max allowed distance |
Path 2 | 57.60 | 1 collision, exceeded optimal distance with penalty |
Path 3 | 100.00 | No collisions, optimal path followed |
Path 4 | 0.00 | 4 collisions, exceeded max allowed distance |
Path 5 | 100.00 | No collisions, within optimal range |
Path 6 | 100.00 | No collisions, optimal path |
Path 7 | 67.00 | 3 collisions, distance exceeded slightly |
Path 8 | 0.00 | Exceeded max allowed distance |
Path 9 | 70.00 | 1 collision, slight penalty |
Path 10 | 98.33 | No collisions, distance exceeded slightly |
Path | Metric (%) | Description |
---|---|---|
Path 1 | 0.00 | 4 collisions, exceeded max allowed distance |
Path 2 | 0.00 | Timeout, exceeded max distance |
Path 3 | 0.00 | 4 collisions, failed navigation |
Path 4 | 0.00 | Exceeded max allowed distance with 3 collisions |
Path 5 | 85.77 | No collisions, slight distance and time penalties |
Path 6 | 100.00 | No collisions, optimal path followed |
Path 7 | 10.00 | 3 collisions, slight penalty |
Path 8 | 0.00 | Exceeded max allowed distance |
Path 9 | 0.00 | Exceeded max allowed distance, timeout |
Path 10 | 0.00 | Exceeded max allowed distance with 1 collision |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ali, R.; Dogru, S.; Marques, L.; Chiaberge, M. Adaptive Robot Navigation Using Randomized Goal Selection with Twin Delayed Deep Deterministic Policy Gradient. Robotics 2025, 14, 43. https://doi.org/10.3390/robotics14040043
Ali R, Dogru S, Marques L, Chiaberge M. Adaptive Robot Navigation Using Randomized Goal Selection with Twin Delayed Deep Deterministic Policy Gradient. Robotics. 2025; 14(4):43. https://doi.org/10.3390/robotics14040043
Chicago/Turabian StyleAli, Romisaa, Sedat Dogru, Lino Marques, and Marcello Chiaberge. 2025. "Adaptive Robot Navigation Using Randomized Goal Selection with Twin Delayed Deep Deterministic Policy Gradient" Robotics 14, no. 4: 43. https://doi.org/10.3390/robotics14040043
APA StyleAli, R., Dogru, S., Marques, L., & Chiaberge, M. (2025). Adaptive Robot Navigation Using Randomized Goal Selection with Twin Delayed Deep Deterministic Policy Gradient. Robotics, 14(4), 43. https://doi.org/10.3390/robotics14040043