An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning
Abstract
1. Introduction
- (1)
- We propose an A*-distance-based improvement strategy from two complementary perspectives: reward shaping and action selection. On the reward side, A*-distance replaces Manhattan distance in computing reward and penalty terms, so that the reward signal accurately reflects the true traversal cost under obstacle constraints. On the action-selection side, the A*-distance guiding function is embedded in the -greedy exploration mechanism, guiding agents to prioritize directionally rational actions, reducing wasteful exploration, and improving training quality.
- (2)
- We conduct comparative experiments across four environments (simple obstacles, complex obstacles, large-scale, and congested), using final reward, global average reward, and episodes to 90% success rate as evaluation metrics. Results confirm that, across all environments, the proposed method outperforms standard QMIX across all evaluation metrics, with the advantage becoming more obvious as environment scale and obstacle density increase. Moreover, when compared with other MARL baselines such as VDN and MAPPO, the proposed method also demonstrates clear superiority, particularly in the large-scale and congested environments, where competing methods largely fail. Notably, the proposed method is the only learning-based method that reliably solves the two hardest tasks, achieving path lengths that closely match or, in certain cases, even surpass those of dedicated search-based solvers like CBS and LaCAM.
2. Environment Modeling and Problem Definition
2.1. Environment Modeling
2.2. Dec-POMDP Formulation for Multi-AGV Path Planning
2.3. A*-QMIX Framework
3. A*-Distance-Guided Exploration Strategy
3.1. Agent Design
3.1.1. Agent
3.1.2. Observation Space
3.1.3. Action Space
3.1.4. Reward Function
3.1.5. Conflict Resolution Strategy
3.2. A*-Distance-Based Reward Function
3.3. A*-Distance-Guided Action Selection Strategy
4. Experiments and Evaluations
4.1. Ablation Experiments Across Different Environments
4.1.1. Simple Obstacle Environment
4.1.2. Complex Obstacle Environment
4.1.3. Large-Scale Environment
4.1.4. Congested Environment
4.1.5. MAPF Performance in Ablation Experiments
4.2. Comparison with Existing Methods
5. Conclusions
- (1)
- We incorporated A*-distance into the -greedy exploration strategy within the QMIX framework to design the A*-Distance-Guided Action Selection Strategy for multi-AGV path planning. By rewarding actions that bring agents closer to their goals and penalizing those that move them farther away, the strategy effectively reduces unnecessary exploration and improves learning efficiency. Replacing Manhattan distance with A*-distance in reward computation further provides a more accurate, obstacle-aware feedback signal.
- (2)
- Simulation results show the following. In the simple obstacle environment, the proposed method achieves a global average reward of 61.3 vs 55.8 for standard QMIX (+9.9%), reaches the 90% success rate approximately 318 episodes earlier, and both algorithms converge to a comparable final reward of approximately 93.4 with comparable path quality. In the complex obstacle environment, the proposed method achieves a global average reward of 50.5 vs 37.7 for standard QMIX (+34.0%), reaches the 90% success rate approximately 203 episodes earlier, and achieves a notably higher converged final reward (76.3 vs. 68.3). In the large-scale environment, the proposed method achieves a global average reward of 107.0 vs 57.7 for standard QMIX (+85.5%), and reaches a 90% success rate within 2000 episodes, whereas standard QMIX fails to. In the congested environment, the proposed method substantially raises the global average reward from −33.9 for standard QMIX to 28.2, and reaches a 90% success rate within 2000 episodes, whereas standard QMIX fails to. Ablation comparisons indicate that the A*-distance-guided action selection is the primary source of these improvements and is decisive for convergence in the large-scale and congested environments, whereas the A*-distance reward plays a supporting role by providing a more accurate reward signal and raising the converged final reward. These results validate the effectiveness of the A*-distance-guided improvement strategy in enhancing exploration efficiency and policy quality for multi-AGV path planning, with improvements becoming more pronounced as environment scale and obstacle density increase. Compared with other algorithms, A*-QMIX is the only learning-based method that reliably solves the large-scale and congested tasks (100% and 80% test success rate), whereas QMIX, VDN, and MAPPO largely fail (≤20%). Its path length (19.0 and 14.0) remains close to that of the search-based solvers CBS (16.0 and 13.0) and LaCAM (18.0 and 24.0).
- (3)
- By combining A*-distance guidance with multi-agent reinforcement learning, this work offers a solution for multi-AGV cooperative path planning in unmanned warehouses that balances convergence speed and planning quality. Future work may explore dynamic task assignment, even larger AGV fleets, and the impact of uncertainty in real-world warehouse settings, with the aim of extending the practical applicability of the proposed approach.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ni, J.; Ge, Y.; Zhao, Y.; Gu, Y. An Improved Multi-UAV Area Coverage Path Planning Approach Based on Deep Q-Networks. Appl. Sci. 2025, 15, 11211. [Google Scholar] [CrossRef]
- Yin, R.; Rahman, M.N.A.; Hishamuddin, H.; Ikram, I.M.; Sabtu, M.I. Towards Industry 5.0: Integrating Technical Optimization of Automated Guided Vehicles with Human-Centricity in Sustainable Production Systems. J. Mech. Sci. Technol. 2026, 40, 619–632. [Google Scholar] [CrossRef]
- Bhargava, A.; Suhaib, M.; Singholi, A.S. A Review of Recent Advances, Techniques, and Control Algorithms for Automated Guided Vehicle Systems. J. Braz. Soc. Mech. Sci. Eng. 2024, 46, 419. [Google Scholar] [CrossRef]
- Tang, Y.; Zakaria, M.A.; Younas, M. Path Planning Trends for Autonomous Mobile Robot Navigation: A Review. Sensors 2025, 25, 1206. [Google Scholar] [CrossRef] [PubMed]
- Xuan, D.T.; Hung, N.T.; Thang, V.T. A Comprehensive Review of Improved A* Path Planning Algorithms and Their Hybrid Integrations. Automation 2025, 6, 52. [Google Scholar] [CrossRef]
- Song, F.; Shao, Y.; Jiang, D.; Ren, Z.; Tang, F.; Tang, Y.; Si, B. An Improved Artificial Potential Field Method With Distributed Representation and Scale-Invariant Path Planning. IEEE Trans. Cogn. Dev. Syst. 2026, 18, 128–141. [Google Scholar] [CrossRef]
- Ul Islam, N.; Gul, K.; Faizullah, F.; Ullah, S.S.; Syed, I. Trajectory Optimization and Obstacle Avoidance of Autonomous Robot Using Robust and Efficient Rapidly Exploring Random Tree. PLoS ONE 2024, 19, e0311179. [Google Scholar] [CrossRef] [PubMed]
- Hao, K.; Zhao, J.; Yu, K.; Li, C.; Wang, C. Path Planning of Mobile Robots Based on a Multi-Population Migration Genetic Algorithm. Sensors 2020, 20, 5873. [Google Scholar] [CrossRef] [PubMed]
- Lin, S.; Wang, J.; Kong, X. Bio-Inspired Reactive Approaches for Automated Guided Vehicle Path Planning: A Review. Biomimetics 2025, 11, 17. [Google Scholar] [CrossRef] [PubMed]
- Hu, H.; Yang, X.; Xiao, S.; Wang, F. Anti-Conflict AGV Path Planning in Automated Container Terminals Based on Multi-Agent Reinforcement Learning. Int. J. Prod. Res. 2023, 61, 65–80. [Google Scholar] [CrossRef]
- Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar] [CrossRef]
- Lin, S.; Liu, A.; Wang, J.; Kong, X. A Review of Path-Planning Approaches for Multiple Mobile Robots. Machines 2022, 10, 773. [Google Scholar] [CrossRef]
- Wong, A.; Bäck, T.; Kononova, A.V.; Plaat, A. Deep Multiagent Reinforcement Learning: Challenges and Directions. Artif. Intell. Rev. 2023, 56, 5023–5056. [Google Scholar] [CrossRef]
- El Wafi, M.; Youssefi, M.A.; Dakir, R.; Bakir, M. Intelligent Robot in Unknown Environments: Walk Path Using Q-Learning and Deep Q-Learning. Automation 2025, 6, 12. [Google Scholar] [CrossRef]
- Ben-Akka, M.; Tanougast, C.; Diou, C. Novel Design of Reward and Epsilon-Greedy Decay Strategy Tailored for Q-Learning in Optimizing Local Mobile Robot Path Planning. Knowl.-Based Syst. 2025, 324, 113836. [Google Scholar] [CrossRef]
- Mou, J.; Shi, B.; Wang, B.; Yu, C.; Wang, Y.; Zhong, F.; Zheng, L.; Wang, J.; Li, J. A Novel Reinforcement Learning Framework-Based Path Planning Algorithm for Unmanned Surface Vehicle. Front. Mar. Sci. 2025, 12, 1641093. [Google Scholar] [CrossRef]
- Gharbi, A. A Dynamic Reward-Enhanced Q-Learning Approach for Efficient Path Planning and Obstacle Avoidance in Mobile Robotics. Appl. Comput. Inform. 2024. ahead of print. [Google Scholar] [CrossRef]
- Zhang, R.; Wang, S.; Chen, W.; Zhou, Y.; Zhao, Z.; Zhang, Z.; Zhang, R. Optimistic ε-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning. arXiv 2025, arXiv:2502.03506. [Google Scholar]
- Pei, X.; Zhang, L.; Zhang, M.; Yin, Y.; Leng, Z.; Wang, Y.; Gan, H. A Path Planning Method Based on Noisy D3QN Algorithm with N-Step Updates. Ain Shams Eng. J. 2026, 17, 103826. [Google Scholar] [CrossRef]
- Wang, G.; Wu, F.; Zhang, X.; Guo, N.; Zheng, Z. Adaptive Trajectory-Constrained Exploration Strategy for Deep Reinforcement Learning. Knowl.-Based Syst. 2024, 285, 111334. [Google Scholar] [CrossRef]
- Li, C.; Yue, X.; Liu, Z.; Ma, G.; Zhang, H.; Zhou, Y.; Zhu, J. A Modified Dueling DQN Algorithm for Robot Path Planning Incorporating Priority Experience Replay and Artificial Potential Fields. Appl. Intell. 2025, 55, 366. [Google Scholar] [CrossRef]
- Wang, R.; Zhang, J.; Lyu, M.; Yan, C.; Chen, Y. An Improved Frontier-Based Robot Exploration Strategy Combined with Deep Reinforcement Learning. Robot. Auton. Syst. 2024, 181, 104783. [Google Scholar] [CrossRef]
- Xue, J.; Chen, J.; Zhang, S. Action-Curiosity-Based Deep Reinforcement Learning Algorithm for Path Planning in a Nondeterministic Environment. Intell. Comput. 2025, 4, 0140. [Google Scholar] [CrossRef]
- Yin, Y.; Chen, Z.; Liu, G.; Guo, J. A Mapless Local Path Planning Approach Using Deep Reinforcement Learning Framework. Sensors 2023, 23, 2036. [Google Scholar] [CrossRef] [PubMed]
- Futuhi, E.; Karimi, S.; Gao, C.; Müller, M. ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control. arXiv 2026, arXiv:2410.05225. [Google Scholar]
- Koval, A.; Karlsson, S.; Nikolakopoulos, G. Experimental Evaluation of Autonomous Map-Based Spot Navigation in Confined Environments. Biomim. Intell. Robot. 2022, 2, 100035. [Google Scholar] [CrossRef]
- Chen, Y.F.; Liu, M.; Everett, M.; How, J.P. Decentralized Non-Communicating Multiagent Collision Avoidance with Deep Reinforcement Learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
- Li, W.; Chen, H.; Jin, B.; Tan, W.; Zha, H.; Wang, X. Multi-Agent Path Finding with Prioritized Communication Learning. In 2022 International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
- Ma, L.; Liu, Y.; Chen, J.; Jin, D. Learning to Navigate in Indoor Environments: From Memorizing to Reasoning. arXiv 2019, arXiv:1904.06933. [Google Scholar]
- Marchesini, E.; Farinelli, A. Centralizing State-Values in Dueling Networks for Multi-Robot Reinforcement Learning Mapless Navigation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Piscataway, NJ, USA, 2021; pp. 4583–4588. [Google Scholar]
































| Parameter | Value |
|---|---|
| Discount factor () | 0.95 |
| Initial value | 1 |
| decay rate | 0.5 × 10−3 |
| Mini-batch size () | 64 |
| Learning rate () | 1 × 10−4/1 × 10−5 |
| Target network update frequency () | 100 episodes |
| Total training episodes () | 2000 |
| Guiding coefficient | 0.1 |
| Environments | Method | Success Rate | Arrival Rate | Collision Rate | Deadlock Rate | Path Length | Training Time (s) | A* Calls (Training) |
|---|---|---|---|---|---|---|---|---|
| Simple (20 × 20, 5 AGVs) | QMIX | 100.0% | 100.0% | 0.0% | 0.0% | 25.2 ± 6.6 | 6017 | 0 |
| A*-Distance Reward only | 100.0% | 100.0% | 0.0% | 0.0% | 22.2 ± 0.4 | 6026 | 1,449,654 | |
| A*-Guided Selection only | 100.0% | 100.0% | 0.0% | 0.0% | 22.0 ± 0.0 | 5464 | 3,231,204 | |
| A*-QMIX (Full) | 100.0% | 100.0% | 0.0% | 0.0% | 22.0 ± 0.0 | 5589 | 4,529,552 | |
| Complex (20 × 20, 5 AGVs) | QMIX | 100.0% | 100.0% | 0.0% | 0.0% | 20.4 ± 0.9 | 5476 | 0 |
| A*-Distance Reward only | 100.0% | 100.0% | 0.0% | 0.0% | 20.2 ± 0.4 | 5583 | 1,322,456 | |
| A*-Guided Selection only | 100.0% | 100.0% | 0.0% | 0.0% | 20.0 ± 0.0 | 5106 | 2,718,567 | |
| A*-QMIX (Full) | 100.0% | 100.0% | 0.0% | 0.0% | 20.0 ± 0.0 | 5130 | 3,892,883 | |
| Large-scale (30 × 30, 16 AGVs) | QMIX | 20.0% | 90.0% | 5.0% | 10.0% | 30.0 | 17,129 | 0 |
| A*-Distance Reward only | 0.0% | 80.0% | 2.5% | 20.0% | N/A | 18,529 | 5,828,229 | |
| A*-Guided Selection only | 100.0% | 100.0% | 6.2% | 0.0% | 17.6 ± 2.6 | 10,823 | 7,644,380 | |
| A*-QMIX (Full) | 100.0% | 100.0% | 5.0% | 0.0% | 19.0 ± 2.2 | 11,317 | 1,126,9181 | |
| Congested (31 × 31, 16 AGVs) | QMIX | 0.0% | 75.0% | 0.0% | 25.0% | N/A | 15,465 | 0 |
| A*-Distance Reward only | 0.0% | 90.0% | 0.0% | 10.0% | N/A | 15,233 | 5,285,939 | |
| A*-Guided Selection only | 100.0% | 100.0% | 0.0% | 0.0% | 14.0 ± 0.0 | 9104 | 5,104,937 | |
| A*-QMIX (Full) | 80.0% | 97.5% | 0.0% | 2.5% | 14.0 ± 0.0 | 9462 | 8,359,040 |
| Method | Simple (20 × 20, 5 AGVs) | Complex (20 × 20, 5 AGVs) | Large-Scale (30 × 30, 16 AGVs) | Congested (31 × 31, 16 AGVs) | ||||
|---|---|---|---|---|---|---|---|---|
| Success Rate | Path Length | Success Rate | Path Length | Success Rate | Path Length | Success Rate | Path Length | |
| A*-QMIX | 100.0% | 22.0 ± 0.0 | 100.0% | 20.0 ± 0.0 | 100.0% | 19.0 ± 2.2 | 80.0% | 14.0 ± 0.0 |
| QMIX | 100.0% | 25.2 ± 6.6 | 100.0% | 20.4 ± 0.9 | 20.0% | 30.0 | 0.0% | N/A |
| VDN | 100.0% | 22.4 ± 0.5 | 100.0% | 20.8 ± 1.8 | 0.0% | N/A | 0.0% | N/A |
| MAPPO | 100.0% | 24.6 ± 5.3 | 100.0% | 20.2 ± 0.4 | 0.0% | N/A | 0.0% | N/A |
| CBS | 100.0% | 22.0 ± 0.0 | 100.0% | 20.0 ± 0.0 | 100.0% | 16.0 ± 0.0 | 100.0% | 13.0 ± 0.0 |
| LaCAM | 100.0% | 22.0 ± 0.0 | 100.0% | 20.0 ± 0.0 | 100.0% | 18.0 ± 0.0 | 100.0% | 24.0 ± 0.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhou, Y.; Feng, Y.; Mao, P.; Wang, P. An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning. Automation 2026, 7, 100. https://doi.org/10.3390/automation7040100
Zhou Y, Feng Y, Mao P, Wang P. An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning. Automation. 2026; 7(4):100. https://doi.org/10.3390/automation7040100
Chicago/Turabian StyleZhou, Ying, Yixin Feng, Peiyan Mao, and Pengfei Wang. 2026. "An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning" Automation 7, no. 4: 100. https://doi.org/10.3390/automation7040100
APA StyleZhou, Y., Feng, Y., Mao, P., & Wang, P. (2026). An A*-Distance-Guided Exploration Strategy for Multi-AGV Path Planning. Automation, 7(4), 100. https://doi.org/10.3390/automation7040100
