Online Trajectory Planning Method for Midcourse Guidance Phase Based on Deep Reinforcement Learning
Abstract
:1. Introduction
- (1)
- The Markov decision process (MDP) is designed according to the characteristics of the midcourse guidance trajectory planning model. The reward function is designed using the negative feedback reward contrary to the traditional positive reward, which makes the agent learning more consistent with the interceptor guidance mechanism;
- (2)
- The trajectory planning task training strategy is set based on the idea of course learning (CL) to deal with the low convergence of the direct training of the DDPG algorithm. The algorithm convergence is significantly improved by replacing the training task from easy to difficult, which is verified by simulation experiments;
- (3)
- The trajectory planning of midcourse guidance is investigated using the DDPG algorithm. The optimality of the trajectory it solves is similar to the traditional optimization algorithm, which has higher efficiency and is more suitable for online trajectory planning, as verified by simulation experiments.
2. Online Trajectory Planning Problem Formulation
2.1. Model Build
2.2. Problem Formulation
3. MDP Design
3.1. State Set
3.2. Action Set
3.3. Reward Function
3.4. Discount Coefficient
4. DDPG Algorithm Model
5. Online Trajectory Planning Method for Midcourse Guidance
5.1. Training Strategy Based on CL
5.2. Online Trajectory Planning Method
- (1)
- Transform the interceptor midcourse guidance trajectory planning model into the corresponding MDP of the DDPG algorithm;
- (2)
- Based on the DDPG algorithm model, refer to Section 3 for details, and set the offline algorithm training process. The pseudocode is detailed in Algorithm 1.
Algorithm 1. Offline DDPG algorithm training. |
Initialize critic network and actor with weights Initialize target network with weights Initialize the memory capacity for episode = 1, M do Initialize the random process (as exploration noise) Set the initial state of interceptor Randomly select the target point position for t = 1, T do Select the action of the interceptor according to the current strategy and exploration noise Execute At, calculate the current reward Rt, integrate the next state St+1 Store the state transition sequence (St, At, Rt, St+1) in the memory Randomly sample small batch training data of N sequences (St, At, Rt, St+1) form memory Update according to the DDPG network model (Equations (16)–(22)) end for end for |
- (3)
- Set the low-precision training network without considering constraints, substitute the interceptor into the offline training process of step (2) and obtain the optimal parameters of the network;
- (4)
- Set high-precision training network considering constraints as well as substitute the interceptor into the offline training process of step (2), note that the optimal parameters trained in step (3) are taken as the initial parameters of the network and obtain the optimal parameters of the network;
- (5)
- Based on the offline trained network parameters, the midcourse guidance planning trajectory is quickly generated online. The pseudocode is detailed in Algorithm 2.
Algorithm 2. Online trajectory planning. |
Input the initial state of interceptor and the target point position Import optimal network parameters to the network Select the action of the interceptor according to the optimal strategy Integral calculation of the state sequence of the interceptor Output interceptor midcourse guidance planning trajectory |
6. Simulation Verification
6.1. Experimental Design
6.2. Algorithm Convergence Verification
6.3. Algorithm Effectiveness Verification
6.4. Algorithm Optimization Performance Verification
6.5. Algorithm Anti-Interference Verification
7. Conclusions
- (1)
- An MDP designed according to the characteristics of the midcourse guidance trajectory planning model can enable the interceptor to use the DDPG algorithm for offline learning training;
- (2)
- The training strategy based on CL is proposed, which greatly improves the training convergence of the DDPG algorithm;
- (3)
- The simulation results show that the online trajectory planning method based on the deep reinforcement learning proposed method has a good optimization performance, strong anti-interference and an outstanding real-time performance.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Nomenclature
Magnitude | Meaning | Units |
Coordinate position of the interceptor | M | |
Velocity of the interceptor | m/s | |
Trajectory inclination angle | rad | |
Trajectory deflection angle | rad | |
Radius of the Earth | M | |
Lift and drag force of the interceptor | N | |
Gravity acceleration | m/s2 | |
Gravity acceleration at sea level | m/s2 | |
Air density | kg/m3 | |
Air density at sea level | kg/m3 | |
Reference height | M | |
, | Lift coefficient and drag coefficient | |
S | State set of MDP | |
A | Action set of MDP | |
R | Reward function of MDP | |
Discount coefficient of MDP | ||
Remaining distance of interceptor | m | |
Longitudinal plane component and transverse plane component of velocity leading angle | rad | |
Heat flow density | W/m2 | |
Dynamic pressure | Pa | |
Overload | ||
Attack angle | rad | |
Bank angle | rad | |
Objective function | ||
Updates the parameter of network | ||
Updates the parameter of network |
References
- Zhou, J.; Lei, H. Optimal trajectory correction in midcourse guidance phase considering the zeroing effort interception. Acta Armamentarii 2018, 39, 1515–1525. [Google Scholar]
- Liu, X.; Shen, Z.; Lu, P. Entry trajectory optimization by second-order cone programming. J. Guid. Control Dyn. 2016, 39, 227–241. [Google Scholar] [CrossRef]
- Roh, H.; Oh, Y.; Kwon, H. L1 penalized sequential convex programming for fast trajectory optimization: With application to optimal missile guidance. Int. J. Aeronaut. Space 2020, 21, 493–503. [Google Scholar] [CrossRef]
- Bae, J.; Lee, S.D.; Kim, Y.W.; Lee, C.H.; Kim, S.Y. Convex optimization-based entry guidance for space plane. Int. J. Control Autom. 2022, 20, 1652–1670. [Google Scholar] [CrossRef]
- Zhou, X.; He, R.Z.; Zhang, H.B.; Tang, G.J.; Bao, W.M. Sequential convex programming method using adaptive mesh refinement for entry trajectory planning problem. Aerosp. Sci. Technol. 2020, 109, 106374. [Google Scholar] [CrossRef]
- Liu, X.; Li, S.; Xin, M. Mars entry trajectory planning with range discretization and successive convexification. J. Guid. Control Dyn. 2022, 45, 755–763. [Google Scholar] [CrossRef]
- Ross, I.M.; Sekhavat, P.; Fleming, A.; Gong, Q. Optimal feedback control: Foundations, examples, and experimental results for a new approach. J. Guid. Control Dyn. 2012, 31, 307–321. [Google Scholar] [CrossRef]
- Garg, D.; Patterson, M.; Hager, W.W.; Rao, A.V.; Benson, D.A.; Huntington, G.T. A unified framework for the numerical solution of optimal control problems using pseudo spectral methods. Automatica 2010, 46, 1843–1851. [Google Scholar] [CrossRef]
- Benson, D.A.; Huntington, G.T.; Thorvaldsen, T.P.; Rao, A.V. Direct trajectory optimization and costate estimation via an orthogonal collocation method. J. Guid. Control Dyn. 2006, 29, 1435–1440. [Google Scholar] [CrossRef]
- Sagliano, M.; Theil, S.; Bergsma, M.; D’Onofrio, V.; Whittle, L.; Viavattene, G. On the Radau pseudospectral method: Theoretical and implementation advances. CEAS Space J. 2017, 9, 313–331. [Google Scholar] [CrossRef]
- Zhao, J.; Zhou, R.; Jin, X. Reentry trajectory optimization based on a multistage pseudo-spectral method. Sci. World J. 2014, 2014, 878193. [Google Scholar]
- Zhu, Y.; Zhao, K.; Li, H.; Liu, Y.; Guo, Q.; Liang, Z. Trajectory planning algorithm using gauss pseudo spectral method based on vehicle-infrastructure cooperative system. Int. J. Automot. Technol. 2020, 21, 889–901. [Google Scholar] [CrossRef]
- Zhu, L.; Wang, Y.; Wu, Z.; Cheng, C. The intelligent trajectory optimization of multistage rocket with gauss pseudo-spectral method. Intell. Autom. Soft Comput. 2022, 33, 291–303. [Google Scholar] [CrossRef]
- Malyuta, D.; Reynolds, T.; Szmuk, M.; Mesbahi, M.; Acikmese, B.; Carson, J.M. Discretization performance and accuracy analysis for the rocket powered descent guidance problem. In Proceedings of the AIAA Scitech 2019 Forum, San Diego, CA, USA, 7–11 January 2019. [Google Scholar]
- Sagliano, M.; Heidecker, A.; Macés Hernández, J.; Farì, S.; Schlotterer, M.; Woicke, S.; Seelbinder, D.; Dumont, E. Onboard guidance for reusable rockets: Aerodynamic descent and powered landing. In Proceedings of the AIAA Scitech 2021 Forum, online. 11–15 & 19–21 January 2021. [Google Scholar]
- Marco, S. Generalized hp pseudospectral-convex programming for powered descent and landing. J. Guid. Control Dyn. 2019, 42, 1562–1570. [Google Scholar]
- Ventura, J.; Romano, M.; Walter, U. Performance evaluation of the inverse dynamics method for optimal spacecraft reorientation. Acta Astronaut. 2015, 110, 266–278. [Google Scholar] [CrossRef]
- Yazdani, A.M.; Sammut, K.; Yakimenko, O.A.; Lammas, A.; Tang, Y.; Zadeh, S.M. IDVD-based trajectory generator for autonomous underwater docking operations. Robot. Auton. Syst. 2017, 92, 12–29. [Google Scholar] [CrossRef]
- Yakimenko, O. Direct method for rapid prototyping of near-optimal aircraft trajectories. J. Guid. Control Dyn. 2000, 23, 865–875. [Google Scholar] [CrossRef]
- Yan, L.; Li, Y.; Zhao, J.; Du, X. Trajectory real-time optimization based on variable node inverse dynamics in virtual domain. Acta Aeronaut. Astronaut. Sin. 2013, 34, 2794–2803. [Google Scholar]
- Minh, D.; Wang, H.X.; Li, Y.F.; Nguyen, T.N. Explainable artificial intelligence: A comprehensive review. Artif. Intell. Rev. 2021, 55, 3503–3568. [Google Scholar] [CrossRef]
- He, R.; Lv, H.; Zhang, H. Lane Following Method Based on Improved DDPG Algorithm. Sensors 2021, 21, 4827. [Google Scholar] [CrossRef] [PubMed]
- Yin, Y.; Feng, X.; Wu, H. Learning for Graph Matching based Multi-object Tracking in Auto Driving. J. Phys. Conf. Ser. 2021, 1871, 012152. [Google Scholar] [CrossRef]
- Joohyun, W.; Chanwoo, Y.; Nakwan, K. Deep reinforcement learning-based controller for path following of an unmanned surface vehicle. Ocean Eng. 2019, 183, 155–166. [Google Scholar]
- You, S.; Diao, M.; Gao, L.; Zhang, F.; Wang, H. Target tracking strategy using deep deterministic policy gradient. Appl. Soft Comput. 2020, 95, 106490. [Google Scholar] [CrossRef]
- Hu, Z.; Gao, X.; Wan, K.; Zhai, Y.; Wang, Q. Relevant experience learning: A deep reinforcement learning method for UAV autonomous motion planning in complex unknown environments. Chin. J. Aeronaut. 2021, 34, 187–204. [Google Scholar] [CrossRef]
- Yu, Y.; Tang, J.; Huang, J.; Zhang, X.; So, D.K.C.; Wong, K.K. Multi-Objective Optimization for UAV-Assisted Wireless Powered IoT Networks Based on Extended DDPG Algorithm. IEEE Trans. Commun. 2021, 69, 6361–6374. [Google Scholar] [CrossRef]
- Hua, J.; Zeng, L.; Li, G.; Ju, Z. Learning for a Robot: Deep Reinforcement Learning, Imitation Learning, Transfer Learning. Sensors 2021, 21, 1278. [Google Scholar] [CrossRef]
- Li, X.; Zhong, J.; Kamruzzaman, M. Complicated robot activity recognition by quality-aware deep reinforcement learning. Future Gener. Comput. Syst. 2021, 117, 480–485. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), Beijing, China, 21–26 June 2014. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Brian, G.; Richard, L.; Roberto, F. Deep reinforcement learning for six degree-of-freedom planetary landing. Adv. Space Res. 2020, 65, 1723–1741. [Google Scholar]
- Gaudet, B.; Furfaro, R.; Linares, R. Reinforcement learning for angle-only intercept guidance of maneuvering targets. Aerosp. Sci. Technol. 2020, 99, 105746. [Google Scholar] [CrossRef]
- Gaudet, B.; Furfaro, R. Terminal adaptive guidance for autonomous hypersonic strike weapons via reinforcement learning. arXiv 2021, arXiv:2110.00634. [Google Scholar]
- Sagliano, M.; Mooij, E. Optimal drag-energy entry guidance via pseudospectral convex optimization. Aerosp. Sci. Technol. 2021, 117, 106946. [Google Scholar] [CrossRef]
x0/(km) | h0/(km) | z0/(km) | v0/(m·s−1) | θ0/(°) | ψv0/(°) |
---|---|---|---|---|---|
0 | 70 | 0 | 3000 | −5 | 0 |
nmax/(g) | qmax/(W·m−2) | pmax/(Pa) | |α|max/(°) | |σ|max/(°) |
---|---|---|---|---|
10 | 1 × 106 | 1 × 105 | 30 | 85 |
xf/(km) | hf/(km) | zf/(km) |
---|---|---|
338 | [20, 40] | [−60, 60] |
Training Strategy | Success Rate/(%) | Average Reward (Maximum, Minimum) | Standard Deviation |
---|---|---|---|
CL training | 90.16 | 852.67 (872.90, −52.77) | 214.73 |
Direct training | 40.96 | 394.84 (854.65, −132.59) | 536.65 |
Method | Flight Time of Simulated Trajectory/(s) | Time Required for Trajectory Simulation/(s) |
---|---|---|
CLDDPG | 118 | 0.15 |
PSC | 119.09 | 1.62 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, W.; Li, J.; Li, N.; Shao, L.; Li, M. Online Trajectory Planning Method for Midcourse Guidance Phase Based on Deep Reinforcement Learning. Aerospace 2023, 10, 441. https://doi.org/10.3390/aerospace10050441
Li W, Li J, Li N, Shao L, Li M. Online Trajectory Planning Method for Midcourse Guidance Phase Based on Deep Reinforcement Learning. Aerospace. 2023; 10(5):441. https://doi.org/10.3390/aerospace10050441
Chicago/Turabian StyleLi, Wanli, Jiong Li, Ningbo Li, Lei Shao, and Mingjie Li. 2023. "Online Trajectory Planning Method for Midcourse Guidance Phase Based on Deep Reinforcement Learning" Aerospace 10, no. 5: 441. https://doi.org/10.3390/aerospace10050441
APA StyleLi, W., Li, J., Li, N., Shao, L., & Li, M. (2023). Online Trajectory Planning Method for Midcourse Guidance Phase Based on Deep Reinforcement Learning. Aerospace, 10(5), 441. https://doi.org/10.3390/aerospace10050441