Policy-Guided Model Predictive Path Integral for Safe Manipulator Trajectory Planning
Highlights
- The proposed PG-MPPI framework integrates CD-SAC offline learning, MPPI online planning and CBF safety filtering to construct a three-level safety system, which addresses the challenges of difficult hard-constraint enforcement and weak environmental generalization in manipulator trajectory planning.
- The algorithm achieves a 100% success rate of collision-free target reaching in multi-scenario simulations on the SIASUN T12B manipulator, with the trajectory satisfying physical constraints and demonstrating excellent adaptability.
- The algorithm provides a new paradigm integrating global prior learning and local real-time optimization for manipulator trajectory planning, breaking through the limitations of traditional methods in sampling efficiency and constraint handling.
- Its multi-level safety assurance mechanism can be transferred to practical industrial scenarios, providing technical support for the application of industrial robots in complex scenarios such as flexible manufacturing and human–robot collaboration.
Abstract
1. Introduction
- A dense safety-guided policy learning method based on CDF is proposed. To overcome the limitations of sparse collision penalties of traditional workspace signed distance functions, CDF is integrated into the design of the SAC reward function, and its continuously differentiable gradient characteristic is utilized to provide global dense guidance for obstacle avoidance behavior;
- A policy-guided MPPI online planning architecture is designed to achieve deep integration of global prior and local optimization. The offline-trained Constraint-Discounted SAC (CD-SAC) policy is used as the nominal control sequence generator for MPPI to provide a high-quality warm start, concentrating the sampling distribution near the global optimal solution. This design alleviates the shortcomings of traditional MPPI and compensates for the generalization limitations of the offline policy via MPPI’s online receding horizon optimization;
- A multi-level safety system featuring “policy soft guidance + optimization hard constraint + filter final guarantee” is constructed. At the offline learning layer, the safety preference is internalized into the policy through CDF and TD-CD; at the online planning layer, the trajectory feasibility is enhanced through the explicit cost penalty of MPPI; at the execution layer, a safety filter based on first-order CBF is introduced to perform real-time projection correction on control commands.
2. Preliminaries
2.1. Problem Formulation
- Obstacle avoidance constraint: Define the minimum safety distance between the manipulator’s links, joints and obstacles in the workspace and require the actual minimum distance to satisfy the safety threshold during motion to ensure no geometric collision;
- State constraint: Set upper and lower bounds for joint angles and joint velocities, i.e., , , to prevent the system from entering a physically infeasible state for the mechanical structure and drive system;
- Control constraint: Set an amplitude upper bound for joint acceleration, i.e., , to match the actual output capability of actuators and prevent mechanical vibration, impact or hardware damage caused by abrupt changes in control signals.
2.2. Model Predictive Path Integral
2.3. Soft Actor–Critic
3. Policy-Guided MPPI
3.1. Algorithm Framework
- Offline Learning: A prior controller is obtained through repeated interactions of reinforcement learning in a simulation environment. The objectives of target reaching + safety constraints are encoded into the reward function or constraint discounting mechanism to train a policy . Such a policy is capable of providing action guidance with a long planning horizon and shifting a large amount of computational load from the online execution phase to the offline training phase;
- Online Planning: Real-time optimization is performed under the current actual state and environment to ensure safe execution of the manipulator. The output of the offline policy is adopted as the nominal control sequence of MPPI; random sampling and rollout are conducted around this sequence. The desired control signal is then revised by a safety filter to satisfy all safety constraints. This approach not only leverages the RL prior to improve sampling efficiency but also utilizes the explicit cost function and receding horizon optimization of MPPI.
3.2. Offline Policy Learning
3.2.1. Reward Design
3.2.2. Constraints Based on Discount Factor
- Constraint violation no longer relies on manually designed penalty functions. Instead, the degree of violation is directly converted into a discount weight for future rewards. The more severe the violation, the closer the value is to , and the higher the proportion of future rewards being discounted;
- A high termination probability is triggered as long as one constraint is severely violated, which prevents safety risks or motion failure caused by the superposition of multiple minor violations and ensures the stringency of constraints;
- In the absence of violations, and , meaning future rewards are not discounted to encourage normal exploration. In the case of minor violations, and , future rewards are partially discounted, which not only warns of violations but also allows the robot to learn recovery strategies. In the case of severe violations, and , which directly terminates all future rewards and strictly prohibits severe violations.
3.2.3. Constraint-Discounted SAC Learning Strategy
3.3. Online Trajectory Planning
3.3.1. CBF-Based Safety Filter
3.3.2. Algorithm Implementation
| Algorithm 1 PG-MPPI Pseudocode | |
| Required: Dynamics learned policy | |
| Ensure: applied to the robot at each time step. | |
| 1: | Initialize norminal control sequence |
| 2: | for do |
| 3: | Observe current state |
| Policy-guided nominal (warm start) | |
| 4: | Predict a nominal rollout under the learned policy: |
| 5: | |
| 6: | for to do |
| 7: | |
| 8: | |
| 9: | end for |
| 10: | Set the nominal sequence |
| MPPI update | |
| 11: | |
| Safety filtering and execution | |
| 12: | |
| 13: | |
| 14: | Apply to the robot, obtain next state |
| Receding horizon shift | |
| 15: | |
| 16: | end for |
- Guided sampling and variance reduction. The nominal control of standard MPPI is usually based on the translation of the previous time step or simple heuristics, which is prone to falling into local minima or failing under complex constraints. PG-MPPI leverages the RL policy to provide a high-quality warm-start sequence, concentrating the sampling distribution near the globally optimal solutions. This greatly improves the sampling effectiveness and convergence speed;
- Global planning capability and fast local correction. The RL policy learns a long-horizon value function through offline training, enabling it to handle tasks that require global information. In contrast, MPPI can perform high-frequency, real-time local correction on policy outputs, effectively responding to unencountered obstacles or model errors in offline training;
- Complementary safety and constraint handling. RL training based on CDF enables the policy to internalize a soft safety preference, so that the output initial trajectory, even if not perfect, is always close to the safe region. On this basis, MPPI further guarantees the feasibility and safety of the trajectory at the execution level through hard-constraint projection and collision cost penalty, forming a dual assurance mechanism of policy soft guidance + optimization hard constraint.
4. Experiments
4.1. Offline Policy Learning
4.2. Obstacle Avoidance Trajectory
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Liu, J.; Yap, H.J.; Khairuddin, A.S.M. Review on motion planning of robotic manipulator in dynamic environments. J. Sens. 2024, 1, 5969512. [Google Scholar] [CrossRef]
- Koptev, M.; Figueroa, N.; Billard, A. Reactive collision-free motion generation in joint space via dynamical systems and sampling-based MPC. Int. J. Robot. Res. 2024, 43, 2049–2069. [Google Scholar] [CrossRef] [PubMed]
- Luo, S.; Zhang, M.; Zhuang, Y.; Ma, C.; Li, Q. A survey of path planning of industrial robots based on rapidly exploring random trees. Front. Neurorobot. 2023, 17, 1268447. [Google Scholar] [CrossRef] [PubMed]
- Schulman, J.; Ho, J.; Lee, A.X.; Awwal, I.; Bradlow, H.; Abbeel, P. Finding locally optimal, collision-free trajectories with sequential convex optimization. In Proceedings of the Robotics: Science and Systems 2013, Berlin, Germany, 24 June 2013. [Google Scholar]
- Mayne, D.Q. Model predictive control: Recent developments and future promise. Automatica 2014, 50, 2967–2986. [Google Scholar] [CrossRef]
- Williams, G.; Drews, P.; Goldfain, B.; Rehg, J.M.; Theodorou, E.A. Information-theoretic model predictive control: Theory and applications to autonomous driving. IEEE Trans. Robot. 2018, 34, 1603–1622. [Google Scholar] [CrossRef]
- Belvedere, T.; Ziegltrum, M.; Turrisi, G.; Modugno, V. Feedback-MPPI: Fast sampling-based MPC via rollout differentiation–adios low-level controllers. IEEE Robot. Autom. Lett. 2025, 11, 1–8. [Google Scholar] [CrossRef]
- Qu, Y.; Chu, H.; Gao, S.; Guan, J.; Yan, H.; Xiao, L.; Li, S.E.; Duan, J. RL-driven MPPI: Accelerating online control laws calculation with offline policy. IEEE Trans. Intell. Veh. 2023, 9, 3605–3616. [Google Scholar] [CrossRef]
- Ezeji, O.; Ziegltrum, M.; Turrisi, G.; Belvedere, T.; Modugno, V. BC-MPPI: A Probabilistic Constraint Layer for Safe Model-Predictive Path-Integral Control. In Proceedings of the Agents and Robots for Reliable Engineered Autonomy 2025, Bologna, Italy, 25 October 2025. [Google Scholar]
- Tamizi, M.G.; Yaghoubi, M.; Najjaran, H. A review of recent trend in motion planning of industrial robots. Int. J. Intell. Robot. Appl. 2023, 7, 253–274. [Google Scholar] [CrossRef]
- Stan, L.; Nicolescu, A.F.; Pupăză, C. Reinforcement learning for assembly robots: A review. Proc. Manuf. Syst. 2020, 15, 135–146. [Google Scholar]
- Romero, A.; Aljalbout, E.; Song, Y.; Scaramuzza, D. Actor–Critic Model Predictive Control: Differentiable Optimization Meets Reinforcement Learning for Agile Flight. IEEE Trans. Robot. 2025, 42, 673–692. [Google Scholar] [CrossRef]
- Baltussen, T.M.J.T.; Orrico, C.A.; Katriniok, A.; Heemels, W.P.M.H.; Krishnamoorthy, D. Value Function Approximation for Nonlinear MPC: Learning a Terminal Cost Function with a Descent Property. In Proceedings of the 2025 IEEE 64th Conference on Decision and Control, Rio De Janeiro, Brazil, 10 December 2025. [Google Scholar]
- Hansen, N.; Wang, X.; Su, H. Temporal difference learning for model predictive control. arXiv 2022, arXiv:2203.04955. [Google Scholar] [CrossRef]
- Wang, P.; Li, C.; Weaver, C.; Kawamoto, K.; Tomizuka, M.; Tang, C.; Zhan, W. Residual-mppi: Online policy customization for continuous control. arXiv 2024, arXiv:2407.00898. [Google Scholar] [CrossRef]
- Dergachev, S.; Pshenitsyn, A.; Panov, A.; Skrynnik, A.; Yakovlev, K. CoRL-MPPI: Enhancing MPPI with Learnable Behaviours for Efficient and Provably-Safe Multi-Robot Collision Avoidance. arXiv 2025, arXiv:2511.09331. [Google Scholar]
- Liu, P.; Zhang, Y.; Wang, H.; Yip, M.K.; Liu, E.S.; Jin, X. Real-time collision detection between general SDFs. Comput. Aided Geom. Des. 2024, 111, 102305. [Google Scholar] [CrossRef]
- Li, Y.; Miyazaki, T.; Kawashima, K. One-Step Model Predictive Path Integral for Manipulator Motion Planning Using Configuration Space Distance Fields. arXiv 2025, arXiv:2509.00836. [Google Scholar] [CrossRef]
- Li, Y.; Chi, X.; Razmjoo, A.; Calinon, S. Configuration space distance fields for manipulation planning. arXiv 2024, arXiv:2406.01137. [Google Scholar] [CrossRef]
- Ames, A.D.; Coogan, S.; Egerstedt, M.; Notomista, G.; Sreenath, K.; Tabuada, P. Control barrier functions: Theory and applications. In Proceedings of the 2019 18th European Control Conference, Naples, Italy, 15 June 2019. [Google Scholar]
- Almubarak, H.; Sadegh, N.; Theodorou, E.A. Safety embedded control of nonlinear systems via barrier states. IEEE Control Syst. Lett. 2021, 6, 1328–1333. [Google Scholar] [CrossRef]
- Crestaz, P.N.; De Matteis, L.; Chane-Sane, E.; Mansard, N.; Del Prete, A. TD-CD-MPPI: Temporal-Difference Constraint-Discounted Model Predictive Path Integral Control. IEEE Robot. Autom. Lett. 2025, 11, 498–505. [Google Scholar] [CrossRef]
- Chane-Sane, E.; Leziart, P.A.; Flayols, T.; Stasse, O.; Souères, P.; Mansard, N. Cat: Constraints as terminations for legged locomotion reinforcement learning. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14 October 2024. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10 July 2018. [Google Scholar]
- Ames, A.D.; Xu, X.; Grizzle, J.W.; Tabuada, P. Control barrier function based quadratic programs for safety critical systems. IEEE Trans. Autom. Control 2016, 62, 3861–3876. [Google Scholar] [CrossRef]
- Todorov, E. Convex and analytically-invertible dynamics with contacts and constraints: Theory and implementation in MuJoCo. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation, Hong Kong, China, 31 May 2014. [Google Scholar]









| Type | Parameters | Value | Function |
|---|---|---|---|
| Constraint Discount | velocity bound | 1 rad/s | Joint velocity constraint |
| acceleration bound | 2 rad2/s | Joint acceleration constraint | |
| 1 | Termination probability calculation in Equation (19) | ||
| 0.99 | Exponential moving average decay rate in Equation (20) | ||
| SAC Network | hide dim | 256 | Number of neurons per layer for actor/critic MLP |
| learning rate | 3 × 10−4 | Learning rate of Adam optimizer | |
| 0.99 | Baseline discount factor of SAC in Equation (21) | ||
| batch size | 256 | Batch size for each network update | |
| action repeat | 5 | Number of physics simulation steps per RL step | |
| Environment Reward | reach tolerance | 0.03 | Position error threshold for task success |
| max ep steps | 2500 | Maximum steps per training episode | |
| success bonus | 10.0 | Extra reward | |
| total steps | 200,000 | Total environment interaction steps |
| Parameters | Value | Function |
|---|---|---|
| policy update | 0.1 s | Update cycle of the global prior policy |
| horizon | 25 | Length of the predictive receding horizon |
| samples number | 200 | Number of sampled trajectories for MPPI |
| 0.6 | Temperature parameter | |
| standard noise | 0.6 | Standard deviation of Gaussian noise |
| Method | Normal SR (%) | Complex SR (%) | Planning Time (ms) |
|---|---|---|---|
| SF-MPPI | 20 | 20 | 1.47 ± 0.4 |
| SF-SAC | 100 | 60 | 1.14 ± 0.12 |
| PG-MPPI | 100 | 100 | 3.03 ± 0.82 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liang, L.; Wu, C.; Wang, X. Policy-Guided Model Predictive Path Integral for Safe Manipulator Trajectory Planning. Sensors 2026, 26, 2074. https://doi.org/10.3390/s26072074
Liang L, Wu C, Wang X. Policy-Guided Model Predictive Path Integral for Safe Manipulator Trajectory Planning. Sensors. 2026; 26(7):2074. https://doi.org/10.3390/s26072074
Chicago/Turabian StyleLiang, Liang, Chengdong Wu, and Xiaofeng Wang. 2026. "Policy-Guided Model Predictive Path Integral for Safe Manipulator Trajectory Planning" Sensors 26, no. 7: 2074. https://doi.org/10.3390/s26072074
APA StyleLiang, L., Wu, C., & Wang, X. (2026). Policy-Guided Model Predictive Path Integral for Safe Manipulator Trajectory Planning. Sensors, 26(7), 2074. https://doi.org/10.3390/s26072074

