PHER: A Method for Solving the Sparse Reward Problem of a Manipulator Grasping Task
Abstract
1. Introduction
2. Related Works
2.1. Off-Policy Reinforcement Learning
- 1.
- DDPG: It was developed by Silver et al. by combining deterministic policy gradients [31] with deep learning, which is used to solve continuous action space problems. It is based on the actor–critic algorithm framework. In addition, each network has its corresponding target network, so DDPG includes four networks, namely the actor network, critic network, target actor network, and target–critic network.
- 2.
- TD3: It is an improvement of the DDPG algorithm. Based on four networks in the DDPG, it adds two actor networks and two critic networks to reduce the impact of Q value overestimation [32].
- 3.
- SAC: It introduces the concept of entropy [33], which can randomize the policy by increasing the information entropy term, that is, making the probability of each output action as uniform as possible, rather than concentrating on one action. This encourages exploration and can also learn more behaviors close to the goal.
2.2. Hindsight Experience Replay
3. Methodology
3.1. The Problem of Uniform Sampling in HER
3.2. Prioritized Hindsight Experience Replay
| Algorithm 1 Prioritized Hindsight Experience Replay (PHER) |
Input:
|
4. Experiment and Performance Analysis
4.1. Environments
- 1.
- Fetch-Reach (Reach): The end of the manipulator moves to the target position. If the distance between the end of the manipulator and the target position is less than 0.05 m, the task of this episode is completed and the training is successful.
- 2.
- Fetch-Push (Push):The end of the manipulator pushes the block to the position of the ball. The manipulator achieves its goal by pushing the box. If the distance between the box and the target position is less than 0.05 m, the task of this episode is completed and the training is successful.
- 3.
- Fetch-Pick And Place (PickAndPlace): The manipulator gripper picks up the block and moves it to the ball position. The target location can be desktop or airborne. If the distance between the end of the robot arm and the target position is less than 0.05 m, the task of this episode is completed and the training is successful.

4.2. Baselines
- 1.
- DDPG (TD3, SAC) + HER/RHER, which randomly and uniformly samples the hindsight experiences.
- 2.
- DDPG + CHER, which uses a curriculum learning method to improve the training efficiency based on HER.

4.3. Training Setting
4.4. Benchmark Result
4.5. On the Real Robot
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mannion, P.; Devlin, S.; Mason, K.; Duggan, J.; Howley, E. Policy invariance under reward transformations for multi-objective reinforcement learning. Neurocomputing 2017, 263, 60–73. [Google Scholar] [CrossRef]
- Shi, C. Statistical inference in reinforcement learning: A selective survey. arXiv 2025, arXiv:2502.16195. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Zheng, K.; Ganapati, V. Hyperparameter optimization for deep reinforcement learning: An Atari Breakout case study. J. Stud. Res. 2025, 14. [Google Scholar] [CrossRef]
- Liu, Y.; Liu, X. Adventurer: Exploration with BiGAN for deep reinforcement learning. Appl. Intell. 2025, 55, 726. [Google Scholar] [CrossRef]
- Moreno-Vera, F. Performing deep recurrent double Q-learning for Atari games. In Proceedings of the 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI); IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of Go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
- Zheng, X. The advancements and applications of deep reinforcement learning in Go. In ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2025. [Google Scholar] [CrossRef]
- Jang, S.; Kim, H.-I. Efficient deep reinforcement learning under task variations via knowledge transfer for drone control. ICT Express 2024, 10, 576–582. [Google Scholar] [CrossRef]
- Kaufmann, E.; Bauersfeld, L.; Loquercio, A.; Müller, M.; Koltun, V.; Scaramuzza, D. Champion-level drone racing using deep reinforcement learning. Nature 2023, 620, 982–987. [Google Scholar] [CrossRef] [PubMed]
- Kamil, Z.; Abdulazeez, A. A review on deep reinforcement learning for autonomous driving. Indones. J. Comput. Sci. 2024. [Google Scholar] [CrossRef]
- Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
- Yu, J.; Feng, X.; Gong, D.; Gong, Y. SAR-PPO (Segmented Adaptive Reward): Robotic arm open door motion control with reinforcement learning based on segmented adaptive reward. In 2024 43rd Chinese Control Conference (CCC); IEEE: New York, NY, USA, 2024; pp. 2970–2975. [Google Scholar] [CrossRef]
- Kwon, G.; Kim, B.; Kwon, N. Reinforcement learning with task decomposition and task-specific reward system for automation of high-level tasks. Biomimetics 2024, 9, 196. [Google Scholar] [CrossRef] [PubMed]
- Hong, C.; Lee, T.-E. Multi-agent reinforcement learning approach for scheduling cluster tools with condition based chamber cleaning operations. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA); IEEE: New York, NY, USA, 2018; pp. 885–890. [Google Scholar]
- Pavlichenko, D.; Behnke, S. Dexterous Pre-Grasp Manipulation for Human-Like Functional Categorical Grasping: Deep Reinforcement Learning and Grasp Representations. IEEE Trans. Autom. Sci. Eng. 2026, 23, 2231–2244. [Google Scholar] [CrossRef]
- Wang, Y.; Yu, W.; Wu, H.; Guo, H.; Dong, H. SA-DEM: Dexterous Extrinsic Robotic Manipulation of Non-Graspable Objects via Stiffness-Aware Dual-Stage Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2026, 23, 347–362. [Google Scholar] [CrossRef]
- Liao, J.; Xiong, P.; Zhou, M.; Liu, P.X.; Song, A. Adversarial Subgraph Contrastive Learning for Predicting Grasp Stability of Robotic Hands With Multimodal Signals. IEEE Trans. Autom. Sci. Eng. 2025, 22, 17720–17733. [Google Scholar] [CrossRef]
- Sodhani, S.; Zhang, A.; Pineau, J. Multi-task reinforcement learning with context-based representations. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021. [Google Scholar]
- Yoganathan, V.; Osburg, V.-S.; Kunz, W.H.; Toporowski, W. Check-in at the Robo-desk: Effects of automated social presence on social cognition and service implications. Tour. Manag. 2021, 85, 104309. [Google Scholar] [CrossRef]
- Zhang, M.; Cai, W.; Pang, L. Predator-prey reward based Q-learning coverage path planning for mobile robot. IEEE Access 2023, 11, 29673–29683. [Google Scholar] [CrossRef]
- Karalakou, A.; Troullinos, D.; Chalkiadakis, G.; Papageorgiou, M. Deep reinforcement learning reward function design for autonomous driving in lane-free traffic. Systems 2023, 11, 134. [Google Scholar] [CrossRef]
- Cao, J.; Dong, L.; Yuan, X.; Wang, Y.; Sun, C. Hierarchical multi-agent reinforcement learning for cooperative tasks with sparse rewards in continuous domain. Neural Comput. Appl. 2023, 36, 273–287. [Google Scholar] [CrossRef]
- Devidze, R.; Kamalaruban, P.; Singla, A. Exploration-guided reward shaping for reinforcement learning under sparse rewards. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 5829–5842. [Google Scholar]
- Christiano, P.F.; Leike, J.; Brown, T.B.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4302–4310. [Google Scholar]
- Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, P.; Zaremba, W. Hindsight experience replay. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5055–5065. [Google Scholar]
- Fang, M.; Zhou, T.; Du, Y.; Han, L.; Zhang, Z. Curriculum-guided hindsight experience replay. Adv. Neural Inf. Process. Syst. 2019, 32, 1131. [Google Scholar]
- Fang, M.; Zhou, C.; Shi, B.; Gong, B.; Xu, J.; Zhang, T. DHER: Hindsight experience replay for dynamic goals. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Luo, Y.; Wang, Y.; Dong, K.; Zhang, Q.; Cheng, E.; Sun, Z.; Song, B. Relay hindsight experience replay: Self-guided continual reinforcement learning for sequential object manipulation tasks with sparse rewards. Neurocomputing 2023, 557, 126620. [Google Scholar] [CrossRef]
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, PMLR, Beijing, China, 21–26 June 2014. [Google Scholar]
- Fujimoto, S.; van Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Akinola, I.; Xu, J.; Song, S.; Allen, P.K. Dynamic grasping with reachability and motion awareness. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2021; pp. 9422–9429. [Google Scholar]
- Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2017; pp. 23–30. [Google Scholar]






| Parameter | Value | Parameter | Value | Parameter | Value |
|---|---|---|---|---|---|
| max_u | 1 | n_test_rollouts | 10 | replay_strategy | future |
| layers | 3 | test_with_polyak | False | replay_k | 4 |
| hidden | 256 | random_eps | 0.3 | norm_eps | 0.01 |
| network_class | ActorCritic | noise_eps | 0.2 | norm_clip | 5 |
| Q_lr | 0.001 | buffer_size | 1,000,000 | alpha | 0.6 |
| pi_lr | 0.001 | polyak | 0.95 | beta | 0.4 |
| action_l2 | 1 | clip_obs | 200.0 | eps | |
| relative_goals | False | n_cycles | 50 | w_potential | 1.0 |
| batch_size | 256 | n_batches | 40 | w_linear | 1.0 |
| Environment | Method | Seed | Convergence Epoch | ||
|---|---|---|---|---|---|
| DDPG | TD3 | SAC | |||
| Push | HER | 100 | 185 | 192 | 146 |
| 200 | 182 | 231 | 121 | ||
| 300 | 196 | 255 | 102 | ||
| RHER | 100 | 182 | 209 | 89 | |
| 200 | 127 | 181 | 64 | ||
| 300 | 151 | 146 | 74 | ||
| PHER | 100 | 51 | 61 | 62 | |
| 200 | 55 | 93 | 63 | ||
| 300 | 53 | 53 | 63 | ||
| PickAndPlace | HER | 100 | 233 | 262 | 298 |
| 200 | 292 | 303 | 456 | ||
| 300 | 272 | 229 | 342 | ||
| RHER | 100 | 225 | 261 | 253 | |
| 200 | 275 | 262 | 273 | ||
| 300 | 262 | 211 | 248 | ||
| PHER | 100 | 81 | 55 | 76 | |
| 200 | 75 | 52 | 83 | ||
| 300 | 61 | 48 | 90 | ||
| Method | Environment | DDPG | TD3 | SAC |
|---|---|---|---|---|
| HER | Reach | 4 | 16 | 1 |
| RHER | Reach | 2 | 3 | 1 |
| PHER | Reach | 5 | 5 | 7 |
| Task | Trials | PHER Success | HER Success | PHER Rate | HER Rate |
|---|---|---|---|---|---|
| Reach | 20 | 20 | 19 | 100% | 95% |
| Push | 20 | 17 | 13 | 85% | 65% |
| PickandPlace | 20 | 15 | 10 | 75% | 50% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, D.; Yang, M.; Wang, Y.; Dong, Y.; Cheng, S.; Zhao, K. PHER: A Method for Solving the Sparse Reward Problem of a Manipulator Grasping Task. Technologies 2026, 14, 164. https://doi.org/10.3390/technologies14030164
Zhang D, Yang M, Wang Y, Dong Y, Cheng S, Zhao K. PHER: A Method for Solving the Sparse Reward Problem of a Manipulator Grasping Task. Technologies. 2026; 14(3):164. https://doi.org/10.3390/technologies14030164
Chicago/Turabian StyleZhang, Dianfan, Mutian Yang, Yuxuan Wang, Yameng Dong, Shuhong Cheng, and Kunpeng Zhao. 2026. "PHER: A Method for Solving the Sparse Reward Problem of a Manipulator Grasping Task" Technologies 14, no. 3: 164. https://doi.org/10.3390/technologies14030164
APA StyleZhang, D., Yang, M., Wang, Y., Dong, Y., Cheng, S., & Zhao, K. (2026). PHER: A Method for Solving the Sparse Reward Problem of a Manipulator Grasping Task. Technologies, 14(3), 164. https://doi.org/10.3390/technologies14030164

