Abstract
Off-policy reinforcement learning is usually used to train the grasping task model of the manipulator. However, in the training process, it is difficult to collect enough successful experience data and rewards for learning and training; that is, there is a problem of sparse rewards. Hindsight experience replay (HER) allows the agent to relabel the completed states. However, not all failed experiences have the same effect on learning and training. Facing the many transitions generated by the environment during operation, adopting a random uniform sampling method from the experience replay buffer will result in low data utilization and slow convergence. This paper proposes using a prioritized sampling method to sample the relabelled transitions, and then combines various off-policy reinforcement learning algorithms with it for training in simulated environments. This paper uses the prioritized sampling method, which allows the agent to access more important transitions earlier and accelerate the convergence of training. The results demonstrate that hindsight experience replay with prioritization (PHER) exhibits significantly faster convergence compared to other methods.
1. Introduction
Reinforcement learning (RL) is a learning paradigm in which an agent improves its behavior through interactions with an environment. At each decision step, the agent observes the current state, selects an action according to a policy, and then receives a reward while the environment transitions to a new state. The objective of RL is to learn a policy that maximizes the expected long-term cumulative reward, rather than optimizing immediate feedback only [1,2].
Deep reinforcement learning (DRL) combines RL with deep neural networks to handle high-dimensional observations and complex decision-making. Depending on the algorithm, deep networks can be used to approximate a policy, a value function, or both, enabling successful applications in simulated games and increasingly in complex control and robotics tasks, for example, learning and training to play various Atari games [3,4,5,6] and learning to play go [7,8], even surpassing humans. In recent years, DRL has also become an effective method for solving a series of complex control problems. Classic balance car control, training unmanned aerial vehicles [9,10] and autonomous driving [11,12] also achieved exciting breakthroughs. Similarly, in the field of robotics, DRL has attracted widespread attention in the application of complex manipulator control tasks, such as controlling the manipulator to complete the task of opening a door [13], performing operations [14,15], and so on. In more dexterous manipulation scenarios, Pavlichenko et al. [16] proposed a pre-grasp manipulation framework that enables human-like functional categorical grasping through deep reinforcement learning and structured grasp representations. Wang et al. [17] introduced a stiffness-aware dual-stage reinforcement learning framework (SA-DEM) for dexterous extrinsic manipulation of non-graspable objects. In addition, learning-based approaches have also been applied to grasp stability analysis. Liao et al. [18] proposed an adversarial subgraph contrastive learning framework to predict grasp stability using multimodal sensory signals. DRL has achieved breakthroughs in various sequential decision-related problems. In simulated environments, it can train agents to execute various challenging tasks. Sodhani et al. proposed a framework to leverage natural language task descriptions to learn interpretable and composable representations for multitasking reinforcement learning [19,20].
DRL enhances traditional RL by using deep neural networks as powerful function approximators. To train a reliable agent, we need to design a reward function that helps the agent complete the training according to the reward function [21]. However, the design of reward functions is often limited to specific tasks in specific environments. This leads to the need for a detailed understanding of the task content and a certain level of expertise when designing reward functions. Karalakou et al. designed an effective reward function for a lane-free autonomous driving agent and proposed different components of reward functions tied to the environment at various levels of information [22]. Moreover, for some engineering problems where only the outcome is known, but the process is indescribable, it is difficult to design the corresponding reward function, for example, using a manipulator to solve a Rubik’s cube and so on [23]. This also reflects the current reinforcement learning reward function design of a difficult problem: sparse rewards. When facing situations where the state space is large and only a few specific tasks can be rewarded, the problem of sparse rewards arises. Several algorithms have been proposed to solve this kind of problem: Devidze et al. proposed a novel framework, called Exploration-Guided Reward Shaping (ExploRS), which can speed up the learning of agents in sparse reward environments in a fully self-supervised way. The core idea of ExploRS is to combine exploratory rewards to learn an intrinsic reward function, which maximizes the utility of the agent relative to the extrinsic reward [24]. Christiano et al. proposed a method called deep reinforcement learning from human preferences, which learns the reward function of reinforcement learning from human preferences. The motivation of this method is to enable the agent to exhibit the behavior that humans expect in complex and ambiguous tasks, without requiring humans to provide explicit and frequent feedback [25]. Andrychowicz et al. proposed a method called HER, which solves the reinforcement learning problem under sparse rewards [26]. They propose using the achieved goals of failed experiences to replace the desired goals of training trajectories. This method relabels the quadruples in the experience replay buffer and uses the failed experiences to enrich the repository. Through this modification, any failed experience can get a non-negative reward, and the agent can learn something even in the case of task failure. This method performs well in solving problems related to manipulator control. Later, researchers proposed a series of improvements to HER, focusing on how to select more informative replay samples and relabel goals more effectively. Representative methods include curriculum-guided hindsight experience replay (CHER) [27], hindsight experience replay for dynamic goals (DHER) [28], and relay hindsight experience replay (RHER) [29]. DHER is designed for scenarios where goals vary over time or are dynamically updated; it incorporates goal dynamics into the replay procedure so that relabeled hindsight goals remain consistent with the evolving task, thereby improving training stability and sample efficiency. RHER introduces a “relay” mechanism that constructs a sequence of intermediate states (or subgoals) from a failed trajectory as stepwise hindsight goals, effectively decomposing long-horizon sparse reward tasks into shorter learning stages, which eases credit assignment and accelerates policy learning. In contrast, CHER combines HER with curriculum learning and proposes an adaptive strategy to select failed experiences for replay: its selection criterion jointly considers goal proximity to the true goal and diversity curiosity toward diverse fictitious goals, and gradually adjusts their relative weights during training to form a human-like learning curriculum.
For HER, the number of failure experiences gained after multiple training sessions is large, with varying degrees of proximity to the desired target, and they are treated equally when being sampled. However, not all experiences of failure train the environment equally, and some offer only limited help. To achieve efficient use of experience, we introduce the prioritized experience replay (PER) [30] method for HER. This method can rank the experiences according to their priorities and make full use of them. The main contributions of this paper are as follows: In this study, the prioritized experience replay technique is introduced for HER to prioritize transitions that are sampled and relabeled in the training process. The results show that a prioritized sampling approach can make training for manipulative tasks of the robotic arm more biased.In the prioritized hindsight experience replay (PHER) proposed in this study, PHER was combined and evaluated with the twin-delayed deep deterministic policy gradient (TD3) and soft actor–critic (SAC) in addition to the deep deterministic policy gradient (DDPG).
The rest of this article is organized as follows: Section 2 discusses the main off-policy reinforcement learning algorithms and the HER/RHER method for solving sparse reward problems that are combined in this study. Section 3 discusses the problems of HER/RHER and the improvement scheme of this study. The PHER method is proposed, and the pseudo-code is given. Section 4 mainly shows the experimental results of this study and introduces the advantages of the PHER method. Section 5 is the conclusion and outlook of this study.
2. Related Works
This section mainly introduces two aspects closely related to this research, namely off-policy reinforcement learning algorithms and HER. This paper mainly uses three off-policy reinforcement learning algorithms: DDPG, TD3, and SAC.
2.1. Off-Policy Reinforcement Learning
Reinforcement learning algorithms can be broadly categorized into on-policy and off-policy methods, depending on the relationship between the policy used to collect data and the policy being optimized. In on-policy learning, the agent updates its value function or policy using trajectories generated by the same policy that is currently being improved. Consequently, the learning target is tightly coupled with the data collection policy, and using data that deviates substantially from the current policy may introduce bias or instability. In contrast, off-policy methods are designed to learn a target policy from data generated by a different behavior policy, which enables learning from previously collected experiences, demonstrations, or exploratory behaviors.
This distinction is particularly important when experience replay and hindsight experience replay (HER) are employed. Experience replay reuses transitions stored in a buffer, meaning that the samples used for updates are typically generated by earlier policies rather than the current one. More critically, HER relabels the goal (and thus the reward signal and learning target) of past transitions to create additional training signals. After relabeling, the training objective no longer corresponds to the original data generation process, making the update inherently off-policy. Therefore, off-policy reinforcement learning is essential for effectively leveraging replayed and relabeled transitions in HER-based frameworks. In this work, we adopt three off-policy algorithms in combination with PHER.
- 1.
- DDPG: It was developed by Silver et al. by combining deterministic policy gradients [31] with deep learning, which is used to solve continuous action space problems. It is based on the actor–critic algorithm framework. In addition, each network has its corresponding target network, so DDPG includes four networks, namely the actor network, critic network, target actor network, and target–critic network.
- 2.
- TD3: It is an improvement of the DDPG algorithm. Based on four networks in the DDPG, it adds two actor networks and two critic networks to reduce the impact of Q value overestimation [32].
- 3.
- SAC: It introduces the concept of entropy [33], which can randomize the policy by increasing the information entropy term, that is, making the probability of each output action as uniform as possible, rather than concentrating on one action. This encourages exploration and can also learn more behaviors close to the goal.
2.2. Hindsight Experience Replay
This paper mainly conducts experiments on the manipulator environment, and a common challenge for the manipulator environment is to design a reward function [34], which gives a positive reward when the manipulator completes the goal task, and a negative reward when it fails to complete the goal task. However the real possibility of completing the task is very small, and the manipulator multitask mostly gets negative rewards during the training process. In this case, the manipulator can hardly learn anything, which leads to the sparse reward problem. HER provides a promising way to improve the efficiency of goal-oriented tasks under sparse rewards.
HER studies an agent with sparse rewards operating in a multi-goal environment. First, at each step T, the agent receives an observation (or state) from the environment, then selects an action based on its policy, and interacts with the environment to receive a reward . The agent can then generate a trajectory:
where T is any length, and each step is associated with a transition . In many reinforcement learning tasks, the agent only receives a reward when the trajectory reaches the final desired goal g, and only successful trajectories receive a non-negative return. However, since the policy is not fully trained and has a low success rate, the collected successful trajectories are often insufficient for training, resulting in the sparse reward problem. HER solves the sparse reward problem by treating failed experiences as successful experiences of achieving desired goals and learning from failed experiences. For any off-policy reinforcement learning algorithm, HER modifies the desired goal g in the transition to some achieved goal sampled from the states of failed experiences and uses it for training. The desired goal is the actual goal that the agent wants to achieve. The achieved goal is the state that the agent has already achieved, such as the Cartesian position of each fingertip in the manipulator environment. If g is replaced by , the corresponding failed experience will be assigned a non-negative reward, which helps to learn the policy. The transition composed of relabeled states is .
3. Methodology
3.1. The Problem of Uniform Sampling in HER
HER cleverly solves the sparse reward problem when training manipulator grasping task environments. However, this method uses random uniform sampling to extract mini-batches from the transitions. This results in low sampling efficiency and is not conducive to the subsequent training process.
In the early stage of experience collection, transitions were generated and stored by initializing the state and appending randomly generated actions . Then, some of the stored transitions were relabeled. However, the number of transitions after relabeling is large and there are duplicates. The importance of each transition is also different. Simply sampling all the relabeled transitions randomly and uniformly to train the network would ignore the importance of the experience. A large number of transitions, mixed and sampled randomly, is not conducive to the subsequent training process. Therefore, this paper introduces a uniform sampling strategy, in which labels and samples transition with priority.
3.2. Prioritized Hindsight Experience Replay
To introduce prioritized sampling for HER, the priority of each transition needs to be calculated. The priority of a transition is calculated by the TD error of the corresponding transition. TD error is the difference between the Q-value of the current state and the Q-value of the expected state in reinforcement learning. The goal of training the critic network is to make the current Q-value close to the expected Q-value, that is, to minimize the TD error. Therefore, in the sampling process of training the critic network, transitions with larger TD errors are more helpful for training and can be ranked higher in sampling. This way, TD error can be used to calculate the sampling order for training the critic network, which is prioritized uniform sampling. Replacing the random uniform sampling in HER with uniform sampling can sort experiences with different priorities according to their priority, and improve the training efficiency of the critic network. The calculation method of the TD error is shown in Equation (2), where is the TD error, is the discount factor, and is the estimated value function of the action executed by policy at state and time step t.
In HER, the first step is to store the transitions without relabeling. By substituting the elements , etc., from the transition into Equation (2), Equation (3) can be obtained. This equation shows that a transition with a large TD error is more important and should be learned more frequently. Hence, the probability for each transition needs to be calculated, and sampling should be done according to this probability.
When calculating the priority of the transition after relabeling, by substituting the elements , etc., from the transition into Equation (2), Equation (4) can be obtained.
The equation for calculating the probability is given by Equation (5), where is the priority of transition i. The exponent determines how much priority to use, and corresponds to the case of uniform sampling.
There are two ways to determine the probability of each transition, namely proportional prioritization and rank-based prioritization. The first one directly determines the probability based on , as shown in the following Equation (6), where is a small constant greater than 0 used to ensure the implementation of .
Another way to determine the probability is based on . sorts in descending order, with larger values in the front and smaller ones in the back, which is equivalent to sorting by priority . In Equation (7), is the rank of , and the sampling probability is inversely proportional to . When is larger, the rank is smaller, and the sampling probability will be bigger. To reduce the repetition of experience during training, new experiences are stored in the experience replay buffer without calculating TD error, and are directly set to the maximum TD error in the current experience replay buffer, ensuring that they are sampled at least once. Rank-based prioritization is more robust than proportional prioritization because it is not affected by outliers or the magnitude of errors. Its heavy-tail property guarantees the diversity of the samples. Stratified sampling ensures that the minibatch gradients remain stable in magnitude throughout the training process. Therefore, this experiment mainly uses the second method for sorting.
Since the sampling is uniform, different sampling probabilities have different learning rates. This will cause some bias in Q network prediction. At this time, we need to adjust the learning rate according to the sampling probability and optimize the loss function accordingly. If a transition has a larger sampling probability, the learning rate should be appropriately reduced. Equation (8) is our original loss function, and Equation (9) is the improved loss function. The change from Equation (8) to Equation (9) changes the weight .
Equation (10) reduces the learning rate, where is between 0 and 1, and the hyperparameter is adjusted to obtain the most suitable learning rate. In PHER, to prevent the computational burden of sampling by priority from increasing sharply with the increase in buffer size, sum-tree is used to implement storage priority, execute sampling and other operations, and thus reduce the computational complexity. Sum-tree is a binary tree structure. is the root node, and are the leaf nodes.The leaf nodes of the sum-tree are used to store the priorities of each transition. A buffer of size N corresponds to a sum-tree with 2N-1 nodes.
The acceleration of convergence observed with PHER can be theoretically attributed to the optimization of gradient estimation quality. In sparse reward environments, traditional uniform sampling often leads to an abundance of “zero-reward” transitions entering the training loop, which increases the variance of stochastic gradient estimates and hampers learning progress. By introducing a priority-based weighting mechanism, PHER concentrates computational resources on transitions with high temporal difference (TD) error. From a theoretical perspective, this approach functions as a form of importance sampling, which effectively reduces the variance of the policy gradient estimator and enhances the signal-to-noise ratio of the learning update [30]. Furthermore, by prioritizing successful trajectories that offer meaningful credit assignment, PHER facilitates a more rapid propagation of rewards throughout the value network, enabling the agent to reach the optimal policy with significantly fewer environment interactions. The complete algorithm of PHER is shown in Algorithm 1.
| Algorithm 1 Prioritized Hindsight Experience Replay (PHER) |
Input:
|
4. Experiment and Performance Analysis
In this experiment, the success rate of PHER in training was evaluated. We combined PHER with DDPG, TD3, and SAC, respectively, and trained them in three environments from Open AI Gym. After training in these three simulated MuJoCo environments, we compared how fast the success rate converges for HER+DDPG/TD3/SAC and RHER+DDPG/TD3/SAC under identical conditions in the Spinning Up framework. In this work, convergence is defined as the first epoch at which the success rate exceeds 95% and remains above this threshold for subsequent evaluation episodes.
4.1. Environments
Figure 1 shows three environments that use a 7-DoF Fetch robot arm with a parallel gripper: Fetch-Pick and Place, Fetch-Push, Fetch-Reach. The agent controls the gripper by sending three-dimensional vectors that indicate its desired motion in Cartesian coordinates. The gripper has different goals in each environment. The reward function settings for this task are sparse and binary: if the agent accomplishes the goal (within a certain margin of error), it gets 0; otherwise, it gets −1. The specific tasks are as follows:
- 1.
- Fetch-Reach (Reach): The end of the manipulator moves to the target position. If the distance between the end of the manipulator and the target position is less than 0.05 m, the task of this episode is completed and the training is successful.
- 2.
- Fetch-Push (Push):The end of the manipulator pushes the block to the position of the ball. The manipulator achieves its goal by pushing the box. If the distance between the box and the target position is less than 0.05 m, the task of this episode is completed and the training is successful.
- 3.
- Fetch-Pick And Place (PickAndPlace): The manipulator gripper picks up the block and moves it to the ball position. The target location can be desktop or airborne. If the distance between the end of the robot arm and the target position is less than 0.05 m, the task of this episode is completed and the training is successful.
Figure 1.
Illustration of three tasks considered in experiments: (a) Reach, (b) Push, and (c) PickAndPlace.
Figure 1.
Illustration of three tasks considered in experiments: (a) Reach, (b) Push, and (c) PickAndPlace.

4.2. Baselines
This study is based on three algorithms, DDPG, TD3, and SAC, and uses a prioritized sampling method to sort and sample the transitions in the experience replay buffer, in order to replay and train the policies for environments with sparse rewards. The framework of PHER combined with the three off-policy algorithms and the comparison are shown in Figure 2. The left side of Figure 2 is the process of PHER from experience collection to storage. The right side of Figure 2a is the DDPG, the right side of Figure 2b is the TD3, and the right side of Figure 2c is the SAC. All three algorithms are off-policy algorithms that can be combined with experience replay algorithms for training. All three algorithms show the process from experience collection to training of each network. The baselines for this experiment are:
- 1.
- DDPG (TD3, SAC) + HER/RHER, which randomly and uniformly samples the hindsight experiences.
- 2.
- DDPG + CHER, which uses a curriculum learning method to improve the training efficiency based on HER.
Figure 2.
The framework for combining the method in this paper with three off-policy reinforcement learning algorithms, respectively. The main function of PHER is to sort the priority of the partially stored transitions and store the sorted experiences in the experience replay buffer: (a) PHER+DDPG, (b) PHER+TD3, and (c) PHER+SAC.
Figure 2.
The framework for combining the method in this paper with three off-policy reinforcement learning algorithms, respectively. The main function of PHER is to sort the priority of the partially stored transitions and store the sorted experiences in the experience replay buffer: (a) PHER+DDPG, (b) PHER+TD3, and (c) PHER+SAC.

4.3. Training Setting
For the three environments mentioned above, each epoch contains 50 episodes, and each episode contains 50 steps. The hyperparameters during the training process were the same as those for HER. In Algorithm 1, we set k to 256 and R to . The hyperparameter settings used in the experiments are listed in Table 1. Finally, the success rate is calculated by calculating the ratio of the number of times to the total number of times.
Table 1.
Hyperparameters used in the experiments.
4.4. Benchmark Result
Figure 3 shows the results of the Reach task. This paper proposes PHER and compares it with HER/RHER. Figure 3a shows the results of combining the three methods with the DDPG. Figure 3b shows the results of combining the three methods with the TD3. Figure 3c shows the results of combining the three methods with the SAC. Comparing these three off-policy reinforcement learning algorithms, the SAC performs best in this task.
Figure 3.
Comparison of the learning curve of Reach tasks obtained by the proposed method and the HER/RHER method. The curve represents the change in success rate: (a) PHER and HER/RHER combined with DDPG, (b) PHER and HER/RHER combined with TD3, and (c) PHER and HER/RHER combined with SAC.
Figure 4 shows the results of the Push task. This paper proposes PHER and compares it with HER/RHER. Figure 4a shows the results of combining the three methods with the DDPG. Figure 4b shows the results of combining the three methods with the TD3. Figure 4c shows the results of combining the three methods with the SAC. The results show that adding priority sorting to HER improves the convergence speed of three off-policy reinforcement learning algorithms. PHER combined with the three algorithms achieves faster success rate convergence than the other methods. In addition, comparing these three off-policy reinforcement learning algorithms, the SAC performs best in this task.
Figure 4.
Comparison of the learning curve of Push task obtained by the proposed method and the HER/RHER method. The curve represents the change in the median success rate, and the shaded area represents the range of the success rate estimated by the 3 random number seeds: (a) PHER and HER/RHER combined with DDPG, (b) PHER and HER/RHER combined with TD3, and (c) PHER, HER/RHER combined with SAC.
Figure 5 shows the results of the PickAndPlace task. This paper proposes a method and compares it with HER/RHER. Figure 5a shows the results of combining the three methods with and the DDPG. The results show that adding priority sorting to HER improves the convergence speed of the DDPG, and the median success rate exceeds HER and quickly converges at 72 epochs. Figure 5b shows the results of combining the three methods with the TD3. The results show that PHER has a median success rate higher than HER and quickly converges at 52 epochs. Figure 5c shows the results of combining the three methods with the SAC. The results show that PHER has a median success rate higher than HER and quickly converges at 83 epochs. The results show that PHER has a median success rate higher than HER and quickly converges.
Figure 5.
Comparison diagram of PickAndPlace task learning curve obtained by the proposed method and the HER/RHER method. The curve represents the change in the median success rate, and the shaded area represents the range of the success rate estimated by the three random number seeds: (a) PHER and HER/RHER combined with DDPG, (b) PHER and HER/RHER combined with TD3, and (c) PHER and HER/RHER combined with SAC.
We conducted comparative experiments across three environments and three deep reinforcement learning algorithms. In the Fetch-Reach environment, PHER also demonstrated fast convergence, though with noticeable instability. However, as the environmental complexity increases, PHER exhibits irreplaceably faster convergence compared to HER and RHER. Additionally, it can be observed that PHER’s instability is more pronounced under the DDPG algorithm, whereas under SAC, the differences in stability between PHER, HER, and RHER become minimal. In the case of TD3, PHER achieves dominant performance. Based on the experimental findings, we can conclude that PHER combined with the TD3 algorithm is more suitable for accomplishing precise operations in complex environments, offering better stability and faster convergence.
We compared the number of convergence epochs for the PHER, DDPG, TD3, and SAC algorithms when combined in different environments and trained with different random seeds. Table 2 presents the comparative results of convergence epochs for different combinations in the Push and PickAndPlace environments. The results indicate that, under the same random-seed settings, PHER achieves the fastest convergence across the three algorithms (DDPG, TD3, and SAC). In the Push environment, the combination of PHER with DDPG converges the fastest, with an average of 53 epochs across different random seeds while maintaining a success rate of 0.9–1.0. In the PickAndPlace environment, the combination of PHER with TD3 shows the fastest convergence, averaging 52 epochs across different random seeds, and the success rate likewise remains at 0.9–1.0. Table 3 presents the comparative results of convergence epochs for different combinations in the Reach environment. The results show that in the Reach environment, PHER and HER exhibit similar convergence epochs during training, both maintaining a success rate of 1.0. Bold values indicate fewer epochs, representing faster convergence.
Table 2.
Comparison of the number of convergence epochs starting in the Push and PickAndPlace environment.
Table 3.
Comparison of convergence epochs in the Reach environment.
From the above experiments, it can be observed that the proposed method demonstrates significant advantages in complex tasks. To further analyze its behavior, we conduct a sensitivity analysis on the TD3 in the PickAndPlace environment. As shown in Figure 6, when the buffer size is fixed, the curve corresponding to = 0.6 and = 0.4 achieves the fastest convergence and exhibits higher training stability.
Figure 6.
Hyperparameter sensitivity analysis experiments: (a) sensitivity analysis with respect to ; and (b) sensitivity analysis with respect to .
Table 3 shows the comparison results of different combinations of convergence numbers in the Reach environment. The results show that in the Reach environment, PHER and HER have similar convergence numbers when they are trained, and the success rate is maintained at 1.0.
This paper trains CHER+DDPG in three environments, Reach, Push, and PickAndPlace, and compares the training results with PHER+DDPG in these three environments. The results show that in the Reach and Push environments, PHER+DDPG can converge quickly, but CHER+DDPG cannot. In the PickAndPlace environment, PHER+DDPG converges at around 250 epochs, while CHER+DDPG’s success rate gradually increases but fails to converge in the end. Figure 7 shows the comparison of the experimental results.
Figure 7.
Comparison of the learning curves of the three tasks obtained by the proposed method and CHER method; the curves represent the changes in success rate. Both methods are combined with the DDPG. (a) PHER and CHER trained in the Reach environment, (b) PHER and CHER trained in the Push environment, and (c) PHER and CHER trained in the PickAndPlace environment.
4.5. On the Real Robot
To validate the effectiveness of the proposed PHER algorithm in real-world scenarios, we deployed the trained policy on a 6-DoF Z-Arm S622 robotic manipulator (Huiling Technology, Shenzhen, China) to perform three tasks: Reach, Push, and PickAndPlace. The successful deployment relied on the following three key techniques. First, we adopted a CNN-based model that fuses RGB and depth information to predict the positions of the target and objects [35]. During the training phase, approximately 1000 real-world images were collected and split into training and validation sets with a ratio of 8:2. The model achieved a mean Average Precision (mAP) of 95.2% on the validation set, providing high-accuracy state inputs for the reinforcement learning policy. During real-world execution, the control loop consisted of image acquisition, CNN inference, simulation stepping, and TCP/IP communication. The measured end-to-end average latency was approximately 120 ms. Since the tasks primarily employed “MoveL” commands for quasi-static point-to-point position control, this latency remained within an acceptable range and did not lead to significant state–action mismatch. To address the sim-to-real gap, we adopted a state decoupling strategy together with high-fidelity modeling. Unlike end-to-end “pixels-to-actions” approaches, the reinforcement learning agent receives precise object coordinates estimated by the CNN. By mapping the calibrated camera coordinates to the robot base coordinate frame and constructing a 1:1 digital twin model in the MuJoCo environment, we effectively mitigated discrepancies in visual perception and dynamics between simulation and reality.
The experimental process was as follows: First, a simulation environment was initialized at the beginning of each episode, using the object position predicted by the CNN as the initial state of the manipulator environment. Then, the trained policy was run in the simulation environment, and the position control was sent to the real robot. Figure 8 shows the three tasks performed on the 6-DOF manipulator, with Figure 8a showing the complete process of the Reach task, where the manipulator end moved to the red marker point, keeping the gripper closed throughout the process. Figure 8b shows the complete process of the push task, where the manipulator end moved the blue cube to the red marker point. Figure 8c shows the complete process of the PickAndPlace task, where the manipulator end moved the blue block to the gray block. The experiment tested each task 20 times, setting 5 cm as the tolerance error to maintain consistency with the simulation environment.
Figure 8.
Demonstration of three tasks’ policy deployed on a real 6-DoF manipulator: (a) Reach task, (b) Push task, and (c) PickAndPlace task.
Finally, as shown in Table 4, in the real-world experiments, PHER achieved a success rate of 100% on the Reach task, 85% on the Push task, and 75% on the PickandPlace task. In comparison, the success rates of HER were 95%, 65%, and 50%, respectively. Overall, PHER significantly outperformed HER in terms of success rate.
Table 4.
Comparison of success rates between PHER and HER in real-world tasks.
5. Conclusions
This study introduces a method (PHER) to solve the sparse reward problem of a manipulator grasping task. The method is combined with off-policy reinforcement learning algorithms and applied to three manipulation tasks of manipulators. The method improves upon HER by adding two steps: sorting the experience priority and sampling according to the priority. This solves the problem of low experience utilization caused by random uniform sampling in HER. By combining PHER with three off-policy reinforcement learning algorithms, experiments are conducted in three manipulator environments. The experimental results are compared with HER and CHER, and the convergence speed of the median success rate is significantly improved. However, compared with other algorithms, PHER exhibits relatively lower stability. Although it achieves faster completion when handling more complex tasks, its accuracy still requires further improvement. The future goal is to find better methods to solve the sparse reward problem, increase the utilization of experience, and enable the manipulator to converge in complex environments.
Author Contributions
Conceptualization, S.C. and M.Y.; methodology, D.Z.; software, M.Y.; validation, Y.W. and Y.D.; formal analysis, M.Y.; resources, D.Z.; writing—original draft preparation, M.Y. and D.Z.; writing—review and editing, Y.D.; visualization, Y.W. and K.Z.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.
Funding
This work was funded by the National Natural Science Foundation of China (52275035), Central Government Guides Local Science and Technology Development Foundation, China (246Z1815G), Science Research Project of Hebei Education Department, China (CXZX2025003), and, in part, Industry–University Research Cooperation Project of Universities in Shijiazhuang City, stationed in Hebei Province, China, under grant (2510800401A).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Data are contained within the article.
Conflicts of Interest
Author Mutian Yang was employed by the company Zoomlion Heavy Industry Science & Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
References
- Mannion, P.; Devlin, S.; Mason, K.; Duggan, J.; Howley, E. Policy invariance under reward transformations for multi-objective reinforcement learning. Neurocomputing 2017, 263, 60–73. [Google Scholar] [CrossRef]
- Shi, C. Statistical inference in reinforcement learning: A selective survey. arXiv 2025, arXiv:2502.16195. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Zheng, K.; Ganapati, V. Hyperparameter optimization for deep reinforcement learning: An Atari Breakout case study. J. Stud. Res. 2025, 14. [Google Scholar] [CrossRef]
- Liu, Y.; Liu, X. Adventurer: Exploration with BiGAN for deep reinforcement learning. Appl. Intell. 2025, 55, 726. [Google Scholar] [CrossRef]
- Moreno-Vera, F. Performing deep recurrent double Q-learning for Atari games. In Proceedings of the 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI); IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of Go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
- Zheng, X. The advancements and applications of deep reinforcement learning in Go. In ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2025. [Google Scholar] [CrossRef]
- Jang, S.; Kim, H.-I. Efficient deep reinforcement learning under task variations via knowledge transfer for drone control. ICT Express 2024, 10, 576–582. [Google Scholar] [CrossRef]
- Kaufmann, E.; Bauersfeld, L.; Loquercio, A.; Müller, M.; Koltun, V.; Scaramuzza, D. Champion-level drone racing using deep reinforcement learning. Nature 2023, 620, 982–987. [Google Scholar] [CrossRef] [PubMed]
- Kamil, Z.; Abdulazeez, A. A review on deep reinforcement learning for autonomous driving. Indones. J. Comput. Sci. 2024. [Google Scholar] [CrossRef]
- Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
- Yu, J.; Feng, X.; Gong, D.; Gong, Y. SAR-PPO (Segmented Adaptive Reward): Robotic arm open door motion control with reinforcement learning based on segmented adaptive reward. In 2024 43rd Chinese Control Conference (CCC); IEEE: New York, NY, USA, 2024; pp. 2970–2975. [Google Scholar] [CrossRef]
- Kwon, G.; Kim, B.; Kwon, N. Reinforcement learning with task decomposition and task-specific reward system for automation of high-level tasks. Biomimetics 2024, 9, 196. [Google Scholar] [CrossRef] [PubMed]
- Hong, C.; Lee, T.-E. Multi-agent reinforcement learning approach for scheduling cluster tools with condition based chamber cleaning operations. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA); IEEE: New York, NY, USA, 2018; pp. 885–890. [Google Scholar]
- Pavlichenko, D.; Behnke, S. Dexterous Pre-Grasp Manipulation for Human-Like Functional Categorical Grasping: Deep Reinforcement Learning and Grasp Representations. IEEE Trans. Autom. Sci. Eng. 2026, 23, 2231–2244. [Google Scholar] [CrossRef]
- Wang, Y.; Yu, W.; Wu, H.; Guo, H.; Dong, H. SA-DEM: Dexterous Extrinsic Robotic Manipulation of Non-Graspable Objects via Stiffness-Aware Dual-Stage Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2026, 23, 347–362. [Google Scholar] [CrossRef]
- Liao, J.; Xiong, P.; Zhou, M.; Liu, P.X.; Song, A. Adversarial Subgraph Contrastive Learning for Predicting Grasp Stability of Robotic Hands With Multimodal Signals. IEEE Trans. Autom. Sci. Eng. 2025, 22, 17720–17733. [Google Scholar] [CrossRef]
- Sodhani, S.; Zhang, A.; Pineau, J. Multi-task reinforcement learning with context-based representations. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021. [Google Scholar]
- Yoganathan, V.; Osburg, V.-S.; Kunz, W.H.; Toporowski, W. Check-in at the Robo-desk: Effects of automated social presence on social cognition and service implications. Tour. Manag. 2021, 85, 104309. [Google Scholar] [CrossRef]
- Zhang, M.; Cai, W.; Pang, L. Predator-prey reward based Q-learning coverage path planning for mobile robot. IEEE Access 2023, 11, 29673–29683. [Google Scholar] [CrossRef]
- Karalakou, A.; Troullinos, D.; Chalkiadakis, G.; Papageorgiou, M. Deep reinforcement learning reward function design for autonomous driving in lane-free traffic. Systems 2023, 11, 134. [Google Scholar] [CrossRef]
- Cao, J.; Dong, L.; Yuan, X.; Wang, Y.; Sun, C. Hierarchical multi-agent reinforcement learning for cooperative tasks with sparse rewards in continuous domain. Neural Comput. Appl. 2023, 36, 273–287. [Google Scholar] [CrossRef]
- Devidze, R.; Kamalaruban, P.; Singla, A. Exploration-guided reward shaping for reinforcement learning under sparse rewards. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 5829–5842. [Google Scholar]
- Christiano, P.F.; Leike, J.; Brown, T.B.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4302–4310. [Google Scholar]
- Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, P.; Zaremba, W. Hindsight experience replay. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5055–5065. [Google Scholar]
- Fang, M.; Zhou, T.; Du, Y.; Han, L.; Zhang, Z. Curriculum-guided hindsight experience replay. Adv. Neural Inf. Process. Syst. 2019, 32, 1131. [Google Scholar]
- Fang, M.; Zhou, C.; Shi, B.; Gong, B.; Xu, J.; Zhang, T. DHER: Hindsight experience replay for dynamic goals. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Luo, Y.; Wang, Y.; Dong, K.; Zhang, Q.; Cheng, E.; Sun, Z.; Song, B. Relay hindsight experience replay: Self-guided continual reinforcement learning for sequential object manipulation tasks with sparse rewards. Neurocomputing 2023, 557, 126620. [Google Scholar] [CrossRef]
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, PMLR, Beijing, China, 21–26 June 2014. [Google Scholar]
- Fujimoto, S.; van Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Akinola, I.; Xu, J.; Song, S.; Allen, P.K. Dynamic grasping with reachability and motion awareness. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2021; pp. 9422–9429. [Google Scholar]
- Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2017; pp. 23–30. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.





