Episodic Self-Imitation Learning with Hindsight

Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state-action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation.


Introduction
Reinforcement learning (RL) has has been shown to be very effective in training agents within gaming environments [1,2], particularly when combined with deep neural networks [3,2,4]. In most tasks settings that are solved by RL algorithms, reward shaping is an essential requirement for guiding the learning of the agent. Reward shaping, however, often requires significant quantities of domain knowledge that are highly task-specific [5] and, even with careful design, can lead to undesired policies. Moreover, for complex robotic manipulation tasks, manually designing reward shaping functions to guide the learning agent becomes intractable [6,7] if even minor variations to the task are introduced. For such settings, the application of deep reinforcement learning requires algorithms that can learn from unshaped, and usually sparse, reward signals. The complicated dynamics of robot manipulation exacerbate the difficulty posed by sparse rewards, especially for on-policy RL algorithms. For example, achieving goals that require successfully executing multiple steps over a long horizon involves high dimensional control that must also generalise to work across variations in the environment for each step. These aspects of robot control result in a situation where a naive RL agent so rarely receives a reward at the start of training that it is not able to learn at all. A common solution in the robotics community is to collect a sufficient quantity of expert demonstrations, then use imitation learning to train the agent. However, in some scenarios, demonstrations are expensive to collect and the achievable performance of a trained agent is restricted by their quantity. One solution is to use the valuable past experiences of the agent to enhance training, and this is particularly useful in sparse reward environments.
To alleviate the problems associated with having sparse rewards, there are two kinds of approaches: imitation learning and hindsight experience replay (HER). First, the standard approach of imitation learning is to use supervised learning algorithms and minimise a surrogate loss with respect to an arXiv:2011.13467v1 [cs.AI] 26 Nov 2020 oracle. The most common form is learning from demonstrations [8,9]. Similar techniques are applied to robot manipulation tasks [10,11,12,13]. When the demonstrations are not attainable, self-imitation learning (SIL) [14], which uses past good experiences (episodes in which the goal is achieved), can be used to enhance exploration or speed up the training of the agent. Self-imitation learning works well in discrete control environments, such as Atari Games. Whilst being able to learn policies for continuous control tasks with dense or delayed rewards [14], the present experiments suggest that SIL struggles when rewards are sparse. Recently, hindsight experience replay has been proposed to solve such goal-conditional, sparse reward problems. The main idea of HER [15] is that during replay, the selected transitions are sampled from state-action pairs derived from achieved goalsthat are substituted for the real goals of the task; this increases the frequency of positive rewards. Hindsight experience replay is used with off-policy RL algorithms, such as DQN [1] and DDPG [16], for experience replay and has several extensions [17,18]. The present experiments show that simply applying HER with SIL does not lead to an agent capable of performing tasks from the Fetch robot environment. In summary, self-imitation learning with on-policy algorithms for tasks that require continuous control, and for which rewards are sparse, remains unsolved. In this paper, episodic self-imitation learning (ESIL) for goal-oriented problems that provide only sparse rewards is proposed and combined with a state-of-the-art on-policy RL algorithm: proximal policy optimization (PPO). In contrast to standard SIL, which samples past good transitions from the replay buffer for imitation learning, the proposed ESIL adopts entire current episodes (successful or not), and modifies them into "expert" trajectories based on HER. An extra trajectory selection module is also introduced to relieve the effects of sample correlation [19] in updating the network. Figure 1 shows the difference between naive SIL+HER and ESIL. During training by SIL+HER, a batch of transitions is sampled from the replay buffer; these are modified into "hindsight experiences" and used directly in self-imitation learning. In contrast, ESIL utilises entire current collected episodes and converts them into hindsight episodes. The trajectory selection module removes undesired transitions in the hindsight episodes. Using tasks from the Open AI Fetch environment, this paper demonstrates that the proposed ESIL approach is effective in training agents which are required to solve continuous control problems, and shows that it achieves state-of-the-art results on several tasks. The primary contribution of this paper is a novel episodic self-imitation learning (ESIL) algorithm that can solve continuous control problems in environments providing only sparse rewards; in doing so, it also empirically answers an open question posed by [20]. The proposed ESIL approach also provides a more efficient way to perform exploration in goal-conditional settings than the standard self-imitation learning algorithm. Finally, this approach achieves, to our knowledge, the best results for four moderately complex robot control tasks in simulation. The paper is organised into the following structure: Sections 2 and 3 provide an introduction to related work and corresponding background. Section 4 describes the methodology of the proposed ESIL approach. Section 5 introduces the settings and results of the experiments. Finally, Section 6 provides concluding remarks and suggestions for future research.

Related Work
Imitation learning (IL) can be divided into two main categories: behavioural cloning and inverse reinforcement learning [21]. Behavioural cloning involves the learning of behaviours from demonstrations [22,23,24]. Other extensions have an expert in the loop, such as DAgger [25], or use an adversarial paradigm for the behavioural cloning method [26,27]. The inverse reinforcement learning estimates a reward model from expert trajectories [28,29,30]. Learning from demonstrations is powerful for complex robotic manipulation tasks [31,32,33,10,34]. [26] propose generative adversarial imitation learning (GAIL), which employs generative adversarial training to match the distribution of state-action pairs of demonstrations. Compared with behavioural cloning, the GAIL framework shows strong improvements in continuous control tasks. In the work of [35], goalGAIL is proposed to speed up the training in goal-conditional environments; goalGAIL was also shown to be able to learn from demonstrations without action information. Prior work has used demonstrations to accelerate learning [10,11,12]. Demonstrations are often collected by an expert policy or human actions. In contrast to these approaches, episodic self-imitation learning (ESIL) does not need demonstrations.
Self-imitation learning (SIL) [14] is used for exploiting past experiences for parametric policies. It has a similar flavor to [36,37], in that the agent learns from imperfect demonstrations. During training, past good experiences are stored in the replay buffer. When SIL starts, transitions are sampled from the replay buffer according to the advantage values. In the work of [38], generalised SIL was proposed as an extension of SIL. It uses an n-bound Q-learning approach to generalise the original SIL technique, and shows robustness to a wide range of continuous control tasks. Generalised SIL can also be combined with both deterministic and stochastic RL algorithms. [39] points out that using imitation learning with past good experience could lead to a sub-optimal policy. Instead of imitating past good trajectories, a trajectory-conditioned policy [39] is proposed to imitate trajectories in diverse directions, encouraging exploration in environments where exploration is otherwise difficult. Unlike SIL, episodic self-imitation learning (ESIL) applies HER to the current episodes to create "imperfect" demonstrations for imitation learning; this also requires introducing a trajectory-selection module to reject undesired samples from the hindsight experiences. In the work of [19], it was shown that the agent benefits from using whole episodes in updates, rather than uniformly sampling the sparse or delayed reward environments. The present experiments suggest that episodic self-imitation learning achieves better performance in an agent that must learn to perform continuous control in environments delivering sparse rewards.
Recently, the technique known as hindsight learning was developed. Hindsight experience replay (HER) [15] is an algorithm that can overcome the exploration problems in multi-goal environments, delivering sparse rewards. Hindsight policy gradient (HPG) [40] introduces techniques that enable the learning of goal-conditional policies using hindsight experiences. However, the current implementation of HPG has only been evaluated for agents that need to perform discrete actions, and one drawback of hindsight policy gradient estimators is the computational cost because of the goaloriented sampling. An extension of HER, called dynamic hindsight experience replay (DHER) [41], was proposed to deal with dynamic goals. [42] uses the GAIL framework [26] to generate trajectories that are similar to hindsight experiences; it then applies imitation learning, using these trajectories. Competitive Experience Replay (CER) complements HER by introducing a competition between two agents for exploration [18]. [43] point out that the hindsight trajectories which contain higher energy are more valuable during training, leading to a more efficient learning system. [44] proposed curriculum-guided HER, which incorporates curriculum learning in the work. During training, the agent focuses on the closest goals in the initial stage, then focuses on the expanding the diversity of goals. This approach accelerates training compared with other baseline methods. Unlike these works, episodic self-imitation learning (ESIL) combines episodic hindsight experiences with imitation learning, which aids learning at the start of training. Furthermore, ESIL can be applied to continuous control, making it more suitable for control problems that demand greater precision.

Reinforcement Learning
Reinforcement Learning (RL) can be formulated under the framework of a Markov Decision Process (MDP); it is used to learn an optimal policy to solve sequential decision-making problems. In each time step t, the state s t is received by the agent from the environment. An action a t is sampled by the agent according to its policy π θ (s t |a t ), parameterised by θ, which-in deep reinforcement learning-represent the weights of an artificial neural network. Then, the state s t+1 and reward r t+1 are provided by the environment to the agent. The goal is to have the agent learn a policy that maximises the expected return where γ is the discount factor. In a robot control setting, the state s t can be the velocity and position of each joint of the robotic arm. The action a t can be the velocities of actuators (control signals) and the reward r t might be calculated based on the distance between the gripper of the robot arm and the target position.

Proximal Policy Optimization
In this work, proximal policy optimization (PPO) [46] is selected as our base RL algorithm. This is a state-of-the-art, on-policy actor-critic approach to training. The actor-critic architecture is common in deep RL; it is composed of an actor network which is used to output a policy, and a critic network which outputs a value to evaluate the current state, s t . Proximal policy optimization (PPO) has been widely tested in robot control [47] and video games [48]. In contrast with the "vanilla policy" gradient algorithms, proximal policy optimization (PPO) learns the policy using a surrogate objective function where π θ (a t |s t ) is the current policy and π θ old (a t |s t ) is the old policy; is a clipping ratio which limits the change between the updated and the previous policy during the training process. A t is the advantage value which can be estimated as R(s t , a t ) − V (s t ), with R(s t , a t ) being the return value, and V (s t ) the state value predicted by the critic network.

Hindsight Experiences and Goals
The experiments follow the terminology suggested by OpenAI [20], in which the possible goals are drawn from G, and the goal being pursued does not influence the environment dynamics. In ESIL, two types of goal are recognised. One is the desired goal g ∈ G, which is the target position or state, and may be different for different episodes. Within a single episode, g is constant. The second type of goal is the achieved goal g ac , which is the achieved state in the environment, and this is considered to be different at each time step in an episode. In an episode, each transition can be represented as , where s t indicates a state, a t indicates an action and r t indicates a reward; , is simply used to represent grouping of goals.
In sparse reward settings, an agent will only get positive rewards when the desired goal g is achieved.
The sparse reward function can be defined as where is a threshold value, used to identify if the agent has achieved the goal. However, the desired goal, g, might be difficult to reach during training. Thus, hindsight experiences are created through replacing the original desired goal g with the current achieved goal g ac t to augment the successful samples, and then reward r t can be recomputed according to Equation 3. The modification of the desired goal can be denoted as g and transitions from hindsight experiences can be represented as (s t | g , g ac t , a t , r t , s t+1 | g , g ac t+1 ). Intuitively, introducing g serves a useful purpose in the early stages of training; taking, for example, a robot reaching task, the agent has no prior concept of how to move its effector to a specific location in space. Thus, even these original failed episodes contain valuable information for ultimately learning a useful control policy for the original, desired goal g.

Methodology
The proposed method combines PPO and episodic self-imitation learning to maximally use hindsight experiences for exploration to improve learning. Recent advantages in episodic backward update [19] and hindsight experiences [15] are also leveraged to guide exploration for on-policy RL.

Episodic Self-Imitation Learning
The present method aims to use episodic hindsight experiences to guide the exploration of the PPO algorithm. To this end, hindsight experiences are created from current episodes.
For an episode i, let there be T time steps; after T , a series of transitions it implies that in this episode, the agent failed to achieve the original goal. Simply, to create hindsight experiences, the achieved goal g ac T in the last state s T is selected and considered as the modified desired goal g , i.e., g = g ac T . Next, a new reward r t is computed under the new goal g . Then, a new "imagined" episode is achieved, and a new series of transitions t , a t , r t , s t+1 | g , g ac t+1 )} t=0:T −1 is collected. Then, an approach to self-imitation learning based on episodic hindsight experiences is proposed, which applies the policy updates to both hindsight and in-environment episodes. Proximal policy optimization (PPO) is used as the base RL algorithm, which is a state-of-the-art on-policy RL algorithm. With current and corresponding hindsight experiences, a new objective function is introduced and defined as where α is the weight coefficient of L P P O . In the experiments, we set α = 1 as default to balance the contribution of L P P O and L ESIL . L P P O is the loss of PPO which can be written as where L policy is the policy loss which is parameterised by θ, L value is the value loss which is parameterised by η, and c is the weight coefficient of the L value , which is set to 1 to match the default PPO setting [46]. The policy loss, L policy , can be represented as here, A t is the advantage value, and can be computed as R t − V η (s t , g). V η (s t , g) is the state value at time step t which is predicted by the critic network. R t is the return at time step t. is the clip ratio.
T indicates original trajectories. The value loss is an squared error loss For the L ESIL term, β is an adaptive weight coefficient of L ESIL ; it can be defined as the ratio of samples which are selected for self-imitation learning where N ESIL is the number of samples used for self-imitation learning and N T otal is the total number of collected samples. The episodic self-imitation learning loss L ESIL can be written as where T indicates hindsight trajectories and F t is the trajectory selection module which is based on returns of the current episodes, R, and the returns of corresponding hindsight experiences, R .

Episodic Update with Hindsight
Two important issues of ESIL are: (1) hindsight experiences are sub-optimal, and (2) the detrimental effect of updating networks with correlated trajectories. Although episodic self-imitation learning makes exploration more effective, hindsight experiences are not from experts and not "perfect" demonstrations. With the training process continuing, if the agent is always learning these imperfect demonstrations, the policy will be stuck at the sub-optimal, or experience overfitting.
To prevent the agent learning from imperfect hindsight experiences, hindsight experiences are actively selected based on returns. With the same action, different goals may lead to different results. The proposed method only selects hindsight experiences that can achieve higher returns. The illustration of the trajectory selection module is in Figure 2. For an episodic experience and its hindsight experience, the returns of the episodic experience and its hindsight experience can be calculated, respectively. In a trajectory, at time step t, the return R t can be calculated by For the hindsight experiences, similarly, the return R t for each time step, t, with respect to the hindsight goals g , can be calculated. Based on the modified trajectory τ i with the same length of τ i , we therefore have the returns R i 0 , R i 1 , R i 2 , ..., R i T −1 . During training, the hindsight experiences with higher returns are used for self-imitation learning. The rest of the hindsight experiences will be supposed to be worthless samples and ignored. Then, Equation (8) can be rewritten as L ESIL = E st,at,g ∈T ,g∈T log π θ (a t |s t , g ) · F (s t , g, g ) , where F (s t , g, g ) is the trajectory selection module. The selection function can be expressed as F (s t , g, g ) = 1 [R(s t , g ) > R(s t , g)] , (10) here, 1(·) is the unit step function. Consider the OpenAI FetchReach environment as an example. For a failed trajectory, the rewards r t are {−1, −1, · · · , −1}. The desired goal is modified to construct a new hindsight trajectory and the new rewards r t become {−1, −1, · · · , 0}. Then, R and R can be calculated separately.
From a goal perspective, episodic self-imitation learning (ESIL) tries to explore (desired) goals to get positive returns. It can be viewed as a form of multi-task learning, because ESIL has two objective functions to be optimised jointly. It is also related to self-imitation learning (SIL) [14]. However, the difference is that SIL uses (R − V θ (s)) + on past experiences to learn to choose the action chosen in the past in a given state, rather than goals. The full description of ESIL can be found in Algorithm 1.

Experiments and Results
The proposed method is evaluated on several multi-goal environments, including the Empty Room environment and the OpenAI Fetch environments (see Figure 3). The Empty Room environment is a toy example, and has discrete action spaces. In the Fetch environments, there are four robot tasks with continuous action spaces. To obtain a comprehensive comparison between the proposed method and other baseline approaches, suitable baseline approaches are selected for different environments. Ablation studies of the trajectory selection module are also performed.

Algorithm 1 Proximal policy optimization (PPO) with Episodic Self-Imitation Learning (ESIL)
Require: an actor network π(s, g|θ), a critic network V (s, g|η), the maximum steps T of an episode, a reward function r 1: for iteration = 1, 2, · · · do for t = 0, 1, · · · , T − 1 do 6: Sample an action a t using the actor network π(s t , g|θ) 7: Execute the action a t and observe a new state s t+1 8: Store the transition s t | g, g ac t , a t , r t , s t+1 | g, g ac t+1 in τ 9: end for 10: for each transition (s t , a t , r t , g, g ac t ) in τ do 11: Clone the transition and replace g with g , where g = g ac T 12: r t := r (s t , a t , g )

13:
Store the transition (s t , a t , r t , g , g ac t ) in τ 14: end for 15: Store the trajectory τ and the hindsight trajectory τ in T and T , respectively 16: end for 17: Calculate the Return R and R for all transitions in T and T , respectively 18: Calculate the PPO loss: L P P O = L policy (θ) − c · L value (η) using T (5) 19: Calculate the ESIL loss: L ESIL (θ) using T , R and R (8) 20: Update the parameters θ and η using loss L = α · L P P O + β · L ESIL (4) 21: end for

Setup
Empty Room (grid-world) environment: The Empty Room environment is a simple grid-world environment. The agent is placed in an 11 × 11 grid, representing the room. The goal of the agent is to reach a target position in the room. The start position of the agent is at the left upper corner of the room, and the target position is randomly selected within the room. When the agent chooses an action that would lead it to fall outside the grid area, the agent stays at the current position. The length of each episode is 32. The desired goal, g, is a two-dimensional grid coordinate which represents the target position. The achieved goal, g ac t , is also a two-dimensional coordinate which represents the current position of the agent at time step t, and finally, the observation is a two-dimensional coordinate which represents the current position of the agent. The agent has five actions: left, right, up, down and stay; the agent executes a random action with probability 0.2. The agent can get +1 as a reward only when g ac t = g, otherwise, it gets a reward of 0. The agent is trained with 1 CPU core. In each epoch, 100 episodes are collected for the training. After each epoch, the agent is evaluated for 10 episodes. During training, the actions are sampled from the categorical distribution. During evaluation, the action with the highest probability will be chosen.
Fetch robotic (continuous) environments [20]: The Fetch robotic environments are physically plausible simulations based on the real Fetch robot. The purpose of these environments is to provide a platform to tackle problems which are close to practical challenging robot manipulation tasks. Fetch is a 7-DoF robot arm with a two finger gripper. The Fetch environments include four tasks: FetchReach, FetchPush, FetchPickAndPlace and FetchSlide. For all Fetch tasks, the length of each episode is 50. The desired goal, g, is a three-dimensional coordinate which represents the target position. If a task has an object, the achieved goal g ac t is a three-dimensional coordinate represents the position of the object. Otherwise, g ac t is a three-dimensional coordinate represents the position of the gripper. Observations include the following information: position, velocity and state of the gripper. If a task has an object, the position, velocity and rotation information of the object is included. Therefore, the observation of FetchReach is a 10-dimensional vector. The observation of other tasks is a 25-dimensional vector. The action is a four-dimensional vector. The first three dimensions represent the relative position that the gripper needs to move in the next step. The last dimension indicates the distance between the fingers of the gripper. The reward function can be written as r t = −1 ( g ac t − g > ), where = 0.05. In the Fetch environments, for FetchReach, FetchPush and FetchPickAndPlace tasks, the agent is trained using 16 CPU cores. In each epoch, 50 episodes are collected for training. The FetchSlide task is more complex, so 32 CPU cores are used. In each epoch, 100 episodes are collected for training. The Message Passing Interface (MPI) framework is used to perform synchronization when updating the network. After each epoch, the agent is evaluated for 10 episodes by each MPI worker. Finally, the success rate of each MPI worker is averaged. During training, actions are sampled from multivariate normal distributions. In the evaluation phase, the mean vector of the distribution is used as an action.
The proposed method, termed PPO+ESIL, is compared with different baselines on different environments. All experiments are plotted based on five runs with different seeds. The solid line is the median value. The upper bound is the 75th percentile and the lower bound is the 25th percentile.

Network Structure and Hyperparameters
Network structure: Both the actor network and the critic network have three hidden layers with 256 neurons. ReLu is selected as the activation function for the hidden layers. In the grid-world environment, the actor network builds a categorical distribution. In the Fetch environment, the actor network builds normal distributions by producing mean vectors and the standard deviations of the independent variables.
Hyperparameters: For all experiments, the learning rate is 0.0003 for both the actor and critic networks. The discount factor γ is 0.98. Adam is chosen as an optimiser with = 0.00001. For each epoch, the actor network and critic network are updated 10 times. The clip ratio of the PPO algorithm is 0.2. For the grid-world environment, it trains networks for 100 epochs with batch size equals 160. Each epoch consists of 100 episodes. For the Fetch environments, in FetchReach task, it trains networks for 100 epochs and other tasks for 1000 epochs with batch size equals to 125. For FetchReach, FetchPush and FetchPickAndPlace tasks, each epoch consists of 50 episodes. For FetchSlide task, each epoch consists of 100 episodes. In designing the experiments, the number of episodes within an epoch is a balance between being able to train, the length of time required to run experiments and the maximum number of time steps that would be required to a achieve a goal. All environments have a fixed maximum number of time-steps , but this maximum differs depending on the problem or environment. This means that the number of state-action pairs can differ between two environments that have the same number of episodes and the same number of epochs. We arrange the episodes to try to compensate for the number of state-action pairs collected during training to make experiments easier to compare. The models are trained on a machine with an Intel i7-5960X CPU and 64GB RAM.

Grid-World Environments
To understand the basic properties of the proposed method, the toy Empty Room environment is used to evaluate ESIL. The following baselines are considered: • PPO: vanilla PPO [46] for discrete action spaces; • PPO+SIL/PPO+SIL+HER: Self-imitation learning (SIL) is used with PPO to solve hard exploration environments by imitating past good experiences [14]. In order to solve sparse rewards tasks, hindsight experience replay (HER) is applied to sampled transitions; • DQN+HER: Hindsight experience replay (HER), designed for sparse reward problems, is combined with a deep Q-learning network (DQN) [15]; this is an off policy algorithm; • Hindsight Policy Gradients (HPG): the vanilla implementation of HPG that is only suitable for discrete action spaces [40].
More specifically, PPO+ESIL is compared with above baseline methods in Figure 4a. This shows that PPO+ESIL converges faster than the other four baselines, and PPO+SIL converges faster than vanilla PPO, because PPO+SIL reuses past good experiences to help exploration and training. Hindsight Policy Gradient (HPG) is slower than the others because goal sampling is not efficient and also unstable.
Further, the performance of the trajectory selection module is evaluated in Figure 4b. This shows that the selection strategy helps improve the performance. Hindsight experiences are not always perfect; the trajectory selection module filters some undesirable, modified experiences. Through adopting this selection strategy, the chance of agents learning from poor trajectories is reduced. The adaptive weight coefficient β is also investigated in these experiments. In Figure 4c, it can be seen that at the initial stages of training, β is high. This is because at this stage, the agent very seldom achieves the original goals. The hindsight experiences can yield higher returns than the original experiences. Therefore, a large proportion of hindsight experiences are selected to conduct self-imitation learning, helping the agent learn a policy for moving through the room. In the later stages of training, the agent can achieve success frequently, and the hindsight experiences might be redundant (e.g., R(s t , g) ≥ R(s t , g )). In this case, undesired hindsight experiences are removed by using the trajectory selection module and L P P O leads the training. However, when the trajectory selection module is not employed, all hindsight experiences are used through the entire training process which includes the redundant hindsight experiences. This leads to overfitting and makes training unstable. Thus, the L ESIL can provide the agent with a better initial policy, and the adaptive weight coefficient β can balance the contributions of L P P O and L ESIL properly during training. Finally, the combination of PPO+ESIL is also compared with DQN+HER, which is an off-policy RL algorithm, in Figure 4d. This shows that DQN+HER works a little better than ESIL at the start of training. However, the proposed method achieves similar results to DQN+HER later in training.

Continuous Environments
Continuous control problems are generally more challenging for reinforcement learning. In the experiments of this section, the aim is to investigate how useful the proposed method is for several hard exploration OpenAI Gym Fetch tasks. These environments are commonly used to assess the performance of RL methods for continuous control. The following baselines are considered: • PPO: the vanilla PPO [46] for continuous action spaces; • PPO+SIL/PPO+SIL+HER: Self-imitation learning is used with PPO to solve hard exploration environments by imitating past good experiences [14]. For sparse rewards tasks, hindsight experience replay (HER) is applied to sampled transitions; • DDPG+HER: this is the state-of-the-art off-policy RL algorithm for the Fetch tasks. Deep deterministic policy gradient (DDPG) is trained with HER to deal with the sparse reward problem [15]. cannot solve the other two manipulation tasks. In contrast, the proposed PPO+ESIL, through utilizing episodic hindsight experiences from failed trajectories, can achieve positive rewards quickly at the start of training.

Ablation Study of Trajectory Selection Module
In order to investigate the effect of trajectory selection, ablation studies are performed to validate the selection strategy of our approach. Figure 6, when the trajectory selection module is not used, the performance of the agent increases at first, and then starts to decrease. This suggests that the agent starts to converge to a sub-optimal location. However, Figure 6d, for the FetchSlide task, the agent converges faster without the trajectory selection module, and has better performance. This is likely to be because FetchSlide is the most difficult of the Fetch environments. During training, the agent is very unlikely to achieve positive rewards. Figure 7 also indicates that the value of β in FetchSlide is higher than values in other environments, which means the majority of hindsight experiences have higher returns than original experiences. Thus, using more hindsight experiences (without filtering) accelerates training at this stage. Nonetheless, the trajectory selection module prevents the agent overfitting the hindsight experience in the other three tasks. Figure 7, shows the adaptive weight coefficient β on all Fetch environments. When the trajectory selection module is used, the value of β decreases with the increase in training epochs. This implies that the agent can achieve a greater proportion of the original goals in the latter stages of training, and fewer hindsight experiences are required for self-imitation learning.

Comparison to Off-Policy Baselines
Finally, the proposed method is also compared with a state-of-the-art off-policy algorithm: DDPG+HER. From Figure 8, it may be seen that DDPG+HER converges faster than PPO+ESIL in all tasks. However, PPO+ESIL obtains a similar performance to DDPG+HER. This is because DDPG+HER is an off-policy algorithm and uses a large number of hindsight experiences. A replay buffer is also employed to store samples collected in the past. This approach has better sample efficiency than on-policy algorithms such as PPO. Even so, Figure 8c shows that PPO+ESIL still outperforms DDPG+HER in the FetchPickAndPlace task and the success rate is close to 1. This suggests that PPO+ESIL approximates the characteristics of on-policy algorithms, which have low sample efficiency, but are able to obtain a comparable performance to off-policy algorithms in continuous control tasks [46]. Table 1 shows the average success rate of the last 10 epochs during training of baseline methods and PPO+ESIL. The proposed ESIL achieves the best performance in four out of five tasks. However PPO and PPO+SIL only obtain reasonable results for the Empty Room and FetchReach tasks. With the assistance of HER, PPO+SIL+HER obtains a better performance in the FetchSlide task. For the off-policy methods of DDPG+HER, all five tasks are achieved, but a better performance is obtained than PPO+ESIL only in the FetchPush task.

Conclusions
This paper proposed a novel method for self-imitation learning (SIL), in which an on-policy RL algorithm uses episodic modified past trajectories, i.e., hindsight experiences, to update policies. Compared with standard self-imitation learning, episodic self-imitation learning (ESIL) has a better performance in continuous control tasks where rewards are sparse. As far as we know, it is also the first time that hindsight experiences have been combined with state-of-the-art on-policy RL algorithms, such as PPO, to solve relatively hard exploration environments in continuous action spaces.
The experiments that we have conducted suggest that simply using self-imitation learning with the PPO algorithm, even with hindsight experience, leads to disappointing performance in continuous control Fetch tasks. In contrast, the episodic approach we take with ESIL is able to learn in these sparse reward settings. The auxiliary trajectory selection module and the adaptive weight β help the training process to remove undesired experiences and balance the contributions to learning between the PPO term and the ESIL term automatically, and also increase the stability of training.
Our experiments suggest that the selection module is useful to prevent overfitting to sub-optimal hindsight experiences, but also that it does not always lead to learning a better policy faster. Despite this, selection filtering appears to support learning a useful policy in challenging environments. The experiments we have conducted to date have utilised relatively small networks, and it would be appropriate to extend the experiments to consider more complex observation spaces, and to actor/critic networks, which are consequently more elaborate.
Future work includes extending the proposed method to support hierarchical reinforcement learning (HRL) algorithms for more complex manipulation control tasks, such as in-hand manipulation. Episodic self-imitation learning (ESIL) can also be applied to simultaneously learn sub-goal policies.