Episodic SelfImitation Learning with Hindsight
Abstract
:1. Introduction
2. Related Work
3. Background
3.1. Reinforcement Learning
3.2. Proximal Policy Optimization
3.3. Hindsight Experiences and Goals
4. Methodology
4.1. Episodic SelfImitation Learning
4.2. Episodic Update with Hindsight
Algorithm 1 Proximal policy optimization (PPO) with Episodic SelfImitation Learning (ESIL) 
Require: an actor network $\pi (s,g\theta )$, a critic network $V(s,g\eta )$, the maximum steps T of an episode, a reward function r

5. Experiments and Results
5.1. Setup
5.2. Network Structure and Hyperparameters
5.3. GridWorld Environments
 PPO: vanilla PPO [46] for discrete action spaces;
 PPO+SIL/PPO+SIL+HER: Selfimitation learning (SIL) is used with PPO to solve hard exploration environments by imitating past good experiences [14]. In order to solve sparse rewards tasks, hindsight experience replay (HER) is applied to sampled transitions;
 DQN+HER: Hindsight experience replay (HER), designed for sparse reward problems, is combined with a deep Qlearning network (DQN) [15]; this is an off policy algorithm;
 Hindsight Policy Gradients (HPG): the vanilla implementation of HPG that is only suitable for discrete action spaces [40].
5.4. Continuous Environments
 PPO: the vanilla PPO [46] for continuous action spaces;
 PPO+SIL/PPO+SIL+HER: Selfimitation learning is used with PPO to solve hard exploration environments by imitating past good experiences [14]. For sparse rewards tasks, hindsight experience replay (HER) is applied to sampled transitions;
 DDPG+HER: this is the stateoftheart offpolicy RL algorithm for the Fetch tasks. Deep deterministic policy gradient (DDPG) is trained with HER to deal with the sparse reward problem [15].
5.4.1. Comparison to OnPolicy Baselines
5.4.2. Ablation Study of Trajectory Selection Module
5.4.3. Comparison to OffPolicy Baselines
5.5. Overall Performance
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
Empty Room  Reach  Push  Pick  Slide  

PPO  1.000 ± 0.000  1.000 ± 0.000  0.070 ± 0.001  0.033 ± 0.001  0.077 ± 0.001 
PPO+SIL  0.998 ± 0.002  0.225 ± 0.016  0.071 ± 0.001  0.036 ± 0.002  0.011 ± 0.001 
PPO+SIL+HER  0.996 ± 0.013  1.000 ± 0.000  0.066 ± 0.011  0.035 ± 0.004  0.276 ± 0.011 
DQN+HER  1.000 ± 0.000         
DDPG+HER    1.000 ± 0.000  0.996 ± 0.001  0.888 ± 0.008  0.733 ± 0.013 
HPG  0.964 ± 0.012         
PPO+ESIL (Ours)  1.000 ± 0.000  1.000 ± 0.000  0.984 ± 0.003  0.986 ± 0.002  0.812 ± 0.015 
