Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation

Deep Reinforcement Learning (DRL) algorithms have been widely studied for sequential decision-making problems, and substantial progress has been achieved, especially in autonomous robotic skill learning. However, it is always difficult to deploy DRL methods in practical safety-critical robot systems, since the training and deployment environment gap always exists, and this issue would become increasingly crucial due to the ever-changing environment. Aiming at efficiently robotic skill transferring in a dynamic environment, we present a meta-reinforcement learning algorithm based on a variational information bottleneck. More specifically, during the meta-training stage, the variational information bottleneck first has been applied to infer the complete basic tasks for the whole task space, then the maximum entropy regularized reinforcement learning framework has been used to learn the basic skills consistent with that of basic tasks. Once the training stage is completed, all of the tasks in the task space can be obtained by a nonlinear combination of the basic tasks, thus, the according skills to accomplish the tasks can also be obtained by some way of a combination of the basic skills. Empirical results on several highly nonlinear, high-dimensional robotic locomotion tasks show that the proposed variational information bottleneck regularized deep reinforcement learning algorithm can improve sample efficiency by 200–5000 times on new tasks. Furthermore, the proposed algorithm achieves substantial asymptotic performance improvement. The results indicate that the proposed meta-reinforcement learning framework makes a significant step forward to deploy the DRL-based algorithm to practical robot systems.

Despite the remarkable progress of the application of DRL to solve various robotic skill learning tasks, the potential assumption, i.e., the training environment and the deploy environment should be identical, results in the agents trained with DRL algorithms often over-specialize to the training environment, which is still prohibitive for adapting to some novel, unseen circumstances [21,22]. However, in modern robot application scenarios, the robot should work in an ever-changing and even unknown environment [23,24].
Intuitively, there are two methods for this problem, the first is re-training in the new environment, which would require a huge number of samples. The second is to deploy the well-trained policy directly, leading to great performance degradation, and damage to the practical robot system. Therefore, the problem of how to make the DRL agent efficiently learn and adapt to a dynamical environment is of great significance [25][26][27].
On the other hand, we deeply recognize that human beings can learn new skills for solving new tasks in new environments only through very limited samples [28]. The reason is that we can draw inferences about other cases from one instance, i.e., when new tasks emerge, we always find some related or similar cases by checking our experiences, and the skills would be abstracted and transferred to new tasks, thus we would acquire the new skills very efficiently. This kind of human intelligence gives us significant inspiration on how to improve the robot intelligence, i.e., endowing the robot with the capability of transfer learning. Many practical insights give us the possibility to realize this idea. For example, the majority of robots are not designed for a specific task, the universal robot manipulator can be used to accomplish versatile tasks and shares the same configurations and dynamics. Moreover, different tasks share the same skills, such as the door opening task and the cap unscrewing task, both of them can be decomposed into holding and rotating the object. Therefore, if we embed the mechanism of transfer learning into the DRL framework, the learning efficiency of new skills would be improved substantially.
Recently, transfer learning methods have been widely studied in the pattern recognition field [29]. The main idea of this method is that pre-training a model with huge amounts of acquired data, then fine-tune the model with a few new samples from the new task. Taylor et improving the robot intelligence, i.e., endowing the robot with the and then fine-tuning the network with few specific images [30]. Ramachandran and Howard et al. train a large long short-term memory model with the Corpora natural language processing database, then fine-tuning technique used to update the model [31]. However, this simple pre-training technique can only deal with very limited kinds of tasks, the performance is also limited by the number of new task samples. More severely, we cannot build a robot learning database like ImageNet or Corpora, so this method has hardly been applied in robotic skill transfer [32,33].
The meta-learning scheme, or learning to learn, aims to extract some similar or common structures which are shared for various tasks [34]. Once the latent common structures are obtained, one can acquire the new skills very efficiently by combining the latent structures with very few samples from the current task. Following this idea, meta-reinforcement learning [35], i.e., incorporating the meta-learning mechanism into the reinforcement learning framework, can be categorized into two classes. The first is encoding the various tasks and the according skills with memory-enabled neural networks, such as neural turning machine [36], long short term memory model, and the memorized information can be further used to assist the skill learning for new tasks [37,38]. In general, the encoding information could be the initialization parameter for new networks, and the learning parameters for new tasks. This kind of meta-learning can realize structure extraction and transfer to some extent, however, this memory-based meta-learning approach can not learn the intrinsic structure explicitly, and the learning process is like a black box.
Another kind of meta-learning approach aims to embed structure learning into policy learning. The most famous algorithm is model agnostic meta-learning (MAML) based on the policy gradient method, and can be used in nearly all policy gradient-based reinforcement learning frameworks [39,40]. The main idea of MAML is that obtain a bunch of initial parameters that are sensitive to different task gradients by formulating the parameter optimization as a bi-level optimization problem. Once the initial parameters are obtained from the source training tasks, only one or a few gradient steps are needed for the new task, then the new policy will be obtained. In comparison with the memory-based meta-learning method, MAML does not need memory-enabled neural networks, and does not introduce additional model parameters. From the perspective of feature learning, MAML can learn a bunch of task features that can be widely used for many tasks, thus a high-performance policy can be obtained through only a few policy gradients. From the dynamical system perspective, MAML hopes to find a bunch of parameters with the greatest sensitivity to all task performance. MAML has been successfully applied to practical robot learning tasks, such as the domain adaptation issue in the observing-then-imitating robot skill learning [41]. Since MAML is built on policy gradient algorithms, there still exist some issues for the computation of the policy gradient, such as the bias-variance problem in gradient estimation, and unstable gradient computation. Liu et al. proposed taming MAML algorithm to deal with the bias-variance issue by reconstructing a surrogate objective function by introducing a zero-expectation baseline function [42]. Rothfuss et al. proposed the PROMP algorithm by analyzing the credit assignment issue to improve the gradient estimation and algorithmic stability [43]. Gupta et al. studied the exploration issue and built structured noise for better exploration, thus enabling better performance [44]. However, there are still some issues, as MAML optimizes policy parameters that are most sensitive to all tasks, that is to say, the obtained policy cannot accomplish any task, including those source tasks, and once the policy adapts to one specific task, the network loses the ability of further improvement. During adaptation, all parameters of the MAML policy would be updated, and second-order optimization is involved, which aggravates the computational burden. And most of all, there is still a lack of effective methods for sub-task extraction and representation.
Inspired by how human beings adapt the obtained skill to new tasks. Pastor et al. realize the generalization of relevant features among different skills with the help of the associative skill memory technique [45,46]. Rueckert et al. use the motion primitive technique to extract a group of low-dimensional control variables that allow being reused [47]. Sutton et al. represents the process of task realization as a hierarchical motion sequence [48][49][50]. However, the aforementioned works still cannot deal with the problem of how to extract and represent the basic tasks and basic policies from the source task space, and how to infer the spatial-temporal combination of the basic tasks and basis policies automatically.
Therefore, we take advantage of latent space learning techniques and deep neural networks to automatically extract the basic tasks and basic policies, and their respective combination methods. Latent space learning techniques are widely used to compress high-dimensional data into low-dimensional information structures. Lenz et al. map the high-dimensional state space into low-dimensional latent space, and learn skills in the latent space [51]. Du et al. applies latent space to learn the dynamics of the robot, then online skill transfer is realized [52]. Generally, principal component analysis is the most classic latent space learning technique, it would struggle to deal with those high-dimensional, highly nonlinear robotic skill learning problems [53]. Tishby et al. proposed the information bottleneck technique from the perspective of information theory to extract low-dimensional structures [54]. Then, Saxe et al. extended to the deep learning domain by combining deep neural networks [55]. Alemi et al. further proposed a variational information bottleneck to quantify the inference uncertainties via variational inference theory [56]. Recent studies [57] also show that the information bottleneck principle advocates for learning minimal sufficient representations, i.e., those which contain only sufficient information for the downstream task. Optimal representations contain relevant information between input and output that is parsimonious to learn a task. Peng et al. [58] showed that variational information bottleneck could be applied to imitation learning, inverse reinforcement learning, and adversarial learning for more robust performance.
Therefore, we incorporate the variational information bottleneck into the maximum entropy off-policy reinforcement learning framework to develop a novel meta-reinforcement learning scheme. In summary, the main contributions of this article are summarized as follows: • We develop a novel meta-reinforcement learning framework based on a variational information bottleneck. The framework consists of two stages, i.e., the meta-training stage and the meta-testing stage. The meta-training stage aims to extract the basic tasks and the according to basic policies, the meta-testing stage aims to efficiently infer the new policy for a new task by taking advantage of the basic tasks and basic policies. • The meta-training and meta-testing algorithms are presented in detail. Thus, the metareinforcement learning framework allows efficient robotic skill transfer learning in a dynamic environment.

•
Empirical experiments based on Mujoco have been conducted to show the effectiveness of the proposed scheme.
The following of this paper is organized as follows: Section 2 formally describes the problem mathematically and introduces the variational information bottleneck theory. The novel variational information bottleneck regularized DRL framework is proposed in Section 3. In Section 4, we first formulate the transfer learning tasks based on Mujoco, and then empirical results are presented. In Section 5, discussions and conclusions are represented.

Problem Formulation
Meta-reinforcement learning involves two stages, meta-training and meta-testing. During the meta-training stage, a batch of tasks is first sampled from the source task space as the training set, then an algorithm is employed for model training. During the meta-testing stage, a small batch of tasks is sampled from the target task space as a testing set, using the well-trained model to evaluate the transfer performance through only a few interactions on each task. In general, assuming that the training task set and testing task set are sampled from the same distribution p(T ), and the task space consists of state space S, action space A, transition function P, and bounded reward function space R, every task could be characterized by Markov Decision Process (MDP), we further assume that the transition functions and reward functions are unknown to the agent, only samples could be gathered. Stochastically sampling a task T = p(s 0 ), p( s t+1 |s t , a t ), r(s t , a t ) , where p(s 0 ) denotes the initial state distribution, p( s t+1 |s t , a t ) denotes the transition probability, which is unknown, r(s t , a t ) denotes reward function. From the aforementioned definition, p(T ) can describe both tasks with different transition functions, i.e., different robot dynamics, and tasks with different reward functions, i.e., different tasks, but assuming that all robots and tasks share with the same state space and action space. When given a task distribution p(T ), we firstly sample M tasks as the meta-training set {T i } ∼ p(T ), (i = 1, · · · , M), then we can optimize the latent space based policy π( a|s, z), where z denotes the latent space, with these training tasks. Denoting e T k = s k , a k , r k , s k as a sample for the task T , then e T 1:K denote a trajectory. Thus, we can obtain the meta-training data space D T i , i = 1, · · · , M. Once meta-training accomplished, we can sample another N tasks from the task distribution p(T ) as meta-testing set T j ∼ p(T ), (j = 1, · · · , N). For any test task, only a few interactions are needed, then the latent space can infer the spatial-temporal combination of basic tasks. Thus, the agent can adapt to the test task efficiently. The detailed variable definitions are shown in Table 1.
The latent space-based meta-reinforcement learning can simultaneously learn the latent space and its corresponding policy. The latent space is obtained in a self-supervised manner and used to differentiate various tasks. Thus, the latent space should satisfy the following principles, (1) sufficiency, the latent space z should sufficiently characterize the task space p(T ), i.e., the mapping from task space to latent space should be surjective and injective. (2) Parsimony, the latent space should not be over-fit to the detailed state.
(3) Identifiability, the latent space could identify the specific task from the trajectories. Therefore, we tailored the variational information bottleneck technique to extract the aforementioned latent space. Furthermore, to achieve better training efficiency, we consider the state-of-art maximum entropy off-policy actor-critic algorithm, i.e., soft actor-critic (SAC). In the following subsections, we will describe the mathematical formulation of reinforcement learning and variational information bottleneck theory.

Markov Decision Process (MDP)
In general, the MDP is describe in tuple M = S, A, P, r, γ, ρ 0 , where S, A and P : S × A × S → [0, ∞) with definition before. r : S × A → R denotes the reward function, bounded by [r,r]. ρ 0 : S → R denotes the initial state distribution, and γ ∈ (0, 1) denotes the discount factor.
In this work, we consider stochastic policy, i.e., π : S × A −→ [0, 1], and let J (π) denote the expected discounted cumulative reward: where E means expectation, by taking on a bunch of trajectories following the policy π.
The objective of RL is to optimize the policy by maximizing the objective function: where s 0 ∼ ρ 0 (s 0 ), s t+1 ∼ P( s t+1 |a t , s t ), a t ∼ π( a t |s t ). Let Q π denotes the state-action value function, V π denotes the state value function, and A π denotes the advantage function: where, a t ∼ π( a t |s t ), s t+1 ∼ P( s t+1 |a t , s t ) for all t ≥ 0.

Maximum Entropy Actor-Critic
To balance the exploration and exploitation problem in the original RL, the per-step entropy is usually added to the original per-step reward as r(s t , a t ) + αH(π( ·|s t )), where H(π( ·|s t )) = − A π( ·|s t ) log(π( ·|s t ))da t denotes the entropy of the policy at s t , α ∈ (0, 1) is a weighting term. Then, the optimal policy would be achieved by maximizing the discounted reward and its entropy in expectation: By doing so, the agent aims to maximize the cumulative reward, whilst keeping the policy as stochastic as possible. Several previous works have studied this framework carefully, and shown some conceptual and practical advantages [9][10][11]. The policy is incentivized to explore more diversely, while giving up those obviously hopeless directions by showing improved sampling efficiency and performance. In comparison with general RL, the action-state value function is modified as and the state value function as where ρ π (s, a)=π( a|s) ∞ ∑ t=0 γ t P( s t = s|π) and ρ π (s) = E a∼π( ·|s) [ρ π (s, a)] denote the stateaction occupancy measure and state occupancy measure, respectively. Then, according to the Bellman optimality principle, we obtain the following results.

Lemma 1.
As the modified action-state value function and the modified state value function defined in Equation (7) and Equation (8), respectively, the modified action-state value function and the modified state value function satisfy the Bellman optimality equation with the optimal policy given by π * (a t |s t ) = exp 1 Moreover, the modified value iteration is given by converges to their fix point Q * (s t , a t ) and V * (s t ), respectively.
The modified Bellman optimality equation in Equation (9) can be regarded as a generalization of the conventional Bellman equation, which could be recovered as α → 0. The SQL (Soft Q-Learning) and SAC (Soft Actor-Critic) algorithms, which are built on the aforementioned value iteration, can achieve state-of-art performance in continuous control tasks and dexterous hand manipulation tasks, by using deep neural networks to approximate the modified action-value function, the modified state value function and the policy [10].

Variational Information Bottleneck (VIB)
The information bottleneck theory was first proposed by Tishby et al., and Alemi et al. proposed variational information bottleneck by incorporating variational inference and deep learning techniques. For a general supervised learning task, given a data set x i , y i , where x i denotes the input, y i denotes the label. Information bottleneck encodes the inputs into a latent space as E( z|x), and constrain the information flow between x and z using mutual information I(X; Z), defined by, where D KL denotes Kullback-Leibler (KL) divergence. Thus, we obtain a constrained optimization problem as, where I c denotes the user-defined information constraint, E( z|x) denotes the latent space encoder, q( y|z) denotes the mapping from latent space to the label space. By solving this constrained optimization problem, we obtain a latent space that is sufficient for the final task and parsimony enough without redundancy information.
In practice, one can take samples of D KL [E(Z | X) | p(Z)] to estimate the mutual information. While E(Z | X) is straightforward to compute, calculating p(Z) requires marginalization across the entire state space S, which is impossible since most non-trivial environments are intractable. Instead, we introduce an approximator, q(Z) ∼ N ( − → 0 , I), to replace p(Z). This achieves an upper bound on I(Z, X), where the inequality arises because of the non-negativeness KL-divergence, i.e., D KL [p(z) | q(z)] ≥ 0. Substituting Equation (15) into Equation (14) obtains, whereJ(q, E) ≥ J(q, E) denotes the upper bound of the objective function. By taking advantage of lagrangian multiplier β, we convert the constrained optimization problem into an unconstrained optimization problem, By solving the aforementioned problem, we can obtain a sufficient and parsimonious representation of the original data distribution. Alemi et al. showed that the variational information bottleneck approach could suppress the parameter over-fitting problem, and the obtained model was robust to adversarial attacks.

Overview
The latent space-based robotic skill transfer learning framework is shown in Figure 1, which includes two stages. During the meta-training stage, we sample from the source task space and obtain M training tasks, for each task, the RL agent interacts with the environment and obtains several trajectories e T 1:K , then we iteratively optimize the latent space encoder E ω z|e T m and latent-based policy π θ a|s, z T m , which both parameterized by deep neural networks with parameters as ω and θ, respectively. Once the training process converges, the two networks are fixed, and they would be reused for test tasks. During the meta-testing stage, we sample another N tasks, the latent space encoder infers the specific task from very few interacting samples, then the latent vector informs the policy to synthesize the final policy. During the test stage, there is no gradient update, so the agent can reuse the obtained skills, which would be very efficient. Moreover, since the latent space would inform the agent of the way to synthesize new policy according to the real-time data sequence for a specific task, improved performance would be achieved. Following the procedure of maximum entropy actor-critic framework, the latent-based constrained optimization problem is formulated as, where R e T , z denotes the latent space informed trajectory discounted return e T .
According to the variational information bottleneck theory, introducing the variational lower bound, one obtains, whereĴ (π, E) denotes the upper bound of J (π, E).
According to the maximum entropy Bellman optimality principle, denote the optimal action-state value function for the problem (19) as γ k (r(s t+k , a t+k , z t+k ) + αH(π * ( ·|s t+k , z t+k ))) , (20) with the optimal state value function as, According to Lemma 1, the optimal Bellman equation is, followed, by optimal policy, as π * (a t |s t , z t ) = exp 1 For practical robotic skill learning problems, the state space and action space are always very high dimensional, so we utilize deep neural networks to approximate the action-state value function Q ϕ , state value function V ψ , policy π θ and latent space encoder E ω with network parameters as ψ, ϕ, θ, and ω, respectively.

Latent Space Learning
From Equation (19), the latent space modeling should simultaneously satisfy two requirements, maximizing the entropy-augmented returns for task accomplishment and the information bottleneck constraint. So, the latent learning algorithm should simultaneously minimize the Bellman residual B π Q − Q and KL divergence D KL [ p(Z) p(Z)] ≥ 0, the cost function is formulated as, where Vψ denotes the target network for algorithm stability. Taking the computation feasibility into account, implementing p(z) as Gaussian distribution, i.e., p(z) = where N (0, I) is also Gaussian distributions. Thus, the E ω can be modeled as multi-variable Gaussian distributions as, where, µ ω e T k and σ ω e T k denote the mean and variance. The latent space encoder network updating rule is, where η ω denotes updating rate, and∇ ω denotes the first-order gradient.
Using the aforementioned Bayesian optimization protocol, we can realize skill transfer from task space to new tasks by posteriori inference. Furthermore, the intrinsic uncertainty estimation property can be used to explore a better way of task realization.

Vib Based Meta-Reinforcement Learning Algorithm
According to Lemma 1, we know that the Bellman equation in Equation (22) is a contraction mapping, i.e., the optimal action-state value function is the fixed point. So the cost function for the action-state value function is, And we get the action-stage value function network parameter's updating rule by gradient descending, where η ϕ denotes the learning rate,∇ ϕ denotes the unbiased estimation of gradient estimation. According to Equation (21), the cost function for state value function is, where the computation of the action-state value function using actions from the current policy. We get the state value function network parameter's updating rule as, where η ψ denotes the learning rate, and∇ ψ denotes the unbiased estimation. According to Equation (23), the cost function for policy network is, where Z φ (s t , z t ) denotes the normalizing function, which is irrelevant to the computation of policy function gradient. So the policy network parameter's updating rule is, where η θ denotes the learning rate,∇ θ denotes the unbiased gradient estimation. According to the aforementioned analysis, we could formulate the VIB-based metareinforcement learning algorithm. The meta-training protocol is presented in Algorithm 1, and the meta-testing procedure is presented in Algorithm 2. We use an ADAM optimizer for deep neural network training. Algorithm 1 VIB based meta-reinforcement learning training algorithm. 1: Input: D: training data set; η θ , η ϕ , η ψ , η ω denote learning rate; {T m } m=1,...,M ∼ p(T ): meta-training task set. 2: Output: policy network π θ and latent space encoder E ω . 3: Setting parameter for the target network: ψ ←ψ; the initial sample set for each task D m 4: for epoch do 5: for T m do 6: Initializing each trajectory: e T m = {} 7: for k = 1, · · · , N do 8: The current policy π θ ( a|s, z) interact with each task and obtain sample D m

10:
Updating e T m = {(s j , a j , s j , r j )} j:1...K ∼ D m 11: end for 12: end for 13: for step do 14: for T m do 15: Sampling from the training data set: e T m , d m ∼ D m Updating the action-state value function network: Updating the state value function network: Updating the policy network: θ t+1 ← θ t − η θ∇θ ∑ m J m (π) 25: Updating the latent space encoder network: ω t+1 ← ω t − η ω∇ω ∑ m J m (E) 26: end for 27: Updating the target network ψ ←ψ 28: end for Algorithm 2 VIB based meta-reinforcement learning testing algorithm. Latent space inference: z ∼ E ω z|e T

5:
Using current policy π θ ( a|s, z) to interact with each task and obtain D k Evaluating empirical discounted return for each task. 8: end for

Experiments
Our experiments aim to investigate the following questions: 1.
Does the VIB-based meta-reinforcement learning algorithm realize efficient skill transfer? 2.
How about the learning efficiency and asymptotic performance of the VIB-based metareinforcement learning algorithm in comparison with that of other meta-learning approaches, such as MAML and ProMP? 3.
Does the VIB-based meta-reinforcement learning algorithm can improve the learning performance during the training stage in comparison with other algorithms?

Experiments Configuration
To investigate the aforementioned questions, we implement the proposed algorithm on several challenging robotic locomotion tasks from the OpenAI Gym benchmark [21] and also on the rllab implementation tasks [59], which are implemented using MuJoCo [60], a 3D physics simulator with better modeling of contacts. More specifically, we build on four high-dimensional, highly nonlinear robots, including Walker2d, Half-Cheetah, Ant, and Humanoid, as shown in Figure 2. Walker2d is a planar robot with 7 links, the reward is given by r(s, a) = v x − 0.005 a 2 2 , the episode is terminated when z body < 0.8, z body > 0.2 or when |θ| > 1.0, the dimension of state space and action space are 17 and 6, respectively; Half-Cheetah is a planar biped robot with 9 links, the reward is given by r(s, a) = v x − 0.05 a 2 2 , the dimension of state space and action space are 17 and 6, respectively; Ant is a quadruped robot with 13links, the reward is given by r(s, a) = v x − 0.005 a 2 2 − C contact + 0.05, where C contact penalizes contacts to the ground, and is given by 5 × 10 −4 F contact 2 2 , where F contact is the contact force, the episode is terminated when z body < 0.2 or when z body > 1.0, the dimension of state space and action space are 111 and 8, respectively; Humanoid is a human-like robot with 17 joints, including the head, body trunk, two-arms and twolegs, the reward is given as r , the episode is terminated when z body < 0.8 or z body > 2.0, the dimension of state space and action space are 376 and 17, respectively. During the building of source tasks space, we consider two classes of tasks. The first class of task is the robot owns the same configuration and same physical parameters but with different tasks, which can be realized by setting different reward functions, including, Walker2d-Diff-Velocity and HalfCheetah-Diff-Velocity, the robot move forward with different velocity, where the training task set includes 100 tasks with different velocity setting, and the testing task set includes 30 tasks with different velocity setting. Ant-Random-Goal and Humanoid-Random-Dir, robot move in different directions in the 2D planar space, where the training task set includes 100 tasks with different directions, and the testing task set include 30 tasks with different directions. Another class of tasks is that the robot with different physical parameters realizes the same task. Walker2d-Diff-Params and Ant-Diff-Params, robots own different parameters, where the training task set includes 40 tasks with different physical parameters, and the testing task set includes 10 tasks with different physical parameters. To investigate the advantages of the proposed algorithm, we compare the VIB-based meta-reinforcement learning algorithm with MAML and ProMP, where the learning parameters are borrowed from the original papers.
Due to the universal approximation capability of DNNs, we parameterize the actionvalue function, state-value function, policy, and the latent space encoder all in feed-forward type DNNs with three hidden layers, each hidden layer owning 300 neurons. More specifically, Table 2 lists the parameters used in the algorithm for comparative evaluation.

Comparative Study
To show the efficiency and effectiveness of the presented VIB-based off-policy actorcritic algorithms, we compare our algorithm with the famous MAML and ProMP from two aspects, i.e., the sample efficiency during the training procedure, and asymptotic performance after the algorithm converges. To increase the statistical reliability of the results, all the results are obtained by taking an empirical mean and variance on 10 random seeds. The results are shown in Figures 3-5. The solid lines represent the mean and the shallow part around the solid lines denotes one standard deviation, the dotted lines denote the asymptotic performance for different algorithms with a different color.
From the results, the proposed algorithm achieves simultaneous big sample efficiency during training and asymptotic performance improvement. We compute the number of times for performance improvement as, where the t VIB ReachingAsymptotic denotes the time moment when the proposed VIB-based meta-RL algorithm reaches the asymptotic performance of the baseline MAML or ProMP, the t MAML|ProMP ReachingAsymptotic denotes the time moment when the baseline MAML or ProMP algorithms reach asymptotic performance. As shown in Table 3, the sample efficiency improves by at least 200 times. In particular, for the tasks Walker2d-Diff-Velocity and Humanoid-Random-Dir, 5000 times sample efficiency improvement was achieved. We know that the Humanoid task is very challenging due to its extremely high-dimensional state space and action space, and its highly nonlinear dynamics. Therefore, the proposed VIB-based meta-reinforcement learning algorithm can realize efficient skill adaptation.

Conclusions
In this paper, we revisited the problem of efficiently adapting skills obtained with DRL methods to novel, unseen environments or tasks. By observing the facts that human beings could learn new skills very efficiently since we could draw inferences about other cases from one instance, thus the acquired skills would be adapted to new tasks and be formulated as new skills. Inspired by this idea, we argued that if the robot could extract the basic tasks and the respective basic skills from the task space, when a new task was encountered, the basic tasks and basic skills could be efficiently transferred and formulate new skills. Therefore, we took advantage of variational information bottleneck techniques and developed a latent space-based meta-reinforcement learning algorithm. The empirical results based on Mujoco benchmarking robotic locomotion tasks show that the variational information bottleneck-based meta-reinforcement learning could realize efficient skill learning and transfer. Thus, this work took a substantial step in implementing the learning-based algorithms to practical robotic skill learning. In the future, we will take a step further by applying our algorithm to much more complicated tasks, such as quadrotor aerobatic flight with quickly changing aerodynamics, and high dimensional tasks, such as tasks with image input.