Intrinsic Motivation Based Hierarchical Exploration for Model and Skill Learning

Hierarchical skill learning is an important research direction in human intelligence. However, many real-world problems have sparse rewards and a long time horizon, which typically pose challenges in hierarchical skill learning and lead to the poor performance of naive exploration. In this work, we propose an algorithmic framework called surprise-based hierarchical exploration for model and skill learning (Surprise-HEL). The framework leverages the surprise-based intrinsic motivation for improving the efficiency of sampling and driving exploration. It also combines the surprise-based intrinsic motivation and the hierarchical exploration to speed up the model learning and skill learning. Moreover, the framework incorporates the reward independent incremental learning rules and the technique of alternating model learning and policy update to handle the changing intrinsic rewards and the changing models. These works enable the framework to implement the incremental and developmental learning of models and hierarchical skills. We tested Surprise-HEL on a common benchmark domain: Household Robot Pickup and Place. The evaluation results show that the Surprise-HEL framework can significantly improve the agent’s efficiency in model and skill learning in a typical complex domain.


Introduction
In the field of psychology and neuroscience, one of the core objectives is to understand the formal structure of behavior [1,2]. It has been shown that the behavior displays a hierarchical structure in humans and animals. Simple actions are combined into coherent subtasks, and these subtasks are further combined to achieve higher-level goals [3]. The hierarchical structures are observed readily in our daily life: opening a faucet is the first step of the task 'cleaning vegetables', which is part of the bigger task 'cooking a dish'. The detailed formal analysis of the hierarchical structure of behavior can be found in Whiten's paper [4].
Hierarchical structure-based learning and planning are one of the most important research fields in artificial intelligence. Compared with flat representation, hierarchical representation is more efficient and compact, which is computationally simpler for encoding the complex behaviors at the neural level [5]. Moreover, the hierarchical representation can accelerate the discovery of new hierarchical behaviors through planning and learning [6][7][8][9].
Exploration is the fundamental of hierarchical learning and planning. However, exploration in environments with sparse rewards is challenging, as exploration over the space of primitive action sequences is unlikely to result in a reward signal. For humans, the problem of sparse rewards is solved by intrinsic motivation based exploration [10][11][12]. Intrinsic motivation is defined as the doing of an activity for its own sake, rather than to directly solve some specific problems [10]. When people

Related work
Intrinsic motivation has many forms of expression including innate preferences, empowerment [15][16][17], novelty [18][19][20], surprise [21,22], predictive confidence, habituation [23], and so on [13,14]. Mohamed et al. [17] used mutual information to form the definition of an internal drive, which is called empowerment. Christopher et al. [12] employed the function of an agent's confidence to formulate intrinsic rewards. Burda et al. [24] performed the study of purely curiosity-driven learning across 54 standard benchmark environments. Singh et al. [25] adopted an evolutionary perspective to optimize the reward framework, and designed a primary reward function by capturing the pressure, which leads to a notion of intrinsic and extrinsic motivation.
In this work, we used intrinsic motivation to drive exploration and skill learning. Substantial works have been done in this field.
Kulkarni et al. [26] presented a hierarchical-DQN (h-DQN) framework. The framework integrated hierarchical action-value functions and intrinsic motivation. The focus of their work is to make a decision at different temporal scales, but the intrinsic motivation is independent of the current performance of the agent.
McGovern et al. [27] and Menache [28] searched for states that act as bottlenecks to generate skills. Tomar et al. [29] used successor representation to generalize the bottleneck approach to continuous state space. These works mainly focus on building options, rather than improving skill learning performance from intrinsic motivation based exploration.
Achiam et al. [22], Alemi et al. [30], and Klissarov et al. [31] used the approximation of KLdivergence to form intrinsic rewards. Frank et al. [32] empirically demonstrated that artificial curiosity exists in humanoid robots and is capable of exploring the environment through information gain maximization. Fu et al. [33] proposed a discriminator to differentiate states from each other in order to judge whether a state was sufficiently visited. Still et al. [34] defined a curiosity-based, weighted-random exploration mechanism. This exploration mechanism enabled agents to investigate unvisited regions iteratively. Emigh et al. [35] proposed a framework called Divergence-to-Go to quantify the uncertainty of each action-state pair. These works are inefficient in problems with large state and action space due to the underlying models functioning at the level of primitive actions.
In this work, our framework combines the intrinsic motivation-driven exploration and hierarchical exploration to accelerate model learning and hierarchical skill learning. Moreover, the framework motivates the agent to reuse the learned skills to improve its capabilities.

Semi-Markov Decision Process (SMDP)
The semi-Markov decision process (SMDP) is the mathematical model of the hierarchical methods, and provides the theoretical foundation for hierarchical reinforcement learning.
SMDP is a generalization of the Markov decision process (MDP

Option
There are three general approaches to formalize hierarchical skills: option [38], MAXQ [36], and HAM [39,40]. In this paper, we worked with the framework of option (temporally extended actions) in skill learning.
An option is a short-term skill that consists of a policy for the state space in a specified region and a termination function for leaving the region. Formally, an option o is defined as a three-tuple is a termination condition function, and  is an intra-policy. The action space of the intra-policy contains primitive actions and lowlevel options. If an option is selected, then this option will be executed according to  until its termination.
The low-level options can be regarded as the primitive actions embedded within a high-level option. Moreover, an option can also be regarded as a sub-MDP embedded in a larger MDP. Therefore, all the mechanisms associated with MDP learning also apply to the learning options [41].

Information Theory
Entropy (Information Entropy) Entropy measures uncertainty between random variables. The entropy ( ) H X of the discrete random variable X is defined as: where  is the value space of random variable X and ( ) Pr{ is the probability mass function [42]. Moreover, the entropy of the random variable X can also be interpreted as the expected value of the random variable 1 log ( ) q x : The entropy has the following property:

Relative-Entropy
The relative entropy D( || ) q p is a measure of the inefficiency of assuming that the distribution is p when the true distribution is q [42]. The relative entropy (Kullback-Leibler divergence) between two probability mass functions ( ) q x and ( ) p x is defined as In the definition, it uses the convention that

Surprise-Based Intrinsic Motivation
Surprise is one of the intrinsic motivations that arouse interest and drive learning. In this section, we formalized the surprise-based intrinsic motivation with information theory, and used it to weigh the tradeoff between exploration and exploitation.
The surprise is used to measure the uncertainty of a state-action pair. Let surprise denote a measure for the exploration and learning progress when visiting a state-action pair. The surprise changes with the new experience. That is, as the agent continues to explore the environment, the true transition model Q and the learned transition model P in the context of ( , ) s a gets closer, and the surprise gets smaller.
Exploration and exploitation are two different ways of acting, and is an inherent challenge in reinforcement learning problems [43]. Exploitation is about using the current information to make the best decisions, and exploration is about collecting more information. The tradeoff between exploration and exploitation is that the best long-term policy may include short-term sacrifices to gather enough information to make the best overall decision.
The update step of surprise-based exploration policy makes progress on an approximation to the optimization problem: where V  is the performance measure, and 0   is an exploration-exploitation trade-off coefficient. The left-hand term is to maximize exploitation, and the right-hand term is to drive exploration. In the right-hand term, we formalized surprise in the context of ( , ) s a with the divergence between the true transition model Q and the learned transition model P , in other words, the relative-entropy of Q with respect to P .
Equation (5) shows that in regions of the transition state space, the more times the region is visited, the closer P to Q are in this region. In other words, the surprise of P and Q is more significant in unfamiliar regions, and tends to be 0 in familiar regions.
The exploration incentive in Equations (4) and (5) is intended to minimize the agent's surprise about its experience. If the surprise is significant, it implies that the agent's learned model of the environment was insufficient to predict the true environment dynamics and the state transition. The region should be repeatedly investigated until the surprise decreases. If the surprise is low, the learned transition model is approximating the true transition model well. The agent understands the part of the environment well, thus it can move on to explore more uncertain parts to acquire more knowledge and improve its performance.
According to the surprise incentive, we proposed the reshaped form of intrinsic reward as follows: where (0,1)   is a positive step-size parameter; (s, a) r is the original reward; and the initial values of intrinsic rewards of all state-action pairs are a constant (e.g., 1). The intrinsic rewards will gradually decrease in the learning process, and the rate of decline is gradually decreasing. Here, the intrinsic rewards are the surprise in the context of ( , ) s a . As above-mentioned, the intrinsic reward in unfamiliar regions will be higher than that of familiar regions. Moreover, after full exploration, the variance between the true model and the learned model is so small that the intrinsic reward will decrease to zero. Essentially, this encourages the agent to select less-selected actions in the current state to go where it is unfamiliar. However, the intrinsic reward (Equation (6)) cannot be implemented directly, and the true transition model Q is unknown. In order to solve this problem, we introduce the following theorems.
Theorem 1 (Jensen's inequality [42]). If f is a convex function and X is a random variable, Moreover, if f is strictly convex, the equality in Equation (7) implies that X X   with probability 1 (i.e., X is a constant). Theorem 2 (Convexity of relative entropy [42]) D ( || ) KL Q P is convex in the pair ( , ) Q P , that is, if ( , ) Q P are two pairs of probability mass functions, then The intrinsic reward is: According to the above equations, the update rule of intrinsic reward is approximated as: Ideally, we would like the intrinsic rewards to approach zero in the limit of P Q  , because in this case, the agent should have sufficiently explored the state space. Moreover, with continuous exploration and learning, the value of  is gradually decreasing. Ideally, we could expect that Based on these constraints, the value range of  is limited to: where , is the initial intrinsic reward in the context of (s, a) . The update progress of , is a monotonically decreasing process.

Reward Independent Hierarchical Skill Model
In this work, we used the framework of option to represent skill. However, the intrinsic rewards are continually changing as the process of exploration and learning in this work, so the traditional option model is not applicable in this case. Therefore, we needed to adopt a reward-independent option model. Currently, Yao et al. [44] have proposed the linear universal option model (UOM) to deal with these problems. In this section, based on the incremental learning rule [41] and UOM [44], we propose reward independent incremental learning rules for hierarchical skill learning to deal with these problems. Moreover, we describe the formal expression of the reward-independent option model.
Formally, an option model of the long-term effect is defined as , . This model comprises two parts: the reward model and the transition model . and are equivalent to the reward function and transition function in the MDP problem, respectively.
The reward model gives the expected discounted reward, which is received after executing the option o in state s at time t until its termination: where k is the random execution time for option o from start to termination. t k r  is the reward at is a discount factor, which affects how much weight it gives to future rewards in the reward model.
where (s', k) p is the probability that option o terminates at state ' s after k time steps.
The option model has the Bellman equation the same as the value function. For a Markov option 〈 , , 〉, the model is: For all , , ' k s s s S  ,  .
 is an indicator function. If its condition is satisfied, it equals 1; otherwise, it equals 0. According to Equations (14) and (15), the temporal-difference update rules are: where  is a positive step-size parameter.
From the perspective of a linear setting, the temporal-difference approximation of o R is defined as: where f is the least-squares approximation of the expected one-step reward under the option o. o U is the discounted state occupancy function (reward-independent function). ( ) s  is the feature vector, which maps any state s S  to its n-dimensional feature representation.
Combining the Bellman equation and the linear setting of the option model, we propose the reward independent incremental learning rules as follows: The learning of the option model is divided into two parts: the learning of the reward independent function and the learning of the transition mode. In the high-level options, action selection is performed on options rather than primitive actions. Moreover, the modeling of option selection and the update of option models are the same as those of the primitive actions. Thus, we can implement the incremental learning of hierarchical skill models in lifelong learning. In the following section, we detail the specific implementation of the incremental learning framework for hierarchical skills.

Learning Algorithm
In this work, our framework allows agents to explore the environment and learn skills in a developmental way, and the exploring trajectories can be multi-step. Furthermore, the learned model can be reused to perform complex specific tasks later.
Algorithm 1 gives an overview of the algorithm framework. The algorithm contains two parts. The first part is surprise-based hierarchical exploration and learning (Algorithm 1, lines 2-11). The second part is regularly updating the basic surprise-based exploration policy (Algorithm 1, lines 12-13) as well as the model and the intra-policy of the options (Algorithm 1, lines 14-16).
In the first part, the agent starts with the basic surprise-based hierarchical exploration policy (Algorithm 1, line 2), which is described in detail (Algorithm 2, lines 1-9). In the learning process, the primitive action models have the same representation and learning approach as the option models. The difference is that the primitive action models are only modeled over a single time step, and the termination functions return one in all next states. Then, the framework executes the option t o and observes the next state (Algorithm 1, line 3). After a round of action selection and execution, the framework updates the intrinsic rewards and the models of primitive actions with current experience (Algorithm 1, line 4). The update methods of intrinsic rewards and models are described in Sections 4.1 and 4.2, respectively. Next, the framework determines whether to create an option by judging whether the current state is the goal state for the first visit (Algorithm 1, lines 5-8). The new option is made up of a valid goal state, a pseudo-reward function , and a termination function . In the second part, every K time steps, the framework updates the basic surprise based exploration policy with current experience (Algorithm 1, lines 12-13), which is described in detail (Algorithm 2, lines 11-18). Every T time steps, the framework updates the model and the intrapolicy of the learned options with current experience (Algorithm 1, line [14][15][16]. The update method of the intra-policy is described in line 20 to 29 of Algorithm 2, and the update method of the option model is described in Section 4.2. The advantages of alternate learning and updating are to accelerate the learning process of the option policy and to avoid the bad learning effect when the option policy is not perfect.
Algorithm 2 contains specific implementations of some functions in Algorithm 1. The function of Surprise-based Hierarchical Exploration encourages the agent to go where it is unfamiliar to speed up the discovery of new goal states. The agent chooses the primitive action based on

Working Example
In this section, we describe a simple working example to make each definition clearer. Figure 1 is a simplified version of our experimental environment. The robot can be anywhere in the room. We assume that the robot starts at the position s . 1 s and 2 s are the goal positions. The purpose of the robot is to explore the entire environment aimlessly based on intrinsic motivation and learn the skills to reach the goal positions from any position in the room. The working example is described as follows. For each do 22: for N iterations do 23: for each ∪ do Step 1: The robot explores the environment according to the surprise based hierarchical exploration policy (Algorithm 1, line 2; Algorithm 2, lines 1-9), and continuously updates the transition model and intrinsic reward of primitive actions (Algorithm 1, line 4).

Algorithm 2 Exploration and
Step 2: When the robot first reaches the position 1 s , it finds that 1 s is a goal position whose associated option has not yet been created. Then, the robot creates a new option 1 Step 3: During the learning and exploration process, the framework updates the surprise based exploration policy every K time steps as follows (Section 4.3, Algorithm 2, line 17).
where in t a r is the intrinsic reward, and is reshaped by surprise, which is described in Equation (10) (Section 4.1). a is a primitive action or an option.
Step 4: Every T time steps, the framework optimizes the model and the intra-policy of the learned options with simulation. The framework updates the option model according to Equations (19) and (20) Step 5: The robot uses the intra-policy of the learned option to reach the unfamiliar area more quickly, which is hierarchical exploration. When the robot first encounters the position s2, it constructs the new option , and it can use the intra-policy of option to learn the intra-policy of option faster. For example, the intra-policy of option from s to s2 only needs to learn from s1 to s2 because it can use the existing intra-policy of option from s to s1.
(a) Incomplete intra-policy (b) Optimal intra-policy Step 6: After multiple time steps, the learned model will get closer to the true model, and the intrinsic rewards will tend to zero in the limit of P→ . This means that the robot has sufficiently explored the state space. Moreover, the learned options underlie both the ability to choose to spend effort and time to specialize at particular tasks, and the ability to collect and exploit previous experience to be capable of solving harder problems over time with less cognitive effort.

Household Robot Pickup and Place Domain
We tested the performance of our framework in the domain of Household Robot Pickup and Place. The problem is modeled as the robot picks up an object and places it at designated positions in the discrete rooms ( Figure 3). The test domain is a variant of the hierarchical reinforcement learning household robot [45]. The experiments are designed with increasing state sizes to demonstrate that the proposed methods are scalable to handle large problems. We preset ten goal positions of g1~g10. The destination position and the initial position of the object were randomly selected from the ten positions. The destination position and the initial position were two different positions. For example, if the destination position is g1, then the initial position of the object should be selected from g2~g10. Moreover, there are also many other notable positions (D1~D10). The goal positions and notable positions were used as sub-goals to create corresponding options (e.g., option 1 D o is defined as the transition from room 4 to room 1, and option 1 g o is defined as the transition from any position in room 1 to position g1).
In this work, we split the experiment into separate phases. In the first phase, the robot first explores the environment without a specific task, and it learns the skills in a developmental way. In the course of exploration and learning, the surprise-based hierarchical exploration can improve the exploration efficiency and speed up the model learning and skill learning. The robot also continues to refine the models and intra-policies of skills through simulation. In the second phase, the robot leverages the learned skills to perform the specific task (Robot Pickup and Place) in the framework of hierarchical MCTS (Monte Carlo Tree Search).

Performance Evaluation
The surprise based intrinsic motivation is defined as the doing of an activity for its own sake rather than to directly solve some specific problems. Therefore, evaluating the benefits of surprisedbase hierarchical exploration is not as simple as evaluating the performance of a standard reinforcement learning in a particular task. Therefore, we will evaluate the performance of our framework with model accuracy, loss value, the number of visited states, and the task performance in specific task with the learned model and options.
A typical loss is: where P denotes the learned transition model, which approximates the true model Q . To compare the performance of our framework with the performance of other methods, we used three different exploration policies, and these exploration policies guide behavior as the agent learns skills in the domain. The comparison will help us verify whether the proposed framework can accelerate exploration and skill learning.
Random exploration. A random exploration chooses a random action from the sets of primitive actions and options. It executes each action or option to completion before choosing another one. It is the baseline approach.
Exploration with Exemplar Models (EX 2 ) [33]. We use the heuristic bonus they used in their experiments to reshape the intrinsic rewards to drive exploration.
Surprise-HEL. This method employs surprise-base hierarchical exploration policy. The combination of surprise and hierarchical exploration allows the agent to reach the areas with more information faster.

Results
In this section, we present the evaluation results for Household Robot Pickup and Place domain. In the following content, we first show the performance of model learning and skill learning where there is no external reward. Then, we applied the learned model in the first step to a specific task to evaluate the task performance. While learning relatively accurate models with finite steps, it is more important for the algorithm to explore unfamiliar and useful parts of the domain. Figures 4e,f show the average number of visited states of the three methods for both the 10 × 10 domain and 20 × 20 domain, which demonstrate the performance of the skill learning of different algorithms in these domains. Surprise-HEL performed better than the other two algorithms, and the random exploration failed in obtaining convergence in both domains. The reason is that the Surprise-HEL leverages the learned options and the surprisebased intrinsic motivation to speed up exploring unfamiliar areas. As a result, the more states the robot visits, the more options will be discovered to accelerate skill learning.
Taking the same number of steps in the domain of 10 × 10, it shows that in the term of model learning, compared with EX 2 and random exploration, the maximum differences of model accuracy were 4.73 and 19.64, and the maximum increases were 15.95% and 37.00%, respectively. Similarly, in the domain of 20 x 20, the maximum differences were 7.38 and 23.45, and the maximum increases were 18.80% and 69.17%, respectively.
In the term of skill learning, compared with EX 2 and random exploration and taking the same number of steps, the maximum differences of the visited states were 85.46 and 193.53, and the maximum increases were 38.40% and 105.37%, respectively. Similarly, in the domain of 20 × 20, the maximum differences were 361.90 and 790.83, and the maximum increases were 42.57% and 150.25%, respectively. The specific analysis results are shown in Table 1. This result shows that the Surprise-HEL consistently outperformed other methods. Moreover, the performance improvements were much more dramatic in both model learning and skill learning in the larger domain. This further demonstrates the efficiency of our framework.

Performance of Specific Task
Next, we applied the learned model and skills to a specific task. The specific task was defined as the robot picks up the object in position g5 and places it in position g9. The initial position of the robot was (1, 1). There were three state variables: the position of the robot, the position of the object, and the position of the destination. There were four primitive actions: move up, move down, move left, and move right. Each primitive action had a reward of −1. Reaching the object's initial position and destination position both have a reward of 10. The maximal planning horizon was 400.
The performance was evaluated using the average cumulative return in terms of the number of simulations. Each data point was averaged over 100 runs in terms of the number of simulations. Figure 5 shows the cumulative reward received by option based hierarchical MCTS and flat MCTS over 2 18 simulations of the task. It shows that the option based hierarchical MCTS significantly outperformed the flat MCTS with limited computational resources. The option based hierarchical MCTS could find the optimal policy faster than flat MCTS. The reason is that the feature of option based hierarchical search can significantly reduce computational cost while speeding up learning and planning. Moreover, the rollouts of option based hierarchical MCTS sometimes terminate early, since the learned options help the rollouts to reach more useful parts of the state regions. Therefore, the option based MCTS can find the strategy to complete the task faster. Intuitively, the effect of option based hierarchical MCTS should become more significant in larger domains with longer planning horizons. Additionally, the primary purpose of our work was to apply this framework to the domain of robot urban search and rescue (USAR) environments in a cluttered environment. However, the state space of the real world is continuous, dynamic, partially observable, and there are many uncertain factors. Applying our framework to this domain is a complex project, and our future work is to solve these problems.

Conclusions
In this work, we proposed the framework of surprise-based hierarchical exploration for model and skill learning (Surprise-HEL). This framework has three main features: surprise-based hierarchical exploration, reward independent incremental learning rules, and the technique of alternating model learning and policy update. The nature of the framework implements the incremental and developmental learning of models and hierarchical skills. In the experiment of household robot domain, we empirically showed that Surprise-HEL performed much better than other algorithms both for model learning and skill learning, and it performed better in large-scale problems. Moreover, we applied the learned model to a specific task to evaluate the task performance, and the results showed that the skill based hierarchical planning significantly outperformed the flat planning with limited computational resources. In future work, we plan to apply this method to complex real-world application scenarios such as rescue robots in cluttered and unknown urban search and rescue (USAR). We expect that our method can accelerate the robot's exploration and learning in a complex and unknown environment, and improve the robot's search and rescue capabilities based on the learned skills.