You are currently viewing a new version of our website. To view the old version click .
Future Internet
  • Article
  • Open Access

13 February 2020

Latent Structure Matching for Knowledge Transfer in Reinforcement Learning

and
School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
*
Author to whom correspondence should be addressed.

Abstract

Reinforcement learning algorithms usually require a large number of empirical samples and give rise to a slow convergence in practical applications. One solution is to introduce transfer learning: Knowledge from well-learned source tasks can be reused to reduce sample request and accelerate the learning of target tasks. However, if an unmatched source task is selected, it will slow down or even disrupt the learning procedure. Therefore, it is very important for knowledge transfer to select appropriate source tasks that have a high degree of matching with target tasks. In this paper, a novel task matching algorithm is proposed to derive the latent structures of value functions of tasks, and align the structures for similarity estimation. Through the latent structure matching, the highly-matched source tasks are selected effectively, from which knowledge is then transferred to give action advice, and improve exploration strategies of the target tasks. Experiments are conducted on the simulated navigation environment and the mountain car environment. The results illustrate the significant performance gain of the improved exploration strategy, compared with traditional ϵ -greedy exploration strategy. A theoretical proof is also given to verify the improvement of the exploration strategy based on latent structure matching.

1. Introduction

Reinforcement learning (RL) is where an agent guides its actions based on the rewards obtained from the trial-and-error interaction with the environment [1,2]. The ultimate goal of RL is to achieve a mapping from states to actions with maximum return [3]. In recent years, reinforcement learning has been successfully applied to some aspects of life such as medical care [4], finance [5], and transportation [6]. However, in RL, a small change on the problem configuration usually leads to a complete new training, which inevitably brings a slow convergence and the growing requirement of a large amount of empirical samples. The introduction of transfer learning [7] in RL provides a promising solution to this problem.
Transfer learning guides the transfer of useful knowledge from well-learned source tasks to target tasks [8]. In RL, the knowledge obtained from previous situations can be reused as heuristics to achieve effective knowledge transfer, thus speeding up the learning procedure in new situations and reducing sample request [3]; moreover, knowledge transfer is able to mitigate much the issue caused by a change on the problem configuration as mentioned above.
However, if an inappropriate source task is selected in knowledge transfer, it will lead to negative transfer [9], which will slow down or even disrupt the learning of a new target task. Therefore, it is very important for knowledge transfer to select appropriate source tasks that have a high degree of matching with target tasks.
The similarity estimation of tasks is the main way to select matched source tasks in the existing works on knowledge transfer for RL. The works proposed at the beginning are semi-automatic, because they require human intervention for similarity estimation. A typical work [10] defined a one-to-one mapping between states and actions to estimate similarity. In the subsequent work [11], a many-to-one mapping of relevant states and actions was defined. These semi-automatic works require manual parameterization of defined mappings; nevertheless, the optimal values of the parameters are not easy to reach by experience or intuition [12]. In order to overcome this problem, some mechanisms [13,14] to select optimal mappings were developed. However, human intervention is still needed in these works [12].
Later, in order to solve the problems in the earlier works mentioned above, some automatic similarity estimation works [15,16,17,18,19,20,21] are presented. These works made use data-driven similarity metrics, including the Markov decision process (MDP) similarity metric [20], the Hausdorff metric [21], and the Kantorovich metric [21]. These works are highly interpretative, but computationally expensive and have trouble when handling situations involving a large number of tasks [19].
Recently, clustering algorithms [22] were used in some works to tackle large number of tasks. In these works, the clustering of policies, value functions, rewards, and dynamics of tasks, were modeled as random process to estimate the similarity [23,24]. However, these clustering-based works suffer from the bias from the modeling for clustering, and are sensitive to noise and variation in policies, dynamics, and value functions of tasks, resulting in the inaccurate estimation of task similarity.
The classification and characteristics of existing similarity estimation methods mentioned above are summarized in Table 1.
Table 1. Classification of similarity metrics.
To address the limitation of existing works, we propose a novel task matching algorithm, namely latent structure matching (LSM), which uses low-dimensional embedding to extract latent essential factors and relationships of tasks. Furthermore, through low-dimensional alignment, LSM effectively matches tasks in large number, and estimates their similarity, thus finding highly-matched source tasks. Based on the matched source tasks, useful exploratory advice is formed to improve exploration strategy, and to accelerate the learning of target tasks. The experiments are conducted on the simulated maze navigation environment and the mountain car environment. The maze navigation environment is an environment with discrete state and action space, and the mountain car environment is an environment with continuous state space. The results of experiments illustrate the significant performance improvement of RL agents, when taking improved exploration strategies with the action advice from the matched source tasks selected effectively by LSM.
Our main contributions in this paper can be summed up as follows:
(i)
A novel task matching algorithm, namely LSM, is proposed for RL to derive latent structures of value functions of tasks, and align the structures for similarity estimation. Structural alignment permits efficient matching of large number of tasks, and locating the correspondences of tasks.
(ii)
Based on LSM, we present an improved exploration strategy, that is built on the knowledge obtained from the highly-matched source task. This improved strategy reduces random exploration in value function space of tasks, thus effectively improving the performance of RL agents.
(iii)
A theoretical proof is given to verify the improvement of exploration strategy with the latent structure matching-based knowledge transfer (Please see Appendix B).
In the remainder of this paper, related work is first discussed in Section 2. The proposed method is introduced in Section 3. The detailed experimental procedures and experimental results are presented in Section 4. Finally, a brief conclusion is drawn in Section 5.

3. Method

The proposed method is divided into two stages. Figure 1 shows an overview of the two-staged method composed of latent structure matching and value function transfer.
Figure 1. Overview of the two-staged method composed of latent structure matching and value function transfer. In the first stage, latent structure matching (LSM) is used to measure the similarity between tasks, so as to find the highly-matched source task for the target task. In the second stage, value function transfer is used to improve the agent’s exploration strategy, thereby improving the agent’s performance.
As shown in Figure 1, in the first stage, latent structures are derived to estimate the similarity between tasks and find the highly-matched source task for a new target task. In the second stage, value function transfer is conducted to improve the exploration strategy of agents by using action advice extracted from the highly-matched source task. The details of each stage will be given in the following subsections.

3.1. Latent Structure Matching (LSM)

The algorithmic basis of LSM is manifold alignment [32], through which LSM derives latent structures of tasks and finds matched source task for a new target task. LSM is actually one of the variants of LRE. Like normal LRE, it firstly calculates a reconstruction coefficient matrix for source and target tasks, and then minimizes the reconstruction error in low-dimensional space. However, compared with normal LRE, several modifications are made in the reconstruction error function and reconstruction coefficient matrix for efficient estimation of similarity between source and target tasks.
Given action value functions learned for S source tasks, we arrange all these functions into a vector Q = [ Q l ( s ) ] 1 × L , where Q l ( s ) is the l t h action value function for the s t h source task. Similarly, all of the action value functions learned for T target tasks are also arranged into the vector Q = [ Q γ ( t ) ] 1 × Γ , where Q γ ( t ) is the γ t h action value function for the t t h target task.
The Equation (2) can be used to derive source and target tasks’ latent structures, namely the low-dimensional embeddings of action value functions Q and Q . To match the embeddings in low dimension, we add an alignment item and obtain our error function:
S ( Φ ) = ( 1 μ ) Φ B Φ F 2 + μ i , j = 1 m Φ i Φ j 2 C i , j ,
where μ [ 0 , 1 ] is a weight parameter. The first item of (3) represents the reconstruction error, and it is consistent with the right side of (2); the second item is added to represent the alignment error. Φ R m × d , B R m × m , and C R m × m are all block matrices:
Φ = E E , B = Q ^ 0 0 Q ^ , C = 0 C ( Q , Q ) C ( Q , Q ) 0 ,
where m = L + Γ and d is the dimension of low-dimensional space. E and E are the embeddings respectively for Q , Q . C ( Q , Q ) is the proximity matrix, which can be obtained by using cosine similarity metric. Q ^ and Q ^ are the reconstruction coefficient matrices of Q and Q , respectively, which can be solved by minimizing (1). However, the minimization of (1) in LRE requires frequent iterative calculation. For this, we use the closed-form solution of subspace estimation [33] to calculate the reconstruction coefficient matrix, thereby reducing the calculation. The specific procedure is described in Appendix A.
The error function in (3) can be reduced to a sum of matrix traces:
S ( Φ ) = ( 1 μ ) t r ( ( Φ B Φ ) T ( Φ B Φ ) ) + μ z = 1 d i , j = 1 m Φ i , z Φ j , z 2 2 C i , j = ( 1 μ ) t r ( ( ( I B ) Φ ) T ( I B ) Φ ) + 2 μ z = 1 d Φ . , z T L Φ . , z = ( 1 μ ) t r ( Φ T ( I B ) T ( I B ) Φ ) + 2 μ t r ( Φ T L Φ ) .
Similar to locally linear embedding [34] and LRE, we introduce the constraint Φ T Φ = I to ensure that the minimization problem of error function S is a well-posed problem. Thus, we have
a r g m i n Φ : Φ T Φ = I S = a r g m i n Φ : Φ T Φ = I ( 1 μ ) t r ( Φ T A Φ ) + 2 μ t r ( Φ T L Φ ) ,
where A = ( I B ) T ( I B ) , and L is the combinatorial graph Laplacian L = D C , where D is the diagonal matrix of row sums D i , i = j C i , j . To construct a loss function from (5), a Lagrange multiplier δ is introduced to the right side of (5):
L ( Φ , δ ) = ( 1 μ ) t r ( Φ T A Φ ) + 2 μ t r ( Φ T L Φ ) + δ , Φ T Φ I .
To minimize the loss function (6), the roots for the partial derivatives of L for Φ and δ are calculated as:
L Φ = ( 1 μ ) A Φ + 2 μ L Φ δ Φ = 0 L δ = Φ T Φ I = 0 .
Therefore, to optimize (5) to obtain embedding Φ , we need to calculate the d smallest non-zero eigenvectors of matrix ( ( 1 μ ) A + 2 μ L ) . After the embedding Φ is obtained, the distance between E and E , that represents the similarity between source and target task set, can be estimated. Finally, the highly-matched source task for a target task can be found. The whole procedure of LSM is fully described in pseudocode Algorithm 1.
Algorithm 1 Latent structure matching
Input: source action value function Q; target action value function Q ; embedding dimension d; weight parameter μ
Output: embedding matrix E of source task; embedding matrix E of target task.
1:
Compute the reconstruction coefficient matrices as described in Appendix A:
2:
Build the block matrices Φ , B and C.
3:
Construct the error function (3), namely S ( Φ ) = ( 1 μ ) Φ B Φ F 2 + μ i , j = 1 m Φ i Φ j 2 C i , j
4:
Minimize the loss function (6): L = ( 1 μ ) t r ( Φ T A Φ ) + 2 μ t r ( Φ T L Φ ) + δ , Φ T Φ I
5:
The embedding matrices E and E can be obtained by calculating the d smallest non-zero eigenvectors of matrix ( 1 μ ) A + 2 μ L .

3.2. Value Function Transfer

After finding the highly-matched source task, the learnt action value functions of the matched source task can be directly transferred to the target task. However, if there is not enough similarity between source tasks and target tasks, it will result in the risk of negative transfer due to direct transfer. So using the action value functions of the matched source task to guide the agent in action exploration may be a safe program.
The most commonly used exploration strategy is the ϵ -greedy exploration strategy [35], whose development purpose is to strike a balance between exploration and exploitation. This balance determines whether one should explore the invisible parts of the space or use the knowledge that has been acquired during the learning process. Specifically, the ϵ -greedy exploration strategy performs exploitation with the probability ϵ by selecting a known action that maximizes the value function of a target task, and performs exploration with the probability 1- ϵ by randomly selecting an unknown action.
By using the knowledge from the source tasks, the ϵ -greedy exploration strategy is modified as shown in Algorithm 2. The modified strategy, which we call an LSM-based exploration strategy, performs exploitation as normal ϵ -greedy exploration strategy, but performs exploration with a probability of 1- ϵ by selecting the learnt action that maximizes the value function of the matched source task.
Algorithm 2 LSM-based exploration strategy
Input: t a r g e t Q ; s o u r c e Q ; ϵ
Output: Action
A c t i o n A r g m a x a [ t a r g e t Q ] , with probability ϵ A r g m a x a [ s o u r c e Q ] , with probability ( 1 ϵ )
This exploratory action advice from LSM-based exploration strategy can reduce the instability from random exploration. Moreover, the modified exploration strategy can greatly speed up the learning of a target task, while achieving at least as good performance as ϵ -greedy exploration strategy. The Equation (8) gives a definite proof for this assertion, and the specific proof details of (8) are given in Appendix B.:
Q i π ( s , a ) Q i π ( s , a ) ϱ m i n j s p a ω Q i π ( s , a ) Q j π ( s , a ) ,
where p denotes the probability p ( s | s , a ) in brief form, w denotes the average probability ( ϵ / α ) over the total number of actions α . ϱ is defined as γ / [ 1 γ ( 1 ϵ ) ] with the discount factor 0 < γ < 1 . Q denotes the action value function with its superscript denoting the used exploration strategy, and its subscript denoting the task index. π and π denote respectively the ϵ -greedy exploration strategy, and the LSM-based exploration strategy.

4. Experiments

In this section, we conduct several experiments on the maze navigation problem and the mountain car problem to verify the effectiveness of the proposed method. The experiments on maze navigation problem show how knowledge is transferred between similar environments with a tabular representation of discrete state and action space, and action value functions. The experiments on the mountain car problem illustrate a more complex transfer procedure under the continuous control. In this experiment, the radial basis function (RBF) network [36,37], an effective feed-forward neural network with the best approximation and global optimal performance, is utilized to estimate action value functions. Both experiments take Q-Learning framework [38], and the source codes for the experiments can be downloaded from the github link: https://github.com/yizhouzhouyi/LSM.git. The experimental results with the modified ϵ -greedy exploration strategy are compared against the results from the traditional ϵ -greedy exploration strategy.

4.1. Experiments on Maze Navigation Problem

These experiments are conducted on a 10×10 maze navigation environment in two-dimensional space, as shown in Figure 2. An agent is required to walk from the start point to the target point, while avoiding all obstacles in the path and maximizing the total rewards obtained.
Figure 2. The maze navigation environments of two different settings. The red, yellow and gray grids in the figure denote respectively the starting positions, the target positions, and the obstacles. The goal of the agent is to walk from the start point (the red grids) to the target point (the yellow grids).
The detailed settings about the maze navigation environments are listed below.
  • Action: The action space is one-dimensional and discrete. The agent can perform four actions: Upward, downward, leftward, and rightward.
  • State: The state space is also one-dimensional and discrete. Since each grid in the maze represents one state, there are a total of 100 states.
  • Reward: When the agent reaches the target state, it will receive a reward of +10; when the agent encounters an obstacle, it will get a penalty of −10; and when the agent is in any other state, it will get a living penalty of −0.1.
We design two maze environments as shown in Figure 2. In two maze environments, the action value functions obtained from source tasks (the grids marked in blue in Figure 2a) and the action value functions from target tasks (the grids marked in yellow in Figure 2a,b) are used as the input of LSM.
The experimental results are shown in Figure 3, the learning curves are smoothed with a window of 100 episodes with an error bar at every 100 episodes. The left curve shows the effect of knowledge transfer under the same environment settings, and the right curve shows the effect of knowledge transfer under different environment settings.
Figure 3. The learning curves of LSM-based exploration strategy and ϵ -greedy exploration strategy on the maze navigation problem. The curves are smoothed with a window of 100 episodes with an error bar at every 100 episodes.
It can be seen from Figure 3 that LSM-based modified ϵ -greedy exploration strategy performs significantly better than normal ϵ -greedy exploration strategy. For two exploration strategies, the value of ϵ is set as 0.9.
To have an insight into the effect of ϵ , we change the value of ϵ in the range [ 0.5 , 0.9 ] to observe the performance variation of two exploration strategies, and the results are given in Figure 4 and Figure 5. The upper part of Figure 4 shows the average return of 2000 episodes obtained by changing the value of ϵ under the same environment settings, and the lower part of Figure 4 shows the average return of 2000 episodes obtained by changing the value of ϵ under different environment settings. As can be seen, compared with ϵ -greedy exploration strategy, the LSM-based exploration strategy can always achieve a higher average return regardless of the value of ϵ .
Figure 4. The comparison of the average return of 2000 episodes using the LSM-based and the ϵ -greedy exploration strategies, and the values of ϵ are 0.9, 0.7, and 0.5 respectively from left to right. The upper part shows the results under the same environment settings and the lower part shows the results under different environment settings.
Figure 5. The learning curves of LSM-based and ϵ -greedy exploration strategies with different values of ϵ ( 0.9, 0.7, and 0.5 from left to right). The curves are smoothed with a window of 100 episodes with an error bar at every 100 episodes.
Figure 5 shows the variation in average return of each episode obtained by changing the value of ϵ under the same environment settings and different environment settings, respectively. When the value of ϵ is set to 0.7 or 0.5, the Q-Learning algorithm using ϵ -greedy exploration strategy almost stops learning; however, the Q-Learning algorithm using LSM-based exploration strategy continues to perform well whether under the same environment settings or different environment settings. The reason for this may be that the increase of randomness in action selection greatly will reduce the possibility of taking sequent actions to find valid paths to target state; the good performance of the modified strategy attributes to the knowledge transfer, which can reduce the randomness to some extent.

4.2. Experiments on Mountain Car Problem

These experiments are conducted on the mountain car [39,40] environment under the classical control problem of the OpenAI gym environment library. The mountain car problem belongs to the continuous control problem. As shown in Figure 6, in the experimental environment, a car is located on a two-dimensional track between two mountains; it is required to slide down from the left side of the mountain and rush to the target position on the right side of the mountain.
Figure 6. The mountain car environment. The purpose of the agent is to start from the starting position marked "Start", and use the energy obtained by sliding down from the left side of the mountain to rush to the position marked "Goal" on the right side of the mountain.
The detailed settings about the mountain car environment are listed below:
  • Action: The action space is one-dimensional and discrete. The car can perform three actions, including leftward, rightward, and neutral, namely:
    A c t i o n = ( l e f t w a r d , n e u t r a l , r i g h t w a r d ) .
  • Reward: Each time the car arrives at a state, it obtains a reward of −1.0.
  • State: The state space is two-dimensional and continuous. A state consists of speed and position, defined as:
    V e l o c i t y = ( 0.007 , 0.007 ) P o s i t i o n = ( 1.2 , 0.6 )
  • Initial state: The position marked "Start" in Figure 6 denotes the initial state of the car. The initial state is set as:
    P o s i t i o n = 0.5 V e l o c i t y = 0.0
  • Goal state: The position marked "Goal" in Figure 6 is the target state of the car. The target state is set as: P o s i t i o n 0.5 .
  • State update rule: A state is updated as the rule:
    V e l o c i t y = V e l o c i t y + 0.001 × ( A c t i o n 1 ) 0.0025 × c o s ( 3 × P o s i t i o n ) P o s i t i o n = P o s i t i o n + V e l o c i t y
As state values in these experiments are continuous, we use the RBF network to approximate action value functions instead of the tabular representation. The parameters of estimated action value functions of source and target tasks are the input in LSM, used to find the matched source tasks, from which knowledge is transferred to improve the ϵ -greedy exploration strategy.
We compare the traditional ϵ -greedy exploration strategy with our strategy, namely the LSM-based exploration strategy, and the experimental results are given in Figure 7. The learning curves are smoothed with a window of five episodes with an error bar at every five episodes. According to the evaluation criteria for knowledge transfer methods in reinforcement learning proposed by Taylor [3], our strategy performs better than the initial performance, called Jumpstart [3], and the final performance, called Asymptotic Performance [3], of the ϵ -greedy exploration strategy. In the mountain car environment, because the agent using ϵ -greedy exploration strategy explores randomly, it needs to try many times to know which speed it should take to get to the left side of the mountain so that it can use the energy of gliding to reach the right side of the mountain. Further, because of the experience gained in previous tasks, the agent using LSM-based exploration strategy can find a good speed at the beginning, and then uses the feedback from the environment to quickly find the optimal speed. Therefore, our method can achieve better performance at the beginning and the end.
Figure 7. Learning curves for the mountain car problem. The curves are smoothed with a window of five episodes with an error bar at every five episodes.

5. Conclusions

In this paper, we propose a novel task matching algorithm to derive the latent structures of tasks, and align the structures to select highly-matched source tasks. We develop an improved exploration strategy to draw knowledge from the matched source tasks, in order to speed up the learning and to improve the performance of RL agents. We also give a theoretical proof to verify the improvement of the exploration strategy based on latent structure matching. Our method achieves significant performance improvement, when compared with the traditional ϵ -greedy exploration strategy on the maze navigation problem and the mountain car problem. Such a method can prove to be useful for real-world applications such as autonomous robots, where sample efficiency is a critical factor. Further, in future work, causal inference may be introduced to transfer knowledge in reinforcement learning by adding causal factors.

Author Contributions

Y.Z. and F.Y. conceived and designed the theoretical calculations and experiments. Y.Z. and F.Y. carried out all the theoretical calculations. Y.Z. and F.Y. performed and analyzed the experiments. Both authors discussed the results and prepared and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by National Nature Science Foundation of China (NSFC) under Grant 61371149.

Acknowledgments

This work is supported in part by National Nature Science Foundation of China (NSFC) under Grant 61371149.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this supplementary material, we describe the steps of the reconstruction coefficient matrix mentioned in Section 3.
First, Singular value decomposition (SVD) [41] is used to decompose Q into the form Q = U K V T , where U is the left singular matrix having only a single element, K = [ k i ] 1 × L is the singular value matrix, and V = [ v i j ] L × L is the right singular matrix. According to the value of elements in K, we divide K into two submatrices, namely K 1 and K 2 . Correspondingly, V is also divided into two submatrices V 1 and V 2 . The detailed division rule is defined as follows:
(a)
for k i K , if k i > 1 , it is put in the submatrix K 1 ; correspondingly, the i t h column vector in V is put in the submatrix V 1 ;
(b)
if k i 1 , it is put in the submatrix K 2 ; correspondingly, the i t h column vector in V is put in the submatrix V 2 .
Next, according to [33], the vector Q is reconstructed as the matrix Q ^ R L × L by the following formula:
Q ^ = V 1 ( I K 1 2 ) V 1 T ,
where I is the identity matrix. Using the same way, the reconstruction coefficient matrix Q ^ R Γ × Γ of Q can be obtained.

Appendix B

In this supplementary material, we give the proof details of (8) mentioned in Section 3. In the previous literature, two important theories about policy improvement have been proposed. One is Bellman’s [42] policy improvement theory. This theory states that selecting actions greedily with respect to a policy’s value function gives rise to another policy whose performance is no worse than the former’s. Another one is Andre’s [43] generalized policy improvement (GPI), which extends Bellman’s policy improvement theory to the scenario where the new policy is to be computed based on value functions of a set of source tasks.
Here, according to GPI, we extend the notations to make it more convenient to depict the quantities involved in transfer learning. In this supplementary material, j { 1 , 2 , , L } and i { 1 , 2 , , Γ } denote a task from source task set Y and target task set T, respectively. Moreover, we use π to denote the ϵ -greedy exploration strategy, and π to denote the modified exploration strategy with knowledge transfer, namely LSM-based exploration strategy. Q j π is used to denote the action value function with the policy π on task j.
Suppose we have learnt action value functions Q i π and Q i π . To examine the performance improvement of LSM-based exploration strategy, we compare it with traditional ϵ -greedy exploration strategy by calculating their difference. Based on the Bellman equation [42], we have
Q i π ( s , a ) Q i π ( s , a ) = R i ( s , a ) + γ s p ( s | s , a ) a π ( a | s ) Q i π ( s , a ) R i ( s , a ) γ s p ( s | s , a ) a π ( a | s ) Q i π ( s , a ) = γ s p ( s | s , a ) a π ( a | s ) Q i π ( s , a ) γ s p ( s | s , a ) a π ( a | s ) Q i π ( s , a ) = γ s p ( s | s , a ) [ a π ( a | s ) Q i π ( s , a ) a π ( a | s ) Q i π ( s , a ) ]
Then, according to the exploration rules of the ϵ -greedy exploration strategy and the LSM-based exploration strategy as mentioned above, the above equation is rewritten as
Q i π ( s , a ) Q i π ( s , a ) = γ s p ( s | s , a ) [ ( 1 ϵ ) m a x Q i π ( s , a ) + ϵ / w a Q i π ( s , a ) ( 1 ϵ ) m a x Q i π ( s , a ) ϵ m a x Q j π ( s , a ) ] = γ s p ( s | s , a ) [ ( 1 ϵ ) m a x Q i π ( s , a ) ( 1 ϵ ) m a x Q i π ( s , a ) ] + γ s p ( s | s , a ) [ ϵ / w a Q i π ( s , a ) ϵ m a x Q j π ( s , a ) ] γ s p ( s | s , a ) [ ( 1 ϵ ) m a x Q i π ( s , a ) ( 1 ϵ ) m a x Q i π ( s , a ) ] + γ s p ( s | s , a ) [ ϵ / w a Q i π ( s , a ) ϵ / w a Q j π ( s , a ) ] γ s p ( s | s , a ) ( 1 ϵ ) m a x [ Q i π ( s , a ) Q i π ( s , a ) ] + γ s p ( s | s , a ) a ϵ / w [ Q i π ( s , a ) Q j π ( s , a ) ] γ s p ( s | s , a ) ( 1 ϵ ) m a x [ Q i π ( s , a ) Q i π ( s , a ) ] + γ s p ( s | s , a ) a ϵ / w Q i π ( s , a ) Q j π ( s , a ) = γ ( 1 ϵ ) Δ + γ s p ( s | s , a ) a ϵ / w Q i π ( s , a ) Q j π ( s , a ) ,
where Δ = m a x [ Q i π ( s , a ) Q i π ( s , a ) ] , and w represents the total number of actions.
Since all values of [ Q i π ( s , a ) Q i π ( s , a ) ] satisfy the above formula, the maximum value of [ Q i π ( s , a ) Q i π ( s , a ) ] , that is, Δ , also satisfies the above formula. Thus, we obtain
Δ γ ( 1 ϵ ) Δ + γ s p ( s | s , a ) a ϵ / w Q i π ( s , a ) Q j π ( s , a ) Δ P s p ( s | s , a ) a ϵ / w Q i π ( s , a ) Q j π ( s , a ) Q i π ( s , a ) Q i π ( s , a ) Δ P s p ( s | s , a ) a ϵ / w Q i π ( s , a ) Q j π ( s , a ) ,
where P = γ / [ 1 γ ( 1 ϵ ) ] . At this point, (8) is proved.
As we can see from (8), the loss [ Q i π ( s , a ) Q i π ( s , a ) ] is upper-bounded by one term. This term can be seen as a multiple of the differences between Q i π (the value function learned on the target task) and the closest Q j π (the value function achieved from the source task). This formalizes the intuition that an agent should perform well on target task i if it has solved a similar task before. In a word, if we can match the source task for the target task with enough similarity, and transfer the value function of the source task to the target task to guide a new exploration strategy based on the ϵ -greedy exploration strategy, then this new strategy will perform no worse than the ϵ -greedy exploration strategy.

References

  1. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw. 1998, 9. [Google Scholar] [CrossRef]
  2. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  3. Reinaldo, A.C.B.; Luiz, A.C., Jr.; Paulo, E.S.; Jackson, P.M.; de Mantaras, R. Transferring knowledge as heuristics in reinforcement learning: A case-based approach. Artif. Intell. 2015, 226, 102–121. [Google Scholar]
  4. Komorowski, M.; Celi, L.A.; Badawi, O.; Gordon, A.C.; Faisal, A.A. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 2018, 24, 1716–1720. [Google Scholar] [CrossRef]
  5. Deng, Y.; Bao, F.; Kong, Y. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 1–12. [Google Scholar] [CrossRef]
  6. Belletti, F.; Haziza, D.; Gomes, G. Expert Level control of Ramp Metering based on Multi-task Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2017, 99, 1–10. [Google Scholar] [CrossRef]
  7. Abel, D.; Jinnai, Y.; Guo, Y.; Konidaris, G.; Littman, M.L. Policy and Value Transfer in Lifelong Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 20–29. [Google Scholar]
  8. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  9. Lazaric, A. Transfer in reinforcement learning: a framework and a survey. Reinf. Learn. 2012, 12, 143–173. [Google Scholar]
  10. Taylor, M.E.; Stone, P.; Liu, Y. Transfer learning via inter-task mappings for temporal difference learning. J. Mach. Learn. Res. 2007, 8, 2125–2167. [Google Scholar]
  11. Cheng, Q.; Wang, X.; Shen, L. Transfer learning via linear multi-variable mapping under reinforcement learning framework. In Proceedings of the 36th Chinese Control Conference, Liaoning, China, 26–28 July 2017; pp. 8795–8799. [Google Scholar]
  12. Cheng, Q.; Wang, X.; Shen, L. An Autonomous Inter-task Mapping Learning Method via Artificial Neural Network for Transfer Learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Biomimetics, Macau, China, 5–8 December 2017. [Google Scholar]
  13. Fachantidis, A.; Partalas, I.; Taylor, M.E.; Vlahavas, I. Transfer learning via multiple inter-task mappings. In Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning, Athens, Greece, 9–11 September 2011; pp. 225–236. [Google Scholar]
  14. Fachantidis, A.; Partalas, I.; Taylor, M.E.; Vlahavas, I. Transfer learning with probabilistic mapping selection. Adapt. Behav. 2015, 23, 3–19. [Google Scholar] [CrossRef]
  15. Ferns, N.; Panangaden, P.; Precup, D. Metrics for finite Markov decision processes. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, AB, Canada, 7–11 July 2004; pp. 162–169. [Google Scholar]
  16. Taylor, M.E.; Kuhlmann, G.; Stone, P. Autonomous transfer for reinforcement learning. In Proceedings of the Seventh International Joint Conference on Autonomous Agents and Multiagent Systems, Estoril, Portugal, 12–16 May 2008. [Google Scholar]
  17. Celiberto, L.A., Jr.; Matsuura, J.P.; De Mantaras, R.L.; Bianchi, R.A. Using cases as heuristics in reinforcement learning: a transfer learning application. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Menlo Park, CA, USA, 16–22 July 2011; pp. 1211–1217. [Google Scholar]
  18. Carroll, J.L.; Seppi, K. Task similarity measures for transfer in reinforcement learning task libraries. In Proceedings of the IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July 2005; pp. 803–808. [Google Scholar]
  19. Teng, T.H.; Tan, A.H.; Zurada, J.M. Self-Organizing Neural Networks Integrating Domain Knowledge and Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 889–902. [Google Scholar] [CrossRef]
  20. Ammar, H.B.; Eaton, E.; Taylor, M.E.; Mocanu, D.C.; Driessens, K.; Weiss, G.; Tuyls, K. An automated measure of mdp similarity for transfer in reinforcement learnin. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Quebec, QC, Canada, 27–31 July 2014. [Google Scholar]
  21. Song, J.H.; Gao, Y.; Wang, H. Measuring the distance between finite Markov decision processes. In Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems, Singapore, 9–13 May 2016. [Google Scholar]
  22. Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: a review. ACM Comput. Surv. 1999, 31, 264–323. [Google Scholar] [CrossRef]
  23. Li, H.; Liao, X.; Carin, L. Multi-task reinforcement learning in partially observable stochatic environments. J. Mach. Learn. Res. 2009, 10, 1131–1186. [Google Scholar]
  24. Liu, M.; Chowdhary, G.; How, J.; Carin, L. Transfer learning for reinforcement learning with dependent Dirichlet process and Gaussian process. In Proceedings of the Twenty-sixth Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8 December 2012. [Google Scholar]
  25. Karimpanal, T.G.; Bouffanais, R. Self-Organizing Maps as a Storage and Transfer Mechanism in Reinforcement Learning. arXiv 2018, arXiv:1807.07530. Available online: https://arxiv.org/abs/1807.07530 (accessed on 30 September 2019).
  26. Taylor, M.E.; Whiteson, S.; Stone, P. ABSTRACT Transfer via InterTask Mappings in Policy Search Reinforcement Learning. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, Honolulu, HI, USA, 14–18 May 2007. [Google Scholar]
  27. Lazaric, A.; Ghavamzadeh, M. Bayesian multi-task reinforcement learning. In Proceedings of the International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 599–606. [Google Scholar]
  28. Wilson, A.; Fern, A.; Tadepalli, P. Transfer learning in sequential decision problems: A hierarchical bayesian approach. In Proceedings of the International Conference on Unsupervised and Transfer Learning Workshop, Bellevue, WA, USA, 1–16 January 2012. [Google Scholar]
  29. Liu, R.; Hao, R.; Su, Z. Mixture of manifolds clustering via low rank embedding. J. Inf. Comput. Sci. 2011, 8, 725–737. [Google Scholar]
  30. Boyd, S.; Parikh, N.; Chu, E. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Found. Trends Mach. Learn. 2010, 3, 1–122. [Google Scholar] [CrossRef]
  31. Candés, E.J.; Tao, T. The Power of Convex Relaxation: Near-Optimal Matrix Completion. IEEE Trans. Inf. Theory 2010, 56, 2053–2080. [Google Scholar] [CrossRef]
  32. Devis, T.; Gustau, C.V.; Zhao, H.D. Kernel Manifold Alignment for Domain Adaptation. PLoS ONE 2016, 11, e0148655. [Google Scholar]
  33. Favaro, P.; Vidal, R.; Ravichandran, A. A closed form solution to robust subspace estimation and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
  34. Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef]
  35. Tokic, M.; Gunther, P. Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax. In Proceedings of the Annual Conference on Artificial Intelligence, Berlin, Heidelberg, 16–22 July 2011. [Google Scholar]
  36. Lei, L.; Ding, L.; Zhang, W. Generalization Performance of Radial Basis Function Networks. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 551–564. [Google Scholar]
  37. Soo, C.B.; Byunghan, L.; Sungroh, Y. Biometric Authentication Using Noisy Electrocardiograms Acquired by Mobile Sensors. IEEE Access 2016, 4, 1266–1273. [Google Scholar]
  38. Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
  39. Ammar, H.B.; Mocanu, D.C.; Taylor, M.E. Automatically mapped transfer between reinforcement learning tasks via three-way restricted boltzmann machines. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Prague, Czech Republic, 22–26 September 2013; pp. 449–464. [Google Scholar]
  40. Melnikov, A.A.; Makmal, A.; Briegel, H.J. Benchmarking projective simulation in navigation problems. IEEE Access 2018, 6, 64639–64648. [Google Scholar] [CrossRef]
  41. Yanai, H.; Takeuchi, K.; Takane, Y. Singular Value Decomposition (SVD). Stat. Soc. Behav. Sci. 2011, 64639–64648. [Google Scholar]
  42. Bellman, R. Dynamic programming. Science 1966, 8, 34–37. [Google Scholar] [CrossRef]
  43. Barreto, A.; Dabney, W.; Munos, R.; Hunt, J.J.; Schaul, T.; Hasselt, H.V.; Silver, D. Successor features for transfer in reinforcement learning. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4055–4065. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.