Supervised Reinforcement Learning via Value Function

: Using expert samples to improve the performance of reinforcement learning (RL) algorithms has become one of the focuses of research nowadays. However, in di ﬀ erent application scenarios, it is hard to guarantee both the quantity and quality of expert samples, which prohibits the practical application and performance of such algorithms. In this paper, a novel RL decision optimization method is proposed. The proposed method is capable of reducing the dependence on expert samples via incorporating the decision-making evaluation mechanism. By introducing supervised learning (SL), our method optimizes the decision making of the RL algorithm by using demonstrations or expert samples. Experiments are conducted in Pendulum and Puckworld scenarios to test the proposed method, and we use representative algorithms such as deep Q-network (DQN) and Double DQN (DDQN) as benchmarks. The results demonstrate that the method adopted in this paper can e ﬀ ectively improve the decision-making performance of agents even when the expert samples are not available.


Introduction
In recent years, great achievements have been made in solving Sequential decision-making problems modelled by Markov decision processes (MDPs) through deep reinforcement learning (DRL), which is selected as one of the MIT Technology Review 10 Breakthrough Technologies in 2017.The range of applied research on DRL is extensive.The breakthroughs mainly include the deep Q-network (DQN) for Atari games [1,2] and strategic policies combined with tree search for the game of go [3,4].Other notable examples of utilising DRL includes learning robot control strategy from video [5], playing video games [6], indoor navigation [7], managing power consumption [8], building machine translation model [9] et al.DRL has been used to meta-learn ("learn to learn") to obtain even more powerful agents which can generalise to completely strange environments [10].
In many scenarios, we have previously accumulated experience (not necessarily optimal) that we call expert samples.The combination of expert samples and reinforcement learning (RL) to improve decision-making performance is a current research direction.The expert samples combined with Deep Deterministic Policy Gradient (DDPG) are used as a guide for the exploration in RL to solve the tasks where exploration is difficult [11].The expert trajectories are used to accelerate DQN [12].A framework was proposed to use the expert samples to pre-train actor-critic RL algorithms [13].The expert samples were combined with DRL via value through a separate replay buffer for expert samples [14].Human demonstrations were also used to pre-train a deep neural network (DNN) by supervised learning (SL) [15].Methods combining policy gradient algorithms with expert samples have been applied to complex robotic manipulation [16].In real scenarios, it is difficult to guarantee the quantity and quality of expert samples in different scenarios.The over-dependence of the algorithm on expert samples will limit the application of the algorithm in real scenarios.The algorithm named Deep Q-Learning from Demonstrations (DQfD) [14] is designed for scenarios with a few expert samples to reduce the dependence of expert samples.In this work, We propose a method called supervised reinforcement learning via value function (SRLVF).When expert samples are available, the method can combine expert samples with RL through SL.If the expert samples are very poor or even unavailable, the method can still use the data generated by the interaction between agent and environment to optimize agent's decision.In both Pendulum and Puckworld scenarios, we tested the SRLVF-based DQN (SDQN) algorithm and the SRLVF-based double DQN (SDDQN) algorithm.The results show that the performance of SDQN and SDDQN algorithms are significantly improved.
The rest of the paper is organized as follows.In Section 2 related work are discussed, followed by some background knowledge in Section 3. We present our method in Section 4 and report experimental results in Section 5. Conclusion is in Section 6.

RL Based on Value Function
Both Q-learning [17] and the SARSA [18] algorithms realize the evolution of the strategy through the estimation and evaluation of value functions.The difference between the two is that the exploration and exploitation of Q-learning adopts different strategies, while SARSA uses the same strategy.DQN [2] combines the deep neural network with the Q-learning algorithm, realizes the "end-to-end" control of agents, and achieves better results than human beings in many kinds of Atari games.Double-DQN [19] solves the problem of overestimation in DQN.Other value-based algorithms include Duelling-DQN [20], asynchronous advantage actor-critic (AC) [21] et al.In this paper, we use DQN and Double-DQN (DDQN) as benchmark algorithms to verify the effectiveness of SRLVF.

Combining Expert Samples and RL Based on Value Function
In the framework of combining expert samples with AC algorithm, expert samples are combined with AC algorithm by pre-training agents [13].In Replay Buffer Spiking (RBS) algorithm [22], expert samples are combined with DQN by initializing the experience replay buffer of DQN.It is similar to the design purpose of the SRLVF algorithm to reduce dependence on expert samples, Accelerated DQN with Expert Trajectories (ADET) [12] algorithm and DQfD [14] algorithm are designed for scenarios with only a few expert samples.To reduce the dependence on expert samples, the ADET algorithm and the DQfD algorithm fully exploit the potential of expert samples by using agent pre-training, loss function design et al.For the same purpose, the SRLVF algorithm explores the potential of the data generated by the interaction between the agent and the environment by introducing SL, constructing the decision evaluation mechanism et al.

Reinforcement Learning via Value Function
The RL is about an agent interacting with the environment, learning an optimal policy, by trial and error, for sequential decision making problems [23].The RL algorithms mainly include RL algorithms based on value function and direct policy search.In RL via value function we evaluates the value function and uses value function to improve the policy, e.g., Q-learning, DQN et al.A value function is a prediction of the expected accumulative discounted future reward, measuring how good each state or state-action pair is [23].MDPs have become the standard formalism for learning sequential decision making [24].The goal of RL is to find the optimal strategy for a given MDP [18].
The standard MDPs is defined as S, A, R, T, γ , where S is the state set, A is the action set, R is the return function, T is the transfer function, and γ is the discount factor.When agent is in state s ∈ S, it takes action a ∈ A to reach the new state s ∈ S, and gets the return r = R(s, a).The state transition function T = P(s s, a) .Since we are using a model-free MDPs, the transfer function is unknown.State value function v π (s) is an estimate of future reward in state s based on strategy π.
State-action value function Q π (s, a) is an estimate of future returns when action a is taken in the state s based on policy π.
The optimal Q * (s, a) is determined by solving the Bellman equation, The State value function v π (s) and the State-action value function Q π (s, a) are collectively called value function.
In recent years, DQN [2] is a breakthrough in RL via value function, which extends state space from finite discrete space to infinite continuous space by combining deep neural network with Q-learning [17] algorithm.The principle of DQN is shown in Figure 1.The standard MDPs is defined as , , , , In recent years, DQN [2] is a breakthrough in RL via value function, which extends state space from finite discrete space to infinite continuous space by combining deep neural network with Qlearning [17] algorithm.The principle of DQN is shown in Figure 1.The data ( , , , ) s a r s′ generated by the interaction between agent and environment are stored in the replay buffer, which is randomly extracted and provided to the main network and target network.The policy realized evolution through minimizing the Loss function and updating the main network parameters by gradient descent method.The parameters of the target network are updated by copying the main network parameters, after a certain interval.The loss function of DQN is Where is generated by the target network, and main ( , ) Q s a is generated by the main network.
The DQN algorithm sets experience replay mechanism [25] and target network with updating parameters asynchronously to break the correlation during training data and then guarantee the stability of the neural network.
Hado et al. proposed the Double-DQN algorithm (DDQN) to solve the problem of overestimation in the DQN algorithm [19].DDQN does not directly select the max Q value in the The data (s, a, r, s ) generated by the interaction between agent and environment are stored in the replay buffer, which is randomly extracted and provided to the main network and target network.The policy realized evolution through minimizing the Loss function and updating the main network parameters by gradient descent method.The parameters of the target network are updated by copying the main network parameters, after a certain interval.The loss function of DQN is where r + γmax ) is generated by the target network, and Q main (s, a) is generated by the main network.
The DQN algorithm sets experience replay mechanism [25] and target network with updating parameters asynchronously to break the correlation during training data and then guarantee the stability of the neural network.[19].DDQN does not directly select the max Q value in the target network but uses the corresponding action of the max Q value in the main network to determine the target Q value in the target network.The Loss function of DDQN is

Hado et al. proposed the Double-DQN algorithm (DDQN) to solve the problem of overestimation in the DQN algorithm
DQN and DDQN are the most representative RL algorithms via value function, which are also the benchmarks in our paper.

Supervised Learning
The goal of SL is to build a concise model of the distribution of class labels in terms of predictor features [26].In this paper, the agent has been motivated to make favourable decisions in the states with certain characteristics, by using the classification technology in the SL.The demonstrations database for SL consists of (s, a), where s represents state and a represents action.In this paper, cross entropy is used as losses for SL.

Supervised Reinforcement Learning via Value Function
SRLVF is mainly based on SL network and RL network, and constructs corresponding training database.The demonstrations can improve the decision-making performance of RL algorithms through SL network.
By introducing decision evaluation mechanism, SRLVF constructs demonstration sets based on the data generated during the interaction between agent and environment, which greatly reduces the dependence on expert samples.
Figure 2 is the framework of our SRLVF approach.In the training phase, the data (s, a, r, s ) generated by the interaction between the agent and the environment are stored in the experience replay buffer, and batch data are randomly selected from it to train the RL network.The decision-making evaluation mechanism is used to select the superior decision from the experience replay buffer as the demonstrations.The SL network is trained by randomly extracting data from the training data buffer.In the testing phase, the RL network and the SL network jointly make decisions according to different weights.The pseudo-code is shown in Appendix A.
DQN and DDQN are the most representative RL algorithms via value function, which are also the benchmarks in our paper.

Supervised learning
The goal of SL is to build a concise model of the distribution of class labels in terms of predictor features [26].In this paper, the agent has been motivated to make favourable decisions in the states with certain characteristics, by using the classification technology in the SL.The demonstrations database for SL consists of ( , ) s a , where s represents state and a represents action.In this paper, cross entropy is used as losses for SL.

Supervised Reinforcement Learning via Value Function
SRLVF is mainly based on SL network and RL network, and constructs corresponding training database.The demonstrations can improve the decision-making performance of RL algorithms through SL network.
By introducing decision evaluation mechanism, SRLVF constructs demonstration sets based on the data generated during the interaction between agent and environment, which greatly reduces the dependence on expert samples.
Figure 2 is the framework of our SRLVF approach.In the training phase, the data ( , , , ) s a r s′ generated by the interaction between the agent and the environment are stored in the experience replay buffer, and batch data are randomly selected from it to train the RL network.The decisionmaking evaluation mechanism is used to select the superior decision from the experience replay buffer as the demonstrations.The SL network is trained by randomly extracting data from the training data buffer.In the testing phase, the RL network and the SL network jointly make decisions according to different weights.The pseudo-code is shown in appendix A. The SRLVF method proposed in this paper is less dependent on expert samples.When expert samples are available, the method can combine expert samples with RL to improve agent decision performance.When the expert samples are very poor or even unavailable, the method can still use the data generated by the interaction between agent and environment to optimize agent's decision.The purpose of RL is to optimize the overall strategy of the task.There is no evaluation and correction The SRLVF method proposed in this paper is less dependent on expert samples.When expert samples are available, the method can combine expert samples with RL to improve agent decision performance.When the expert samples are very poor or even unavailable, the method can still use the data generated by the interaction between agent and environment to optimize agent's decision.The purpose of RL is to optimize the overall strategy of the task.There is no evaluation and correction mechanism for decision-making under specific states.SL can fit the mapping model of states and agent actions in expert samples very well.By introducing SL, we can use expert samples to optimize agent's specific action decision driven by RL, and then realize the combination of expert samples and RL.In the case that the expert samples are difficult to guarantee, we select better state-action pairs from the data generated by the interaction between agent and environment through decision evaluation mechanism and optimize the decision-making under specific states through SL.Most of the current algorithms focus on how to combine expert samples with RL, but less on the objective reality that it is difficult to guarantee the availability of expert samples in different scenarios.The dependence on expert samples greatly limits the application scenarios of the algorithm.Hester et al. proposed a DQfD algorithm for scenarios with only a few expert samples to reduce the dependence on expert samples.Compared with DQfD, the SRLVF method relies less on expert samples and has better applicability to scenarios with different availability of expert samples.

Demonstration Sets for the SL Network
The process of constructing SL network training data buffer mainly includes the evaluation and storage of data.In state s, agents take action a, which makes the environment become state s .v(s) is an estimate of future reward in state s, if v(s ) > v(s), we can believe that a is a better decision in state s, and then store (s, a) in training data buffer.There are three different ways to calculate v(s) in this paper.The first is where q π (s, a) is the value function of action a in state s.The second is The third is where ε is the exploratory ability of RL to adopt ε − greedy strategy in the training process, v adaption adds an adaptive adjustment based on exploratory probability ε, which makes the calculation of state value function v(s) more objective.v sum , v max and v adaption are obtained on the basis of Equations ( 1) and ( 2) taking into account the influence of different factors.The difference between v sum and v adaption is that the probability used to compute v sum is obtained through softmax function, but when computing v adaption the probability is defined by the ε − greedy policy.v max and v adaption are identical when ε = 0.In SRLVF method, the state value function is the criterion of selecting demonstrations, so the way of calculating the state value function affects the quality of demonstrations, v adaption is the most accurate computing mode, but ε gradually decreases with the training process, and the criterion for selecting demonstrations eventually becomes v max .Because the state value function and the state-action value function are both estimates of the future total return under the current strategy rather than accurate values, in order to obtain the optimal selection criteria, we will simulate and validate the performance of three calculation models of the value function in the experimental part.The introduction of decision evaluation mechanism has expanded the source of demonstrations and formed a powerful supplement to the expert samples, which can effectively reduce the dependence on expert samples.

Generalization of SRLVF
In the SRLVF method, the demonstrations are selected from the data generated by the interaction between agent and environment driven by RL algorithms through decision-making evaluation mechanism and then optimizes the performance of RL algorithm by using the demonstrations and existing expert data through SL.In SRLVF method, the data which can optimize decision-making include the demonstrations generated by the interaction between agent and environment and the existing expert samples in different application scenarios.
It is difficult to guarantee the availability of expert samples in different scenarios.When expert samples can't play a role in decision-making optimization or even have poor availability, effective data mainly comes from the interaction between agent and environment.These data optimize decision-making through the SL network, so there is no generalization problem for SL network.At this time, the generalization of SRLVF is mainly influenced by RL algorithms.However, the performance of different RL algorithms in different scenarios is different.
When the quality and quantity of expert samples can be guaranteed, the performance of SRLVF method is affected by expert samples.The expert samples optimize the decision-making of RL algorithms through the SL network, which has a beneficial impact on the generalization performance of SRLVF method.

Experiments
For sequential decision-making tasks modelled by MDPs, each individual task instance needs a sequence of decisions to bring the agent from a starting state to a goal state.According to the length of the decision-making sequence from a starting state to a target state, decision tasks can be divided into three categories: finite, fixed horizon tasks, indefinite horizon tasks, and infinite horizon tasks.In finite, fixed horizon tasks the length of the decision-making sequence is fixed.In indefinite horizon tasks, the decision-making sequence can have arbitrary length and end at the goal state.In the infinite horizon tasks, the decision-making sequence does not end in the goal state.The application conditions of the first model are strict and so its scope of application is limited, for example tutoring students for exams or handling customer service requests.The other two types of tasks are the focus of our attention.In Puckworld scenario, the decision-making process will end in the goal state and the length of the decision-making sequence is not fixed, so Puckworld is the type of indefinite horizon tasks.In Pendulum scenario, the decision-making process can't end to keep the agent in the goal state, so the Pendulum belongs to the kind of infinite horizon tasks.The Puckworld and Pendulum scenarios represent two types of sequential decision-making tasks modelled by MDPs.So we choose these two scenarios as our agent environment.

Experimental Setup
In both Pendulum and Puckworld scenarios, the performance of SRLVF method is verified.The (a) of Figure 3 is the Pendulum scenarios, which is an experimental scenario of RL algorithm provided by gym.In this scenario, the pendulum is ensured to be inverted by continuously applying force F c in different directions and sizes.In order to adapt to the application characteristics of the RL algorithm based on value function, the continuous force The Puckworld scenarios is the (b) of Figure 3 with reference to the code written by Qiang Ye [27].In Puckworld scenarios, it mainly includes a predator and a prey.Predator can move freely in four directions.Prey's position changes randomly in a fixed time interval.Predator captures Prey as an agent.

Algorithm and Hyperparameter
SRLVF combines RL with demonstrations through SL.In the experiment, we use DQN and DDQN as benchmark algorithms to verify the effectiveness of SRLVF method.The RL part of SRLVF method adopts DQN algorithm and DDQN algorithm, which we call SRLVF-based DQN (SDQN) and SRLVF-based DDQN (SDDQN) respectively.The SL part of SRLVF method uses demonstrations to train the neural network.
The principles of DQN and DDQN have been analyzed in Section 3.2.Both DQN and DDQN have two neural networks: the main network and the target network.Both the two networks are set up as two layers fully connected networks with 25 units per layer.In the RL network, the discount factor γ is 0.9 and the learning rate l is 0.005.The SL network is a 2 fully connected network with 128 units per layer.In the SL network, the learning rate l is 0.01, the loss function was computed by cross-entropy.

Result and Analysis
In order to evaluate the performance of SRLVF, we tested the performance of SDQN and SDDQN in Pendulum and Puckworld scenarios, as well as the influence of different computing modes of ( ) v s .The code is implemented based on python 3.5 with Tensorflow 1.12.0 and Gym 0.10.9.
As illustrated in Figure 4, Figure 5 and Figure 6, each averaged reward curve is computed 7 times with a continuous error bar.

Testing the performance of the SRLVF method
Figure 4 and Figure 5 show that SDQN and SDDQN are better than DQN and DDQN respectively in both Pendulum and Puckworld scenarios, but both of them have worse performance before convergence.
Our algorithm outperforms the benchmark algorithm after convergence.We know that the purpose of RL is to approximate the overall strategy, while the decision-making under specific states is unstable.However, SL can optimize the decision making performance by training with demonstrations.SRLVF can improve the overall performance of the algorithm by combing RL with SL.
In SRLVF method, we use demonstrations to optimize the decision-making of RL algorithms through SL network in order to improve the performance.Differ from the convergence process of RL algorithms only involves the training of RL network, the convergence process of SRLVF method includes the training of RL network and SL network.Because the demonstrations for training SL network come from the interaction between agent and environment driven by the RL algorithm, and the institutional characteristics of demonstrations are also changing in the process of training RL network, the convergence of the SL network lags behind that of the RL network.In the convergence process of the SRLVF method, the SL network can't optimize the performance of the RL algorithm because it has not yet converged, and even has adverse effects.So in the convergence stage, the performance of SRLVF method is worse than benchmark algorithm.Algorithm and Hyperparameter SRLVF combines RL with demonstrations through SL.In the experiment, we use DQN and DDQN as benchmark algorithms to verify the effectiveness of SRLVF method.The RL part of SRLVF method adopts DQN algorithm and DDQN algorithm, which we call SRLVF-based DQN (SDQN) and SRLVF-based DDQN (SDDQN) respectively.The SL part of SRLVF method uses demonstrations to train the neural network.
The principles of DQN and DDQN have been analyzed in Section 3.2.Both DQN and DDQN have two neural networks: the main network and the target network.Both the two networks are set up as two layers fully connected networks with 25 units per layer.In the RL network, the discount factor γ is 0.9 and the learning rate l is 0.005.The SL network is a 2 fully connected network with 128 units per layer.In the SL network, the learning rate l is 0.01, the loss function was computed by cross-entropy.

Result and Analysis
In order to evaluate the performance of SRLVF, we tested the performance of SDQN and SDDQN in Pendulum and Puckworld scenarios, as well as the influence of different computing modes of v(s).The code is implemented based on python 3.5 with Tensorflow 1.12.0 and Gym 0.10.9.As illustrated in Figures 4-6, each averaged reward curve is computed 7 times with a continuous error bar.

Testing the Performance of the SRLVF Method
Figures 4 and 5 show that SDQN and SDDQN are better than DQN and DDQN respectively in both Pendulum and Puckworld scenarios, but both of them have worse performance before convergence.
Our algorithm outperforms the benchmark algorithm after convergence.We know that the purpose of RL is to approximate the overall strategy, while the decision-making under specific states is unstable.However, SL can optimize the decision making performance by training with demonstrations.SRLVF can improve the overall performance of the algorithm by combing RL with SL.
In SRLVF method, we use demonstrations to optimize the decision-making of RL algorithms through SL network in order to improve the performance.Differ from the convergence process of RL algorithms only involves the training of RL network, the convergence process of SRLVF method includes the training of RL network and SL network.Because the demonstrations for training SL network come from the interaction between agent and environment driven by the RL algorithm, and the institutional characteristics of demonstrations are also changing in the process of training RL network, the convergence of the SL network lags behind that of the RL network.In the convergence process of the SRLVF method, the SL network can't optimize the performance of the RL algorithm because it has not yet converged, and even has adverse effects.So in the convergence stage, the performance of SRLVF method is worse than benchmark algorithm.As shown in the Figure 6, the performance of SDQN and SDDQN algorithms with different () vs computing methods does not differ greatly in both Pendulum and Puckworld scenarios, especially in Puckworld scenario, the performance of the algorithms is much the same.In Pendulum scenario, the performance of SDDQN with adaption v is slightly worse, but it is the best in Puckworld scenarios.
The performance of SDQN with adaption v is not the best in Puckworld scenarios.These results show that the impact of different () vs on SRLVF performance is related to RL algorithm and scenarios, so the () vs computing method needs to be selected according to the specific situation.As shown in the Figure 6, the performance of SDQN and SDDQN algorithms with different () vs computing methods does not differ greatly in both Pendulum and Puckworld scenarios, especially in Puckworld scenario, the performance of the algorithms is much the same.In Pendulum scenario, the performance of SDDQN with adaption v is slightly worse, but it is the best in Puckworld scenarios.
The performance of SDQN with adaption v is not the best in Puckworld scenarios.These results show that the impact of different () vs on SRLVF performance is related to RL algorithm and scenarios, so the () vs computing method needs to be selected according to the specific situation.As shown in the Figure 6, the performance of SDQN and SDDQN algorithms with different v(s) computing methods does not differ greatly in both Pendulum and Puckworld scenarios, especially in Puckworld scenario, the performance of the algorithms is much the same.In Pendulum scenario, the performance of SDDQN with v adaption is slightly worse, but it is the best in Puckworld scenarios.The performance of SDQN with v adaption is not the best in Puckworld scenarios.These results show that the impact of different v(s) on SRLVF performance is related to RL algorithm and scenarios, so the v(s) computing method needs to be selected according to the specific situation.
Value function is an estimated value rather than a deterministic value.In our method, the value function is approximated by the neural network, which is affected by many factors, such as network parameters, reward functions, reward delay length, the task characteristics, et al.We can indeed select the high quality demonstrations to improve the performance through using the state value function obtained by three calculation methods as the selection criterion of demonstrations.However, there are many factors affecting the estimation of the value function, it is difficult to get the selection rules of the three calculation models.

Figure 5 .
Figure 5. (a) Average rewards on DQN and SDQN in Pendulum.(b) Average rewards on DDQN and SDDQN in Pendulum too.5.2.2.Evaluating the Impact of Different v(s) Calculation Methods on the Performance of SRLVF In Pendulum and Puckworld scenarios, we test the performance of SDQN and SDDQN algorithms Which use three computational methods: v max , v adaption and v sum respectively.As shown in the Figure6, the performance of SDQN and SDDQN algorithms with different v(s) computing methods does not differ greatly in both Pendulum and Puckworld scenarios, especially in Puckworld scenario, the performance of the algorithms is much the same.In Pendulum scenario, the performance of SDDQN with v adaption is slightly worse, but it is the best in Puckworld scenarios.The performance of SDQN with v adaption is not the best in Puckworld scenarios.These results show that the impact of different v(s) on SRLVF performance is related to RL algorithm and scenarios, so the v(s) computing method needs to be selected according to the specific situation.Value function is an estimated value rather than a deterministic value.In our method, the value function is approximated by the neural network, which is affected by many factors, such as network parameters, reward functions, reward delay length, the task characteristics, et al.We can indeed select the high quality demonstrations to improve the performance through using the state value function obtained by three calculation methods as the selection criterion of demonstrations.However, there are many factors affecting the estimation of the value function, it is difficult to get the selection rules of the three calculation models.
where S is the state set, A is the action set, R is the return function, T is the transfer function, and γ is the discount factor.When agent is in Symmetry 2019, 21, x FOR PEER REVIEW 4 of 11 target network but uses the corresponding action of the max Q value in the main network to determine the target Q value in the target network.The Loss function of DDQN is