Learning to Teach Reinforcement Learning Agents

In this article we study the transfer learning model of action advice under a budget. We focus on reinforcement learning teachers providing action advice to heterogeneous students playing the game of Pac-Man under a limited advice budget. First, we examine several critical factors affecting advice quality in this setting, such as the average performance of the teacher, its variance and the importance of reward discounting in advising. The experiments show the non-trivial importance of the coefficient of variation (CV) as a statistic for choosing policies that generate advice. The CV statistic relates variance to the corresponding mean. Second, the article studies policy learning for distributing advice under a budget. Whereas most methods in the relevant literature rely on heuristics for advice distribution we formulate the problem as a learning one and propose a novel RL algorithm capable of learning when to advise, adapting to the student and the task at hand. Furthermore, we argue that learning to advise under a budget is an instance of a more generic learning problem: Constrained Exploitation Reinforcement Learning.


INTRODUCTION
In the reinforcement learning framework [10], data efficient approaches are especially important for real world and commercial applications, such as robotics. In such domains extensive interaction with the environment needs time and can be costly.
One data efficient approach for RL is transfer learning (TL) [11]. Typically, when an RL agent leverages TL, it uses knowledge acquired in one or more (source) tasks to speed up its learning in a more complex (target) task. Most realistic TL settings require transfer of knowledge between different tasks or heterogeneous agents that can be vastly different from each other (e.g., humans and software agents).
Transferring between heterogeneous agents is often challenging since most methodologies involve exploiting the agents' structural similarity to transfer knowledge between tasks. As an example, TL can be applied between two similar RL agents, which both use the same function approximation method, by transferring their learned parameters. In such a case, a Q-Value transfer solution could be used, combined with an algorithm constructing mappings between the state variables of the two tasks.
Whereas solutions for extracting similarity between tasks have been extensively studied in the past [3], [11], the main problem of transferring between very dissimilar agents (e.g., humans and software agents) remains.
Consider for example a game hint system for human players. The game hint system can not directly transfer its internal knowledge to the human player. Moreover, it should transfer knowledge in a limited and prioritized way since the attention span of humans is limited.
The only prominent knowledge transfer unit between all agents (software, physical or biological) is action. Action suggestion (advice) can be understood by very different agents. However, even when transferring using advice, four problems arise: 1) Decide what to advise (production of advice) 2) Decide when to advise (distribution of advice), especially when using a limited advice budget 3) Determine a common action language in order to appropriately express the advice between heterogeneous agents 4) Communicate the advice effectively, ensuring its timely and noiseless reception This article focuses on the first two problems-those of deciding when and what to advise under a budget. Moreover, we use the game of Pac-Man to test our methods' effectiveness in a complex domain.
Whereas works such as [16] provide a formal understanding of RL students receiving advice and the implications on the student's learning process (e.g. convergence properties) and papers like [17] and [13] provide practical methods for a teacher to advise agents, this work attempts a new learning formulation of the problem and proposes a novel learning algorithm based on it. We identify and exploit the similarities of the advising under a budget (AuB) problem to the classic exploration-exploitation problem in RL and identify a sub-class of reinforcement learning problems: Constrained Exploitation Reinforcement Learning.
Most successful methodologies for AuB require students to inform their teacher for their intended action. This is not a realistic requirement in many real-world TL problems, since it assumes one more communication channel between the student and the teacher, thus, it requires some form of structural compliance from the student. An example of how arXiv:1707.09079v1 [cs.AI] 28 Jul 2017 restrictive is this requirement for real-world applications comes from the game hint example system. The system advises the human player for his next action in real-time but the human player could never be expected to announce its intended action beforehand. Part of this work's goal is also to alleviate such a prerequisite and propose methods that can also work without such knowledge.
Specifically, the contributions of this article are: • An empirical study on determining an appropriate advising policy in the game of Pac-Man • A novel application of average reward reinforcement learning to produce advice • A novel formulation of the learning to advise under budget (AuB) problem as a problem of constrained exploitation RL • A novel RL algorithm for learning a teaching policy to distribute advice, able to train faster (lower data complexity) than previous learning approaches and advise even when not having knowledge of the student's intended action

BACKGROUND
This section presents the necessary background to understand the methods proposed in this article. Brief introductions are provided to reinforcement learning and transfer learning, which are then followed by a more detailed discussion of the current advising methodologies.

Reinforcement Learning
Reinforcement Learning addresses the problem of how an agent can learn a behaviour through trial-and-error interactions with a dynamic environment [10]. In an RL task the agent, at each time step, senses the environment's state, s ∈ S, where S is the finite set of possible states, and selects an action a ∈ A(s) to execute, where A(s) is the finite set of possible actions in state s. The agent receives a reward, r ∈ R, and moves to a new state s ∈ S according to a transition function, T , of the task with T (s, a, s ) = P (s |s, a). The general goal of the agent is to maximize the expected return, where the return, G, is defined as some specific function of the reward sequence given also a discounting parameter, γ. The γ parameter, where 0 ≤ γ < 1, controls the importance of short-term rewards over the most longterm ones, discounting the later by powers of factor of γ. The outcome will be an action-value function Q π (s, a) which expresses the expected return starting from s, taking action a, and following after that policy π : S → A, which dictates how the agent acts in a certain situation in order to maximize the reward received over time.

Transfer Learning and Advising under a Budget
Transfer Learning [11] refers to the process of using knowledge that has been acquired in a previously learned task, the source task, in order to enhance the learning procedure in a new and more complex task, the target task. The more similar these two tasks are, the easier it is to transfer knowledge between them. By similarity, we mean the similarity of their underlying Markov Decision Processes (MDP) that is, the transition and reward functions of the two tasks and also their state and action spaces.
The type of knowledge that can be transferred between tasks varies among different TL methods, including value functions, entire policies, actions (policy advice) or a set of samples from a source task which can be used by a modelbased RL algorithm in a target task.
Focusing specifically on policy advice under an advice budget constraint, we identify two aspects of the problem, a) learning a policy to produce advice and b) distributing the advice in the most appropriate way, while respecting the advice budget constraint. Most methods in the literature produce advice by greedily using a learned policy for the task in hand [13], [16], [17]. For advice distribution, most methods rely on some form of heuristic function (and not learning) based on which the teacher decides when to advice. Examples of such methods are Importance Advice and Mistake Correcting [13].
The Importance Advice method produces advice by repeatedly querying a learned policy's value function, on each state the student faces, to obtain the best action for that state. Distribution of advice, that is deciding when to advise or not, is determined by a heuristic logical expression of the form Q max (s, a) − Q min (s, a) > t where t is a threshold parameter determining the state-action value gap between the best and the worst action for that state. If this value gap exceeds the threshold value, t, the state is considered critical and advice is given. The algorithm continues until the advice budget finishes.
Mistake correcting (MC) [13] differs from Importance Advising only in presuming knowledge of the student's intended action. Consequently, it validates the Importance Advising criterion only if the student action is wrong, not wasting advice when the student does not need it.
The method presented in [17] (Zimmer's method )formulates the teaching problem as an RL one in order to learn an advice distribution policy. The teacher agent has an action set with two actions, A = {advice, no advice}. The teacher's state space is an augmented version of the student's one and is of the form: s teacher = (s student , a student , b, n e ) where s student is the current state vector of the student, a student is the intended action of the student (it is assumed that the student announces the intended action on every step), b the remaining advice budget and n e the student's training episode number. Moreover, the reward signal is a transformed version of the student's reward with an extra positive reward for the teacher when the student reaches its goal in a small number of steps. We note that this method is tested only on the Mountain Car domain and the reward signal proposed for the teacher is domain-dependent.
The policy advice methods [13], [17], [16], presented in this section will also be used for comparisons in the experiments presented later in this article.

Pac-Man
The experimental domain for the teaching methods presented in this article is the game of Pac-Man. Pac-Man is a famous 1980s arcade game in which the player navigates a maze like the one in Figure 1, trying to earn points by touching edible items and trying to avoid being caught by the four ghosts. In our experiments, we use a JAVA implementation of the game provided by the Ms. Pac-Man vs. Ghosts League [5], which conducts annual competitions. Ghosts in our setting will chase the player 80% of the time and choose actions randomly 20%.
The player and all ghosts have four actions -move up, down, left, and right -but some actions are occasionally unavailable due to the restrictions in the maze. Four moves are required to travel between the small dots on the grid, which represent food pellets and are worth 10 points each. The larger dots are power pellets, which are worth 50 points each. When the player gets the larger ones, the ghosts become edible for a short time, during which they slow down and flee the player. Eating a ghost is worth 200 points (which doubles every time for the duration of a single power pill). Then the ghost respawns in the lair at the center of the maze.The episode ends if any ghost catches Pac-Man, or after 2000 steps.
This domain is discrete but has a very large state space. There are 1293 distinct locations in the maze, and a complete state consists of the locations of Pac-Man, the ghosts, the food pellets, and the power pills, along with each ghost's previous move and whether or not it is edible. The combinatorial explosion of possible states makes it essential to approach this domain through high-level feature construction and Q-function approximation.
In this article, we follow previous work [13] that adopted a high-level feature set (high-asymptote feature set) comprised of action-specific features. When using actionspecific features, a feature set is really a set of functions {f 1 (s, a), f 2 (s, a), ...}. All actions share one Q-function, which associates a weight with each feature. A Q-value is Q(s, a) = w 0 + i w i f i (s, a). To achieve gradient-descent convergence, it is important to have the extra bias weight w 0 and also to normalize the features to the range [0, 1].
For the state representation, we define a feature set which consists of 7 features that count objects at a range of distances from Pac-Man maze, as we used (and defined) in previous work [13].
A perfect score in an episode is 5600 points, but this is quite difficult to achieve (for both human and agent players). An agent executing random actions earns an average of 250 points. The 7-feature set allows an agent to learn to catch some edible ghosts and achieve a per-episode average of 3800 points.
L T † A teacher agent may not have a teaching value function Q T , relying in a handcoded or heuristic teaching policy * If the teacher has learned to act in the same MDP as the student, M Σ = M • In this work we assume R Σ = R and L Σ = L. All agents acting in the task have the same rewards and goals.

THE TEACHING TASK
In this section we attempt a more formal understanding of a teaching task that is based on action advice. The necessary notation is presented in Table 1.

Definitions
Definition 3.1. Student A student agent is an agent acting in an environment and capable of accepting advice from another agent 1 .

Definition 3.2. Teacher
A teacher agent is an agent capable to execute and inform a teaching policy (see Definition 3.7) to provide action advice to a student agent acting in a specific task.

Definition 3.3. Acting Task
The acting task is the task for which the teacher gives advice and can be defined as an MDP of the form M = S, A, T, R, γ on an environment E.

Definition 3.4. Teaching Task
The teaching task is the task of providing action advice to a student agent to assist him in learning faster or learning better the acting task. Any teaching task is accompanied by a finite advice budget, B.
Definition 3.5. Teaching Action Space Given the action space A of the acting task, the action space of the teacher in timestep t is: where a ∈ A an action of the acting task given as advice and the no advice action, ⊥, meaning that the teacher will not give advice in this step allowing the student to act on its own. b t is the advice budget left in time-step t.

Definition 3.6. Teaching State Space
The teacher agent state space in timestep t has the following form: where Θ is a tuple containing any knowledge we can have for the student and its MDP in timestep t. If the student's MDP is M Σ = S, A, T, R, γ and the teacher observes the current state of the student, s t ∈ S, reward, r t ∈ R and action a t ∈ A then Θ t = s t , r t , a t .
Definition 3.7. Teaching Policy A teaching policy, π T , is a deterministic policy of the form: 1. In this work we assume that a student agent always follows the given advice where S T and A T are the teaching state and action spaces respectively (see Definitions 3.6 and 3.5).
The teaching policy actually transforms the acting policy, π Σ (s, a), of an actor agent (expressed through its respective state-action value function, Q Σ ), to a policy producing advice under budget. We should also note that a teaching policy will usually ( [13], [16], [17]) set a = arg max a (Q Σ (s, a)) which means that the teaching policy is greedy with respect to the acting value function, Q Σ .
As a minimal example of the proposed formulation, the Importance Advising method [13] which uses the state importance criterion (see Section 2.2) can be said to use a teaching state space, S T = Θ = {s}, B, Q Σ as it requires knowledge only about the current state, s, of the student, the advice budget, B and an acting policy, Q Σ from which it produces advice.

Learning to Teach
The definitions presented in subsection 3.1 apply to any teacher agent even if it advises based on a heuristic function. In the following, we focus on teachers that use RL to learn a teaching policy (i.e., advice distribution policy).
In its most simplified version, the learning to teach task employs two agents: the teacher and the student. In the first learning phase a teacher agent has the role of the actor: it learns the acting task alone. It observes a state space S Σ and has an action set A Σ . Based on a reward signal R Σ received from the environment, it learns a policy π Σ to achieve the acting task goal L Σ . In our context, this first learning phase can be seen as the advice production phase since the teacher learns the policy that will be used to advise a student later on.
At any time-step t the teacher agent may have to stop acting and a new agent, the student enters the acting task and the corresponding environment.
Consequently, the teacher agent has to now learn and use a teaching policy for the specific task to achieve the teaching goal, L T . Additionally to the definitions given in Section 3, this second learning phase (learning a teaching policy), requires realizing and formulating the following: • Return Horizon. Even if the teaching task is formulated as an episodic one, the teaching episode, also referred as a session, is not necessarily matching the student's learning episode. The teacher's episode scope is greater and could track several learning episodes of the student.

•
Reward signal. A different return horizon implies a different task goal and consecutively the teacher's reward signal can be different from the student's (e.g., encouraging more the learning progress of the student over its absolute learning performance).
Moreover, defining the teacher's state space as a superset of the student's state space (see Definition 3.6) indicates one more difficulty of the learning to teach task. From the teacher's point of view the student can be considered a timeinhomogeneous Markov-Chain (MC) [9], X = (X t : t ≥ 0). This is because the transition matrix P t of the student's MC is dependent on time, since it is learning and constantly changing its policy over time. The time inhomogeneity of this MC poses significant difficulties in handling the problem theoretically. Homogenizing this MC by defining it as a space-time MC, (X t , t) can make practical solutions feasible but still theoretical treatment is difficult (e.g., no stationary distributions exist in this case).
In general, every learning task can have its corresponding teaching task which could be thought as its dual. As learning to act in a specific task and teaching that task can be considered different tasks, they have their own goals and consequently, are "described" by different reward signals.
As an example, in [17] a teacher agent for the mountain car domain has a different reward signal to that of the student, encouraging teaching policies that help the student reach its goal sooner.
Learning a teaching policy, as this is described above, could be modelled by many different types of Markov Processes. However, none of the classic MDP formulations completely models the specific learning problem as a whole either by not handling the non-stationarity of the problem or by not handling the specific budget constraint imposed on the advising action. This fact is the main motivation of Section 5, where we present our proposed method for learning teaching policies.

LEARNING TO PRODUCE ADVICE
In this section, we focus on the advice itself and its production (not its distribution). The main challenge in producing advice based on the Q-Values of an RL value function is that these values are valid only if the policy they represent is fully followed, not when this policy is sparingly sampled to produce advice.
Based on previous methods in the literature (see Section 6) the most common teacher's criterion for selecting which action to advise is π Σ (s) = argmax a Q Σ (s, a), that is, greedy selection of the best action based on the teacher's acting value function. However, the value of Q Σ (s, a) is not correct under the advising scenario since it is accurate only if the student will continue following teacher's acting policy, π Σ thereafter. Unfortunately, this is usually not the case in our context since the student, after receiving advice, will often continue for a long period using its own policy exclusively. Even worse, in the early training phases-when advice is needed the most-the student's policy will be vastly different from the teacher's.
This realization is even more important if we consider how different the teacher and the student agents are allowed to be in our context. Consider a human student receiving advice in the game of Pac-Man. Human players often play fast-paced action games in a myopic and reactive manner, seeking short-term survival and not a long-term strategic advantage.
In that case, a human student infrequently advised by a policy learned using a high γ value close to 1 will often be mislead to locally sub-optimal actions because these actions may be highly valued for the teacher's far-sighted policy. The human player will probably not follow such a policy thereafter and he has therefore been misled to an action that would be useful only if he would also follow the rest of the teacher's acting policy too.
Ideally, we would like to use a teacher's acting policy that would be mostly invariant to the student's particularities. Such a teacher's policy would advise actions that are good on average, whatever policy is followed thereafter by the student and whatever its internals and parameters are (e.g. γ value) etc.
In this article, we propose that the above considerations should affect the way we learn policies intended for teachers. Selecting a specific policy for advising, the RL algorithm producing it and its parameters, form a model selection problem for RL teachers.

Model Selection for Teachers
In this section, we want to investigate how factors such as the teacher's γ value (see Section 2.1) influence advice quality for students that can possibly have very different characteristics (e.g, a myopic student and far-sighted teacher). This is important in order to understand which teacher-agent differences affect the teaching performance the most.
To assess the influence of the γ value in the teaching process, we experiment using an RL algorithm like R-Learning [7], [10], which does not use a γ value for the calculation of state-action values and relies on estimating the average reward received by the student, using its policy from any state and thereafter.
Specifically, R-Learning is an infinite-horizon RL algorithm where a different optimality criterion is used such that the value Q(s, a) given action a and state s under policy π is defined as the expectation: Where ρ π is the average expected reward per time step under policy π. The intuition behind R-Learning is that in the long run the average reward obtained by a specific policy is the same, but some state-action pairs receive betterthan-average rewards for a while, while others may receive worse-than-average rewards. This transient, the difference to the average reward received, ρ π , is what defines the stateaction value. To keep a running estimate of the average reward, R-Learning uses a second update rule, and one more parameter, β, for the learning rate of that update. Using R-Learning to learn a teacher's acting policy along with the rest of the experiments presented in Section 4.2, we can assess the importance of γ value and γ value mismatch between student and teacher. Moreover, we assess other factors that possibly influence the quality of advice such as the performance of the teacher in the acting task, its performance variance and a possible relation of its average td-error [10], with the quality of advising.
As defined in [10] the td-error, δ t , represents the value estimation error of a value function for a specific state s and action a in time t. For the Q-Learning [14] algorithm that is: This is also part of the Q-Learning update rule. Furthermore, by dividing (4) with the previous value estimation, Q(s, a), we get the percentage of error in relation with it, which we can call td-error percentage: where Q(s t , a t ) = 0. In our context, when the teacher uses an acting policy to produce advice it can still compute, for each student's experience, its own td-error just as it would do if it was actually making a learning update. In the same context, we can intuitively say that δ % t represents the teacher's surprise 2 on its new estimation of a state-action value.
Consequently, a teacher with high average td-error percentage, δ % , is a teacher with more unreliable value estimation, and therefore, it can be less suitable for a teacher since its action suggestion is based on a non-converged value function.

Experiments and Results
Based on the discussion in the previous section (Section 4.1) the main goal of the following experiments is to find the teacher's policy parameters (such as γ) that affect the quality of advice most for different student parameters. The experimental design is as follows. In the first phase, we created γ-specific teachers by training five Q-Learning agents and one R-Learning agent for 1000 episodes. The Q-Learning agents had all the same parameters, except γ, which took values in {0.05, 0.2, 0.6, 0.9, 1.0}. The rest of their parameters were the same and fixed, specifically = 0.05 and α = 0.001 (same with previous work [13]). The λ parameter accounting for eligibility traces was set to zero so that the effect of experimentally controlling the γ parameter is isolated. Finally, the parameter β of R-Learning was set to 0.0001 (preliminary results found it produced good results in Pac-Man).
After training for 1000 episodes, the γ-specific Q-Learning teachers and the R-Learning teacher were evaluated on 500 episodes of acting alone in the environment. We calculated their average episode score and the coefficient of variation of these scores as both being possible determining factors of advice quality. Coefficient of variation was used as a measure of score discrepancy as it shows the extent of variability in relation to the mean of the score, allowing a more clear comparison of variance between methods with different average performance. It is a unit-less measure calculated as c v = σ µ . In Table 2 we can see their average episode score on 500 episodes along with the coefficient of variation of that score. R-Learning had significantly worse average acting performance than all versions of Q-Learning. Interestingly, episodic Q-Learning (with γ close to 1) did not perform as well as expected. Moreover, a very low γ value (0.05) came up second, showing that a myopic RL agent can perform well in Pac-Man. This result indicates the highly stochastic nature of the game where reactive short-sighted strategies, based more on survival, can perform better than far-sighted strategies.
2. Note that this definition of surprise, although similar, is different to that presented in [15] which normalizes for different learners and not for different state-action pairs.  Negative bars indicate negative transfer (average score decrease). Error bars indicate 95% confidence intervals (CI) of the means. Non-overlapping CIs indicate statistically significant differences of the means whereas overlapping CIs are inconclusive 3 After the initial training and the evaluation of the acting policies they learned, these agents could be used as teachers for tabula-rasa student agents. In these experiments we used a simple fixed teaching-advising policy called Every-4-Steps for all teachers since we focus only on the quality of the advice itself and not on the quality of its distribution to the student (teaching policy).
In the Every-4-Steps teaching policy, the teacher gives one piece of advice to the student every four steps. Using this fixed advising policy we can test and compare the efficacy of the advice when this is not given consecutively, thus testing how useful the advice is when the student does not take a complete policy trajectory from the teacher, but has to use its own policy in between.
Using the teaching policy Every-4-Steps and a budget of B = 1000 advice we ran 30 trials of advising learning students for each γ-specific teacher-student pair. Specifically, the γ parameters of these teacher-student pairs come from the Cartesian product {0.05, 0.2, 0.6, 0.9, 0.999, −} × {0.05, 0.2, 0.6, 0.9, 0.999} (30 pairs), where the R-Learning teacher in the first set is denoted with a "-" since it does not have a γ value.
In Figure 2 we can see the average performance of each teacher-student pair compared to the same student not receiving advice at all. Combining these results with Table 2 of the teachers' performance when they were acting alone, we can see that the best performer is not the best teacher, with best defined as the best average score when acting alone in the task. The best example of this is R-Learning whose average score was the worst than any γ-specific Q-Learning agent, however, as we can see in Figure 2 is almost as good of a teacher as the γ = 0.999 Q-Learning teacher. R-Learning advising improved all student's score whatever their γ value, while not resulting in a negative transfer for any of them.
Moreover, we can see a pattern where the lower the coefficient of variation (CV) for the acting performance is, the better the teacher, indicating that CV can be an important 3. For the non-conclusiveness of overlapping confidence intervals, a simple and intuitive explanation can be found in https://www.cscu. cornell.edu/news/statnews/stnews73.pdf criteria in model selection for teachers. This is non-trivial since average agent performance (and not its variance) is the dominant model selection criteria adopted in most of the relevant literature in RL. Performance variance expressed by CV seems especially important in our context, that of sparse advising, where the advice should be good whatever the next actions of the student will be.
Based on the results presented here, we can not observe any particular pattern relating teaching performance with the γ values of a teacher-student-pair. Interestingly though a γ = 0.999 teacher is not the most helpful for a γ = 0.999 student. Even more, a γ = 0.2 for a γ = 0.2 student results to significant negative transfer. The teacher with the episodic γ value, γ = 0.999 and the no discounting R-Learning one were the most helpful to all students showing that R-Learning can perform well in settings where the student's γ is unknown or varying, such as in the case of human students.
Having identified the possible use of R-Learning for producing acting policies suitable for advising and the importance of performance CV to model selection, we conducted one more experiment between identical teachers.
Specifically, we independently trained 30 Q-Learning teachers with the same parameters, feature sets and characteristics for 1000 episodes. Due to their different experiences and the stochasticity of the game they naturally learned different policies (i.e., final feature weights in their function approximators). Then, the trained teachers played alone for 500 episodes and we recorded their average performance, average performance variance as also their average TD-error percentage, ∆ pct. t , as this was defined in Section 4.1. We then used the Every-4-step teaching policy with each one of them advising a standard Sarsa [6] student who would learn the task for 1000 episodes. Finally, we recorded the student's average score.
In Figure 3 we can see a correlation plot of the factors mentioned above using a one-tailed non-parametric Spearman correlation test at p < 0.05. Confirming the previous results we can see the negative and statistically significant relation of CV to teaching performance with r = −0.3. Acting performance also has a medium and positive correlation of r = 0.3 with teaching performance (student's score) but it is statistically insignificant on the limit. By weighing average performance in its calculation, CV has a stronger relation to teaching performance than standard statistic variance. Moreover, we see that teacher's surprise, ∆ pct. t relates strongly (r = −0.66) and negatively to the acting performance of the teacher and not to its teaching performance (r < −0.05).

LEARNING TO DISTRIBUTE ADVICE
In this section we change focus from advice production to advice distribution, learning a teaching policy in order to most effectively distribute the advice budget.

Constrained Exploitation Reinforcement Learning
We attempt a more natural formulation of the AuB learning problem described in Section 3.2 by identifying it as an instance of a more generic reinforcement learning problem. This RL problem can be simply described as learning control with constraints imposed on the exploitation ability of the learning agent. These constraints can either be a finite number of times the agent can exploit using its policy, possibly states where it is only allowed to explore, or even perhaps a task where it is costly to have access to an optimal policy and we are allowed to use it only for a limited number of times. How does this RL problem relates to the learning to teach problem? The first insight is that the advise/no-advise decision problem has a striking resemblance to the core exploration-exploitation problem of RL agents. Consider the learning to teach problem. We can view the problem as follows: When the teacher agent is advising it is actually acting on the environment, that is because an obedient student agent will always apply its advice thus becoming a deterministic actuator for the teacher. In the case of a non-obedient student, the teacher could be said using a stochastic actuator.
Consequently, we can view the teacher agent as an acting agent using a student agent as its actuator for the environment. Moreover, the teacher is acting greedily by advising its best action; thus, it exploits. Under this perspective, with advice seen as action, how could we view the no advice action of a teacher? The no advice action can be seen as "trusting" the student to control the environment autonomously. Thus, choosing not to advise in a specific state can be seen as denoting that state to be non-critical with respect to the remaining advice budget and the student's learning progress, or denoting a lack of teacher's knowledge for that state. From the teacher's point of view, not advising can be seen as an exploration action. So controlling when not to advise can be seen as a directed exploration problem in MDPs. Imposing a budget constraint, that is a constraint on the number of times a teacher agent can advise (i.e., exploit) is a problem of constrained and directed exploitation.
We will consider a simple and motivating example of such a domain. In a grid world 10 × 10 a robot learns an optimal path towards a rewarding goal state while it should keep away from a specific damaging state. The robot is semiautonomous, it can either control itself using its own policy or it can be teleoperated for a specific limited number of times. For the robot's operator, what is an optimal use of this finite number of control interventions? What are the states that it would be best to control the robot directly, leaving control of the rest to the robot?
Similarly to the previous example, learning and executing advising policies in a game can be another example of the constrained exploitation problem, which is also the main focus of this article. For example, in a video game like Pac-Man, a game hints system plays the role of the external optimal controller with a limited intervention budget. Such a hint system could suggest actions to human playerswhen these are most necessary-depending also on the player's policy.
In the rest of this section, we use the term exploitation where one can think of advising and the term exploration when not-advising, focusing on the broader learning problem.

Learning Constrained Exploitation policies
Formulating the constrained exploitation task as a reinforcement learning problem itself first requires defining a horizon for the returns. This horizon should be different from that of the actual underlying task (e.g., Pac-Man) because a) if the underlying task is episodic then the scope of an exploration-exploitation policy is naturally greater than that and spans across many episodes of the learning agent b) if the underlying task is continuing or requires several training episodes for the student, the exploration-exploitation policy may have to be evaluated in a shorter (finite) horizon (e.g., for the first x training episodes). The importance of exploration is usually limited in the late episode(s) where the student may have already converged to a policy. A teaching policy should be primarily evaluated for a training period where advice still matters.
Concerning the return horizon of a constrained exploitation task (and similarly to [16] but in a different perspective), we propose algorithmic convergence [16] as a suitable stopping criterion for an exploration-exploitation policy. This defines a meaningful horizon for exploration-exploitation tasks since their goal is completed exactly then, not in the end of an episode and not in the continuous execution of an RL algorithm-after convergence-where exploration may not affect the underlying policy anymore. We proceed by defining the Convergence Horizon Return.

Definition 5.1. Convergence Horizon Return
Let G be the return of the rewards r t received by an explorationexploitation policy, Q the value function of the underlying MDP and ∈ R a small constant then: where for the time step T applies: Given a small constant and the algorithmic convergence of the RL algorithm learning in the underlying MDP, the quantity ∆Q = |Q t+1 − Q t |− −−−→ t→∞ 0. The algorithmic convergence will be realized either if the learning rate α is discounted or if some temporal difference ∆ t of the underlying algorithm tends to .
Using the convergence horizon for the return of a teaching task too, the next question can be what are the rewards r t constituting the return of a teaching task.
One possible goal for any teacher advising with a finite amount of advice would be to help minimize student's regret with respect to the reward obtained by an optimal policy. However, since we do not assume such knowledge, and because there is a finite amount of advice, a better goal could be to advise based on the state-action value of the advised action and not its immediate reward. If the student was able to follow the rest of the teacher's policy after receiving advice, then the action a = argmax a (Q Σ (s, a)) for the current state s would be the best possible. Consequently, we define the notion of value regret.

Definition 5.2. Value Regret
In a convergence horizon T , the value regret, R V of an exploration-exploitation policy (i.e., teaching policy) with respect to both an acting policy π * obtained after the T period and an acting policy (i.e., student's policy), π t , in time step t is: where Q * denotes the corresponding value function of π * . The intuition behind this definition of regret in our context (where the acting agent is the student) is that the best teacher for any specific student would ideally be the student himself, when it would have reached convergence or its near-optimal policy.
The important thing to note here is that because a student agent receives a finite amount of advice it cannot improve its asymptotic performance [16], consequently the evaluation of a teaching policy should ideally be based on the student's optimal policy and not to that of some probably very different teacher, because that is its sustainable optimality.
For example, consider two states in a teacher's acting MDP, A and B. A student agent learning with a very simplistic state representation may observe these states as just one, C, and not differentiate between them. Then, the student's optimal action in state C will have a different expected return than that obtained by the teacher from either A or B. Its sustainable optimality is defined as to what is optimal given its simplistic internal representation. Any advice based on a finer representation may not be supported with consistency by the student in the long run. A teaching policy should be ideally evaluated on how much it speed up the student converging to its own optimal policy.
In the next section we propose a reward signal for teachers based on Value Regret.

The Q-Teaching algorithm
The Q-Teaching algorithm described and proposed in this section is an RL advising (teaching) algorithm learning a teaching policy. For this, we propose a novel reward scheme for the teacher based on the value regret (see Definition 5.2). repeat (for each step) 6: a * ← max a Q Σ (s, a) 7: if (Off-Student's policy Q-Teaching) then 8:â ← min a Q Σ (s, a) 9:

Algorithm 1 Q-Teaching
where a is the action announced by the student 11: end if 12: Choose a T from s T using policy derived from Q T (e.g. -greedy) 13: if a T = {advice} then 14: Advice the student with the action a * until advice budget finishes OR reached the estimated convergence horizon episode of the student 28: until end of teaching episodes The key insight of the method is that of rewarding a teaching policy with quantities of the form max a Q * (s t , a)− Q * (s t , π t (s t )) where π t (s t ) is an estimation of the student's action in s t and max a Q * (s t , a) is the teacher's greedy action in s t (i.e., the action used for advice). This reward has a high value when the value of the greedy action is significantly higher than the value of the action that the student would take. This means that the teacher is encouraged to advise when the advised action is significantly better than the action the student would take.
For terms of efficiency and to emphasize the value impact of the advising action, Q-Teaching rewards all noadvice actions with zero. The advantages of such a scheme is that the teacher's cumulative reward is based only on the value gain produced when advising and a teaching episode can finish when the budget finishes, not having to observe all the student's episodes after its budget finishes. From preliminary experiments, rewarding no advice actions too (which occur significantly more than the maximum B advice actions) was overpowering the advice actions, resulting in an imbalanced expression of the two actions in the teaching value function.
Still, when advising, the teacher should estimate Q * (s t , π t (s t )) in order to compute its reward. The simplest solution is that since we do not have access to the value function of the student or its internals, we use the acting value function Q Σ of the teacher as an approximation for the optimal value function of the student, Q * . To estimate π t (s t ) the teacher has several options. If the teacher is notified of the intended action of the student beforehand, it can use that to compute the reward. If we assume no knowledge of the student's intended action then some other estimation method for the student's intended action should be used. An example of such an estimation method is used in the Predictive Advice method [13].
While predicting the actual student's action (π t (s t )) is possible, there are other-simpler-choices for this estimation too. For example, the Importance Advising (see Section 2.2) uses a very similar quantity for the advising threshold, of the form max a Q * (s t , a) − min a Q * (s t , a). For Importance Advising, we can say that π t (s t ) = min a Q * (s t , a)-it pessimistically assumes the student will take the worst action, representing the risk of the state. The advantage of such an assignment is that it is based on a well-tested criterion [13] and that it does not need knowledge of the student's intended action (desirable for most realistic settings). The disadvantage is that we have a less detailed reward which is also not adapting to the student's specific necessities but mostly, to the domain's characteristics.
Based on this dichotomy, we propose two versions of Q-Teaching (see Algorithm 1), the off-student's policy Q-Teaching and the on-student's policy Q-Teaching. The onstudent's policy Q-Teaching uses the value of the actual students action to compute the reward (thus it is directly influenced by its policy). We can intuitively say that onstudent's policy Q-Teaching will advise when the student is mostly expected to act sub-optimally with respect to the acting value function of the teacher, Q Σ . On the other hand, the off-student's policy Q-Teaching uses the criterion discussed above and the teaching policy is not directly influenced by the policy of the student. Specifically, it is rewarding its teaching policy, π T , at time-step t + 1 with the q-value difference of the best action a * to the worst action, as these were found at time t.
The Q-Teaching algorithm proceeds as follows (see Algorithm 1). A teacher agent enters an RL acting task to learn an acting policy. It initializes two action-value functions, Q Σ and Q T , the acting value function and the teaching value function respectively (lines 1-2). Of course, it can also use an existing acting value function.
Being in time step t and state s the teacher queries its acting value function for the greedy action in that state (line 6). Depending on whether we use the off-student's policy or the on-student's policy Q-Teaching, the teacher sets a baseline action,â, to either the worst possible action for that state or to the action just executed by the student (lines 8-12).
Then, the teacher chooses an action from A T = {advice, not advice} based on Q T and its exploration strategy. If the teacher chooses to advise (line 13) it gives the action a * as an advice to the student agent. If the teacher chooses not to advise, the student will proceed with its own policy.
In line 19 the teacher observes the student's actual action a and its new state and reward, s , r. Once again, the student may be the teacher himself, in this case, it observes its own action which was taken based on Q Σ and its exploration strategy.
In line 20, the first Q-Learning update takes place for the acting value function Q Σ based on the environment's reward. For the teaching value function update, the teacher's reward, r T is calculated first, based on the freshly updated values of the best and baseline actions, a * andâ respectively (lines 21-25).
Finally, a Q-Learning update for the teaching value function takes place based on the reward r T and the algorithm continues in the same way until whatever of the following two events comes first: Either the advice budget finishes or the student reaches a learning episode which we have predetermined as its convergence horizon. These complete one learning episode or session for the teacher.
In this version, the Q-Teaching algorithm is based on the Q-Learning algorithm, although in principle any RL algorithm could be used for the underlying learning updates of Q-Teaching. However, if an off-policy RL algorithm such as Q-Learning is chosen for the updates of both the acting and the teaching value function, then the point of transition from acting to teaching is irrelevant to the learning progress of the two policies. Reducing the impact of the exploration policy to the learning updates allows for smoother interaction between the two policies and ensures us that we continue to learn the same policies. In principle, a Q-Teaching agent is able to update both its acting and teaching value functions continually and refine not only when it should advise but also what it should advise.
Since our goal is to introduce Q-Teaching as a flexible and generic enough method to be applied to multiple domains, we propose a series of state features for the teaching policy that we think are necessary. From our experiments, Q-Teaching works best with an augmented version of the acting task state space (see Table 3) similar to that of [17] (Zimmer's method). Also in Table 3, note the role of the student's progress feature (f 3): it homogenises the student's Markov chain by inducing a state feature for time (see Section 3.2).

Experiments and Results
In this section, we present results from using Q-Teaching in the Pac-Man Domain. We evaluate both on-student's policy Q-Teaching and off-student's policy Q-Teaching, in two variations each: known or unknown student's intended action. Note that methods like Zimmer's and Mistake Correcting require knowledge of the student's intended action.
We use two versions of students for the experiments. A low-asymptote and a high-asymptote Sarsa students. Referring to [13] and Section 2.3, the low asymptote students receive a state vector of 16 primitive features related to the current game state while the high asymptote students receive a state vector of 7 highly engineered features providing more information. The low-asymptote students have significantly worse performance than the high-asymptote ones.
Additionally, we choose to bootstrap all compared teaching methods with the same acting policy in order to compare only their advice distribution performance and not their quality of the advice. The acting policy used for producing advice comes from a high-asymptote Q-Teaching agent after 1000 episodes of learning. Moreover, we use Sarsa students in order to emphasize the ability to advise students that are different to the teacher. All learning methods (Zimmer's and Q-Teaching) were trained for 500 teaching episodes (sessions) to be equally compared for their learning efficiency too.
The evaluation was based on the student performance (game score) and using the Total Reward TL metric [11] divided by the fixed number of training episodes. The student performance is evaluated every 10 advising episodes (learning) for 30 episodes of acting alone (and not learning). For the comparisons between average score performances we used pairwise t-tests with Bonferroni correction. Statistically significant results are denoted with their significance level and they always refer to paired comparisons.
In Figure 4a, teacher agents advise a low-asymptote Sarsa student who always announces its intended action. We can see Zimmer's method performs best and off-student's policy Q-Teaching comes second with a statistically significant difference (p < 0.05). The heuristic based-method Mistake Correcting with a tuned threshold value of t = 100 comes third. On-student's policy Q-Teaching performed worse than the previous three methods by a small margin, having not found an as good advice distribution policy (non-significant difference to Mistake Correcting). Finally, all methods performed statistically significantly better (p < 0.05) than not advising, effectively speeding up the learning progress of the student. In Figure 4b, the teachers advise a high-asymptote Sarsa student. Here, the tuned version of Mistake Correcting (t = 200) performed statistically significantly better (p < 0.05) than all methods, with Q-Teaching methods coming second and third (respectively) and Zimmer's method coming next (having non significant differences between them).
For the case when the teacher agent is not aware of the student's intended action, in Figure 4a the off-studentpolicy Q-Teaching performs best while Importance Advising (t = 200) follows with a small performance difference (n.s.). Early Advising (giving all B advice in the first B steps) performs statistically significantly worse (at p < 0.05) than both Q-Teaching and Importance Advising. In these experiments, we did not use on-student's policy Q-Teaching since that requires knowing the student's intended action to compute the reward.
In Figure 4b, advising a high asymptote Sarsa student, Q-Teaching had the second best performance with the heuristic-based method importance advising (t = 200) performing better (non significant). For high performing students a poorly distributed advice budget can be much less effective. For example, if the teacher knows the student's intended action it does not spend advice in states where the student would anyway choose the correct action. This fact is emphasized in this specific case, since no advising did not perform significantly worse compared to the rest of the methods.
Finally, in Table 5 we can see the average total reward in 1000 training episodes for all the teaching methods. All methods knowing the student's intention performed better than those not, taking advantage of that knowledge.
It is important to note that Q-Teaching, the only learning AuB method allowing students to not announce their intended action, performed relatively well compared to methods that know the student's intended action, which is an advantage of the proposed method.
Another advantage is that off-student's policy Q-Teaching can use the same teaching policy for very different students since it is not directly influenced from the student's policy and the rewards received by the student when not advising (such as in the Zimmer method). This is a significant advantage in terms of learning speed and versatility since heuristic methods have to be manually tuned for each student separately to find the optimum threshold, t.
Moreover, while Zimmer and Q-Teaching methods were both trained for 500 episodes (sessions), Q-Teaching training completed significantly faster since the Zimmer method has to observe all 1000 episodes of each student sessionto complete just one of its own, whereas Q-Teaching has an upper bound for its episode completion. This upper bound is the algorithmic convergence of the student (e.g, the lowasymptote student requires only 500 episodes to converge) and in most cases it will complete much faster, when the budget finishes (around the 30th episode for the lowasymptote student). More specifically, in Table 5 we can see the average training time needed for each teacher in terms of the average observed student episodes in each of the 500 teacher episodes. In general, our proposed methods need at least ×25 less training time than the Zimmer's method. We should also note here that although non-learning methods do not need training time they require a significant and variable amount of manual parameter tuning to achieve the reported performance.
On-student policy Q-Teaching did not perform as well expected, the main problem being the non-stationary reward depending on the student's changing policy. We believe that this method needs significantly more training time than the off-student's policy Q-Teaching because of its nonstationary reward and it probably needs more informative features for the student's current status. In our case, this was 4: Average student score in 1000 training episodes with teachers knowing the student's intended actions. The curves are averaged over 30 trials and the legend is ordered by score. The error bars represent the 95% confidence intervals (CI) of the means. Non-overlapping CIs indicate statistically significant differences of the means whereas overlapping CIs are inconclusive (a) Low-asymptote Sarsa student  only its training episode which is the most basic information available for the student. Moreover, the training episode feature is student-dependent since its meaning varies among students-some students learn faster than others.

RELATED WORK
There are several types of related work in the area of helping to learn. Some of this work focuses on teaching in non-RL settings [1], [8].
In the field of transfer learning in RL [12], an agent uses knowledge from a source task to aid its learning in a target task. However, agents perform transfer knowledge from one task to another and in an off-line manner. Other differences of this typical TL setting to Agent Advising are described in section 2.2 of this article.
More closely related work has one RL agent teach another without a direct knowledge transfer. Examples of such works include imitation learning [4] and apprentice learning [2]. In these approaches an expert provides demonstrations of the task to a student, then the student has to extract a policy by either learning directly from them or building a model to generate mental experience. In our setting, the teacher does not provide a full-policy trajectory and has a limitation on the number of interventions (advice budget). Moreover, we do not require a student with special processing abilities except that of being able to receive advice In [13] a non-learning teaching framework for RL tasks is proposed based on action advice. The methods presented there are described in more detail in section 2.2. One drawback of these methods is that since they are based solely on the teacher's q-values they are not able to handle nonstationarity in the students learning task, and also have to be given a threshold of q-value differences, above which a state is considered important. This parameter needs to be manually tuned for each student in contrast to off-student's policy Q-Teaching which can learn a more generic teaching policy focusing on the criticalities of the state space.
Also, since the methods presented in [13] are heuristicbased and not based on adaptive learning, the agent may spent all of its advising budget on early learning steps of the student that satisfy the importance threshold, while it may later experience even more important states that further exceed the given threshold.
The only other learning method for advising is introduced in [17] (Zimmer's method). The method proposed there is described in more detail in section 2.2. One significant difference is that the method is based on the same reward received by the student, needing ad-hoc modifications for each task to encourage teacher towards a better advising policy. Our method uses a domain-independent reward signal based on the acting task q-values and can be directly used in any task. Moreover, their method has greater data complexity since a complete batch of student training episodes is required for just one training episode of the teacher. As discussed in the previous section, our method may finish one teaching episode as early as the budget finishes; that is multiple times faster completion of one episode. Finally but most important, Q-Teaching can be used in the more realistic setting where there is no knowledge of the student's intended action.
Concerning the model selection criteria proposed in Section 4.1 for the teacher's acting policy, to the best of the author's knowledge there is no other work in the relevant literature examining these criteria and furthermore proposing performance variance, and specifically CV, as an important one. Most relevant works choose models based on their average performance, which as discussed previously, is not enough to evaluate the teaching effectiveness of a policy sampled infrequently and in parts.

CONCLUSIONS AND FUTURE WORK
In this article, we discussed and proposed criteria, considerations and methods for the problem of learning teaching policies to produce and distribute advice.
Concerning advice production, we identify a model selection problem for the teacher, selecting the appropriate acting policy from which to advise. The experiments showed the significant relation of CV to the teaching performance, promoting CV as an important criterion-among others tested-for selecting acting policies for advising. Moreover, average-reward RL was found to produce effective policies for sparse advising under budget, although these policies may under-perform when used as acting ones.
Concerning advice distribution (i.e., teaching policy) we proposed a novel representation of the learning to teach problem as a constrained exploitation reinforcement learning problem. Based on this representation we proposed a novel RL algorithm for learning a teaching policy, Q-Teaching, able to advise even when not having knowledge of the student's intended action. Q-Teaching was found to perform at least equally well with other compared methods while needing significantly less training time.
Advice distribution under budget is a challenging problem, both theoretically and practically, posing a series of problems such as the non-stationarity of the teaching task, as a result of having a learning student as part of the environment. Efficient and principled handling of the budget constraint is another challenge.
From our experiments, Q-Teaching can be considered a promising method based on a more formal understanding of the problem. It is significantly more efficient in terms of data complexity than Zimmer's method, and it can learn teaching policies without the assumption of having knowledge for the student's intended action.
There are several future directions. Q-Teaching could be adapted to student agents with specific "disabilities" and could also be tested under different budget costraints to examine how budget affects its teaching policies. Also, off-student Q-Teaching could be tested on multi-student scenarios since not fitting to a particular student could be proven effective when teaching multiple different students. Moreover, the theoretical properties of the algorithms should be studied, especially the case of learning a teaching and an acting policy at the same time, e.g., under which specific assumptions a teaching policy converges.
The general usefulness of CV as a criteria for selecting teachers should be studied. Specifically, how teacher selection criteria such as CV are capturing the robustness of a policy when that policy is used sparingly for advising.
Finally, other teaching architectures and representations should be studied, allowing, for example, a teacher to use only one value function for both advising under a budget and acting. Such a hybrid agent transitions smoothly from its actor role to the teacher's one. A unified architecture and knowledge representation would further reveal the deep connection between acting and teaching, one we strongly believe exists.