Nondominated Policy-Guided Learning in Multi-Objective Reinforcement Learning

: Control intelligence is a typical field where there is a trade-off between target objectives, and researchers in this field have longed for artificial intelligence that achieves the target objectives. Multi-objective deep reinforcement learning was sufficient to satisfy this need. In particular, multi-objective deep reinforcement learning methods based on policy optimization are leading the optimization of control intelligence. However, multi-objective reinforcement learning has difficulties when finding various Pareto optimals of multi-objectives due to the greedy nature of reinforcement learning. We propose a method of policy assimilation to solve this problem. This method was applied to MO-V-MPO, one of preference-based multi-objective reinforcement learning, to increase diversity. The performance of this method has been verified through experiments in a continuous control environment.


Introduction
An intelligent system refers to a type of system that, imitating the human capability to solve complex problems by means of information processing, learns to self-adapt under indeterminate and uncertain environments.The main features of an intelligent system include uncertainty management, self-adaption and self-training, and a proclivity for optimization.The technique that exploits intelligent systems to prompt machines to produce optimized results adapted for the given environment is called intelligent control.Reinforcement learning has been prevailing in the field of intelligent control [1,2].While the conventional methods require experts to search for optimized models, reinforcement learning effectively designs a reward function to resolve the issue [3][4][5].Nonetheless, in spite of its excellency at finding local optima, it suffers from the difficulty of searching for multi-objective solutions that balance trade-offs for three reasons.
First, designing a reward function for the problems with the trade-offs is highly complicated [6][7][8].In general, the reward function of reinforcement learning is designed by experts with domain knowledge.Even if an expert has excellent knowledge in the field, it is almost impossible for him to design a reward function [9][10][11] that adapts to the growing complexity as the number of objective increases without spending a lot of resources and time.This requires a multi-objective reinforcement learning algorithm that can solve the multi-objective problem.
Second, it is difficult to find solutions in diversity due to the greedy characteristics of reinforcement learning [12][13][14].Reinforcement learning agents generally continue learning in a way that is easier to obtain rewards, so in the case of multiple objectives, they are more biased toward specific objectives that tend to return better rewards.As a result, even if the reward function is distributed evenly in weight, it becomes aligned to a specific objective compared to the even Pareto distribution.
Third, reinforcement learning suffers from low stability [15][16][17].Reinforcement learning collects data by interacting with the environment instead of relying on pre-built data.Therefore, even if learning is performed with the same reward function and parameters, the agents at the end have differences in their performance.As a result, even with appropriate parameters or reward functions, if agents have been trained concurrently to find the multi-objective parato solutions, some agents might lag behind with respect to the original performance.As a result, it can be a difficult task to establish a smooth Pareto front.
We propose a method of policy assimilation to solve the addressed problems.This method divides the learning into four sections, evaluates the performance of each section, and replaces the policy of dominated agents with those of non-dominated agents.This method allows each agent to find the closest non-dominated agent with a lower objective preference and to copy it in order to secure stability in the agent's diversity and performance in the multi-objective program.Our method was applied to MO-V-MPO to validate its performance, which is the latest multi-objective reinforcement learning algorithm proposed by Google DeepMind, and it currently exhibits excellent performance [18].We solved the previously addressed problems by combining the proposed method and MO-V-MPO.

Reinforcement Learning
Reinforcement learning is one sort of machine learning, for which the main objective is to find optimal policy for the agent [15,17,19].Reinforcement learning is the only technique among many other machine-learning-oriented methodologies that collects and learns data by interacting with the environment both directly and indirectly.It is thus qualified for adapting to uncertain environments where securing a sufficient amount of data is not guaranteed, which is a big blow, especially to algorithms that require a pre-built set of data.Reinforcement learning utilizes the Markov decision process (MDP), a crucial step that shapes its ability to adapt.The MDP consists of four core elements, S, A, P, and R [16,20,21].At each timestamp t, an agent observes a set of states s t ∈ S and commences a set of actions a t ∈ A that determines the reward r t ∼ R(s t , a t ) and the subsequent state s (t+1) ∼ P (s t , a t ).An MDP-defined reinforcement learning algorithm, unlike supervised learning algorithms, is not explicitly corrected for its undesirable behaviors; its behavioral tendency is only indirectly influenced by its reward function.
Reinforcement learning performs training via two-fold steps of exploration and exploitation [22,23].During exploration in the early stage of training, the agent attempts actions randomly.As mentioned earlier, since the data must be collected by the algorithm itself, which action produces high expected rewards is unknown at the beginning.Therefore, the random actions by the agent are intended to provide diversity to the policy convergence, contributing to discovering the global optima.During exploitation, on the other hand, it induces the agent to choose actions with the highest expected rewards.It is mostly conducted in the later stage, when the data have been secured sufficiently, in order to help the agent suit the optimal policy convergence.Reinforcement learning, overall, is a proper combination of exploration and exploitation [15,24].

Multi-Objective Reinforcement Learning (MORL)
MORL is a technique to handle multi-objective (MO) problems using reinforcement learning (RL).Since many real-world problems have multiple goals to achieve and constraints to maintain, it is useful to regard them as MO optimization problems, but in the field of RL, simple reward scalarization (RS) is dominant.MORL is either single-policy or multiple-policy-based.The single-policy version finds the optimal policy in a given RS, in general by scalarizing every reward in the form of a weighted sum.On the other hand, the multiple-policy version is a method to approximate the Pareto front to find diverse policy sets.While a simple RS technique tends to be useful in the simple-policy MORL, it is often difficult to find diverse policies to sufficiently approximate the Pareto front with it.Carefully adjusting each reward's weight does not contribute significantly to searching various policies.However, the recently published studies on multi-objective MPO (MO-MPO) and MO-V-MPO show that these techniques overcome the shortcomings of RS and can thus be trained with diverse policies [18].

Maximum a Posteriori Policy Optimization (MPO) and V-MPO
MPO [25] is one of the recently released off-policy RL algorithms of the "RL as inference" family.The purpose of the general RL algorithms is to find out which action to choose to maximize future rewards, but the MPO aims to change its perspective and estimate which action is likely to be selected when the future reward is maximized.MPO has the advantage of using expectation-maximization (EM) to increase sample efficiency and lower the variation of expected returns.E-step estimates the distribution of the action, and M-step fits the policy.
V-MPO [26] is the on-policy variant of MPO, which was originally an off-policy RL algorithm.It uses a state-value function instead of an action-value function that uses MPO.Due to the low variance of the expected return, it shows the relatively stable performance compared to the precedent algorithms, such as IMPALA [27].
Equation ( 1) shows the policy loss of V-MPO.It is similar to the conventional policy loss of policy gradient algorithms such as PPO and IMPALA, but there are two differences.First, it uses only the top 50% of the dataset( D) in terms of the advantage.Second, it augments the advantage (A target ) by temperature η: η is a trainable parameter, and Equation (2) shows its loss.η augments A target , and ϵ η regulates A target to maintain a certain range of A target .Even in the cases of extremely low A target , often because of rare rewards, regulation of η adjusts the value of A target with respect to ϵ η to a sufficient degree to improve the policy.It is comparable to the advantage standardization (normalization by mean and SD), mostly used in the algorithms such as PPO [28], but the fact that it does not presuppose a certain distribution makes it more flexible.

Multi-Objective MPO (MO-MPO) and MO-V-MPO
MO-MPO and MO-V-MPO are the variants of MPO and V-MPO to apply to multiobjective problems.(V-)MPO trains a parameter η to augment the Q-value for MPO (advantages for V-MPO), and use ϵ η to regulate it.If ϵ η is increased, the augmented values will be increased, and vice versa.MO-(V-)MPO extends this concept to multiple objectives by letting each objective contain its η and corresponding ϵ η .In this setting, each ϵ η acts as a preference.We can assign a higher ϵ η for an important objective and a low ϵ η for a less important objective or penalty.
In usual MO environments, simple reward scalarization (RS) can only consider the amplitude of reward, but it is difficult to consider distributions or frequency of Q (actionvalue) and A (advantage).Since it is difficult to consider the changes of reward distributions throughout the learning progress, it is difficult to train for diverse policies using RS.On the other hand, MO-(V-)MPO can adjust η with regard to the training environment, allowing the augmented Q and A to adjust themselves according to each objective's ϵ η .Therefore, it can train more diverse policies than RS.

Proposed Method
The main idea of proposed policy assimilation is to replace the policy of an agent with dominated performance with that of an agent with non-dominated performance.To this end, our method proceeds in three major steps sequentially (Figure 1, Algorithm 1).end if

8: end for
The first is policy evaluation, which is performed upon all the agents every quarter of the total learning time.Based on its result, it is determined whether each agent has been dominated.
The second is policy transfer, which is applied only to dominated agents.If the agent is dominated, it will comply with the policy of the nearest non-dominated agent.The distance is determined by the differences in preference of non-dominated agents.For example, assuming that the preference is two dimensional, suppose there is a dominated agent with the pair of preferences [10,90] and non-dominated agents with [20,80], [70,30], and [50, 50], respectively.Then, the dominated agent will abide by the policy of the agent possessing the preference [20,80], which has the smallest difference of [10,90].If the differences in pairs of preferences are the same (e.g., [5,95], [15,85]), the preference moves closer to the center, which is [50,50].If all the agents are non-dominated, no agent's policy will change.These procedures can raise the stability of reinforcement learning performance in our method.Even if a particular agent has shown incompetent results at the beginning of learning, it can partially improve.It is worth mentioning that since MO-V-MPO is one sort of reinforcement learning, it can also be trained toward a somewhat biased direction that satisfies certain objectives.However, the treatment of moving towards the center under the situation of identical preference differences will attenuate this tendency (Algorithm 2).▷ P: non-dominated agents, Q: dominated agents 2: for q ∈ Q do 3: for p ∈ P do 5: d ← DISTANCE(q.pre f erence, p.pre f erence) q.policy ← p o .policy18: end for 19: return Π new ← P ∪ Q Lastly, the retraining step is the process of training again after the policy is transferred.Each agent maintains the existing preference regardless of whether or not the policy has been transferred and repeats the training.The learning direction according to the preference is then maintained, so the agent performs learning toward each objective, resulting in more stable performance, while increasing the agent's diversity.

Experiment and Results
We used a continuous control environment to prove the performance of the proposed method: Lunar Lander.The main goal of an agent in this environment is to land quickly and safely on the destination point of the moon.If the agent is alive until the end of the episode, it receives 100 points, and if it perishes, it receives −100.When the agent arrives at the destination with the speed of zero, it receives an additional 100 to 140 points, the difference being that an extra 10 points are given per leg that has made a contact with the ground.In addition, a penalty of −0.3 points per frame is given if the center engine runs at 50% of the maximum thrust, −0.03 points for the left and right engines.The engines are only active when they run at above 50% of the maximum thrust, and the penalty points increase exponentially with respect to the engine thrust until 100%, where they double.The hyper parameters of MO-V-MPO for our experiments are described in Table 1.The configuration of Lunar Lander promotes the strategy of starting with a slow initial speed and arriving at the destination as safely as possible in order to earn the highest score possible.However, we added the time penalty to convert it to a multi-objective problem.The time penalty point was −0.1 per step, and the maximum number of steps was 300.Therefore, the agent had to consider energy and time efficiencies at the same time.We set the preference to change from [0, 1] to [1,0] with steps of 0.02 and rounded up and converted to whole numbers all the resulting rewards to manifest the outcome more clearly.Since the environment is a continuous control problem, we formed each group of proximal values into single value to determine the sectional densities.This processing allowed the result from each agent to be transcribed to a specific discrete value to demonstrate how well the agents were distributed (Figure 2). Figure 2 shows that our method had more evenly distributed results than the conventional methods.In particular, it prevailed over the precedent reinforcement learning-based multi-purpose algorithms in securing diversity.This difference is particularly observable in the range of −50 to −40 for the time penalty and −30 to −25 for the energy penalty.Similarly to other reinforcement learning algorithms, the training was not processed well in the sectors where the preferences of the two objectives overlapped, although MO-V-MPO found a few solutions in that case.While the solutions biased toward energy and time appeared mostly for both algorithms, ours produced more evenly distributed results.We repeated the same experiment 10, 50, and 100 times for the sake of fairness and compared the numbers of sets that reached the time penalty in the range of −50 to −40 and the energy penalty in the range of −30 to −25.The overall results are in Table 2.

Conclusions
In this paper, we proposed a method of policy assimilation, and showed that it resolves the problems of biased learning for objectives in multi-objective reinforcement learning and trains appropriately for various objectives.Our method is exceptional, since, unlike most previous reinforcement learning studies that focus on improving the overall performance, it concentrates mainly on the diversity of agents.The performance of our proposed method has been validated via single, limited environment, Lunar Lander.From this fact we admit that the experiment did not consider diverse environments.However, we would also like to emphasize that there were virtually no other continuous control problems we could find that well manifest the contrast between energy and time.Furthermore, the environment we conducted the experiment in can be said to be an abstract version of the real-world trade-offs in solving problems, such as heating and cooling systems, which are regarded as the most challenging control problems.Therefore, we anticipate that our study can be used as a cornerstone for future challenges in real-world problems with complex energy and time trade-offs.Lastly, the provision of diversity in our study is eligible for providing various policies adjusted for numerous users of reinforcement learning AIs to solve the trade-off problems, thereby better reflecting their intended objectives.

Figure 1 .Algorithm 1
Figure 1.Graphical representation of the main processes in policy assimilation.(a) Policy Evaluation.(b) Policy Transfer.(c) Retraining.Algorithm 1 Policy assimilation Require: Set of Agents Π Require: Maximum training step t max Require: Policy transfer rate α = 0.25

Figure 2 .
Figure 2. The black dots form the Pareto front we located throughout the experiment, and the blue circles portray by their sizes the numbers of solutions in proximity, indicating that larger circles contain more solutions.(a) is the conventional MO-V-MPO distribution, and (b) is the proposed method's distribution.

Table 2 .
Comparison on the number of agents that reached the contact point.