In this section, applications of the proposed adaptive algorithm are shown to validate the performance. To test the performance of the on-policy and off-policy algorithms, Tetris game agent and motion planning for a robot manipulator are presented.
5.1. Tetris
Tetris has been used a lot as a challenge in the field of artificial intelligence, and there have been efforts to solve it with various machine learning algorithms [
27,
28,
29]. However, Tetris is known as an NP-hard problem, and it is a challenging problem with deep reinforcement learning [
28]. Additionally, as Tetris is an environment with inherent uncertainty, it is not easy to estimate the maximum reward sum by approximating the value function [
3]. Tetris is a game in which randomly given blocks, called tetrominoes, are stacked on the bottom, and if the blocks are filled without empty spaces, a score proportional to the number of filled rows is given. Tetris basically amounts to a continuing task, but if the player stacks up blocks without filling until the top, the game ends there. Thus, an overestimation of a reward for the time far away from the current time can mislead the game without realizing the uncertainty. For this, the proposed adaptive discount factor can be a suitable method for developing an RL-based game agent for Tetris.
Tetris is a discrete action task, and at every step you can move blocks left, right, down, or rotate . There are two kinds of drop actions: soft and hard drops. The soft drop moves the block down by one space, and the hard drop, to the bottom.
In this paper, we redefine the agent’s action space as a compound action, like [
30]. Compound action space
is defined, which consists of move left/right (9), rotate a block (4), and a hard drop (1). The compound actions can speed up learning by limiting the agent’s meaningless actions. The Tetris game screen is
in this paper.
The state of MDP is composed of two images of the screen, i.e.,
. One is the image of the currently stacked blocks without the currently controllable block, and the other is the images of the controllable blocks without the currently stacked blocks. See
Figure 6. The reward is determined based mainly on the number of lines cleared by the agent.
Table 1 defines the reward.
Here,
denotes the number of cleared lines by the current action, and
r, the corresponding reward.
in the second column in
Table 1 describes that the blocks reach the top. In other words, the agent loses the game.
Figure 6 depicts the structure of MDP of Tetris.
The PPO algorithm is implemented using a parallel agent method such as A3C (Asynchronous Advantage Actor-Critic) for efficient learning, and is applied to develop a Tetris game agent for fast performance improvement. The A3C algorithm samples data from multiple agents in parallel, and uses them for training. When the number of a specific agent is
n, the data obtained from the
nth agent in the time step
t is
. Each agent decides a policy through the same neural network with parameters
, and synchronizes all neural networks whenever policy learning is performed.
Figure 7 describes how the proposed adaptive discount factor is used to devise a PPO-based game agent for Tetris.
The performance of Tetris can be evaluated by the number of lines cleared by the agent during one episode. Learning performance with the adaptive discount factor is compared with the performance with fixed discount factors.
Figure 8 shows training performances by four different fixed discount factors, the proposed adaptive discount factor, and the progressively increasing discount factor. The discount factor of
is a commonly used value in deep reinforcement learning, and
is the low limit of the adaptive discount factor. The horizon axis of the graph indicates the number of episodes learned by the agent, and the vertical axis indicates the number of lines cleared by the agent for each episode. The highest discount factor of
shows a fast performance improvement at the beginning of learning, but the performance converges to a certain level as the episode progresses. The lowest discount factor of
shows the lowest performance compared to other discount factors. The discount factor of
shows a faster performance improvement than
at the beginning of training. It is confirmed that the discount factor of
can reach the highest performance, although the performance improvement is slow at the beginning of training. The performance of the progressively increasing discount factor was improved quickly at the beginning, but in the end, it converged to a similar performance by the fixed discount factor of
. The performance of the agent trained using the adaptive discount factor is lower than those by the fixed discount factors at the beginning of training, but the final performance exceeds that of the agent using a discount factor of
.
Figure 9 shows the used fixed discount factors and the resulting adaptive discount factor and the increasing discount factor. In the case of the adaptive discount factor, adjustment is stopped when the discount factor reaches
. The adaptive discount factor is stopped at
. At the beginning of the adjustment,
does not change because
. However, when
,
is increased, as is shown in
Figure 9. Note that the adaptive discount factor stops adjusting at the value near
, which results in a high performance among the fixed discount factors in
Figure 8.
Table 2 shows the comparison of the final performance of the fixed discount factors and the adaptive discount factor.
In
Table 2, the scores in bold mean the highest maximum score or highest average score along the discount factors and algorithms. Also, the discount factors in bold are the values corresponding to the maximum score.
5.2. Motion Planning
This section presents an application of the proposed adaptive discount factor to the path planning of the robot manipulator for the purpose of validating the proposed method for off-policy RL [
31,
32]. The joint value of the robot arm is expressed in the configuration space, and the current state representing the joint value of the robot arm at the current time step
t is denoted by
[
33,
34]. The action
is the amount of change in the joint value, and the action space is given by
. Since the action of the agent is continuous, the SAC algorithm is employed for motion planning. In general, although path planning problems are achieved via simulation, which means that there are no uncertainties in the problem, in order to consider real environments in this paper, two uncertainties are added to the path planning problem in the simulation. First, noise
is added to the state evolution. In other words, the next state is determined by
, where
is a constant,
is the amount of change in the joint value, and the noise
. Second, a reward signal is transmitted probabilistically. In general, in the path planning task, the reward signal is given a positive value when the agent arrives at a goal point, and a negative value when it collides [
35]. In this paper, when the agent arrives at the goal point, one of
is randomly transmitted. Therefore, the agent cannot easily approximate the reward sum at the current point. Additionally, since there is noise in the environment, the agent cannot easily determine whether to prioritize collision avoidance or to prioritize goal arrival, even if collision is considered. In addition, the problem of uncertainty is more pronounced in continuing the task. In the path planning problem, an episodic task ends when the goal is reached, but for a continuing task, a new goal is given to the agent [
36]. This aspect makes it difficult for the path planning environment to predict the distant future rewards, and an overestimation has to be prevented. Therefore, the proposed adaptive discount factor can handle this difficulty appropriately.
Figure 10 describes the MDP for the path planning based on the reinforcement learning.
In
Figure 10,
represents the goal position given to the agent, and the inequality
is understood as the goal position is reached. Compared with PPO from the view point of applying the adaptive discount factor, the difference is in how to calculate the advantage function. The parameters of all neural networks are needed because SAC computes the advantage function based on Equation (
22). Since SAC is an off-policy algorithm, it implements the experience replay memory
D, randomly samples data during training, and applies it to each neural network training.
Figure 11 describes how SAC with the adaptive discount factor is applied to the path planning of the robot manipulator.
For the validation of the performance of the path planning by the proposed method, two environments are considered. One is path planning in the form of an episodic task in which an episode terminates when the agent reaches the goal. The other is in the form of a continuing task in which a goal is given again when the agent arrives at the previous goal, and this is repeated until the agent collides. As explained before, the effect of the environmental uncertainty is more critical in the continuing task. In the episodic task, the episode ends when the agent reaches the goal, but in the continuing task, the next goal is given at a random location when the agent reaches the goal point. Therefore, in the case of the continuing task, it is not easy to estimate the expected reward sum. In this section, we test the learning performance of both the adaptive and fixed discount factors. In the path planning, the performance can be evaluated as the success ratio of the agent’s arrival at the goal point, that is, the success ratio of path creation. This success ratio is defined as the ratio of reaching the goal point in the last 10 episodes during training.
Figure 12 shows the learning performance of the fixed discount factors
and
, the adaptive discount factor, and the increasing discount factor in the episodic task. The horizontal axis is the number of episodes, and the vertical axis is the success ratio of path generation. The lowest discount factor of
results in the lowest success rate. As the discount factor is increased, the success ratio is also increased. When the adaptive discount factor is applied, it shows a somewhat low performance at the beginning, but as learning progresses, it can be seen that the performance is closest to the fixed discount factor of
. The performance of the increasing discount factor method was also similar and reached a high performance. Since, in the episodic task, the uncertainty is low and the reward sum is easy to predict, the fixed discount factor of
shows a higher performance.
Figure 13 compares the evolution of the adaptive discount factor with other fixed discount factors in the episodic work. It can be seen that the adaptive discount factor converges close to the highest discount factor of
as the training goes on. A high discount factor is advantageous for learning for the path planning in episodic tasks, since a reward signal is given and the episode ends when the final goal is reached. Furthermore, since the path planning problem is generally regarded as a sparse reward problem, a high discount factor is helpful.
Table 3 shows the success ratio of path creation according to each discount factor. In
Table 3, the success ratio is the average of the past 10 test episodes, and if the path generation succeeds in all 10 times, it is
. The values of the success rate in bold are the highest value among them and the discount factors in bold are the optimal values in the episodic environment.
In the following, the learning performance of the adaptive discount factor, increasing discount factor, and the fixed discount factor in the path planning is compared in a continuing task environment. When the agent reaches the current goal point, a new goal point is given, and this is repeated. In other words, in a continuing task, the episode continues until the agent collides. For this case, the learning performance is evaluated by the number of paths generated by the agent before collision.
Figure 14 shows the learning performance of the fixed discount factors of 0.5, 0.75, 0.85, and 0.99, increasing the discount factor and the adaptive discount factor for the path planning in the continuous episode setting. Three observations can be made. First, among the fixed discount factors, the resulting performance is not monotone. Namely, it does not hold that the higher the discount factor is, the better the performance that is generated. Second, except for a discount factor of 0.99, all of the other cases including the adaptive discount factor and the increasing discount factor show similar performances during the transient time, and a performance of 0.99 was not high, even at the end of training. Third, the adaptive discount factor shows the best performance in the end.
Figure 15 shows the evolution of the adaptive discount factor and the fixed discount factors. Note that a fixed discount factor of 0.75 results in the best performance and the adaptive discount factor converges to a value of near 0.75.
Table 4 summarizes the performance according to each discount factor. The numbers of created paths in bold mean the highest maximum or the highest average number of paths according to algorithms and discount factors. Also, the discount factors in bold correspond to the highest performance.
Hence, it is verified that the proposed adaptive discount factor leads to a good performance, although there are inherent uncertainties in the environment, and the reward is sparse.
In view of these case studies, it is confirmed that the deep RL with the proposed adaptive discount factor results in a comparable learning performance with that by the best and fixed discount factor that has to be determined by exhaustive simulations.
5.3. Analysis and Discussion
In view of the previous case study, it is confirmed that the adaptive discount factor is indeed effective. In many existing results on reinforcement learning, a large discount factor is usually used, such as 0.99. However, as seen from the case study, the high discount factor does not necessarily always guarantee a high performance. In the previous learning experiment, when a discount factor was randomly selected among values between 0.5 and 0.99 and training was performed, there was a case in which a higher performance was obtained with a value of lower than 0.99, as shown in
Figure 8 and
Figure 14. On the other hand,
Figure 12 showed the highest performance, with 0.99. The difference between these tasks is the existence of an uncertain risk or an unpredictable negative reward among distant future rewards. In the case of Tetris, the agent might obtain negative rewards according to future mistakes, but it is impossible to predict this because blocks are randomly determined. In the case of path planning, it is impossible to predict the failure and collision of path generation when generating a path to a new randomly generated goal point. These correspond to the example in
Figure 3. On the other hand, in the case of the path planning problem that is an episodic task, it is relatively easy to predict the success or failure of path creation for a fixed goal point. This case only matches
Figure 4, which is a sparse reward environment. Therefore, the discount factor of 0.99 showed the highest performance in this task. The adaptive discount factor algorithm closely finds the discount factor that can give the best performance in all cases. This saves the effort in finding a suitable discount factor. Additionally, as shown in
Figure 9,
Figure 13 and
Figure 15, the adjustment of the discount factor is quickly terminated at the beginning of training. Since the adjusted discount factor is fixed and learning proceeds with the fixed value, it is not computational expensive. Another discount factor adjustment algorithm, called progressively increasing discount factor [
15], gradually increases the discount rate to 0.99. It has been suggested that this method can achieve a higher performance than a fixed discount factor. However, the proposed adaptive discount factor outperforms in an environment requiring a low discount factor. This also suggests that a somewhat lower discount factor may be appropriate, depending on the environment.