Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning

Effective exploration is one of the critical factors affecting performance in deep reinforcement learning. Agents acquire data to learn the optimal policy through exploration, and if it is not guaranteed, the data quality deteriorates, which leads to performance degradation. This study investigates the effect of initial entropy, which significantly influences exploration, especially in the early learning stage. The results of this study on tasks with discrete action space show that (1) low initial entropy increases the probability of learning failure, (2) the distributions of initial entropy for various tasks are biased towards low values that inhibit exploration, and (3) the initial entropy for discrete action space varies with both the initial weight and task, making it hard to control. We then devise a simple yet powerful learning strategy to deal with these limitations, namely, entropy-aware model initialization. The proposed algorithm aims to provide a model with high initial entropy to a deep reinforcement learning algorithm for effective exploration. Our experiments showed that the devised learning strategy significantly reduces learning failures and enhances performance, stability, and learning speed.


Introduction
Reinforcement learning is a commonly used optimization technique for solving sequential decision-making problems [1]. The adoption of deep learning technology to reinforcement learning (so-called deep reinforcement learning (DRL)) has shown successful performance even with high-dimensional observations and action spaces in fields such as robotic control [2][3][4][5][6], gaming [7][8][9], medical [10,11], and financial [12,13] applications. In such a DRL framework, the exploration-exploitation trade-off is a crucial issue that affects the performance of the DRL algorithm [14]. Through exploitation, the agent tries to maximize the current moment's expected reward, whereas exploration is required to maximize the long-term reward during training [15]. In other words, even if the exploitation that makes the best decision over the current information is successful, the solution obtained by DRL would not be optimal without a number of explorations. Therefore, several studies to encourage exploration are being discussed. Incorporating the entropy term in the reinforcement learning (RL) optimization problem is a representative approach to encourage exploration. The entropy term in the DRL framework represents the stochasticity of the action selection. It is calculated based on the output of the policy. Note that the output of the policy is the action selection probability. The evenly distributed output will yield high entropy. Conversely, if the output is biased, its entropy is low. With the biased output, i.e., low entropy, there is a high probability that the agent cannot perform various actions and repeats only certain actions inhibiting exploration. Therefore, various studies are encouraging high entropy [16][17][18][19][20][21].
In [16], a proximal policy optimization (PPO) algorithm was proposed, in which the entropy bonus term was augmented to ensure sufficient exploration motivated by [22,23]. A soft actor-critic (SAC) DRL algorithm based on the maximum entropy RL framework was proposed in [17], where the entropy term was incorporated to improve exploration by acquiring diverse behaviors in the objective with the expected reward. Ref. [21] also adopted the maximum entropy RL framework as it shows better performance and more robustness. In addition, the authors in [24] proposed a maximum entropy-regularized multigoal RL, where the entropy was combined with the multi-goal RL objective to encourage the agent to traverse diverse goal states. In [25], maximum entropy was introduced in the multi-agent RL algorithm to improve the training efficiency and guarantee a stronger exploration capability. In addition, a soft policy gradient under the maximum entropy RL framework [26] was devised, and maximum entropy diverse exploration [27] was proposed for learning diverse behaviors. However, these approaches, which consider entropy along with other factors (e.g., reward) in the objective, make the handling of low entropy difficult at model initialization. In [20], the impact of entropy on policy optimization was extensively studied. The authors observed that a more stochastic policy (i.e., a policy with high entropy) improved the performance of the DRL. The authors in [28] analyzed the effect of experimental factors in the DRL framework, where the offset in the standard deviation of actions was reported as an important factor affecting the performance of the DRL. These studies dealt with continuous control tasks, where the initial entropy can be easily controlled by adjusting the standard deviation. To the best of our knowledge, for discrete control tasks, neither any research reporting on the effect of the initial entropy nor a learning strategy exploiting it exists. One of the reasons for this may be the difficulty in controlling the entropy of discrete control tasks. The entropy in a discrete control task is determined by the action selection probability obtained through the rollout procedure, whereas, in a continuous control task, the standard deviation determines the entropy.
To address the abovementioned concerns, we have conducted experimental studies to investigate the effect of initial entropy, focusing on tasks with a discrete action space. Furthermore, based on the experimental observations, we have devised a learning strategy for DRL algorithms, namely entropy-aware model initialization. The contributions of this study can be summarized as follows: • We reveal a cause of frequent learning failures despite the ease of the tasks. Our investigations show that the model with low initial entropy significantly increases the probability of learning failures, and that the initial entropy is biased towards a low value for various tasks. Moreover, we observe that the initial entropy varies depending on the task and initial weight of the model. These dependencies make it difficult to control the initial entropy of the discrete control tasks; • We devise entropy-aware model initialization, a simple yet powerful learning strategy that exploits the effect of the initial entropy that we have analyzed. The devised learning strategy repeats the model initialization and entropy measurements until the initial entropy exceeds an entropy threshold. It can be used with any reinforcement learning algorithm because the proposed strategy just provides a well-initialized model to a DRL algorithm. The experimental results show that entropy-aware model initialization significantly reduces learning failures and improves performance, stability, and learning speed.
In Section 2, we present the results of the experimental study on the effect of the initial entropy on DRL performance with discrete control tasks. In Section 3, we describe the devised learning strategy, and discuss the experimental results in Section 4. Finally, we detail the conclusions in Section 5.

Effect of Initial Entropy in DRL
To investigate the effect of the initial entropy in the DRL framework, we adopted the policy gradient method (PPO [16]) implementation in RLlib [29]. The network architecture was set to be the same as in [16]. We adopted the Glorot uniform [30], which is the default initializer for Tensorflow [31] and representative RL frameworks such as RLlib, TF-Agents [32], and OpenAI Baselines [33] to initialize the network. Unless otherwise stated, PPO and Glorot uniform are the default settings for the analyses. For this experimental study, we considered eight tasks (please refer to Figure 1) with a discrete action space from the OpenAI Gym [34]. Note that eight tasks (Freeway, Breakout, Pong, Qbert, Enduro, KungFuMaster, Alien, and Boxing) were selected to cover various action space sizes and task difficulties (easy and hard exploration) referring to [35]. Freeway is the game that moves a chicken across the freeway by avoiding oncoming traffic with the action space size of 3. The Breakout game moves a paddle to hit a moving ball to destroy a brick wall, where the action space size is 4. Like the Breakout game, Pong, with the action space size of 6, competes with a computer (left paddle) by controlling the right paddle for rallying the ball, where the paddles move only vertically. In addition, Qbert is a game that moves the cube pyramid and changes the color of the top of the cube and has six action spaces. Next, Enduro is a racing game with nine action spaces aiming to pass an assigned number of cars each day. KungFuMaster is a game in which we fight the enemies we meet on the way to rescue the princess, and it has 14 action spaces. As the game with the most action space, Alien is the game where you destroy aliens' eggs while avoiding them, and Boxing is the game where we are rewarded by defeating the enemy in the boxing ring. As seen in Figure 1 and the description above, the goals and rules for each of the eight tasks differ. The agent receives the rewards according to the task's rules in achieving the goals. Therefore, it makes the reward range differ for each task. For example, the range of rewards that an agent can acquire in Pong is from −21 to 21, whereas, in Qbert, it can receive from 0 to more than 15,000. Please refer to [36] for detailed explanations (e.g., description, action types, rewards, and observations) for each game. First, we investigated the effect of the initial entropy on performance (i.e., reward). We generated 50 differently initialized models for the experiment and measured the rewards after 3000 training iterations for Freeway, Pong, KungFuMaster, and Boxing, and 5000 training iterations for Breakout, Qbert, Enduro, and Alien. For each iteration, 2048 experiences were collected with 16 workers, and six stochastic gradient descent (SGD) epochs were performed with a learning rate of 2.5 × 10 −4 . Figure 2 shows the reward for the initial entropy. We can see that, the lower the initial entropy, the higher the learning failures (e.g., −21 for Pong, 0 for Breakout, and −100 for Boxing). The low initial entropy leads to learning failures by inhibiting exploration. Recall that the entropy is the stochasticity of the action selection probability, and low entropy means the probability is biased towards a specific action. It causes the agent to perform the specific action for every step during the episode with a high probability. Repeating the same action makes exploration difficult. This reminds us of the importance of exploration, particularly during the earlier training stage. We then investigated the distribution of initial entropy. For this, we generated 1000 models with different random seeds for each of the eight tasks and measured the initial entropy values. Note that the maximum value of the initial entropy is determined by the action space size of the task, for example, 1.099, 1.386, 1.792, 2.197, and 2.890 for action space sizes of 3, 4, 6, 9, and 18, respectively, which are shown in parentheses in Figure 3. From Figure 3, we can see that the initial entropy is biased towards low values, even if the maximum initial entropy value is high, owing to the large action space size. The average initial entropy values were 0.114, 0.246, 0.189, 0.342, 0.636, 0.345, 0.694, and 0.273 for Freeway, Breakout, Pong, Qbert, Enduro, KungFuMaster, Alien, and Boxing, respectively. We performed additional experiments to analyze this tendency on a different network initializer. Specifically, Figure 4 presents the results with an orthogonal initialization technique [37] instead of the Glorot uniform. Nevertheless, we can observe similar trends as in Figure 3. Our experimental findings (i.e., the high probability of learning failures for low initial entropy, and the low biased initial entropy) explain why DRL often fails for tasks with discrete action spaces and why the performance drastically varies for each experiment.
Finally, we investigated the factors affecting the initial entropy. Tables 1 and 2 show that both the tasks and the initial weight significantly affect the initial entropy. In Tables 1 and 2, the meaning of seed is a random seed for initializing the neural network. For example, in the first row, Seed 01, of Table 1, the same network, i.e., the same initial weights, are used for measuring the values of Pong and Qbert. The same is true for Alien and Boxing. However, the initial weights of Qbert and Alien differ as the neural network structures differ. Note that the network structure varies according to the size of the action space. For example, for the action space sizes of 6 and 18, the network's output nodes are 6 and 18, respectively. We can see that the initial entropy varies with the task, even with the same initial weight (e.g., Seed 02's Alien and Boxing cases in Table 1). In addition, the initial entropy differs according to the initial weight of the model, even with the same task (e.g., Seeds 03 and 04 in Alien cases in Table 1). This is because the input image, which is the observation, differs significantly for each task. These task and model initialization dependencies on initial entropy make it difficult to control the initial entropy.    4. The histograms of the initial entropy for eight tasks. For each task, 1000 models were generated using an orthogonal initializer with different random seeds. Table 1. Initial entropy of (Pong, Qbert) pair with action space size 6 and (Alien, Boxing) pair with action space size 18 under different random seeds, where "STD" denotes the standard deviation of the initial entropy values for 10 different random seeds.  From the above observations, we conclude that DRL algorithms require models with high initial entropy for successful training, and we need a strategy to generate such models.

Entropy-Aware Model Initialization
In the previous section, we observed that (1) learning failure frequently occurs with the model with low initial entropy, (2) the initial entropy is biased towards a low value, and (3) even with the same network architecture, the initial entropy greatly varies based on the task and the initial weight of the models. Inspired by the above experimental observations, we propose an entropy-aware model initialization strategy. The learning strategy repeatedly initializes the model until its initial entropy value exceeds the entropy threshold. In other words, the proposed learning strategy encourages DRL algorithms such as PPO [16] to collect a variety of experiences at the initial stage by providing a model with high initial entropy.
Suppose that task (E), number of actors (N), entropy threshold (h th ), initializer (K), and horizon (T) are given. First, we initialize the model (π i ) with K. Then, for each n-th actor, we perform rollout with the initialized model (π i ) for each time step t ∈ {1, · · · , T}.
Rollout here means the agent interacts with the environment, and, with the rollout, the agent obtains data transitions (i.e., current state, task, reward, and next state) for training.
Through the rollout, we store the action selection probabilities (p (n,t) π i ) for entropy calculation. Note that the action selection probability for the set of actions in action space A (e.g., A = {NOOP, FIRE, UP} in the case of Freeway with the action space size of 3) is the softmax of the outputs of π i . Then, we compute the entropy of the model (π i ) for each actor and the time step as (1) Next, the mean entropy (ĥ π i ) of the total action selection probabilities collected from the N actors over T horizon is computed, which is defined bŷ The mean entropy is compared to the predefined entropy threshold (h th ). If the mean entropyĥ π i is larger than the predefined entropy threshold h th , then we terminate the entropy-aware model initialization and output the initialized model (π init ) for the DRL algorithm such as PPO. Otherwise, we set the random seed to a different value and repeat the initialization process untilĥ π i exceeds h th . The entire entropy-aware model initialization process is summarized in Algorithm 1. Through this learning strategy, the DRL algorithm reduces the probability of learning failure and achieves improved performance and fast convergence to a higher reward (refer to Section 4).

Experimental Results
In this section, we validate the effectiveness of the proposed learning strategy. For this, we used the experimental settings and tasks described in Section 2. In this experiment, we set the entropy threshold (h th ) to 0.5.
To validate the effect of the proposed entropy-aware model initialization, we considered 50 models initialized by different random seeds for each task. Figure 5 shows the rewards according to the training iterations for the eight tasks. In this figure, the red line represents the result for the conventional DRL (without the entropy-aware model initialization) denoted as "Default", and the blue line denotes the result for the proposed entropy-aware model initialization denoted as "Proposed". We observed that the DRL with the proposed learning strategy outperformed the conventional DRL for both tasks in four aspects. (1) It restrains the learning failures, e.g., the learning failures for the "Proposed" are 6, 0, 10, 0, 25, 2, 0, and 0, but for the "Default" are 25,15,35,9,29,28,4, and 0, for Freeway, Breakout, Pong, Qbert, Enduro, KungFuMaster, Alien, and Boxing, respectively.
(2) It enhances the performance (i.e., average reward in Table 3 Table 3. (5) It enhances the learning speed as can be seen from the slope of the graphs in Figure 5. Figure 6 shows 50 individual learning curves for the above experiments. From the figure, we can easily observe that, by applying the proposed method, more learning curves are biased towards high rewards, and fewer learning failures occur compared to the default. Furthermore, we conducted the experiments with the advantage actor-critic (A2C) [23] instead of PPO for thorough analyses. The results of A2C corresponding to Figures 5 and 6 and in Table 3, of the PPO results are shown in Figures 7 and 8 and in Table 4. We can observe the same phenomena and therefore infer that the proposed algorithm can benefit other DRL algorithms.        Table 5 shows the overhead of the entropy-aware model initialization in terms of the average number and time for repetitive initialization that repeats until the initial entropy becomes larger than the entropy threshold. For the 3000 and 5000 training iterations, the average training times were measured as 4792.75 and 8145.01 s. We can observe that the time overhead of the proposed strategy is negligible compared with the training times. Moreover, the overhead ratio by repetitive initialization in the proposed strategy was reduced because the training time increased as the task became more complex. This is mainly because the overhead of the proposed method is primarily affected by the action space size and initial entropy distribution, and not by the complexity of the task. Figure 9 presents the number (solid line) and time taken (dashed line) for repetitive initialization along the different entropy thresholds (h th ). The vertical line in the graph corresponds to when h th is set to 0.5. From Figure 9, we can observe that the time overhead increases according to the entropy threshold; however, the extent of increase is different for each task, the reasons being that (1) different action space sizes of tasks have different maximum initial entropy values, and (2) different tasks have different initial entropy distributions, as shown in Figure 3 in Section 2. In other words, the maximum initial entropy value determines the maximum value of h th . The lower the average value of the initial entropy, the faster is the overhead increase. For example, the average initial entropy values of KungFuMaster and Boxing were 0.345 and 0.273, respectively, whereas those of Enduro and Alien were 0.636 and 0.694, respectively. According to Figure 9, we observed that the task (e.g., KungFuMaster) with a low average initial entropy value had a large overhead as the threshold increased. Based on the results in Figures 2 and 9, we set the entropy threshold to 0.5, since the primary purpose of this study is to analyze the effect of initial entropy in DRL and propose a task-independent solution, that is, an entropy-aware model initialization. This value effectively restrains learning failures with tasks of large action space sizes or relatively high initial entropy distribution (e.g., Alien and Boxing) but does not incur much overhead with tasks of small action space sizes or a low-distributed initial entropy (e.g., Freeway and KungFuMaster).

Conclusions
In this study, we conducted experiments to investigate the effect of initial entropy in the DRL framework, focusing on tasks with discrete action spaces. The critical observation is that models with low initial entropy lead to frequent learning failures, even with easy tasks. These initial entropy values were biased towards low values. Moreover, we observed that the initial entropy varied significantly depending on the task and the initial model weight through experiments under various tasks. Inspired by experimental observations, we devised a learning strategy called entropy-aware model initialization, which repeatedly initializes the model and measures its entropy until the initial entropy exceeds a certain threshold. Its purpose is to improve learning failure, performance, performance variation, and learning speed of a DRL algorithm by providing a well-initialized model to the DRL algorithm. Furthermore, it is practical because it is easy to implement and can be easily applied along with various DRL algorithms without modifying them.
We believe this research can benefit various fields since many applications involve discrete control. Such examples are drone control [5], recommender system [38], and medical CT scans [10]. Moreover, Ref. [39] suggested that discretizing continuous control tasks may improve performance.
It may be a good research direction to propose a neural network initialization technique for deep reinforcement learning with discrete action space. Although many studies proposed initialization techniques for effective deep learning, such as the Glorot uniform and orthogonal, there are few studies on initialization techniques for effective deep reinforcement learning. As can be observed in this paper, the network's initial state greatly impacts the algorithms' performance.