# Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- We reveal a cause of frequent learning failures despite the ease of the tasks. Our investigations show that the model with low initial entropy significantly increases the probability of learning failures, and that the initial entropy is biased towards a low value for various tasks. Moreover, we observe that the initial entropy varies depending on the task and initial weight of the model. These dependencies make it difficult to control the initial entropy of the discrete control tasks;
- We devise entropy-aware model initialization, a simple yet powerful learning strategy that exploits the effect of the initial entropy that we have analyzed. The devised learning strategy repeats the model initialization and entropy measurements until the initial entropy exceeds an entropy threshold. It can be used with any reinforcement learning algorithm because the proposed strategy just provides a well-initialized model to a DRL algorithm. The experimental results show that entropy-aware model initialization significantly reduces learning failures and improves performance, stability, and learning speed.

## 2. Effect of Initial Entropy in DRL

## 3. Entropy-Aware Model Initialization

Algorithm 1: Entropy-aware model initialization. |

## 4. Experimental Results

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process. Mag.
**2017**, 34, 26–38. [Google Scholar] [CrossRef][Green Version] - Yang, Z.; Merrick, K.; Jin, L.; Abbass, H.A. Hierarchical Deep Reinforcement Learning for Continuous Action Control. IEEE Trans. Neural Networks Learn. Syst.
**2018**, 29, 5174–5184. [Google Scholar] [CrossRef] [PubMed] - Haarnoja, T.; Pong, V.; Zhou, A.; Dalal, M.; Abbeel, P.; Levine, S. Composable Deep Reinforcement Learning for Robotic Manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 6244–6251. [Google Scholar]
- Lathuilière, S.; Massé, B.; Mesejo, P.; Horaud, R. Neural network based reinforcement learning for audio–visual gaze control in human–robot interaction. Pattern Recognit. Lett.
**2019**, 118, 61–71. [Google Scholar] [CrossRef][Green Version] - Jang, S.; Choi, C. Prioritized Environment Configuration for Drone Control with Deep Reinforcement Learning. Hum. Centric Comput. Inf. Sci.
**2022**, 12, 1–16. [Google Scholar] - Zhang, Q.; Ma, X.; Yang, Y.; Li, C.; Yang, J.; Liu, Y.; Liang, B. Learning to Discover Task-Relevant Features for Interpretable Reinforcement Learning. IEEE Robot. Autom. Lett.
**2021**, 6, 6601–6607. [Google Scholar] [CrossRef] - Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the Game of Go without Human Knowledge. Nature
**2017**, 550, 354–359. [Google Scholar] [CrossRef] [PubMed] - Patel, D.; Hazan, H.; Saunders, D.J.; Siegelmann, H.T.; Kozma, R. Improved Robustness of Reinforcement Learning Policies upon Conversion to Spiking Neuronal Network Platforms Applied to Atari Breakout Game. Neural Netw.
**2019**, 120, 108–115. [Google Scholar] [CrossRef] [PubMed] - Nicholaus, I.T.; Kang, D.K. Robust experience replay sampling for multi-agent reinforcement learning. Pattern Recognit. Lett.
**2021**, 155, 135–142. [Google Scholar] [CrossRef] - Ghesu, F.C.; Georgescu, B.; Zheng, Y.; Grbic, S.; Maier, A.; Hornegger, J.; Comaniciu, D. Multi-scale Deep Reinforcement Learning for Real-time 3D-landmark Detection in CT Scans. IEEE Trans. Pattern Anal. Mach. Intell.
**2017**, 41, 176–189. [Google Scholar] [CrossRef] [PubMed] - Raghu, A.; Komorowski, M.; Celi, L.A.; Szolovits, P.; Ghassemi, M. Continuous state-space models for optimal sepsis treatment: A deep reinforcement learning approach. In Proceedings of the Machine Learning for Healthcare Conference, Boston, MA, USA, 18–19 August 2017; pp. 147–163. [Google Scholar]
- Zarkias, K.S.; Passalis, N.; Tsantekidis, A.; Tefas, A. Deep Reinforcement Learning for Financial Trading using Price Trailing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3067–3071. [Google Scholar]
- Tsantekidis, A.; Passalis, N.; Tefas, A. Diversity-driven Knowledge Distillation for Financial Trading using Deep Reinforcement Learning. Neural Netw.
**2021**, 140, 193–202. [Google Scholar] [CrossRef] - Ishii, S.; Yoshida, W.; Yoshimoto, J. Control of Exploitation–Exploration Meta-parameter in Reinforcement Learning. Neural Netw.
**2002**, 15, 665–687. [Google Scholar] [CrossRef][Green Version] - Sun, S.; Wang, H.; Zhang, H.; Li, M.; Xiang, M.; Luo, C.; Ren, P. Underwater Image Enhancement with Reinforcement Learning. IEEE J. Ocean. Eng.
**2022**, 1–13. [Google Scholar] [CrossRef] - Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] - Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
- Seo, Y.; Chen, L.; Shin, J.; Lee, H.; Abbeel, P.; Lee, K. State Entropy Maximization with Random Encoders for Efficient Exploration. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; pp. 9443–9454. [Google Scholar]
- Zhang, Y.; Vuong, Q.H.; Song, K.; Gong, X.Y.; Ross, K.W. Efficient Entropy for Policy Gradient with Multidimensional Action Space. arXiv
**2018**, arXiv:1806.00589. [Google Scholar] - Ahmed, Z.; Le Roux, N.; Norouzi, M.; Schuurmans, D. Understanding the Impact of Entropy on Policy Optimization. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 151–160. [Google Scholar]
- Chen, J.; Li, S.E.; Tomizuka, M. Interpretable End-to-End Urban Autonomous Driving with Latent Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst.
**2021**, 23, 5068–5078. [Google Scholar] [CrossRef] - Williams, R.J. Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning. Mach. Learn.
**1992**, 8, 229–256. [Google Scholar] [CrossRef][Green Version] - Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
- Zhao, R.; Sun, X.; Tresp, V. Maximum Entropy-regularized Multi-goal Reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 7553–7562. [Google Scholar]
- Wang, Z.; Zhang, Y.; Yin, C.; Huang, Z. Multi-agent Deep Reinforcement Learning based on Maximum Entropy. In Proceedings of the IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 18–20 June 2021; Volume 4, pp. 1402–1406. [Google Scholar]
- Shi, W.; Song, S.; Wu, C. Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning. arXiv
**2019**, arXiv:1909.03198. [Google Scholar] - Cohen, A.; Yu, L.; Qiao, X.; Tong, X. Maximum Entropy Diverse Exploration: Disentangling Maximum Entropy Reinforcement Learning. arXiv
**2019**, arXiv:1911.00828. [Google Scholar] - Andrychowicz, M.; Raichuk, A.; Stańczyk, P.; Orsini, M.; Girgin, S.; Marinier, R.; Hussenot, L.; Geist, M.; Pietquin, O.; Michalski, M.; et al. What Matters for On-policy Deep Actor-critic Methods? A Large-scale Study. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Liang, E.; Liaw, R.; Nishihara, R.; Moritz, P.; Fox, R.; Goldberg, K.; Gonzalez, J.; Jordan, M.; Stoica, I. RLlib: Abstractions for Distributed Reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 3053–3062. [Google Scholar]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org/ (accessed on 20 July 2022).
- Guadarrama, S.; Korattikara, A.; Ramirez, O.; Castro, P.; Holly, E.; Fishman, S.; Wang, K.; Gonina, E.; Wu, N.; Kokiopoulou, E.; et al. TF-Agents: A library for Reinforcement Learning in TensorFlow. 2018. Available online: https://github.com/tensorflow/agents (accessed on 20 July 2022).
- Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; Wu, Y.; Zhokhov, P. OpenAI Baselines. 2017. Available online: https://github.com/openai/baselines (accessed on 20 July 2022).
- Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv
**2016**, arXiv:1606.01540. [Google Scholar] - Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying count-based exploration and intrinsic motivation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
- Gym Documentation. 2022. Available online: https://www.gymlibrary.ml/ (accessed on 20 July 2022).
- Saxe, A.M.; McClelland, J.L.; Ganguli, S. Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Dulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber, T.; Degris, T.; Coppin, B. Deep Reinforcement Learning in Large Discrete Action Spaces. arXiv
**2015**, arXiv:1512.07679. [Google Scholar] - Tang, Y.; Agrawal, S. Discretizing Continuous Action Space for On-Policy Optimization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 5981–5988. [Google Scholar]

**Figure 1.**Example of Atari games (with action space size of 3, 4, 6, 6, 9, 14, 18, 18 for Freeway, Breakout, Pong, Qbert, Enduro, KungFuMaster, Alien, Boxing, respectively) used for the experimental study. (

**a**) Freeway; (

**b**) Breakout; (

**c**) Pong; (

**d**) Qbert; (

**e**) Enduro; (

**f**) KungFuMaster; (

**g**) Alien; (

**h**) Boxing.

**Figure 2.**Reward depending on the initial entropy for 8 tasks, where 50 models for each task were generated to investigate the effect of the initial entropy on the performance.

**Figure 3.**The histograms of the initial entropy for eight tasks. For each task, 1000 models were generated using Glorot uniform initializer with different random seeds.

**Figure 4.**The histograms of the initial entropy for eight tasks. For each task, 1000 models were generated using an orthogonal initializer with different random seeds.

**Figure 5.**Comparison of the entropy-aware model initialization-based PPO (PPO-Proposed) with the conventional PPO (PPO-Default) for eight tasks.

**Figure 6.**Learning curves for 50 individual experiments of (

**a**) the conventional PPO and (

**b**) the proposed entropy-aware model initialization-based PPO for 8 tasks. (

**a**) default; (

**b**) proposed.

**Figure 7.**Comparison of the entropy-aware model initialization-based A2C (A2C-Proposed) with the conventional A2C (A2C-Default) for four tasks.

**Figure 8.**Learning curves for 30 individual experiments of (

**a**) the conventional A2C and (

**b**) the proposed entropy-aware model initialization-based A2C for four tasks. (

**a**) default; (

**b**) proposed.

**Figure 9.**The number (solid line) and time (dashed line) for initialization by the entropy-aware model initialization along the different entropy threshold (${h}_{th}$).

**Table 1.**Initial entropy of (Pong, Qbert) pair with action space size 6 and (Alien, Boxing) pair with action space size 18 under different random seeds, where “STD” denotes the standard deviation of the initial entropy values for 10 different random seeds.

Size of Action Space | ||||
---|---|---|---|---|

6 | 18 | |||

Task | Pong | Qbert | Alien | Boxing |

Seed 01 | $1.48\times {10}^{-3}$ | $4.74\times {10}^{-1}$ | $3.31\times {10}^{-1}$ | $2.55\times {10}^{-3}$ |

Seed 02 | $9.68\times {10}^{-4}$ | $8.61\times {10}^{-1}$ | $8.85\times {10}^{-2}$ | $8.05\times {10}^{-9}$ |

Seed 03 | $9.76\times {10}^{-1}$ | $2.70\times {10}^{-1}$ | $9.13\times {10}^{-4}$ | $1.07\times {10}^{-6}$ |

Seed 04 | $7.20\times {10}^{-4}$ | $2.23\times {10}^{-2}$ | $1.33$ | $2.05\times {10}^{-1}$ |

Seed 05 | $8.04\times {10}^{-1}$ | $1.58\times {10}^{-1}$ | $2.25\times {10}^{-1}$ | $6.35\times {10}^{-1}$ |

Seed 06 | $8.98\times {10}^{-5}$ | $4.68\times {10}^{-1}$ | $2.76\times {10}^{-1}$ | $2.11\times {10}^{-1}$ |

Seed 07 | $5.64\times {10}^{-1}$ | $1.58\times {10}^{-1}$ | $6.05\times {10}^{-1}$ | $4.39\times {10}^{-1}$ |

Seed 08 | $1.18\times {10}^{-1}$ | $3.42\times {10}^{-1}$ | $7.95\times {10}^{-1}$ | $2.88\times {10}^{-2}$ |

Seed 09 | $5.73\times {10}^{-1}$ | $4.34\times {10}^{-1}$ | $7.28\times {10}^{-2}$ | $2.30\times {10}^{-1}$ |

Seed 10 | $1.73\times {10}^{-3}$ | $2.79\times {10}^{-1}$ | $7.91\times {10}^{-1}$ | $3.79\times {10}^{-2}$ |

STD | $3.85\times {10}^{-1}$ | $2.33\times {10}^{-1}$ | $4.22\times {10}^{-1}$ | $2.15\times {10}^{-1}$ |

**Table 2.**Initial entropy of Freeway, Breakout, Enduro, and KungFuMaster under different random seeds, where “STD” denotes the standard deviation of the initial entropy values for 10 different random seeds.

Size of Action Space | ||||
---|---|---|---|---|

3 | 4 | 9 | 14 | |

Task | Freeway | Breakout | Enduro | KungFuMaster |

Seed 01 | $3.78\times {10}^{-1}$ | $2.95\times {10}^{-1}$ | $1.22$ | $5.55\times {10}^{-4}$ |

Seed 02 | $9.93\times {10}^{-14}$ | $5.37\times {10}^{-12}$ | $5.52\times {10}^{-1}$ | $4.30\times {10}^{-8}$ |

Seed 03 | $2.65\times {10}^{-1}$ | $3.38\times {10}^{-1}$ | $1.52$ | $4.31\times {10}^{-2}$ |

Seed 04 | $9.27\times {10}^{-1}$ | $8.27\times {10}^{-5}$ | $2.05\times {10}^{-3}$ | $1.63\times {10}^{-1}$ |

Seed 05 | $9.79\times {10}^{-5}$ | $6.65\times {10}^{-10}$ | $9.87\times {10}^{-1}$ | $3.01\times {10}^{-1}$ |

Seed 06 | $2.18\times {10}^{-1}$ | $9.47\times {10}^{-2}$ | $1.12$ | $6.29\times {10}^{-2}$ |

Seed 07 | $3.75\times {10}^{-2}$ | $7.13\times {10}^{-2}$ | $1.58$ | $1.86\times {10}^{-4}$ |

Seed 08 | $1.23\times {10}^{-1}$ | $7.73\times {10}^{-1}$ | $4.63\times {10}^{-1}$ | $6.79\times {10}^{-8}$ |

Seed 09 | $2.89\times {10}^{-3}$ | $2.18\times {10}^{-2}$ | $8.16\times {10}^{-1}$ | $6.37\times {10}^{-1}$ |

Seed 10 | $7.91\times {10}^{-4}$ | $5.79\times {10}^{-1}$ | $6.18\times {10}^{-1}$ | $8.08\times {10}^{-2}$ |

STD | $2.90\times {10}^{-1}$ | $2.74\times {10}^{-1}$ | $4.96\times {10}^{-1}$ | $2.03\times {10}^{-1}$ |

**Table 3.**Statistical results for the experimentation of the entropy-aware model initialization-based PPO and the conventional PPO.

Task | Method | Avg. Reward | STD of Reward | Min Reward | Max Reward |
---|---|---|---|---|---|

Freeway | Default | 11.067 | 11.369 | 0 | 31.04 |

Proposed | 18.376 | 7.479 | 0 | 31.55 | |

Breakout | Default | 81.847 | 97.855 | 0 | 239.27 |

Proposed | 181.905 | 68.739 | 2 | 348.67 | |

Pong | Default | −11.736 | 16.507 | −21 | 20.82 |

Proposed | 4.119 | 12.319 | −21 | 20.86 | |

Qbert | Default | 9141.865 | 5913.837 | 0 | 14,994.75 |

Proposed | 12,671.130 | 2068.368 | 125 | 15,605.00 | |

Enduro | Default | 74.247 | 97.230 | 0 | 283.69 |

Proposed | 104.804 | 72.493 | 0 | 326.18 | |

KungFuMaster | Default | 6926.000 | 8241.017 | 0 | 23,356.00 |

Proposed | 14,896.011 | 4562.688 | 0 | 34,334.00 | |

Alien | Default | 854.550 | 498.047 | 0 | 1665.00 |

Proposed | 1148.814 | 233.470 | 693.60 | 1665.30 | |

Boxing | Default | −36.100 | 41.182 | −99.94 | 36.55 |

Proposed | 6.113 | 18.284 | −99.88 | 42.10 |

**Table 4.**Statistical results for the experimentation of the entropy-aware model initialization-based A2C and the conventional A2C.

Task | Method | Avg. Reward | STD of Reward | Min Reward | Max Reward |
---|---|---|---|---|---|

Freeway | Default | 29.839 | 4.710 | 18.06 | 33.41 |

Proposed | 31.199 | 3.094 | 19.59 | 33.59 | |

Breakout | Default | 198.892 | 131.255 | 31.00 | 398.53 |

Proposed | 287.870 | 106.686 | 45.36 | 412.72 | |

Enduro | Default | 141.083 | 115.656 | 0 | 328.90 |

Proposed | 285.711 | 79.364 | 78.26 | 432.87 | |

Boxing | Default | 25.184 | 30.662 | −7.51 | 90.07 |

Proposed | 78.129 | 15.385 | 48.09 | 99.39 |

**Table 5.**The average number and time for initialization, and overhead ratio to the total training time by the proposed entropy-aware model initialization.

Task | Average Number of Initialization (#) | Average Time for Initialization (s) | Time Overhead (%) |
---|---|---|---|

Freeway | 9.86 | 119.993 | 4.000 |

Breakout | 5.30 | 72.544 | 1.451 |

Pong | 5.62 | 77.335 | 2.578 |

Qbert | 3.94 | 54.540 | 1.091 |

Enduro | 1.60 | 20.516 | 0.410 |

KungFuMaster | 4.10 | 52.141 | 1.738 |

Alien | 1.84 | 24.536 | 0.491 |

Boxing | 3.86 | 53.115 | 1.771 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Jang, S.; Kim, H.-I. Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning. *Sensors* **2022**, *22*, 5845.
https://doi.org/10.3390/s22155845

**AMA Style**

Jang S, Kim H-I. Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning. *Sensors*. 2022; 22(15):5845.
https://doi.org/10.3390/s22155845

**Chicago/Turabian Style**

Jang, Sooyoung, and Hyung-Il Kim. 2022. "Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning" *Sensors* 22, no. 15: 5845.
https://doi.org/10.3390/s22155845