# Enhanced DQN Framework for Selecting Actions and Updating Replay Memory Considering Massive Non-Executable Actions

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Works

## 3. Action Selection and Replay Update for Non-Executable Massive Actions

#### 3.1. Framework Overview

#### 3.2. Action Filter

#### 3.3. Replay Memory Updater

## 4. Experiments

#### 4.1. Experimental Environment

#### 4.2. Number of Winning Games without Action Filter

#### 4.3. Batch Sizes of Replay Memory without Action Filter

#### 4.4. Number of Winning Games with Action Filter

#### 4.5. Number of Batch Sizes of Replay Memory with Action Filter

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Jeong, D.K.; Yoo, S.; Jang, Y. VR sickness measurement with EEG using DNN algorithm. In Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology, Tokyo, Japan, 28 November–1 December 2018; pp. 1–2. [Google Scholar]
- Kusyk, J.; Sahin, C.S.; Uyar, M.U.; Urrea, E.; Gundry, S. Self-organization of nodes in mobile ad hoc networks using evolutionary games and genetic algorithms. J. Adv. Res.
**2011**, 2, 253–264. [Google Scholar] [CrossRef] [Green Version] - Marks, R.E. Playing games with genetic algorithms. In Evolutionary Computation in Economics and Finance; Physica: Heidelberg, Germany, 2002; pp. 31–44. [Google Scholar]
- Shah, S.M.; Singh, D.; Shah, J.S. Using Genetic Algorithm to Solve Game of Go-Moku. IJCA Spec. Issue Optim. On-Chip Commun.
**2012**, 6, 28–31. [Google Scholar] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Fan, J.; Wang, Z.; Xie, Y.; Yang, Z. A theoretical analysis of deep Q-learning. In Proceedings of the Learning for Dynamics and Control, PMLR, Online, 11–12 June 2020; pp. 486–489. [Google Scholar]
- Conti, E.; Madhavan, V.; Such, F.P.; Lehman, J.; Stanley, K.O.; Clune, J. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. arXiv
**2017**, arXiv:1712.06560. [Google Scholar] - Fedus, W.; Ramachandran, P.; Agarwal, R.; Bengio, Y.; Larochelle, H.; Rowland, M.; Dabney, W. Revisiting fundamentals of experience replay. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 3061–3071. [Google Scholar]
- Li, X.; He, S.; Wu, L.; Chen, D.; Zhao, Y. A Game Model for Gomoku Based on Deep Learning and Monte Carlo Tree Search. In Proceedings of the 2019 Chinese Intelligent Automation Conference, Zhenjiang, China, 20–22 September 2019; Springer: Singapore, 2019; pp. 88–97. [Google Scholar]
- David, O.E.; Netanyahu, N.S.; Wolf, L. Deepchess: End-to-end deep neural network for automatic learning in chess. In Proceedings of the International Conference on Artificial Neural Networks, Barcelona, Spain, 6–9 September 2016; Springer: Cham, Switzerland, 2016; pp. 88–96. [Google Scholar]
- Chen, Y.H.; Tang, C.Y. On the bottleneck tree alignment problems. Inf. Sci.
**2010**, 180, 2134–2141. [Google Scholar] [CrossRef] - Wang, J.; Huang, L. Evolving Gomoku solver by genetic algorithm. In Proceedings of the 2014 IEEE Workshop on Advanced Research and Technology in Industry Applications (WARTIA), Ottawa, ON, Canada, 29–30 September 2014; pp. 1064–1067. [Google Scholar]
- Zheng, P.M.; He, L. Design of gomoku ai based on machine game. Comput. Knowl. Technol.
**2016**, 2016, 33. [Google Scholar] - Colledanchise, M.; Ögren, P. Behavior trees in robotics and AI: An introduction. arXiv
**2017**, arXiv:1709. 00084. [Google Scholar] - Allis, L.V.; Herik, H.J.; Huntjens, M.P. Go-Moku and Threat-Space Search; University of Limburg, Department of Computer Science: Maastricht, The Netherlands, 1993. [Google Scholar]
- Yen, S.-J.; Yang, J.-K. Two-stage Monte Carlo tree search for Connect6. IEEE Trans. Comput. Intell. AI Games
**2011**, 3, 100–118. [Google Scholar] - Herik, J.V.D.; Winands, M.H.M. Proof-Number Search and Its Variants. In Flexible and Generalized Uncertainty Optimization; Springer: Berlin/Heidelberg, Germany, 2008; pp. 91–118. [Google Scholar]
- Coquelin, P.A.; Munos, R. Bandit algorithms for tree search. arXiv
**2007**, arXiv:cs/0703062. [Google Scholar] - Wang, F.-Y.; Zhang, H.; Liu, D. Adaptive Dynamic Programming: An Introduction. IEEE Comput. Intell. Mag.
**2009**, 4, 39–47. [Google Scholar] [CrossRef] - Cao, X.; Lin, Y. UCT-ADP Progressive Bias Algorithm for Solving Gomoku. In Proceedings of the 2019 IEEE Symposium Series on Computational Intelligence (SSCI), Xiamen, China, 6–9 December 2019. [Google Scholar]
- O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv
**2015**, arXiv:1511.08458. [Google Scholar] - Xie, Z.; Fu, X.; Yu, J. AlphaGomoku: An AlphaGo-based Gomoku Artificial Intelligence using Curriculum Learning. arXiv
**2018**, arXiv:1809.10595. [Google Scholar] - Yan, P.; Feng, Y. A Hybrid Gomoku Deep Learning Artificial Intelligence. In Proceedings of the 2018 Artificial Intelligence and Cloud Computing Conference, Tokyo, Japan, 21–23 December 2018. [Google Scholar]
- Shao, K.; Zhao, D.; Tang, Z.; Zhu, Y. Move prediction in Gomoku using deep learning. In Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), Wuhan, China, 11–13 November 2016. [Google Scholar]
- Oh, I.; Rho, S.; Moon, S.; Son, S.; Lee, H.; Chung, J. Creating Pro-Level AI for a Real-Time Fighting Game Using Deep Reinforcement Learning. arXiv
**2019**, arXiv:1904.03821. [Google Scholar] [CrossRef] - Tang, Z.; Zhao, D.; Shao, K.; Le, L.V. ADP with MCTS algorithm for Gomoku. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2017. [Google Scholar]
- Zhao, D.; Zhang, Z.; Dai, Y. Self-teaching adaptive dynamic programming for Gomoku. Neurocomputing
**2012**, 78, 23–29. [Google Scholar] [CrossRef] - Gu, B.; Sung, Y. Enhanced Reinforcement Learning Method Combining One-Hot Encoding-Based Vectors for CNN-Based Alternative High-Level Decisions. Appl. Sci.
**2021**, 11, 1291. [Google Scholar] [CrossRef] - Geffner, H. Model-free, Model-based, and General Intelligence. arXiv
**2018**, arXiv:1806.02308. [Google Scholar] - Peterson, L.E. K-nearest neighbor. Scholarpedia
**2009**, 4, 1883. [Google Scholar] [CrossRef] - Bradtke, S.J.; Barto, A.G. Linear least-squares algorithms for temporal difference learning. Mach. Learn.
**1996**, 22, 33–57. [Google Scholar] [CrossRef] [Green Version] - De Lope, J.; Maravall, D. The knn-td reinforcement learning algorithm. In Proceedings of the International Work-Conference on the Interplay between Natural and Artificial Computation, Santiago de Compostela, Spain, 22–26 June 2009; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn.
**1992**, 8, 279–292. [Google Scholar] [CrossRef] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Sung, Y.; Ahn, E.; Cho, K. Q-learning reward propagation method for reducing the transmission power of sensor nodes in wireless sensor networks. Wirel. Pers. Commun.
**2013**, 73, 257–273. [Google Scholar] [CrossRef] - Von Pilchau, W.B.P.; Stein, A.; Hähner, J. Bootstrapping a dqn Replay Memory with synthetic experiences. arXiv
**2002**, arXiv:2002.01370. [Google Scholar] - Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv
**2015**, arXiv:1511.05952. [Google Scholar] - Wang, J.; Yang, Y.; Mao, J.; Huang, Z.; Huang, C.; Xu, W. Cnn-rnn: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2285–2294. [Google Scholar]
- Liu, H.; Yang, Y.; Wang, X. Overcoming Catastrophic Forgetting in Graph Neural Networks. arXiv
**2020**, arXiv:2012.06002. [Google Scholar]

**Figure 2.**Result of win rate for the Gomoku game in the traditional DQN. From (

**a**–

**e**) replay memory’s batch size is 5, 10, 15, 20, and 25.

**Figure 3.**Result of win rate for the Gomoku game in the proposed enhanced DQN framework without action filter applied. From (

**a**–

**e**) replay memory’s distance of ANN is 1, 2, 3, 4, and 5.

**Figure 4.**Replay memory batch size for the Gomoku game when proposed enhanced DQN framework without action filter is applied. From replay memory ANN distance is 1–5 and replay memory batch size reduced from 2433 to 378.

**Figure 5.**Result of win rate for the Gomoku game in the traditional DQN with action filter applied. From (

**a**–

**e**) replay memory’s batch size is 5, 10, 15, 20, and 25.

**Figure 6.**Result of win rate for the Gomoku game in the proposed enhanced DQN framework with action filter applied. From (

**a**–

**e**) replay memory’s distance of ANN is 1, 2, 3, 4, and 5.

**Figure 7.**Number of replay memory’s batch size for the Gomoku game in proposed enhanced DQN framework with action filter applied. From replay memory’s distance of ANN is from 1 to 5 and replay memory’s batch size reduced from 790 to 2.

PROCEDURE Agent_Process |
---|

BEGINSET $D$←$\varnothing $FOR $\mathrm{each}\mathrm{episode}$FOR $eachtime$ SET ${s}_{t}\leftarrow $CALL State_ReceiverSET $\mathrm{Q}\left({s}_{t},a\right)$←CALL Action_Estimator $with{s}_{t}$ SET ${a}_{t}$←CALL Action_Filter with ${s}_{t},\mathrm{Q}\left({s}_{t},a\right)$ SET ${s}_{t+1}$, ${r}_{t}\leftarrow $CALL Action_Executerwith ${a}_{t}$CALL Replay_Memory_Updaterwith $D,{s}_{t}$, ${s}_{t+1}$, ${a}_{t},{r}_{t}$END FOREND FOREND |

PROCEDURE Action_Filter |
---|

INPUT: ${s}_{t},\mathrm{Q}\left({s}_{t},a\right)$OUTPUT: ${a}_{t}$BEGINInitialize $\mathrm{M}\left({s}_{t},a\right)$ by 0 ${\forall}_{a}$ SET $\mathrm{M}\left({s}_{t},a\right)\leftarrow -\infty $, $\mathrm{where}a$ is not executable at ${s}_{t}$SET ${a}_{t}$←$\underset{a}{\mathrm{argmax}}\left(\mathrm{Q}\left({s}_{t},a\right)+\mathrm{M}\left({s}_{t},a\right)\right)$END |

PROCEDURE Replay_Memory_Updater_with_ANN |
---|

Input: ${s}_{t}$, ${s}_{t+1}$, ${a}_{t},{r}_{t}$BEGINIF $D$ is $\varnothing $SET ${y}_{t}$←$\mathrm{Q}\left({s}_{t},a\right)\xb7(1-\alpha $)+$({r}_{t}+\mathsf{\gamma}\underset{{a}^{\prime}}{\mathrm{max}}\mathrm{Q}\left({s}_{t+1},{a}^{\prime}\right)$) $\xb7\alpha $SET $D\leftarrow D\cup \{[{s}_{t}$,${a}_{t},{r}_{t},{y}_{t},{s}_{t+1}]\}$ ELSE # The Starting of ANNSET $d\leftarrow \mathrm{min}\left(\mathrm{s}-{s}_{t}\right)\mathrm{where}s\mathrm{is}\mathrm{from}D$ SET $i\leftarrow \underset{i}{\mathrm{argmin}}\left({s}_{i}-{s}_{t}\right)\mathrm{where}{s}_{i}\mathrm{is}\mathrm{from}D$ IF $d>\delta $SET ${y}_{t}$←$\mathrm{Q}\left({s}_{t},a\right)\xb7(1-\alpha $)+$({r}_{t}+\mathsf{\gamma}\underset{{a}^{\prime}}{\mathrm{max}}\mathrm{Q}\left({s}_{t+1},{a}^{\prime}\right)$) $\xb7\alpha $SET $D\leftarrow D\cup \{[{s}_{t}$,${a}_{t},{r}_{t},{y}_{t},{s}_{t+1}]\}$ ELSESET ${y}_{i}$←${y}_{i}\xb7(1-\alpha $)+$({r}_{t}+\mathsf{\gamma}\underset{{a}^{\prime}}{\mathrm{max}}\mathrm{Q}\left({s}_{t+1},{a}^{\prime}\right)$) $\xb7\alpha $END IF# The End of ANNEND IFSET $l\leftarrow \frac{1}{\left|D\right|}{\displaystyle \sum}_{j=1}^{\left|D\right|}{\left({y}_{j}-\mathrm{Q}\left({s}_{j},{a}_{j}\right)\right)}^{2}where{y}_{j},{s}_{j},{a}_{j}fromD$$\mathrm{Update}\mathrm{QFunction}\mathrm{using}\mathrm{the}\mathrm{gradient}\mathrm{descent}\mathrm{algorithm}\mathrm{by}\mathrm{minimizing}\mathrm{the}\mathrm{loss}l$ END |

When the Opponent Wins (Black Win) | When the Player Wins (White Win) | When the Player Can’t Place Stone |
---|---|---|

−1 reward | +1 reward | −0.5 reward |

When the Opponent Wins (Black Win) | When the Player Wins (White Win) |
---|---|

−1 reward | +1 reward |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Gu, B.; Sung, Y.
Enhanced DQN Framework for Selecting Actions and Updating Replay Memory Considering Massive Non-Executable Actions. *Appl. Sci.* **2021**, *11*, 11162.
https://doi.org/10.3390/app112311162

**AMA Style**

Gu B, Sung Y.
Enhanced DQN Framework for Selecting Actions and Updating Replay Memory Considering Massive Non-Executable Actions. *Applied Sciences*. 2021; 11(23):11162.
https://doi.org/10.3390/app112311162

**Chicago/Turabian Style**

Gu, Bonwoo, and Yunsick Sung.
2021. "Enhanced DQN Framework for Selecting Actions and Updating Replay Memory Considering Massive Non-Executable Actions" *Applied Sciences* 11, no. 23: 11162.
https://doi.org/10.3390/app112311162