# Enhanced DQN Framework for Selecting Actions and Updating Replay Memory Considering Massive Non-Executable Actions

## Abstract

## 1. Introduction

## 2. Related Works

## 3. Action Selection and Replay Update for Non-Executable Massive Actions

#### 3.1. Framework Overview

#### 3.2. Action Filter

#### 3.3. Replay Memory Updater

## 4. Experiments

#### 4.1. Experimental Environment

#### 4.2. Number of Winning Games without Action Filter

#### 4.3. Batch Sizes of Replay Memory without Action Filter

#### 4.4. Number of Winning Games with Action Filter

#### 4.5. Number of Batch Sizes of Replay Memory with Action Filter

## 5. Conclusions

## References

**Figure 2.**Result of win rate for the Gomoku game in the traditional DQN. From (

**a**–

**e**) replay memory’s batch size is 5, 10, 15, 20, and 25.

**Figure 3.**Result of win rate for the Gomoku game in the proposed enhanced DQN framework without action filter applied. From (

**a**–

**e**) replay memory’s distance of ANN is 1, 2, 3, 4, and 5.

**Figure 4.**Replay memory batch size for the Gomoku game when proposed enhanced DQN framework without action filter is applied. From replay memory ANN distance is 1–5 and replay memory batch size reduced from 2433 to 378.

**Figure 5.**Result of win rate for the Gomoku game in the traditional DQN with action filter applied. From (

**a**–

**e**) replay memory’s batch size is 5, 10, 15, 20, and 25.

**Figure 6.**Result of win rate for the Gomoku game in the proposed enhanced DQN framework with action filter applied. From (

**a**–

**e**) replay memory’s distance of ANN is 1, 2, 3, 4, and 5.

**Figure 7.**Number of replay memory’s batch size for the Gomoku game in proposed enhanced DQN framework with action filter applied. From replay memory’s distance of ANN is from 1 to 5 and replay memory’s batch size reduced from 790 to 2.

PROCEDURE Agent_Process |
---|

BEGINSET $D$←$\varnothing $FOR $\mathrm{each}\mathrm{episode}$FOR $eachtime$ SET ${s}_{t}\leftarrow $CALL State_ReceiverSET $\mathrm{Q}\left({s}_{t},a\right)$←CALL Action_Estimator $with{s}_{t}$ SET ${a}_{t}$←CALL Action_Filter with ${s}_{t},\mathrm{Q}\left({s}_{t},a\right)$ SET ${s}_{t+1}$, ${r}_{t}\leftarrow $CALL Action_Executerwith ${a}_{t}$CALL Replay_Memory_Updaterwith $D,{s}_{t}$, ${s}_{t+1}$, ${a}_{t},{r}_{t}$END FOREND FOREND |

PROCEDURE Action_Filter |
---|

INPUT: ${s}_{t},\mathrm{Q}\left({s}_{t},a\right)$OUTPUT: ${a}_{t}$BEGINInitialize $\mathrm{M}\left({s}_{t},a\right)$ by 0 ${\forall}_{a}$ SET $\mathrm{M}\left({s}_{t},a\right)\leftarrow -\infty $, $\mathrm{where}a$ is not executable at ${s}_{t}$SET ${a}_{t}$←$\underset{a}{\mathrm{argmax}}\left(\mathrm{Q}\left({s}_{t},a\right)+\mathrm{M}\left({s}_{t},a\right)\right)$END |

PROCEDURE Replay_Memory_Updater_with_ANN |
---|

Input: ${s}_{t}$, ${s}_{t+1}$, ${a}_{t},{r}_{t}$BEGINIF $D$ is $\varnothing $SET ${y}_{t}$←$\mathrm{Q}\left({s}_{t},a\right)\xb7(1-\alpha $)+$({r}_{t}+\mathsf{\gamma}\underset{{a}^{\prime}}{\mathrm{max}}\mathrm{Q}\left({s}_{t+1},{a}^{\prime}\right)$) $\xb7\alpha $SET $D\leftarrow D\cup \{[{s}_{t}$,${a}_{t},{r}_{t},{y}_{t},{s}_{t+1}]\}$ ELSE # The Starting of ANNSET $d\leftarrow \mathrm{min}\left(\mathrm{s}-{s}_{t}\right)\mathrm{where}s\mathrm{is}\mathrm{from}D$ SET $i\leftarrow \underset{i}{\mathrm{argmin}}\left({s}_{i}-{s}_{t}\right)\mathrm{where}{s}_{i}\mathrm{is}\mathrm{from}D$ IF $d>\delta $SET ${y}_{t}$←$\mathrm{Q}\left({s}_{t},a\right)\xb7(1-\alpha $)+$({r}_{t}+\mathsf{\gamma}\underset{{a}^{\prime}}{\mathrm{max}}\mathrm{Q}\left({s}_{t+1},{a}^{\prime}\right)$) $\xb7\alpha $SET $D\leftarrow D\cup \{[{s}_{t}$,${a}_{t},{r}_{t},{y}_{t},{s}_{t+1}]\}$ ELSESET ${y}_{i}$←${y}_{i}\xb7(1-\alpha $)+$({r}_{t}+\mathsf{\gamma}\underset{{a}^{\prime}}{\mathrm{max}}\mathrm{Q}\left({s}_{t+1},{a}^{\prime}\right)$) $\xb7\alpha $END IF# The End of ANNEND IFSET $l\leftarrow \frac{1}{\left|D\right|}{\displaystyle \sum}_{j=1}^{\left|D\right|}{\left({y}_{j}-\mathrm{Q}\left({s}_{j},{a}_{j}\right)\right)}^{2}where{y}_{j},{s}_{j},{a}_{j}fromD$$\mathrm{Update}\mathrm{QFunction}\mathrm{using}\mathrm{the}\mathrm{gradient}\mathrm{descent}\mathrm{algorithm}\mathrm{by}\mathrm{minimizing}\mathrm{the}\mathrm{loss}l$ END |

When the Opponent Wins (Black Win) | When the Player Wins (White Win) | When the Player Can’t Place Stone |
---|---|---|

−1 reward | +1 reward | −0.5 reward |

When the Opponent Wins (Black Win) | When the Player Wins (White Win) |
---|---|

−1 reward | +1 reward |

