Enhanced Reinforcement Learning Method Combining One-Hot Encoding-Based Vectors for CNN-Based Alternative High-Level Decisions

: Gomoku is a two-player board game that originated in ancient China. There are various cases of developing Gomoku using artiﬁcial intelligence, such as a genetic algorithm and a tree search algorithm. Alpha-Gomoku, Gomoku AI built with Alpha-Go’s algorithm, deﬁnes all possible situations in the Gomoku board using Monte-Carlo tree search (MCTS), and minimizes the probability of learning other correct answers in the duplicated Gomoku board situation. However, in the tree search algorithm, the accuracy drops, because the classiﬁcation criteria are manually set. In this paper, we propose an improved reinforcement learning-based high-level decision approach using convolutional neural networks (CNN). The proposed algorithm expresses each state as One-Hot Encoding based vectors and determines the state of the Gomoku board by combining the similar state of One-Hot Encoding based vectors. Thus, in a case where a stone that is determined by CNN has already been placed or cannot be placed, we suggest a method for selecting an alternative. We verify the proposed method of Gomoku AI in GuPyEngine, a Python-based 3D simulation platform.


Introduction
Gomoku is a two-player board game that originated in ancient China. Two players alternate in placing a stone of their choice of color, and the player who first completes the five-in-a-row horizontally, vertically, or diagonally wins the game. Various studies have attempted to develop Gomoku using artificial intelligence (AI), for example, using a genetic algorithm and tree search algorithm [1,2]. Deep learning algorithms have been actively employed in the recent years after Alpha-Go showed its outstanding performance. For example, efficient Gomoku board recognition and decision-making was made possible while using a convolution layer of the deep-learning convolutional neural networks (CNN) algorithm [3]. However, as CNN-based decision-making determines a single optimal position in the same recognized Gomoku board state, some cases arise where a stone cannot be placed in the relevant position. In order to address this problem, Alpha-Gomoku (Gomoku AI built with Alpha-Go's algorithm) defines all possible situations in the Gomoku board while using Monte-Carlo tree search (MCTS) and minimizes the probability of learning other correct answers in a duplicated Gomoku board situation [4]. However, in the tree search algorithm, the accuracy drops because the classification criteria are manually set.
In this paper, we propose an improved reinforcement learning-based high-level decision algorithm while using CNN. The proposed algorithm expresses each state as onehot-encoding-based vectors and determines the state of the Gomoku board by combining the similar state of one-hot-encoding-based vectors. Thus, in a case where a stone that is determined by CNN has already been placed or cannot be placed, we suggest a method

Related Works
Several researchers have attempted to develop Gomoku AI models using various algorithms. A genetic algorithm is an evolutionary-type algorithm that is used in various game AI models and it has been used in Gomoku AI models [2]. Shah. S. et al. [5] built a Gomoku AI model by applying the genetic algorithm. Wang et al. solved the bottleneck [6] that arises during a deep search in a tree-based Gomoku AI using the genetic algorithm [7].
In addition to the genetic algorithm, game AI models have been built using various tree search-based algorithms [1]. Typically, behavior trees (BT) allow for the efficient structuring of the behavior of artificial agents [8]. BT is widely used in computer games and robotics because the approach is efficient for building complex AI models that are modular as well as reactive. Allis et al. described a Gomoku AI model using the threatspace search [9] and proof-number search [10,11]. This algorithm was implemented in the Gomoku program, called Victoria. When Victoria first moved a stone, the algorithm yielded a winning result without fail. Cao et al. presented a Gomoku AI model using an algorithm that combined the upper confidence bounds that were applied to the trees (UCT) [12] and adaptive dynamic programming (ADP) [13,14]. This algorithm could solve the search depth defect more accurately and efficiently than the case when only a single UCT was used, thereby improving the performance of Gomoku AI.
Gomoku models using CNN have also been developed, as Deep-Mind's Alpha-Go demonstrated outstanding deep learning performance. Li et al. built a Gomoku model using MCTS and deep learning [15,16]. An expert-level AI model was developed without use of human expert knowledge by integrating the effective depth exploration of MCTS and deep neural networks (DNN). Yan et al. proposed a hybrid Gomoku AI model that improved the performance by including a hard-coded convolution-based Gomoku evaluation function [17]. Shao et al. predicted the next stone position with an accuracy of 42% while using only the current stone state of the board through the use of various DNNs with different hyper-parameters [18].
Various studies have developed expert-level game AI models using reinforcement learning models. For instance, Deep-Mind used the Deep Q-Networks (DQN) algorithm of reinforcement learning to apply it to seven Atari 2600 games, thereby enabling them to play at an expert-level [19]. Oh et al. considered a real-time fighting AI game while using deep reinforcement learning [20]. The AI model achieved a winning rate of 62% when competing with five professional gamers. Zhentao et al. presented an efficient Gomoku AI by integrating MCTS and ADP [21]. This algorithm could effectively eliminate the "short-sighted" defect of the neural network evaluation function, yielding results exceeding the Gomoku model adopting only ADP. Zhao et al. designed a Gomoku AI model that is capable of self-teaching through ADP [22]. Self-teaching is a method that penalizes the final action of the loser while rewarding the last action of the winner. The present study also uses "Record" for self-teaching, inspired by this basic idea.
With the help of reinforcement learning, a model-free algorithm can represent the next state of AI as a probability when AI acts in a specific state [23]. In Q-Learning, the next Q state is calculated while using Q-values [24]. In this regard, de Lope et al. designed an efficient KNN-TD algorithm for a model-free algorithm by integrating the K-nearest neighbor (KNN) algorithm [25] and the temporal difference (TD) [26,27].
In this paper, by combining one-hot-encoding-based vectors, we build a model-free Gomoku AI that is capable of self-teaching. After generating the state vectors by performing convolution operations on the relationship between the stones that were placed on the Gomoku board, we determine similar situations for each situation vector inputted with an approximate nearest neighbor (ANN) [28]. In the case of a similar situation, the one-hotencoding-based vectors are combined; subsequently, a CNN is utilized. The row values of the one-hot-encoding-based vectors are independent of each other, but they are converted into a related row that is not independent of each other by combining one-hot-encodingbased vectors. Following this approach, when applying self-teaching, if a stone cannot be placed in an optimal position, then the next optimal position can be considered. We also develop 'GuPyEngine' for Python and 3D programming education. For the proposed method, we conduct experiments by implementing a Gomoku system in GuPyEngine.

Advanced CNN-Based Decision-Making
This paper proposes an enhanced reinforcement learning model to improve the training data of a CNN, in order to enable optimal decision-making and alternative selection. By selecting an alternative, it solves the problem of a CNN-based decision-making method, allowing for only the optimal solution to be selected, and it minimizes the forgetting problem. The forgetting problem is that, given that the weights through previous learning in neural network algorithms are changed by new learning [29], there is the phenomenon that the weights are forgotten, which is generally invoked when the inputs are similar and labels are different. Figure 1 shows the proposed framework. As can be seen, the framework is based on multiple encoding-based vector combinations, containing the following stages: "State Vector Generation", "CNN-based Estimation", "Decision Making", "One-Hot-Encodingbased Vector Generation", "ANN-based one-hot encoding vector Combination", and "Enhanced CNN Training".

Framework Overview
penalizes the final action of the loser while rewarding the last action of the winner. The present study also uses "Record" for self-teaching, inspired by this basic idea.
With the help of reinforcement learning, a model-free algorithm can represent the next state of AI as a probability when AI acts in a specific state [23]. In Q-Learning, the next Q state is calculated while using Q-values [24]. In this regard, de Lope et al. designed an efficient KNN-TD algorithm for a model-free algorithm by integrating the K-nearest neighbor (KNN) algorithm [25] and the temporal difference (TD) [26,27].
In this paper, by combining one-hot-encoding-based vectors, we build a model-free Gomoku AI that is capable of self-teaching. After generating the state vectors by performing convolution operations on the relationship between the stones that were placed on the Gomoku board, we determine similar situations for each situation vector inputted with an approximate nearest neighbor (ANN) [28]. In the case of a similar situation, the one-hot-encoding-based vectors are combined; subsequently, a CNN is utilized. The row values of the one-hot-encoding-based vectors are independent of each other, but they are converted into a related row that is not independent of each other by combining one-hot-encoding-based vectors. Following this approach, when applying self-teaching, if a stone cannot be placed in an optimal position, then the next optimal position can be considered. We also develop 'GuPyEngine' for Python and 3D programming education. For the proposed method, we conduct experiments by implementing a Gomoku system in GuPyEngine.

Advanced CNN-Based Decision-Making
This paper proposes an enhanced reinforcement learning model to improve the training data of a CNN, in order to enable optimal decision-making and alternative selection. By selecting an alternative, it solves the problem of a CNN-based decision-making method, allowing for only the optimal solution to be selected, and it minimizes the forgetting problem. The forgetting problem is that, given that the weights through previous learning in neural network algorithms are changed by new learning [29], there is the phenomenon that the weights are forgotten, which is generally invoked when the inputs are similar and labels are different. Figure 1 shows the proposed framework. As can be seen, the framework is based on multiple encoding-based vector combinations, containing the following stages: "State Vector Generation", "CNN-based Estimation", "Decision Making", "One-Hot-Encoding-based Vector Generation", "ANN-based one-hot encoding vector Combination", and "Enhanced CNN Training". The state vector generation stage generates the vectors for simulating the states. When f 1 (s t ) is the state vector generation function, f 1 (s t ) = < s' t,1 , s' t,2 , . . . s' t,n >, where s t,i is the i-th element of s t and s' t,i is the converted value of s t,i , which is defined while considering both the environment and CNN. In the CNN-based estimation stage, alternative actions are derived by applying the state vector to a CNN. Moreover, in this stage, the proposed CNN uses convolution with a kernel size of 3 × 3 [18]. As in the game of Gomoku, the position of stones must be accurately determined, the minimum kernel size should be used. In the decision-making stage, a case might arise where the action corresponding to the highest probability in the CNN result cannot be performed. This case is considered as −1, which makes the selection impossible, thereby selecting and performing an alternative action with the highest probability.

Framework Overview
The one-hot-encoding-based vector generation stage generates the vectors for simulating the actions. f 2 (s t ) returns o t the one-hot-encoding-based vector of s t . The ANN-based one-hot-encoding vector combination stage receives < s' t , o t > and then returns < s' t , o" t > as training data, where o' t corresponds to the most similar vector in the training data. Moreover, o" t combines ot with o' t . The following section describes how these two vectors are combined. The CNN training stage makes the CNN learn with <s' t , o" t >.

ANN-Based One-Hot Encoding Vector Combination Stage
In this paper, we define "Record" as the action of going back several moves at the end of each game and placing a stone in the position where the opponent's stone is likely to be placed, while using the proposed method. There are two considerations to update rewards as below. First, when the player wins, no reward is received. If the rewards of wins are reflected, a specific row in the combined one-hot-encoding-based vector is increased rapidly. If so, the diverse kinds of stone's positions are not considered. As a result, Gomoku falls in one way, which is the local optimum problem [30]. Next, whenever the player loses, reward is only updated. For examples, in the case of one record, it goes back by one move from the end of the corresponding game, and reward is updated by 1 considering the situation after replacing the last position of the opponent's stone with the player's own stone, and then encoding-based vectors are combined by one vector. In the case of three records, it repeats one record three times to replace the black stone with the white stone by going back three moves. Figure 2 shows the combination of one-hot-encoding-based vectors for the proposed Gomoku system. In Gomoku, the relationship between similar stone moves occurs frequently. Depending on the player's tendency, the moves can completely differ from one another, even in a similar situation. The proposed Gomoku system divides similar situations for each feature vector while using the ANN algorithm. In the divided similar situations, one-hot encoding-based vectors are combined in order to calculate the probability. This solves the problem of an exponential increase in the number of training data. ANN algorithms are diversely utilized for large-scale image searching and classification problems [28]. In this paper, the similarity among one-hot-encoding-based vectors is calculated by Euclidean distance, a representative ANN algorithm. If the similarity is smaller than a specific threshold, then the corresponding vectors are combined. A new vector that contains the corresponding vector is inserted if the similarity is bigger than the threshold when that there is no similar vector.
One-hot encoding-based vector is a binary vector that sets one value of the corresponding index of labeled data by 1 and sets others by 0 [31]. Given that, when learning ANN algorithms are diversely utilized for large-scale image searching and classification problems [28]. In this paper, the similarity among one-hot-encoding-based vectors is calculated by Euclidean distance, a representative ANN algorithm. If the similarity is smaller than a specific threshold, then the corresponding vectors are combined. A new vector that contains the corresponding vector is inserted if the similarity is bigger than the threshold when that there is no similar vector.
One-hot encoding-based vector is a binary vector that sets one value of the corresponding index of labeled data by 1 and sets others by 0 [31]. Given that, when learning by a neural network with labeled data as numbers, the learning results can be deducted by decimal points. If so, categorical data cannot be accurately classified. However, if one-hot encoding-based vector is applied, categorical data can be classified accurately. Given that only one stone can be placed at a specific place in Gomoku, the possible positions of stones are treated with 49 positions. For example, one-hot-encoding-based vector expression is one of hot-issued approaches for classifying categorical data by Deep Neural Network (DNN) [31][32][33][34]. In the paper, the positions of stones considering states are expressed by one-hot-encoding-based vector.
By combining one-hot-encoding, 49 row relationships that are independent of each other are made non-independent. The CNN result of learning the non-independent onehot encoding outputs the result that completely resembles the learned one-hot-encoding vectors. Thus, the CNN result with each non-independent row is outputted. Although there is a black or white stone in the position with the highest probability in the CNN result, the position with the next highest probability is non-independently related to the current situation. Therefore, it becomes the next best position for the current situation. In addition, as one-hot-encoding is continuously combined for each game in similar situations, the learning result varies according to the user's tendency.

CNN Training Stage
The record data generated for each game are combined for each similar situation through the ANN. Nodes are generated for each different situation in the ANN, and each node has one-hot-encoding-based vectors that are combined ( Figure 3).  The vector for each ANN node frequently varies every time that learning is performed with the record data. Hence, in the 'Enhanced CNN Training ' stage of each game, the existing CNN model is removed, and a new CNN model is regenerated based on the learned ANN result. In addition, the training data of the ANN are regenerated with the latest ANN node. As similar situations are combined through the ANN, the amount of training data does not increase incrementally, even though CNN training is performed for each game. This also mitigates the forgetting problem by reducing the amount of useless data learning during CNN training.

Experiments
We verify the performance of the proposed reinforcement learning model by applying it to a Gomoku game.  The vector for each ANN node frequently varies every time that learning is performed with the record data. Hence, in the 'Enhanced CNN Training' stage of each game, the existing CNN model is removed, and a new CNN model is regenerated based on the learned ANN result. In addition, the training data of the ANN are regenerated with the latest ANN node. As similar situations are combined through the ANN, the amount of training data does not increase incrementally, even though CNN training is performed for each game. This also mitigates the forgetting problem by reducing the amount of useless data learning during CNN training.

Experiments
We verify the performance of the proposed reinforcement learning model by applying it to a Gomoku game. Section 4.1 describes the learning and development environments, whereas Section 4.2 introduces GupyEngine. Sections 4.3-4.5 present the performance verification of the proposed reinforcement learning Gomoku AI model while considering various criteria.

Experimental Environment
We verified the performance through various methods in order to confirm the performance of the Gomoku system proposed in this paper. We performed experiments with "the number of wins", "the number of ANN nodes", and "the next best solution" of the Gomoku game. We verified the difference in performance according to the number of records generated for each game. In this paper, the experiment was developed with C/C++, Python 3.0, and TensorFlow. CPU I-7, RAM 16GB, and GPU GTX-1050 were used in the experimental environment.
In the experiments, 600 Gomoku games were iterated for each experimental case. The genetic algorithm was applied to black stones. The same moves for each game were prevented by randomly initializing X and Y in positions between 3 and 5. As the performance of ANN varies depending on the defined distance, we verified the performance according to various distances in the experiments.
The proposed method was applied to white stones. The goal of the white stone was to prevent the black stone from completing five-in-a-row, by predicting the position of the black stone. The proposed Gomoku system considered the white stone as the winner if it completes the five-in-a-row first, or if all 49 intersections are filled. If a white stone wins, it moves to the next game, whereas, if a black stone wins, it relearns and proceeds to the next game.
In the CNN-based estimation stage, a total of three hidden layers were considered, and each hidden layer had 49, 128, and 516 neurons. The output layer and input layer were designed to have 49 (7 × 7) neurons, which corresponds to the total number of intersections on the board. The CNN result of the proposed Gomoku system was outputted in the form of a vector with 49 rows, in order to predict the position of the next white stone by matching 1:1 to the 49 intersections of the board.

GuPyEngine
GuPyEngine program is a Python-based simulation platform. Figure 4 depicts screenshots of the GuPyEngine, which aims to easily and simply implement a three-dimensional (3D) simulation environment and simulate Python-based math and physics, among others. OpenGL 4.0 was used for 3D rendering, and the GUI supports high-speed rendering using GDI+ and OpenCV. The state changes in 3D objects, such as position, rotation, and scale, were supported, as well as camera, FBX file Loader, and bone key animation.
The Python code can be linked for each 3D object and written by linking with the Pycharm Tool, as when using the Unity 3D Tool ( Figure 5). Currently, in GuPyEngine, we can change the object state, perform creation and deletion, and modify the object texture using the Python code. In addition, it is possible to install and link the code with deep learning tools, such as Tensor Floor and Keras, through PIP, due to the development of the 32 bit and 64 bit versions. The GupyEngine supports PIP versions of up to 3.0, and errors in Python code are displayed in the console (cmd.exe) window to facilitate debugging.
GuPyEngine program is a Python-based simulation platform. Figure 4 depicts screenshots of the GuPyEngine, which aims to easily and simply implement a three-dimensional (3D) simulation environment and simulate Python-based math and physics, among others. OpenGL 4.0 was used for 3D rendering, and the GUI supports high-speed rendering using GDI+ and OpenCV. The state changes in 3D objects, such as position, rotation, and scale, were supported, as well as camera, FBX file Loader, and bone key animation. The Python code can be linked for each 3D object and written by linking with the Pycharm Tool, as when using the Unity 3D Tool ( Figure 5). Currently, in GuPyEngine, we can change the object state, perform creation and deletion, and modify the object texture using the Python code. In addition, it is possible to install and link the code with deep learning tools, such as Tensor Floor and Keras, through PIP, due to the development of the 32 bit and 64 bit versions. The GupyEngine supports PIP versions of up to 3.0, and errors in Python code are displayed in the console (cmd.exe) window to facilitate debugging. GuPyEngine rebuilds and uses Python code in C/C++. Hence, Python can be used with high processing speed, and external libraries are easily incorporated. A function for converting Python code into DLL/So files will be developed in the future for GuPyEngine; this will ensure that the content that is written in Python code in GuPyEngine can be easily linked with Unity 3D Engine, Unreal Engine, and other external programs. The Gomoku system that is proposed in this paper was developed by installing and using TensorFlow in GuPyEngine. GuPyEngine rebuilds and uses Python code in C/C++. Hence, Python can be used with high processing speed, and external libraries are easily incorporated. A function for converting Python code into DLL/So files will be developed in the future for GuPyEngine; this will ensure that the content that is written in Python code in GuPyEngine can be easily linked with Unity 3D Engine, Unreal Engine, and other external programs. The Gomoku system that is proposed in this paper was developed by installing and using TensorFlow in GuPyEngine.

Number of Winning Games
We calculated the winning rate by grouping 600 experimental data into five categories. Figure 6 shows the experimental results regarding the winning rate for four Records. Figure 6a shows that the winning rate is low, owing to the frequent forgetting problems, as only the CNN was used without relearning from the records. Figure 6b confirms that the winning rate gradually increases when the records were relearned. However, new moves of the GA AI continue to be created, which results in frequent variations in the winning rate. Figure 6c shows that the winning rate gradually increases, and all the wins were observed from the 525th data, as seen in the green box area that is depicted in Figure 6c. The figure also shows that the result of learning all the possible moves that the GA AI can perform to place the stone, as the ANN minimizes the forgetting problem. Gomoku is a game that requires the exact stone position to be determined. Therefore, the ideal result is derived when the distance α of the ANN is five. Figure 6d-h show the results of the gradually increasing the α of the ANN. The figures show that, as the α of the ANN is higher than required, more errors occur, which hinders the learning. Figure 6d,e display similar shapes. ANN learning deteriorates the forgetting problem, which results in a decrease in the winning rate, as seen in the red box areas in Figure 6d,e. Figure 6h shows a lower winning rate than Figure 6a. problems, as only the CNN was used without relearning from the records. Figure 6b confirms that the winning rate gradually increases when the records were relearned. However, new moves of the GA AI continue to be created, which results in frequent variations in the winning rate. Figure 6c shows that the winning rate gradually increases, and all the wins were observed from the 525th data, as seen in the green box area that is depicted in Figure 6c. The figure also shows that the result of learning all the possible moves that the GA AI can perform to place the stone, as the ANN minimizes the forgetting problem. Gomoku is a game that requires the exact stone position to be determined. Therefore, the ideal result is derived when the distance α of the ANN is five. Figure 6dh show the results of the gradually increasing the α of the ANN. The figures show that, as the α of the ANN is higher than required, more errors occur, which hinders the learning. Figure 6d,e display similar shapes. ANN learning deteriorates the forgetting problem, which results in a decrease in the winning rate, as seen in the red box areas in Figure 6d,e. Figure 6h shows a lower winning rate than Figure 6a.    Figure 7 show that, overall, the winning rate is lower than those seen in Figure 6. The proposed Gomoku system performs Enhanced CNN Training for each game. Therefore, the results of this experiment show that the winning rate increases nearing the end rather than the beginning. However, in Figure 7b-e, the red box area shows fewer learning records than in the previous experiment. In particular, Figure 7c confirms that the winning rate in the all-winning section, as shown in Figure 6c, also drops, owing to the lack of records. However, as in Figures 6 and 7c shows a very high winning rate when compared to other experimental cases presented in Figure 7, showing that the ANN α of five corresponds to the optimized case in the proposed Gomoku system. confirms that the winning rate in the all-winning section, as shown in Figure 6c, also drops, owing to the lack of records. However, as in Figures 6 and 7c shows a very high winning rate when compared to other experimental cases presented in Figure 7, showing that the ANN α of five corresponds to the optimized case in the proposed Gomoku system.   Figures 8 and 9 show the experimental results regarding the winning rates for the two records and the single record, respectively. These results confirm that the number of records to be learned is insufficient and, therefore, the winning rate drops in all of the sections and the "Self-Teaching" occurs frequently. The experiments confirmed the satisfactory functioning of the ANN in the proposed Gomoku system when characterizing similar situations. We could also confirm that the optimal result was derived when the α of the ANN was five. Furthermore, when the number of α was greater than required, for example, in the range 25-30, it was observed that the CNN learning was rather hindered. similar situations. We could also confirm that the optimal result was derived when the α of the ANN was five. Furthermore, when the number of α was greater than required, for example, in the range 25-30, it was observed that the CNN learning was rather hindered.

Number of ANN Nodes
The number of ANN nodes refers to the number of cases grouped by similar situations in Gomoku, which indicates the amount of CNN training data. Figure 10a shows the results of saving four Records data for each game without ANN. During 600 games, 1000 records data were saved as record data, except for the winning game. Figure 10b shows the ANN and, because record with α less than five are combined into a similar situation, the number of record data is reduced when compared to Figure 10a. Moreover, 874 record data were saved. Figure 10c-g show the results of saving 633, 448, 217, 78, and 50 record data, respectively, from 600 games.

Number of ANN Nodes
The number of ANN nodes refers to the number of cases grouped by similar situations in Gomoku, which indicates the amount of CNN training data. Figure 10a shows the results of saving four Records data for each game without ANN. During 600 games, 1000 records data were saved as record data, except for the winning game. Figure 10b shows the ANN and, because record with α less than five are combined into a similar situation, the number of record data is reduced when compared to Figure 10a. Moreover, 874 record data were saved. Figure 10c With an increase in the α of the ANN, the saved record data become smaller. Figure 10b shows less training data when compared to Figure 10a. Figure 6b,c show that the winning rate is also higher than that when using only the existing CNN. Thus, it is considered to be ideal learning result. However, with an increase in α, records that are not similar are also merged, as seen in Figure 10g,h. Thus, CNN learning is hindered, and the winning rate is dropped, as in Figure 6g,h. Figure 11 shows similar results to those presented in Figure 10.  Figures 12 and 13 confirm that, as the number of Record data to be learned is minimal, overall, the number of ANN nodes is less than required. These experiments also confirm that the amount of CNN training data does not increase linearly when the ANN is combined according to a similar situation. When the α of the ANN is 5, the amount of training data was reduced by 12% on average, while still ensuring an effective performance. With an increase in the α of the ANN, the saved record data become smaller. Figure 10b shows less training data when compared to Figure 10a. Figure 6b,c show that the winning rate is also higher than that when using only the existing CNN. Thus, it is considered to be ideal learning result. However, with an increase in α, records that are not similar are also merged, as seen in Figure 10g,h. Thus, CNN learning is hindered, and the winning rate is dropped, as in Figure 6g,h. Figure 11 shows similar results to those presented in Figure 10. With an increase in the α of the ANN, the saved record data become smaller. Figure 10b shows less training data when compared to Figure 10a. Figure 6b,c show that the winning rate is also higher than that when using only the existing CNN. Thus, it is considered to be ideal learning result. However, with an increase in α, records that are not similar are also merged, as seen in Figure 10g,h. Thus, CNN learning is hindered, and the winning rate is dropped, as in Figure 6g,h. Figure 11 shows similar results to those presented in Figure 10.  Figures 12 and 13 confirm that, as the number of Record data to be learned is minimal, overall, the number of ANN nodes is less than required. These experiments also confirm that the amount of CNN training data does not increase linearly when the ANN is combined according to a similar situation. When the α of the ANN is 5, the amount of training data was reduced by 12% on average, while still ensuring an effective performance.  Figures 12 and 13 confirm that, as the number of Record data to be learned is minimal, overall, the number of ANN nodes is less than required. These experiments also confirm that the amount of CNN training data does not increase linearly when the ANN is combined according to a similar situation. When the α of the ANN is 5, the amount of training data was reduced by 12% on average, while still ensuring an effective performance.

Next Best Answer Selection
The combination of one-hot encoding showed that the rows have a dependent relationship. Thus, we also experimented in order to verify the number of times the next best answer was selected in the proposed Gomoku system. The experimental criteria considered four records and the ANN α of five, thereby yielding the most optimal results. Figure 14 shows the number of times that the next best answer was selected in the proposed Gomoku system. As can be seen, the highest rank was selected 3483 times. When a new case was added through the ANN, the proposed CNN Gomoku system performed the greatest number of selections, because there was no other one-hot encoding to merge. If a stone was already placed in the highest rank or could not be placed, then the second-best result was selected. The number of the second-best was 1693, third-best was 751, fourthbest was 472, and fifth-best was 322. The number of times that the other rankings were selected was 1377. Through this experiment we confirmed that the proposed Gomoku system provides the next best answer by combining the independent one-hot encoding.

Next Best Answer Selection
The combination of one-hot encoding showed that the rows have a dependent relationship. Thus, we also experimented in order to verify the number of times the next best answer was selected in the proposed Gomoku system. The experimental criteria considered four records and the ANN α of five, thereby yielding the most optimal results. Figure 14 shows the number of times that the next best answer was selected in the proposed Gomoku system. As can be seen, the highest rank was selected 3483 times. When a new case was added through the ANN, the proposed CNN Gomoku system performed the greatest number of selections, because there was no other one-hot encoding to merge. If a stone was already placed in the highest rank or could not be placed, then the second-best result was selected. The number of the second-best was 1693, third-best was 751, fourth-best was 472, and fifth-best was 322. The number of times that the other rankings were selected was 1377. Through this experiment we confirmed that the proposed Gomoku system provides the next best answer by combining the independent one-hot encoding.

Conclusions
We proposed a new reinforcement learning algorithm based on a CNN. We verified the performance of the proposed algorithm by applying it to Gomoku. We enabled the CNN to determine the correct answers according to the situation by itself by determining similar situations with an ANN and combining the rows of one-hot-encoding-based vectors into the determined similar situations. As the rows of the independent one-hot-encoding-based vectors become non-independent relationships through the proposed algorithm, the next best answer can be selected if there exists a case where it is impossible to place a stone for the correct answer with the highest probability. This minimizes the forgetting problem, which arises when indiscriminately learning the data. Thus, it was possible to obtain a result of all wins. Further, the average amount of training data was reduced by 12% on average when the Dis of the ANN was five, as the amount of data did not increase incrementally for each game.
We developed the proposed reinforcement learning based Gomoku AI and verified its performance in GuPyEngine. We also introduced the functions and performance of the GuPyEngine. In the future, we will perform various experiments and optimizations on the linkage with deep learning tools for GuPyEngine. In this way, we expect to con-

Conclusions
We proposed a new reinforcement learning algorithm based on a CNN. We verified the performance of the proposed algorithm by applying it to Gomoku. We enabled the CNN to determine the correct answers according to the situation by itself by determining similar situations with an ANN and combining the rows of one-hot-encoding-based vectors into the determined similar situations. As the rows of the independent one-hot-encoding-based vectors become non-independent relationships through the proposed algorithm, the next best answer can be selected if there exists a case where it is impossible to place a stone for the correct answer with the highest probability. This minimizes the forgetting problem, which arises when indiscriminately learning the data. Thus, it was possible to obtain a result of all wins. Further, the average amount of training data was reduced by 12% on average when the Dis of the ANN was five, as the amount of data did not increase incrementally for each game.
We developed the proposed reinforcement learning based Gomoku AI and verified its performance in GuPyEngine. We also introduced the functions and performance of the GuPyEngine. In the future, we will perform various experiments and optimizations on the linkage with deep learning tools for GuPyEngine. In this way, we expect to contribute to 3D simulation research by improving the function of the 3D simulation platform. Further, we will apply the proposed reinforcement learning algorithm to other games and conduct relevant experiments.