A Long / Short-Term Memory Based Automated Testing Model to Quantitatively Evaluate Game Design

: The mobile casual game application lifespan is getting shorter. A company has to shorten the game testing procedure to avoid being squeezed out of the game market share. There is no su ﬃ cient testing indicator to objectively evaluate the operability of di ﬀ erent game designs. Many automated testing methodologies are proposed, but they adopt rule-based approaches and cannot provide quantitative analysis to statistically evaluate gameplay experience. This study suggests applying “Learning Time” as a testing indicator and using the learning curve to identify the operability of di ﬀ erent game designs. This study also proposes a Long / Short-Term Memory based automated testing model (called LSTM-Testing) to statistically testing game experience through end-to-end functionality (Input: game image; Output: game action) without any manual intervention. The experiment results demonstrate LSTM-Testing can provide quantitative testing data by using learning time as the control variable, game design as the independent variable, and time to complete game as the dependent variable. This study also demonstrates how LSTM-Testing evaluates the e ﬀ ectiveness of di ﬀ erent gameplay learning strategies, e.g., reviewing the newest decisions, reviewing the correct decision, or reviewing the wrong decisions. The contributions of LSTM-Testing are (1) providing an objective and quantitative analytical game-testing framework, (2) reducing the labor cost of ine ﬃ cient and subjective manual game testing, and (3) allowing game company boosts software development by focusing on game intellectual property and leaves game testing to artiﬁcial intelligence (AI).


Introduction
The mobile casual game with a simple gameplay feature (called casual game shortly) contributes 15~25% of the total sales volume of the game industry [1]. A gameplay is a series of actions in which players interact with a game, such a jump to dodge attacks. The design of the casual game targets casual players by reducing demands on time and learned gameplay skills, in contrast to complex hardcore games, such as Starcraft/Diablo-like games. This is because casual players only play games in short bursts for relaxing their brain or social purposes, and 70% of people spend only six months to play a game [2]. This leads that the lifespan of the casual game is getting shorter. A game company analyze game designs, including the difficulties between game levels and the reasonability of the leveling path. They also cannot validate that the game level design is proportional to the game's difficulty levels. It should be noted that game testing is a crucial procedure to ensure game quality through the development cycle. The major challenge of game testing is creating a comprehensive list of testing cases which consumes enormous labor cost.
Some AI assisted testing methods are proposed [7]. The difference between the previous works and this study is twofold. First, from the perspective of testing approach design, they mainly use reinforcement learning with Q-algorithm to learn to complete a game rather than focusing on the learning strategy of a gamer. This study extends reinforcement learning by adding long/short term memory functionality to emulate gamer behaviors, including (a) how to play a game based on long-term memory and (b) how to relearn it based on short-term memory when a gamer forgets how to play it. Second, from the perspective of the testing objective, their AI testing system is usually designed to find out problems/bugs of a game. This study's AI testing system is designed to compare the reasonability of game designs (e.g., reasonable leveling or reasonable maze design) by playing it iteratively.

Proposed Framework: LSTM-Testing
This study, therefore, proposes a Long/Short-Term Memory based automated testing model (called LSTM-Testing) to provide quantitative testing data. These data help the game designer to objectively evaluate the game design. LSTM-Testing emulates real player learning behaviors and iteratively plays the game through end-to-end functionality (Input: game image frame; Output: game action) without any manual feature extraction and intervention. The architecture of LSTM-Testing shown in Figure 1 is developed based on [9] and involves five logic gates and 2 memory pools including an input gate, convolutional gate, forget gate, training gate, output gate. Furthermore, short-term memory pool, and long-term memory pool are added to emulate gamer behaviors, including (a) how to play a game based on long-term memory and (b) how to relearn it based on short-term memory when a gamer forgets how to play it.

Phase 1: Acquiring Sample
In the first phase, LSTM-Testing screenshots video frame as an input image to the input gate as shown in the top-left side of Figure 1. Figure 2 presents the hardware settings: first to prepare a suitable testing room, projector, and project screen, then to set up a webcam and computer with LSTM-Testing installed, finally to run the game, project the game video frame on the project screen, Four phases of 11 steps of LSTM-Testing, as shown in Figure 1, are adopted: (1) Acquire Sample, (2) Extract Feature by Deep Neural Network, (3) Review by LSTM, and (4) Action and Summarize the Testing. Furthermore, Figure 1 demonstrates three working flow: action decision flow, learning flow, and memorizing flow. The action decision flow (red flow) is an iterative flow consisting of steps 1, 2, 3, 4, 5, 6, 10, 11, and then back to step 1. The action decision flow decides what gameplay has the highest possibility to complete a game. The learning flow (blue flow) is also in iterative flow consisting of step 9.1 and step 9.2. The learning flow is an independent background thread that decides what gameplay decisions are picked up for reviewing periodically. One training cycle means a game session (e.g., playing until win or loss), and a cycle owns several times of reviewing. The memorizing flow is a sequential flow consisting of steps 7 and 8 to decide which sample should be stored in and drop from the short-term memory.
The detailed operations of each phase are as follows.

Phase 1: Acquiring Sample
In the first phase, LSTM-Testing screenshots video frame as an input image to the input gate as shown in the top-left side of Figure 1. Figure 2 presents the hardware settings: first to prepare a suitable testing room, projector, and project screen, then to set up a webcam and computer with LSTM-Testing installed, finally to run the game, project the game video frame on the project screen, and take a screenshot of the frame by the webcam. The software platform used by LSTM-Testing is developed by Python 3.8.3, and TensorFlow GPU 2.2.0 on Windows 10 platform.

Phase 2: Extracting Feature by Deep Neural Network
In the second phase, LSTM-Testing extracts features from the screenshot image by deep neural network. The second phase consists of steps 2~6 as shown in the top-right corner of Figure 1.
Steps 2 and 3 use input gate to aggregate continuous pictures as a sample. The input gate owns a screenshot queue to store k screenshots. After one screenshot taken in Step 1, the input gate pops the oldest frame and pushes the new one into its queue, then the input gate combines the screenshots in the queue as a sample of a gameplay, and then it normalizes this sample to standard input (threedimensional array) for the convolution gate, i.e., . For example, in order to store the action of a

Phase 2: Extracting Feature by Deep Neural Network
In the second phase, LSTM-Testing extracts features from the screenshot image by deep neural network. The second phase consists of steps 2~6 as shown in the top-right corner of Figure 1.
Steps 2 and 3 use input gate to aggregate continuous pictures as a sample. The input gate owns a screenshot queue to store k screenshots. After one screenshot taken in Step 1, the input gate pops the oldest frame and pushes the new one into its queue, then the input gate combines the screenshots in the queue as a sample of a gameplay, and then it normalizes this sample to standard input (three-dimensional array) for the convolution gate, i.e., S std . For example, in order to store the action of a gameplay moving right-ward, the input gate pops the oldest screenshot at time t-4 from the queue and pushes the current screenshot at time t into the queue, and then the input gate combines the screenshots in the queue at time (t-3)~t to form a gameplay sample of moving left-ward (Figure 3a). The input gate then converts these images to two-dimensional (2D) arrays of pixel values (Figure 3b), and finally aggregates these 2D arrays to a 3D array as a sample to demonstrate moving right-ward action (Figure 3c). It should be noted that the LSTM-Testing does not collect all video frames of a game but only collects the frames based on the game testing scenario (e.g., eye strain or different hardware requirements). For example, LSTM-Testing may only collect 20 frames per second (one frame per 50 milliseconds) to emulate the game experience of a gamer with high eye strain, given that this game can support 120 frames per second for better flicker fusion experience.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 23 In steps 4~6, LSTM-Testing applies the convolutional gate to reduce the sample size, which can lower the computational complexity during the training process. The convolution gate consists of convolution, pooling, and fully-connected operations.
The convolution operation in step 4 attempts to extract targeted features from the origination sample and filter out unnecessary noise by calculating the input sample with multiple vectors of weight , , and bias , , . The vector of weights and the bias are called a filter, and this filter represents feature extraction of the input sample. As shown in Figure 4, K filters are first selected in this convolutional operation, and each filter k contains a weight (1) where d is the sliding distance (called stride) that a filter attempts to decrease the overlay extracting cells process. In steps 4~6, LSTM-Testing applies the convolutional gate to reduce the sample size, which can lower the computational complexity during the training process. The convolution gate consists of convolution, pooling, and fully-connected operations.
The convolution operation in step 4 attempts to extract targeted features from the origination sample and filter out unnecessary noise by calculating the input sample S std with multiple vectors of weight w cv p,q,l and bias b cv p,q,l . The vector of weights and the bias are called a filter, and this filter represents feature extraction of the input sample. As shown in Figure 4, K filters are first selected in this convolutional operation, and each filter k contains a weight w cv,k p,q,l (P × Q × L matrix) and bias b cv,k i,j (I × J matrix). Then, the sample S std is convolutional calculated with each filter by equation (1), extracting a convoluted sample S cv i,j,k (I × J matrix): where d is the sliding distance (called stride) that a filter attempts to decrease the overlay extracting cells process.
(1), extracting a convoluted sample , , ( × matrix): , , = � � �( × + , × + , × , , , + , , , ) where d is the sliding distance (called stride) that a filter attempts to decrease the overlay extracting cells process.   Figure 5 demonstrates an example of different strides with S std (7 × 7 matrix) and W (7 × 7 matrix). The number of overlay cell in the feature extraction process with a stride of one is two, and the extracted S cv will be a 5 × 5 matrix. On the other hand, the number of overlay cells in the feature extraction process with a stride of one is one, and the extracted S cv will be a 3 × 3 matrix. The S cv therefore shrinks when the stride is set to two from one. The game tester can choose to decrease the size of S cv and lower the computational complexity with the trade-off of learning speed.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 23 Figure 5 demonstrates an example of different strides with (7 × 7 matrix) and W (7 × 7 matrix). The number of overlay cell in the feature extraction process with a stride of one is two, and the extracted will be a 5 × 5 matrix. On the other hand, the number of overlay cells in the feature extraction process with a stride of one is one, and the extracted will be a 3 × 3 matrix. The therefore shrinks when the stride is set to two from one. The game tester can choose to decrease the size of and lower the computational complexity with the trade-off of learning speed.  Figure 6 shows an example of a convolution operation to transform a 7 × 7 array to a 3 × 3 array by 5 × 5 filter and stride = 2. At first, the 5 × 5 array in the top-left corner of the original array is picked up, multiply the filter (element by element), and fill it in the top-left corner of the new array ( Figure  6a). Figure 6b follows the same principle to fill the new array.     Figure 6 shows an example of a convolution operation to transform a 7 × 7 array to a 3 × 3 array by 5 × 5 filter and stride = 2. At first, the 5 × 5 array in the top-left corner of the original array is picked up, multiply the filter (element by element), and fill it in the top-left corner of the new array ( Figure  6a). Figure 6b follows the same principle to fill the new array. The Pooling operation in step 4 reduces the scale of by combining the data clusters into one single cluster. Figure 7 shows the pooling operation that first splits the into U samples, then divides each sample into different clusters, and finally combining the cluster into one value. Two combination methods, max-pooling and average-pooling, are usually adopted. The max-pooling The Pooling operation in step 4 reduces the scale of S cv by combining the data clusters into one single cluster. Figure 7 shows the pooling operation that first splits the S cv into U samples, then divides each sample into different clusters, and finally combining the cluster into one value. Two combination methods, max-pooling and average-pooling, are usually adopted. The max-pooling picks up the biggest value from the cluster, and the average-pooling calculates the average value of the cluster.        The fully-connected operation in step 6 aggregates information from that matters the most and then generates a short-term memory at time t, namely , as the input for step 7. Figure 9 presents the fully-connected operation that first flattens the to a one-dimensional array = [ 1 , … , ], and then executes the fully-connected calculation by Equation (2): where ＝[ 1 , … , ] is the short-term memory converted from this sample, ] is the bias vector. The fully-connected operation in step 6 aggregates information from S pl that matters the most and then generates a short-term memory at time t, namely sm t , as the input for step 7. Figure 9 presents the fully-connected operation that first flattens the S pl to a one-dimensional array C pl = c pl 1 , . . . , c pl r , and then executes the fully-connected calculation by Equation (2): where sm = [c 1 , . . . , c v ] is the short-term memory converted from this sample, Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 23 Figure 9. Fully-connected operation.

Phase 3: Review by LSTM
In Phase 3, LSTM-Testing adopts long/short-term memory to emulate gamer learning behaviors. As steps 7~10 shown in the center part of Figure 1, LSTM-Testing stores the short-term memory into the short-term memory pool for the following training. The training strategy is the learning strategy that what gameplay decisions are kept in the short memory queue for periodical reviewing and, on the other hand, what gameplay decisions are unnecessary and dropped from the short memory queue. The learning strategy designed in this study is reviewing the newest decision and

Phase 3: Review by LSTM
In Phase 3, LSTM-Testing adopts long/short-term memory to emulate gamer learning behaviors. As steps 7~10 shown in the center part of Figure 1, LSTM-Testing stores the short-term memory sm t into the short-term memory pool SM for the following training. The training strategy is the learning strategy that what gameplay decisions are kept in the short memory queue for periodical reviewing and, on the other hand, what gameplay decisions are unnecessary and dropped from the short memory queue. The learning strategy designed in this study is reviewing the newest decision and drop the older ones. That is, the decisions made are kept in the short memory queue and pick up periodically for training, and the oldest decision will be dropped when the short memory queue is full. As shown in Figure 10, the former is controlled by the forget gate F f t ( ), and the latter is controlled by the training gate F tr ( ).

Phase 3: Review by LSTM
In Phase 3, LSTM-Testing adopts long/short-term memory to emulate gamer learning behaviors. As steps 7~10 shown in the center part of Figure 1, LSTM-Testing stores the short-term memory into the short-term memory pool for the following training. The training strategy is the learning strategy that what gameplay decisions are kept in the short memory queue for periodical reviewing and, on the other hand, what gameplay decisions are unnecessary and dropped from the short memory queue. The learning strategy designed in this study is reviewing the newest decision and drop the older ones. That is, the decisions made are kept in the short memory queue and pick up periodically for training, and the oldest decision will be dropped when the short memory queue is full. As shown in Figure 10, the former is controlled by the forget gate ( ), and the latter is controlled by the training gate ( ). The forget gate manages samples in the short-term memory pool. That is, the forget gate determines which sample is valuable and should stay in the pool for training gate, and drops the sample when it is no longer usable, according to ( ) as shown in Figure 11. The forget gate manages samples in the short-term memory pool. That is, the forget gate determines which sample is valuable and should stay in the pool for training gate, and drops the sample when it is no longer usable, according to F f t ( ) as shown in Figure 11. The training gate applies a stochastic approximation method, which interactively loads samples in the short-term memory and then updates the long-term memory optimizing a differentiable objective according to ( ). As shown in Figure 12, in this study, ( ) applies Adaptive Moment Estimation [10] that evaluates averages of both the gradients and the second moments of the gradients to iteratively optimizing the objective function. The training gate applies a stochastic approximation method, which interactively loads samples in the short-term memory and then updates the long-term memory LM = by optimizing a differentiable objective according to F tr ( ). As shown in Figure 12, in this study, F tr ( ) applies Adaptive Moment Estimation [10] that evaluates averages of both the gradients and the second moments of the gradients to iteratively optimizing the objective function. Figure 11. Algorithm of ( ).
The training gate applies a stochastic approximation method, which interactively loads samples in the short-term memory and then updates the long-term memory optimizing a differentiable objective according to ( ). As shown in Figure 12, in this study, ( ) applies Adaptive Moment Estimation [10] that evaluates averages of both the gradients and the second moments of the gradients to iteratively optimizing the objective function. Then, step 10 calculates the probability array = [ 1 , … , ] to demonstrate the probability of z possible gameplays to complete the game. The calculation is defined by: where ＝[ 1 , … , ] is the bias vector.

Phase 4: Action and Summarize the Testing
Step 11 picks up the action with the highest probability from the result A of the output gate. LSTM-Testing then triggers the corresponding keyboard/mouse event and go back to step 1 iteratively. After acquiring sufficient testing results according to the testing plan, LSTM-Testing then where B lm = b lm 1 , . . . , b lm z is the bias vector.

Phase 4: Action and Summarize the Testing
Step 11 picks up the action with the highest probability from the result A of the output gate. LSTM-Testing then triggers the corresponding keyboard/mouse event and go back to step 1 iteratively. After acquiring sufficient testing results according to the testing plan, LSTM-Testing then summarizes "learning time to complete a game" as the dependent variable, "game design" as the control variable.

The LSTM-Testing Parameter Setting in This Study
The design of these logic gates can be adjusted according to the complexity of games. The more complex the gates, the better performance it achieves. However, the training time and computation time is getting higher for a deeper network, the game tester management has to gauge the trade-off between performance and computation-cost.
The LSTM-Testing settings for the following testing case in this study are presented in Figure 13. LSTM-Testing first screenshots each gameplay as the input, and it uses 4 continuous screenshots as a sample of combination gameplays. The standard input size is a 3D array (80 × 80 × 4) for the following steps. LSTM-Testing adopts two combinations of convolution and subsampling for feature extraction. The Adaptive Moment Estimation then calculates the probabilities of possible gameplay actions. Then LSTM-Testing chooses the action with the highest probability to complete the game. The size of short-term memory is 50,000, the training gate F tr ( ) randomly selects 128 samples from the short-term memory for training, and the forgetting gate F f t ( ) drops the oldest sample in the short-term memory when the queue of short-term memory is full. as a sample of combination gameplays. The standard input size is a 3D array (80 × 80 × 4) for the following steps. LSTM-Testing adopts two combinations of convolution and subsampling for feature extraction. The Adaptive Moment Estimation then calculates the probabilities of possible gameplay actions. Then LSTM-Testing chooses the action with the highest probability to complete the game. The size of short-term memory is 50,000, the training gate ( ) randomly selects 128 samples from the short-term memory for training, and the forgetting gate ( ) drops the oldest sample in the short-term memory when the queue of short-term memory is full. Figure 13. LSTM-Testing setting for the following cases in this study.

Environment Settings
This section uses a classic and popular casual game, i.e., Pacman [11], as a case study to ensure that LSTM-Testing can replace manual testing to a certain extent. Pacman is a classic and popular casual game developed by Namco Bandai Games Inc. in 1980. Three different difficulties of mazes shown in Figure 14 are applied in this evaluation. The testing results between LSTM-Testing and manual testing are compared by using learning time as the control variable, game design as the independent variable, and time to complete game as the dependent variable.

Environment Settings
This section uses a classic and popular casual game, i.e., Pacman [11], as a case study to ensure that LSTM-Testing can replace manual testing to a certain extent. Pacman is a classic and popular casual game developed by Namco Bandai Games Inc. in 1980. Three different difficulties of mazes shown in Figure 14 are applied in this evaluation. The testing results between LSTM-Testing and manual testing are compared by using learning time as the control variable, game design as the independent variable, and time to complete game as the dependent variable. The feedback setting is (1)-1 point when Pacman does not eat a bean in a gameplay, (2)-10 points when Pacman eats a bean, and (3)-10 points when Pacman is hunt by a monster. Each maze is tested by LSTM-Testing for 150,000 games including the games that eating all beans or failing to eat all beans.
Regarding the manual testing, each maze is tested by three testers who never play Pacman, and each tester has to finish 150 games including the games that eating all beans or failing to eat all beans. First, considering the order of difficulty of three maze types, as shown in Figure 15a, On the other hand, manual-testing enters the steady-state to eat all beans at about 80, 100, 120 games for the "maze with 2 holes", "maze with 1 hole", and "maze with no hole". The order of the maze difficulty The feedback setting is (1)-1 point when Pacman does not eat a bean in a gameplay, (2)-10 points when Pacman eats a bean, and (3)-10 points when Pacman is hunt by a monster. Each maze is tested by LSTM-Testing for 150,000 games including the games that eating all beans or failing to eat all beans.

Experiment Results
Regarding the manual testing, each maze is tested by three testers who never play Pacman, and each tester has to finish 150 games including the games that eating all beans or failing to eat all beans. On the other hand, considering the learning rate of the casual player to the different mazes, we design a Game Learning Rate Similarity algorithm (GLRS) to evaluate the similarity between the results of manual-testing and that of LSTM-Testing. Figure 16 shows that GLRS consists of three steps. First, GLRS normalizes the testing time index of Manual-Testing and LSTM-Testing data sets to the range 0 to 1, since their testing times are quite different. Second, GLRS selects k time-series samples with the same interval from Manual-Testing and LSTM-Testing, respectively, and to build corresponding vectors to calculate similarity. In step 3, GLRS applies cosine similarity method [12] to First, considering the order of difficulty of three maze types, as shown in Figure 15a, On the other hand, manual-testing enters the steady-state to eat all beans at about 80, 100, 120 games for the "maze with 2 holes", "maze with 1 hole", and "maze with no hole". The order of the maze difficulty is, from easy to difficult: "maze with 2 holes", "maze with 1 hole", and "maze with no hole". Figure 15b demonstrates that LSTM-Testing enters the steady-state to eat all beans at about 40,000, 60,000, 80,000 games for the "maze with 2 holes", "maze with 1 hole", and "maze with no hole". This means that LSTM-Testing suggests that the order of the maze difficulty is, from easy to difficult: "maze with 2 holes", "maze with 1 hole", and "maze with no hole". Therefore, the result of LSTM-testing shows the order of the maze difficulty matches that of manual-testing.

Experiment Results
On the other hand, considering the learning rate of the casual player to the different mazes, we design a Game Learning Rate Similarity algorithm (GLRS) to evaluate the similarity between the results of manual-testing and that of LSTM-Testing. Figure 16 shows that GLRS consists of three steps. First, GLRS normalizes the testing time index of Manual-Testing and LSTM-Testing data sets to the range 0 to 1, since their testing times are quite different. Second, GLRS selects k time-series samples with the same interval from Manual-Testing and LSTM-Testing, respectively, and to build corresponding vectors to calculate similarity. In step 3, GLRS applies cosine similarity method [12] to compare the testing result between Manual-Testing and LSTM-Testing. Cosine similarity is widely applied to estimate the similarity between two non-zero vectors by measuring the cosine of the angle between the two vectors. The cosine of 0 • means the similarity is 100% of the two vectors, and it is less than 100% for any angle in the interval (0, π] radians. As the result calculated by GLRS, the game learning rate similarity of Manual-Testing and LSTM-Testing is 97.60%. We also provide a case study to show how LSTM-Testing to explore an unnecessary trap of a game design.
Four case studies in the following subsections demonstrate how to apply LSTM-Testing to evaluate the reasonability of the game leveling.

Ensure Reasonable Leveling Path Design
This subsection described how LSTM-Testing evaluates the reasonability of leveling between From the above evaluation, the experiment results support the argument that LSTM-Testing can replace manual testing to a certain extent.

Implications of LSTM-Testing
Game leveling provides new challenges and difficulty to keep player's interest high in a game. However, manually testing different leveling designs is difficult due to two reasons. First, we cannot stop game testers spontaneously learning higher gameplay skills, and their skill is getting better in-game testing procedures. This leads that a high-skilled tester, from a beginner's viewpoint, hardly objectively evaluates the reasonability of leveling from easy to hard. The second reason is the game tester may feel paralyzed about the game leveling after numbers of game testing. This causes the tester may think the leveling is indifferent and boring.
In order to overcome the problem of manual-testing mentioned above, LSTM-Test can evaluate different game leveling design by using the trained model in different levels as the control variable, game design as the independent variable, and time to complete game as the dependent variable. This means LSTM-Testing can force the AI to stop learning how to play the game in order to acquire objective and quantitative testing data between different levels. LSTM-Testing can also emulate the game players with varying game skills to acquire corresponding testing data in different game leveling path. Therefore, the game designer can construct a reasonable leveling path and guide the player towards the goal of the level and preventing confusion, idling, and boring, which may cause the player to leave the game.
We also provide a case study to show how LSTM-Testing to explore an unnecessary trap of a game design.
Four case studies in the following subsections demonstrate how to apply LSTM-Testing to evaluate the reasonability of the game leveling.

Ensure Reasonable Leveling Path Design
This subsection described how LSTM-Testing evaluates the reasonability of leveling between mazes with a different number of holes.
As Section 4 shows that order of the maze difficulty, LSTM-Testing then explores the reasonable leveling path from one maze to another maze by using a trained model in the different maze as a control variable, game design as an independent variable, and time to complete game as a dependent variable. Therefore, the test cases are: → Leveling Path A: from "maze with 2 holes" to "maze with 1 hole". LSTM-Testing first load the trained model which can eat all beans in "maze with 2 holes". Then, LSTM-Testing applies this trained model to play the "maze with 1 hole" and observes the training time to eat all beans. → Leveling Path B: from "maze with 2 holes" to "maze with no hole". LSTM-Testing first load the trained model which can eat all beans in "maze with 2 holes". Then, LSTM-Testing applies this trained model to play the "maze with no hole" and observes the training time to eat all beans. → Leveling Path C: from "maze with 1 hole" to "maze with no hole". LSTM-Testing first load the trained model which can eat all beans in "maze with 1 hole". Then, LSTM-Testing applies this trained model to play the "maze with no hole" and observes the training time to eat all beans.
The evaluation result is shown in Figure 17. The training time of leveling path A (orange line), B (blue line), and C (green line) are about 2000 games, 7000 games, and 5000 games, respectively. It should be noted that the leveling path A and C own high percent, 60% and 80% respectively, to eat all beans at the beginning of the training, compared to path B cannot eat most beans at the beginning of the training. These phenomena indicate that: Testing can acquire a high percent to eat all beans at the beginning of the training.
Third, path C requires fewer games than paths A and B is due to that the player can apply his learned skill, i.e., not using "hole" to dodge ghosts, to complete the game without hole of next level. However, the player of the path B and C confronts difficulties to complete the game with one or no hole because he/she needs a hole to doge ghost that is learned from the previous level. Furthermore, path A requires fewer games than path B is because the player of the path A can still use the hole to dodge the ghost and the player of the path B cannot use the hole to doge the ghost.

Validate Game Difficulty Is Proportional to Level Scales
This subsection described how LSTM-Testing validates the argument that the maze difficulty is proportional to the maze size for the first-time gamer. The argument origins from that some game testers suggest that the bigger the size the maze is, the more beans this maze has, and the higher difficulty for the beginner to eat all beans. On the other hand, some game testers argue that the bigger size the maze is, the easier the beginner can dodge the chasing of the monster. However, there is still no sufficient quantitative testing results, to the best of our knowledge, to validate this argument.
The testing result is demonstrated in Figure 18, where three different maze sizes, i.e., small maze (3 rows with 9 columns), median maze (5 rows with 11 columns), and large maze median maze (7 rows with 11 columns), are applied in this testing as shown in Figure 19. The training time of small maze, median maze, and large maze to eat all beans are about 4000 games, 5000 games, and 16,000 First, the leveling path C might be a good leveling design that provides new challenges and difficulty to keep player's interest high in this maze and also guides the gamer to develop a higher skill to eat all beans.
Second, the leveling path A and C might be too easy and unnecessary for the gamer, since LSTM-Testing can acquire a high percent to eat all beans at the beginning of the training.
Third, path C requires fewer games than paths A and B is due to that the player can apply his learned skill, i.e., not using "hole" to dodge ghosts, to complete the game without hole of next level. However, the player of the path B and C confronts difficulties to complete the game with one or no hole because he/she needs a hole to doge ghost that is learned from the previous level. Furthermore, path A requires fewer games than path B is because the player of the path A can still use the hole to dodge the ghost and the player of the path B cannot use the hole to doge the ghost.

Validate Game Difficulty Is Proportional to Level Scales
This subsection described how LSTM-Testing validates the argument that the maze difficulty is proportional to the maze size for the first-time gamer. The argument origins from that some game testers suggest that the bigger the size the maze is, the more beans this maze has, and the higher difficulty for the beginner to eat all beans. On the other hand, some game testers argue that the bigger size the maze is, the easier the beginner can dodge the chasing of the monster. However, there is still no sufficient quantitative testing results, to the best of our knowledge, to validate this argument.
The testing result is demonstrated in Figure 18, where three different maze sizes, i.e., small maze (3 rows with 9 columns), median maze (5 rows with 11 columns), and large maze median maze (7 rows with 11 columns), are applied in this testing as shown in Figure 19. The training time of small maze, median maze, and large maze to eat all beans are about 4000 games, 5000 games, and 16,000 games, respectively. This result supports the argument that the bigger maze size is, the more difficult the maze has. It also shows that the maze difficulty might be proportional to the maze size to a certain extent for the Pacman game. However, it should be noted that the difficulty of a game depends on the game design itself and may not be the proportional to the size of other games' playground.
games, respectively. This result supports the argument that the bigger maze size is, the more difficult the maze has. It also shows that the maze difficulty might be proportional to the maze size to a certain extent for the Pacman game. However, it should be noted that the difficulty of a game depends on the game design itself and may not be the proportional to the size of other games' playground.

The Leveling of the Snake Game
In this experiment, we evaluate the leveling of the Snake Game [13]. The basic Snake game is that a sole player attempts to eat beans by running into them with the head of the snake, as shown in Figure 20. Each bean eaten makes the snake longer, so controlling is progressively more difficult. The increasing difficulty of eating one more bean is evaluated by LSTM-Testing in this subsection.
LSTM-Testing uses AI learning time as a dependent variable to quantitatively evaluate the Snake Game leveling design. That is, we attempt to observe how many games when the LSTM-Testing has learned to eat one bean, two beans, and three beans, respectively. Figure 21 demonstrates that LSTM-Testing needs about 10,000 games to learn how to eat the first bean, given that success rate for 80% means LSTM-Testing has learned the skill. Then, LSTM-Testing needs about 20,000 games to learn to eat two beans, and that about 50,000 games to eat three beans. Therefore, this evaluation result can Appl. Sci. 2020, 10, x FOR PEER REVIEW 18 of 23 games, respectively. This result supports the argument that the bigger maze size is, the more difficult the maze has. It also shows that the maze difficulty might be proportional to the maze size to a certain extent for the Pacman game. However, it should be noted that the difficulty of a game depends on the game design itself and may not be the proportional to the size of other games' playground.

The Leveling of the Snake Game
In this experiment, we evaluate the leveling of the Snake Game [13]. The basic Snake game is that a sole player attempts to eat beans by running into them with the head of the snake, as shown in Figure 20. Each bean eaten makes the snake longer, so controlling is progressively more difficult. The increasing difficulty of eating one more bean is evaluated by LSTM-Testing in this subsection.
LSTM-Testing uses AI learning time as a dependent variable to quantitatively evaluate the Snake Game leveling design. That is, we attempt to observe how many games when the LSTM-Testing has learned to eat one bean, two beans, and three beans, respectively. Figure 21 demonstrates that LSTM-Testing needs about 10,000 games to learn how to eat the first bean, given that success rate for 80% means LSTM-Testing has learned the skill. Then, LSTM-Testing needs about 20,000 games to learn to eat two beans, and that about 50,000 games to eat three beans. Therefore, this evaluation result can

The Leveling of the Snake Game
In this experiment, we evaluate the leveling of the Snake Game [13]. The basic Snake game is that a sole player attempts to eat beans by running into them with the head of the snake, as shown in Figure 20. Each bean eaten makes the snake longer, so controlling is progressively more difficult. The increasing difficulty of eating one more bean is evaluated by LSTM-Testing in this subsection.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 19 of 23 make a hypothesis that each leveling of the Snake Game is two times of difficulty comparing the previous level. LSTM-Testing uses AI learning time as a dependent variable to quantitatively evaluate the Snake Game leveling design. That is, we attempt to observe how many games when the LSTM-Testing has learned to eat one bean, two beans, and three beans, respectively. Figure 21 demonstrates that LSTM-Testing needs about 10,000 games to learn how to eat the first bean, given that success rate for 80% means LSTM-Testing has learned the skill. Then, LSTM-Testing needs about 20,000 games to learn to eat two beans, and that about 50,000 games to eat three beans. Therefore, this evaluation result can make a hypothesis that each leveling of the Snake Game is two times of difficulty comparing the previous level.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 19 of 23 make a hypothesis that each leveling of the Snake Game is two times of difficulty comparing the previous level.

The Impact of Picture Background on the Game Difficulty: Encounter Unnecessary Trap
We design an experiment to observe the impact of picture background on game difficulty. The motive is that using pictures as the game background will increase the game difficulty because the object of the game will be obscured by the background and cause mistake of gameplay. However, there is no sufficient quantitative analysis to evaluate how much game difficulty a picture background increases. This experiment, therefore, uses the ping-pong game as a case study to observe the scale of rising difficulty of a picture background by using LSTM-Testing.
The experiment, as shown in Figure 22, uses a simple Ping-pong game without a picture background as the control group (as shown in Figure 22a), and uses a simple ping-pong game with a picture background as the experiment group (as shown in Figure 22b). Then, we observe how many balls coming down when the LSTM-Testing has learned to return the ball.

The Impact of Picture Background on the Game Difficulty: Encounter Unnecessary Trap
We design an experiment to observe the impact of picture background on game difficulty. The motive is that using pictures as the game background will increase the game difficulty because the object of the game will be obscured by the background and cause mistake of gameplay. However, there is no sufficient quantitative analysis to evaluate how much game difficulty a picture background increases. This experiment, therefore, uses the ping-pong game as a case study to observe the scale of rising difficulty of a picture background by using LSTM-Testing.
The experiment, as shown in Figure 22, uses a simple Ping-pong game without a picture background as the control group (as shown in Figure 22a), and uses a simple ping-pong game with a picture background as the experiment group (as shown in Figure 22b). Then, we observe how many balls coming down when the LSTM-Testing has learned to return the ball. Figure 23 demonstrates that the game without picture background needs about 35,000 balls to learn how to return the ball, given that the LSTM-Testing reaches 90% percent to successfully return the balls. However, the game with picture background needs about 43,000 balls to learn how to return the ball, which means the picture used in this test increases by 15% (43,000/35,000) difficulty. This is because the LSTM-Testing encounter an unnecessary trap that LSTM-testing cannot identify the white ball when this ball passes the playground with light color, as shown the position of ball near the nose of the dog. Therefore, this evaluation result can make a hypothesis that a game with a picture background does increase the game difficulty.  Figure 23 demonstrates that the game without picture background needs about 35,000 balls to learn how to return the ball, given that the LSTM-Testing reaches 90% percent to successfully return the balls. However, the game with picture background needs about 43,000 balls to learn how to return the ball, which means the picture used in this test increases by 15% (43,000/35,000) difficulty. This is because the LSTM-Testing encounter an unnecessary trap that LSTM-testing cannot identify the white ball when this ball passes the playground with light color, as shown the position of ball near the nose of the dog. Therefore, this evaluation result can make a hypothesis that a game with a picture background does increase the game difficulty.  This experiment further evaluates the effect of learning strategy on returning a ball. The learning strategy means what gameplay decisions are kept in the short memory queue for periodical reviewing and, on the other hand, what gameplay decisions are dropped from the short memory queue. Three review strategies are evaluated in this experiment including (1) reviewing the newest decision which is the default setting, (2) reviewing the wrong decision, and (3) reviewing the correct decision. Reviewing the newest decision means the decisions made are kept in the short memory queue, and the oldest decision will be dropped when the short memory queue is full. Reviewing the wrong decision means the keeps the wrong decisions, which LSTM-testing makes, in the short-term memory queue, and drops the right decisions to return the ball when the queue is full. In contrast to reviewing the wrong decisions, reviewing the correct decision is to keep the right decisions in the queue and drop the wrong decisions.
The effect of learning strategy to return the ball, as shown in Table 1, shows that reviewing the wrong decision is the most effective strategy than other strategies for the ping-pong game. The strategy of reviewing the wrong decision requires 33,000 times of trainings, which is the fewest training iterations to have 90% successfully return the ball. This is because, for the ping-pong game, the width of the bar to return the ball is shorter than the width of the bottom of the gameplay ground. This leading that, in a random fresh, the opportunities to make the correct decision to move the bar and catch/return the ball is fewer than that to lose the ball. The LSTM-testing then does not have sufficient learning opportunities since the short-term memory does not have enough samples of correct decisions.

Review Correct Decision
Training iterations to reach 90% of successfully return ball (unit: k) 35 33 42

Conclusions
The proposed LSTM-Testing is capable to objectively and quantitatively evaluate the reasonability of casual game design. The experiment result shows that the game learning rate similarity of Manual-Testing and LSTM-Testing is 97.60%, which supports that LSTM-Testing can replace manual testing to a certain extent. As the experiment demonstrated in Sections 4-6, the most advantages of LSTM-Testing compared to the manual testing are: (1) to quantitatively evaluate the difficulty of different game design, (2) to ensure reasonable leveling path design, and (3) to validate the game difficulty is proportional to level scales.
Currently, LSTM-Testing is only applied to test the casual games. This study will extend the proposed LSTM-Testing to hard-core game testing and strategy training for Decision Makings of Project Management [13]. On the other hand, however, the effectiveness of AI Software Testing procedure for the multiplayer game is still required.