Feature Extraction for StarCraft II League Prediction †

: In a player-versus-player game such as StarCraft II, it is important to match players with others with similar skills. Studies modeling player skills were conducted, with 47.3% and 61.3% performance. In order to improve the performance, we collected 46,398 replays and compared features extracted from six sections of replays. Through the comparison of the six datasets we created, we propose a method for extracting features from a single replay. Two algorithms, k-Nearest Neighbors and Random Forest, which are most commonly used in related studies, are compared. Our research showed a outperforming accuracy of 75.3% compared to previous works. Although no direct comparison has been made with the current system, we conclude that our research can replace the placement games of ﬁve rounds.


Introduction
Commercially successful games tend to provide environments big enough to hold a sufficient number of players to compete [1]. For competitive games such as StarCraft II, playing skills of each player should be taken into consideration as well as game materials [2]. If two competing players differ greatly in their degree of adeptness, one person will almost always end up with every match won. Then, the player will soon lose interest. It is essential to determine each player's skill accurately to match players with similar skills [3]. The main system used in the original game is based on ratings. Rating systems have been employed for the implementation of pairwise assessments of groups and for player-versus-player matchmaking. Each player has a rating that differs depending on the future results and is placed into one of the groups based on their rating [4].
StarCraft II is a real-time strategy game in which the players execute their actions in real time. Players employ a wide range of strategies in order to build their base, gather resources and units, and defeat opponents. Due to various maps, three races, and diverse buildings and units, there are an infinite number of possible situations in-game. Additionally, since players are provided with a large variety of action options at each moment, StarCraft II data have a high level of complexity. Therefore, it is a great place for researchers to try out machine learning algorithms [5].
In StarCraft II, players are distributed into seven different leagues according to their ratings. Regarding the distribution of the player's league, assigning each player a proper rating is essential. However, it is inaccurate with only a few rounds of play. The current ladder system allows each player to start with a default rating. With the results of the first five placement games, the system gives provisional rating and league to players. The placement games show greater variance in the rating than the other results. The weakness of this system is that even if a professional game player wins all of the placement games, he/she would not still be placed at the highest league right away. Beginners also suffer from being placed at the leagues that require higher adeptness than they currently have [6]. Therefore, as a result, each player would need at least five rounds to be assigned to appropriate leagues.
This research proposes predicting each player's adequate league by extracting data from replays. We selected 14 different features and extracted from each replay. Since we use the average value of features, the way the extraction section of the feature is selected has an effect on performance. We generated datasets in six different sections and performed comparative testing on them. Two machine learning algorithms were applied to classify leagues with extracted data.

Related Work
The game replay records the game log and allows it to recheck past games. Through it, human data with higher complexity can be used instead of data from the simulation. StarCraft's game data are useful for testing several algorithms because they have a higher complexity than those of other games [7]. Studies exist that provide game data by constructing datasets with information extracted from replay [7][8][9].
The most representative task of using StarCraft data is strategy prediction [10][11][12][13]. Ben G. Weber and Michael Mateas illustrate a data mining methodology for modeling the opponent's strategy [13]. Their method takes the form of encoding game logs as a feature vector containing the production information of a unit or building. They developed a model to predict the opponent's behaviors by analyzing extracted features. If the opponent's strategy is predicted, it helps develop AI performance through the use of counter strategy accordingly. Similarly, some studies design mathematical models to predict the winner [14,15]. These studies have shown that they can help predict the outcome of the game.
There are also multiple studies focused on player modeling. Siming Liu et al. [16] recognized players through extracted features and Random Forest. T. Avontuur et al. [6] developed a model that predicts skills based on data collected during the early portions of the game. This work is the most similar study to ours, showing an accuracy of 47.3%.
The key features of their model were related to the behavior and control of the players. We referred to this result when selecting features. Yinheng Chen et al. [17] showed better accuracy of 61.7%, using macro features related to the economy performance. The disparities in player skills among the leagues have been investigated by Thompson et al. [18]. They reported that there are differences in behavior depending on the player's league. Based on these research studies, we figured that control-related features could identify the leagues that players belong to.

Data Collection
The first step was to collect game records of various players for each tier. StarCraft II provides records of previous games played through replays. Replays allow users to check all actions and results that have occurred in the game. We collected 46,398 replays from Spawning Tool, a StarCraft II online community. Since only one instance of each player is created in each replay, a large number of replays are required. This dataset has multiple types of replays including one-versus-one and some with AI. It excluded non-game replays between two human players and also excluded league differences of two or more levels. In case of large differences in skills, the game could be moved to a one-sided way, so features extracted may not include the player's characteristics. The details of the dataset for each league are shown in Table 1. The parsing process is necessary to extract and use information from StarCraft II replay files. We used Sc2reader python library to parse replays and get game logs. Sc2reader offers the details about the players' actions and events in the game per frame. There are 14 types of features through Sc2reader, mostly associated with player controls. The first feature is camera switching. The player is only able to see a fraction of the entire map through the camera during the game. It is essential to move the camera in order to understand the game as it occurs simultaneously in several places. Orion Vinyals et al. [19] found out that the camera affects the performance of the agent. Therefore, we expect that the camera number has an impact on human skills and is a important feature. The second is action (APM), which is used as a key feature in a player modeling research [6,20]. Train and Build are the number of times the player has ordered to train units or construct buildings, which are related to the consumption of resources. A control group consists of 4 features: setting (ctrl + #), using (#), adding (shift + ctrl + #), and number. A command consists of 5 features: basic, targeted to unit, targeted to point, update unit, and update point. The last feature is the race of player, which changes the choice of units and buildings that players are provided with. The features other than the race are averaged per second during each section and scaled using minmaxscaler with a minimum value of 1 and a maximum of 10.
We extracted the features from 6 different kinds of section in the game. D1 is data during a combat. The results of combat have a impact on the flow and outcome of the game [15]. In order to extract data during a combat, it is necessary to define the start and end of a combat. In this paper, combat is defined as follows: First units that do not participate in combat, such as the workers and Larva, are excluded. If the difference between the times the units die is less than or equal to 3 s, the units are deemed killed in the same combat. Furthermore, any combat shorter than 10 s is excluded. The average combat length of whole replay is 18 s, so we create two datasets that are 18 s long. D2 and D3 are extracted for a duration of 18 s from the beginning and end of the game. In contrast, D4 and D5 are extracted for 5 min. Finally, D6 is created by extracting the entire section of the game. The description of each dataset is given in Table 2.

Evaluation
Two algorithms are used to classify with extracted data in this paper. The first algorithm is k-Nearest Neighbors (k-NN), which suggests that the test data is are with the k nearest training data in the feature space. We chose k-NN because it is commonly used in data mining due to its simple but high performance capabilities [21]. The parameter k is set to 100. The other algorithm is Random Forest, which is an ensemble method that utilizes multiple decision trees. Random Forest has shown outstanding performance in player identification research, so we use the same setting with 100 random trees.
To compare the performance of the algorithm in detail, we calculate accuracy, precision, recall, and F1-score. Each mathematical definition is as follows: F1-score = 2 * Precision * Recall/(Precision + Recall) (4) TP means that the predicted and the actual class are true. TN means that the predicted and the actual are false. FP and FN mean that the predicted and the actual are not matched.

Results
We applied two algorithms to all datasets for the league prediction. Table 3 compares the six datasets performances for two algorithms. D6 shows the best performance in all evaluations. However, it can be observed that D4, D5, and D6 show similar results. A paired t-test is conducted for detailed analysis, and as a result, there is no significant difference with p 0.05. We can also observe that D1, D2, and D3 have a similar performance. and the result of t-test is p 0.05. Based on the evaluation, we grouped the datasets into two groups of the similar performances (D1 + D2 + D3, D4 + D5 + D6).  Table 4 shows more detailed performance of two groups. The value of the second group is higher for all evaluations. In particular, the precision in Bronze is 0.25 higher, with an outstanding performance of 0.91. The t-test result also shows that their differences are significant with p 0.001. According to these results, it can be observed that each group is a set of datasets with similar performance, and the performance difference between the two groups is significant. Compared to Table 2, we can see that the two groups can be distinguished by the length of the section. D1, D2, and D3 are extracted from different short sections of 18 s. On the other hand, D4, D5, and D6 are extracted from different longer sections. There is no difference in performance when compared between sections of similar length, but when compared between sections of different lengths, it is better if the length is longer. Therefore, we conclude, first, that the features we used are not influenced by the timing of the extraction section. We extracted and compared three short sections including combat, which are important moments in the game, but they showed the similar performances. Second, we concluded that the length of the extraction section should be longer to show better performance.
This experiment compares the performance of two algorithms using D6, which showed the best performance. Figures 1 and 2 show the confusion matrix of each algorithm. Both algorithms are properly classified in all leagues, and even if they are wrongly classified, we can see that they are close to the correct league. If one level of league error is allowed, all leagues except the Bronze show an accuracy of about 90%. This means that our results allow us to place players into appropriate leagues.  The two algorithms are compared in detail based on accuracy, precision, recall, and F1-score. As shown in Table 5 , k-NN outperforms in Bronze class with a precision of 96%. This indicates that if classified as Bronze using k-NN, there is a 96% likelihood that it is correct. Random Forest shows 90% precision in Bronze. In the other classes, Random Forest performs better precision than k-NN. Although k-NN shows outstanding precision in Bronze, Random Forest has better overall precision. In recall, Random Forest performs better recall than k-NN except for Gold class. In contrast to precision, two algorithms perform with low recall in Bronze. Both algorithms have high precision and low recall problem in Bronze, apparently due to its lower number of data compared to other leagues. Table 2 shows that Bronze differs by as little as four times and as much as 40 times from other leagues. F1-score of Random Forest is higher than k-NN in all leagues. Based on these evaluation metrics, Random Forest performs better than k-NN with our features.

Conclusions
In this paper, we propose a method to predict the player's skill with one game replay. As a result of an experiment with six sections of different timing and length, the data from the entire game showed the best performance. Through this result, we can observe that a steady control has a greater impact on the player's skill than an instantaneous control. Comparing the performance with two algorithms, k-NN achieves an accuracy of 72.7%, and Random Forest achieves 75.3%. Our results showed improvement compared to previous studies, namely 47.3% [6] and 61.7% [17]. Random Forest shows better performance in accuracy, precision, recall, and F1-score. We suggest using the entire section of the game to extract features, and we suggest using Random Forest for classification.
In order to use a large number of replays for learning, it was not conducted in detail, such as by country or by race. There are differences in the skills of each league depending on the country, and the strategy varies depending on the race of the match-up. Therefore, it is expected to show better performance considering detailed conditions. Since the results show that the length of the section and performance of the feature are related, we can prove the relationship between length and performance. We failed to compare directly with the system because there was a problem finding the results of the placement games. If the ladder system uses these results instead of the placement games, it can quickly and accurately place players with a single match. Furthermore, applying these results to AI's control will improve AI's performance and help make league-specific AI guidelines.