Spellcaster Control Agent in StarCraft II Using Deep Reinforcement Learning

: This paper proposes a DRL -based training method for spellcaster units in StarCraft II, one of the most representative Real-Time Strategy (RTS) games. During combat situations in StarCraft II, micro-controlling various combat units is crucial in order to win the game. Among many other combat units, the spellcaster unit is one of the most signiﬁcant components that greatly inﬂuences the combat results. Despite the importance of the spellcaster units in combat, training methods to carefully control spellcasters have not been thoroughly considered in related studies due to the complexity. Therefore, we suggest a training method for spellcaster units in StarCraft II by using the A3C algorithm. The main idea is to train two Protoss spellcaster units under three newly designed minigames, each representing a unique spell usage scenario, to use ‘Force Field’ and ‘Psionic Storm’ effectively. As a result, the trained agents show winning rates of more than 85% in each scenario. We present a new training method for spellcaster units that releases the limitation of StarCraft II AI research. We expect that our training method can be used for training other advanced and tactical units by applying transfer learning in more complex minigame scenarios or full game maps.


Introduction
Reinforcement Learning (RL), one of the nature-inspired computing algorithms, is a powerful method for building intelligent systems in various areas. In developing such systems, Artificial Intelligence (AI) researchers often examine their algorithms in complex environments for the sake of prospective research applying their models in challenging real-world settings. In doing so, solving complex game problems with Deep Reinforcement Learning (DRL) algorithms have become a common case. DeepMind unraveled Atari-breakout with DQN [1], for instance, and OpenAI showed significant performance in the Dota2 team play with PPO [2]. Particularly, Real-Time Strategy (RTS) games have emerged as a major challenge in developing real-time AI based on DRL algorithms.
StarCraft II, one of the representative games of the RTS genre, have become an important challenge in AI research. It is supported by substantial research community-AI competitions on StarCraft II; e-sport fans vitalizing the StarCraft II community; and open Application Program Interfaces (APIs) for researchers to manipulate actual game environments. Many related and advanced research have been published since DeepMind's work on creating AI players in StarCraft II using RL algorithms [3]. Especially, training an agent to control units in real-time combat situations is an active as well as a challenging area of RL research in StarCraft II due to its complexity. It is because the agent has to decide over thousands of options in a very short period of time. Unlike other turn-based games, such as Chess or Go, an agent playing StarCraft II is unable to obtain the entire information of the game's action space, and this feature hinders an agents' long-term planning of a game.

Related Works
Recent AI research in StarCraft II has diversified its extent from macroscopic approaches, such as training entire game sets to win a full game, to microscopic approaches, such as selecting game strategies or controlling units [4][5][6]. Several studies have partially implemented some detailed controls over units (often referred to as 'micro') in combat. Although spellcaster units are crucial components for winning the game especially in the mid-late phase of game combats, there is no research on controlling the spellcasters. When spellcaster units cast spells, they have large amounts of options to consider such as terrain, enemy, and ally information. For this reason, effectively casting spells during combat is generally thought of as a difficult task for AI players. As StarCraft II is an RTS game, however, it is challenging to apply the algorithms used in turn-based games directly into StarCraft II. Players of StarCraft II have to deal with various sub-problems such as micro-management of combat units, build order optimization of units and constructions, as well as efficient usage of mineral resources.
There have been various research works approaching to solve the difficulties of developing StarCraft II AI. First, there are methods that train Full-game Agents. Hierarchical policy training which uses macro-actions won against StarCraft II's built-in AI in a simple map setting [7]. Self-playing pre-planned Modular Architectures method showed a 60% winning rate against the most difficult built-in AI in ladder map settings [8]. Deepmind produced Grandmaster-level AI by using imitation learning through human's game replay data by self-playing within Alphastar League [9].
Some works have suggested training micro-control in specific situations. Micro-control means to control units during combat. For instance, several works have been conducted on training with DRL to control units and applying transfer learning to employ the training results in game situations [9][10][11]. The minigame method was first introduced by DeepMind. The method focuses on scenarios that are placed in small maps constructed for the purpose of testing combat scenarios with a clear reward structure [3]. The main purpose of minigame is to win a battle against the enemy via micro-control. DeepMind trained its agents using 'Atari-Net', 'FullyConv', and 'FullyConv LSTM' in the minigame method and showed higher scores than other AI players. By applying the Relational RL architectures, some agents of the minigame method scored higher than that of Grandmaster-ranked (the highest rank in StarCraft II) human users [10].
Some other studies have worked on predicting the combat results. Having knowledge of the winning probability of combat based on the current resource information, in advance, assists AI's decision makings whether one should engage the enemy or not. In StarCraft I, the prediction was done by a StarCraft battle simulator called SparCraft. The limit of this research is that units such as spellcasters are not involved due to numerous variables [12]. There is a battle simulator that takes one spellcaster unit into account [13]. However, considering only one spellcaster among many other spellcaster units is insufficient to give precise combat predictions.
Lastly, StarCraft II AI bots do not consider spell use significantly. For instance, the built-in AI in StarCraft II does not utilize sophisticated spells, such as 'Force Field'. TStarBot avoided the complexity of micro-control by hard-coding the actions of units [14]. However, this approach is not applicable to spell use in complicated situations and good performance is not guaranteed. Also, SparCraft, a combat prediction simulator, does not include spellcasters in combat prediction because the prediction accuracy was poor when considering spellcasters [12]. One of the related studies integrated its neural network with UAlbertaBot, an open-source StarCraft Bot, through learning macro-management from StarCraft replay; however, advanced units such as spellcasters were not included [15]. In this respect, research on the control of spellcaster units is needed in order to build improved StarCraft II AI.

Contribution
Among the aforementioned studies, examinations on spell usage of spellcaster units are mainly excluded. Although spells are one of the most important components in combat, most StarCraft II AI do not consider spell use significantly. This is because spellcaster units are required to make demanding decisions in complex situations. It is difficult to precisely analyze situations based on given information or to accurately predict the combat results in advance. Otherwise, inaccurate use of spells could lead to crucial defeat. Combat in StarCraft II is inevitable. To win the game, a player must efficiently control units with 'micro', appropriately utilize spellcaster units, and destroy all the enemy buildings. In the early phase of a game, micro-controls highly affect the combat results; during the mid-late phase of a game, however, effective use of spells, such as 'Force Field' and 'Psionic Storm', significantly influence the overall results of a game [16].
In this paper, we propose a novel method for training agents to control the spellcaster units via three experiment sets with the DRL algorithm. Our method constructs benchmark agents as well as newly designed minigames for training the micro-control of spellcasters. They are designed to train spellcaster units in specific minimap situations where spells must be used effectively in order to win a combat scenario. Our work can be applied to general combat scenarios in prospective research using trained parameters in order to improve the overall performance of StarCraft II AI.

Deep Reinforcement Learning
Reinforcement Learning (RL) searches for the most optimal actions a learner should take in certain states of an environment. Unlike supervised or unsupervised learning, RL does not require training data but an agent continuously interacts with an environment to learn the best actions which return the largest amount of rewards. This feature is one of the main reasons that RL is frequently used to solve complex problems in games such as Backgammon [17], Chess [18], and Mario [19]. Deep Reinforcement Learning (DRL) is a sophisticated approach of RL employing deep network structures into its algorithm. DRL has solved problems that were once considered unsolvable by using RL only. Adopting deep network methods in RL, however, is often followed by a few drawbacks. DRL algorithms are thought to be unstable for non-stationary targets and the correlation between training samples during on-line RL updates is very strong.
Among others, Deep Q-Network (DQN) is one of the most famous DRL algorithms that solved such a problem. DQN used Convolutional Neural Network (CNN) as a function approximator and applied the 'experience replay' and 'target network' method. DQN showed meaningful achievements in Atari 2600 Games using raw pixel input data, and it surpassed humans in competition [1]. Before DQN, it was challenging for agents to deal directly with high-dimensional vision or language data in the field of RL. After the success of DQN, many DRL algorithm papers have been published, including double DQN [20], dueling DQN [21], and others [22,23] as well as policy-based DRL methods such as DDPG [24], A3C [25], TRPO [26], PPO [27]. These methods solved problems involving the challenge of continuous control in real-time games: StarCraft II [9], Dota2 [2], and Roboschool [27].
This paper employs the Asynchronous Advantage Actor-Critic (A3C) algorithm [25]. While DQN demonstrated meaningful performances, it has a few weaknesses. The computational load is heavy because it uses a large amount of memory. It requires off-policy algorithms that update from information that is generated by older policies [25]. A3C solves the strong correlation problem by using an asynchronous method that executes multiple agents in parallel. This feature also reduces the computational burden compared to DQN. For these reasons, DeepMind used the A3C algorithm in its StarCraft II AI research. Detailed descriptions of the algorithm applied to this paper are provided in the Method section.

StarCraft II and Spellcasters
In StarCraft II, the most common game setting is a dual. Each player chooses one of the three races-Terran, Protoss, and Zerg-with their distinct units and buildings, requiring different strategies for each race. Each player starts with their small base. In order to win, one has to build constructions and armies, and destroy all of the opponent's buildings. Spellcasters refer to any unit that uses energy-based spells. Spells are, and can be, powerful skills depending on how often and effectively a player uses them based on the amount of energy, which is passively generated at 0.7875 energy per second and capped at 200 per unit. Trading energy in the form of spells with enemy units is the most cost-efficient trade in the game since the energy is an infinite and free resource [16]. In short, spellcasters have powerful abilities compared to other combat units, but the spells must be used effectively and efficiently due to the limited energy resources.
Protoss has two significant spellcaster units: Sentry and High Templar. The screenshot images of two spellcaster units are provided in Figure 1. Sentries provide 'Force field', a barrier that lasts for 11 s and impedes the movement of ground units. Designed to seclude portions of the enemy forces, it helps realize the divide-and-conquer strategy, only if properly placed. 'Force field' is one of the spells that are difficult to use because the surrounding complex situations must be regarded before creating it in proper placement coordinates. Generally, it benefits greatly from (1)

Method
In Particular, we use the A3C algorithm that is constructed as an Atari Network (Atari-Net) to train the StarCraft II spellcaster units. The 'Atari-net' is DeepMind's first StarCraft II baseline network [28]. One of the reasons we chose the Atari-net and A3C algorithm is to improve the on-going research topic and learn under the same condition as the minigame method in StarCraft II combat research demonstrated by DeepMind [3]. This paper suggests that spellcaster units are well trained through the methodology that we demonstrate.
A3C, a policy-based DRL method, has two characteristics: (1) Asynchronous, (2) Actor-Critic. (1) The asynchronous feature utilizes multiple workers in order to train more efficiently. In this method, multiple worker agents interact with the environment at the same time. Therefore, it can acquire a more diverse experience than a single agent. (2) The actor-critic feature estimates both the value function V(s) and the policy function π(s). The agent uses V(s) to update π(s). Intuitively, this allows the algorithm to focus on where the network's predictions were performing rather poorly. The update of A3C algorithm is calculated as ∇ θ log π (a t |s t ; θ ) A (s t , a t ; θ, θ v ), where A (s t , a t ; θ, θ v ) is an estimate of advantage function given by , which k varies among states. A3C particularly use CNN which produces one softmax output for policies π (a t |s t ; θ), and one linear output for value function V (s t ; θ v ). In short, the policy-based algorithm updates policy parameter θ asynchronously as below, The Atari-Net accepts current available behaviors and features from the minimap as well as from the screen. This allows the algorithm to acknowledge situations of both the entire space and specific units. The detailed network architecture is in Figure 2. A3C implements parallel training where multiple workers in parallel environments independently update a global value function. In this experiment, we used two asynchronous actors in the same environment simultaneously, and used epsilon-greedy methods for initial exploration to use spells at random coordinates [29]. We utilized eight workers initially but soon found that two workers make the training more time-efficient.  Figure 3 describes the overall structure of the training method used in this paper. The agent in each worker executes minigames by interacting with the StarCraft II environment directly through the PySC2 python package. This paper had 8 workers in parallel. The (x, y) coordinates for spells in minigames are based on three input values-minimap, screen, and available actions. With those input values in consideration, the agent decides the spell coordinates taking the policy parameter given by the global network into account. One episode is repeated until when the agent is defeated in combat or when it has reached its time limit. After the training of each episode is over, the agent returns the scores based on the combat result and the coordinates where spells are used. Then, each worker trains through the network to update policy parameters using the coordinate and score values. Lastly, the global network receives each worker's policy loss and value loss, accumulates those values in its network, and then sends updated policy values to each agent. Algorithm 1 below provides a detailed description of the A3C minigame training method.  if random(0, 1) < then 8: (x, y) = randint(0,M) 9: else 10: s t ← Minigame's screen, minimap, available action 11: Set (x, y) according to policy π (a t |s t ; θ ) 12: end if 13: Use spell at (x, y) cordinates 14: if De f eat or TimeOver then 15

Minigame Setup
The minigames are a key approach to train the spellcasters in this experiment. Each minigame is a special environment created in the StarCraft II Map Editor. Unlike normal games, this map provides a combat situation where the agent repeatedly fights its enemies. These are specifically designed scenarios on small maps that are constructed with a clear reward structure [1]. It can avoid the full game's complexity by defining a certain aspect and can be used as a training environment to apply to the full game.
We trained the agent in three typical circumstances. There are three different minigame scenarios:  Table 1    The '(a) Choke Playing' minigame is designed to test whether training the agent under a rather simplified combat environment with a spell is feasible. The minigame consists of two Sentry ally units, and eight Zergling enemy units. It also tests whether the training agent can use the 'Force field' to utilize the benefit of the choke point. The '(b) Divide Enemy' minigame is designed to test whether training the agent under an environment that resembles a real combat with spells is feasible. The minigame consists of six Stalker and one Sentry ally units, and ten Roach enemy units. It also tests whether a training agent can use the 'Force field' to divide the enemy army. This technique can also force the enemy to engage if it manages to trap a portion of its army. The '(c) Combination with Spell' minigame is designed to test whether the agent can learn two spells at the same time under an environment similar to Choke Playing. The minigame consists of one Sentry and one High Templar ally units, and ten Zergling enemy units. It tests whether a training agent can use a combination of spells ('Force Field' and 'Psionic Storm'). Figure 5 shows the images of minigame combats after training. The objective of these minigames is to defeat the enemy armies on the right side of the minigame in every episode of training, by effectively casting spells around the choke point (narrow passage between the walls). Each episode takes two minutes. Each time all the enemy armies are defeated, the environment is reset. When the player's units are all defeated in battle, each episode terminates as well. The score of the minigame increases by ten points when an enemy unit dies and decreases by one when a player's unit dies. Since the enemy units rush forward to kill the player's units, the agent should strive to kill as many enemy units as possible without losing too many of its own. Consequently, careful 'micro' on the Sentries with regards to efficient use of 'Force field' is required because the scenario is disadvantageous to the player; the Sentries can use 'Force field' only once. In this experiment, during the process of learning to win, the agent learns to use the appropriate spells in needed situations.

Result
The graphs of the experiment results are presented in the figures below. Figure 6 shows an integrated score graph of the three minigame methods. Figure 7 provides score graphs of each training method. In Figure 7, the orange line depicts the scores of trained agents and the blue line depicts the scores of random agents that cast spells at random coordinates.  Table 2 provides the statistics of the training results. Win rate corresponds to the winning rate on each episode. Mean corresponds to the average score of the agent performances. Max corresponds to the maximum observed individual episode score. In order to win and get a high score, spells must be used properly at relevant coordinates. Because the coordinates of the enemy units' location are random in every situation, a high score indicates that spells were used effectively in randomly different situations. In that respect, a high max score implies that the agent is well trained to use spells at optimal coordinates. In short, trained agents achieved high mean and max scores compared to random agents. This aspect is especially noticeable in scenario (a) and (c). In scenario (b), due to the characteristic of the minigame, the score difference between the trained agent and the random agent was not significant as much as the score difference in the other two scenarios. Compared to the other two scenarios, minigame (b) took a comparatively long time to converge to a certain score while training.
In the beginning, agents trained with the '(a) Choke Playing' used 'Force Field' to divide the enemy army to delay defeat. Over 1000 episodes, however, the agent converged to using 'Force Field' at a narrow choke-point to block the area and attacking the enemy with a ranged attack without losing health points. The agent took an especially long time to use 'Force Field' properly with the '(b) Divide Enemy' minigame. Because the score was obtained without using 'Force Field', the presence or absence of 'Force Field' was comparatively insignificant. Over 1600 episodes, the agent converged to using 'Force Field' to divide the enemy army efficiently and taking advantage of the enemy's poor ability to focus fire. Agent trained with the '(c) Combination with Spell' could not use 'Psionic Storm' in combination with 'Force Field', but instead had its units killed along with enemy Zerglings under 'Psionic Storm'. Over 1000 episodes, the agent converged to using both spells simultaneously. They solved the scenario by blocking the choke point first, and then properly casting 'Psionic Storm'. Probability Density Functions (PDF) of the linear output from the global network structure are shown in Figures 8-10. The graphs depict the probability of the (x, y) coordinates converging into a specific location as the training episodes increase. For this experiment result, we applied the following hyperparameters. We used an 84 × 84 feature screen map and 64 × 64 minimap size. The learning rate was 5 × 10 −4 and its discount was 0.99. For exploration, we used the epsilon greedy method. The initial epsilon ( ) value was at 1.0 and the epsilon decay rate was 5 × 10 −4 . We trained the agent on a single machine with 1 GPU and 16 CPU threads. Agents achieved a winning rate of more than 85% against the enemy army in respective minigame situations.

Discussion
This paper presents the method of minigame production to train the spellcasters and make a combat module. First of all, we showed that spellcaster units can be trained to use spells properly and correspondingly with the minigame situations. Designing unique maps that provide various situations for spell usages showed valid effects in training spellcasters. For example, both the (a) and (c) minigames were more adaptive to the training method, because it is difficult to defeat the enemy units without training the spells in their environments.
This approach breaks the limit from using only 'attack, move, stop, and hold' commands in training for combat by including the spell commands of the spellcasters. This training method trains the agents to use spells effectively for a certain situation in both simple and complex battles. In addition, this minigame method can be used not only for training spellcaster unit control but also in training advanced unit control. This can be done with additional minigame designs particularly aimed for training control of other units in StarCraft II. Those units may include advanced combat units, such as Carrier or Battlecruiser, which does not appear frequently in melee games due to expensive costs and trade-offs in managing resources, or tactical units like Reaper or Oracle which are used for hit and run, scouting, and harassment strategies. Through this training method, the extent of employing Transfer Learning in AI research for StarCraft II will be enlarged.
In order to use this agent in real combat, one needs to train minigames in a wider variety of situations and apply transfer learning. A related study that trained existing minigames using the A3C algorithm proposed transfer learning method by applying pre-trained weights into other minigames and showed significant performance [30]. The paper also suggests applying transfer learning from minigames to Simple64 map or other full game maps. Because DRL models have low sample efficiency, a large number of samples are required [10]. Our work can be utilized in generating various and many samples. Additionally, it might be effective if one adds conditions such as region and enemy information in further studies. Therefore, we recommend extending the training of spellcaster units to the same degree as training other combat units.
We deem this an unprecedented approach since most other studies have focused on training their agents merely based on the units' combat specifications without considering the effects of exploiting complex spells. The strengths of our method are that minigame situations can reduce the complexity of training micro-control, and they can train challenging combat situations through various minigame scenarios. The limitations of our approach are that we used two spellcasters among many others, and the proposed minigames are designed to be applicable for 'Force Field' and 'Psionic Storm' spells. While we used one of the most complicated utility spells, the 'Force Field' spell, to provide a baseline for training spellcaster units, our training method can be referred to train other advanced units as well. Through the training, we have succeeded in inducing our agent to deliberately activate the 'Force Field' in critical moments. We have focused on 'Force Field', the most used and complex-to-use spell, and we plan to make use of the spellcaster agents in AI bots.
Our study is a part of the contribution to building more advanced StarCraft II AI players, especially in the aspect of maximizing the benefit of spells that can potentially affect combat situations to a great degree. We expect that developing more versatile agents capable of employing appropriate spells and applying their performances to realistic combat situations will provide another step to an ultimate StarCraft II AI. In the future, we will put effort into transfer learning techniques to use this method.

Conclusions
We proposed a method for training spellcaster units to use spells effectively in certain situations. The method is to train an agent in a series of minigames that are designed for specific spell usage scenarios. Hence, we made three minigames, called '(a) Choke Playing', '(b) Divide Enemy', and '(c) Combination with Spell', each representing a unique spell usage situation. By using these minigames, we trained two spellcasters to use 'Force Field' and 'Psionic Storm' in minigame scenarios. The experiment showed that training spellcasters within the minigames improved the overall combat performance. We suggested a unique method to use minigames for training the spellcasters. This minigame method can be applied to train other advanced units besides spellcaster units. In the future, we look forward to finding a way to utilize this method in real matches. By adopting transfer learning, we expect other complex sub-problems in developing powerful StarCraft II AI can be solved. While this paper employed a DRL algorithm called A3C to minigame method, there is a lot of potential for other nature-inspired methods to be applied to solve the problem. Some of the most representative nature-inspired algorithms such as the Monarch Butterfly Algorithm (MBO) [31], the Earthworm Optimization Algorithm (EWA) [32], the Elephant Herding Optimization (EHO) [33], and the Moth Search (MS) [34] algorithm should be considered in future research. We believe that such an attempt will be another step towards the construction of an ultimate StarCraft II AI.