Considerations for Comparing Video Game AI Agents with Humans

: Video games are sometimes used as environments to evaluate AI agents’ ability to develop and execute complex action sequences to maximize a deﬁned reward. However, humans cannot match the ﬁne precision of the timed actions of AI agents; in games such as StarCraft, build orders take the place of chess opening gambits. However, unlike strategy games, such as chess and Go, video games also rely heavily on sensorimotor precision. If the “ﬁnding” was merely that AI agents have superhuman reaction times and precision, none would be surprised. The goal is rather to look at adaptive reasoning and strategies produced by AI agents that may replicate human approaches or even result in strategies not previously produced by humans. Here, I will provide: (1) an overview of observations where AI agents are perhaps not being fairly evaluated relative to humans, (2) a potential approach for making this comparison more appropriate, and (3) highlight some important recent advances in video game play provided by AI agents.

Often, many of these studies use a variety of retro games, such as for Atari 2600 (Atari Inc., Sunnyvale, CA, USA) or the original Nintendo Entertainment System (NES) (Nintendo Co., Ltd., Kyoto, Japan), to demonstrate the generalisability of an AI agent to play many games and optimise actions based on minimal coding that is unique to each game, such as to know where the score is shown (see Figure 1A). Some games are now well known to be difficult for AI agents, such as the Atari 2600 games Montezuma's Revenge (Parker Brothers, Beverly, MA, USA) and Battle Zone (Atari Inc., Sunnyvale, CA, USA), where the mapping between increases in calculable score and actions are particularly complex, with subgoals not clearly defined [9,10]. Other recent advances in AI agents have been in strategy games that involve direct competition, such as chess and Go. More recently, Vinyals et al. [8] developed AlphaStar to play the competitive computer-based strategy game StarCraft 2 (Blizzard Entertainment, Inc., Irvine, CA, USA). The original StarCraft (Blizzard Entertainment, Inc., Irvine, CA, USA) computer game has been a staple in AI research for well over a decade [11][12][13], but with the recent advances in both deep learning and reinforcement learning, the successor game of StarCraft 2 has stood as a clear benchmark for future developments in video game AI.

Infinite Mario
AlphaStar early game middle of the game  [14,15], respectively. Briefly, StarCraft and StarCraft 2 are real-time strategy (RTS) games where each player is a commander of a science-fiction military force of one of three species: Terran (advanced humans), Protoss (advanced technological aliens), or Zerg (advanced biological aliens). From this role, you are tasked with building one or more military bases, harvesting resources from around the base, and building any of a variety of military units. The goal of each game is to defeat the other players (i.e., destroy their bases). Each species involves different strategies and has approximately 15 unique unit types available to it. Units vary in characteristics such as attack range, movement, which is either ground-based or flying, and abilities such as being able to gather resources, construct buildings, carry other units, heal/repair other units, or even limited invisibility. All species have units that are designed to be suitable counters to units from each species, reflecting a rock-paper-scissors dynamic.
AlphaStar has been provided with a direct API to the StarCraft game engine, allowing the agent to skip past the tedious and arguably less interesting problems of computer vision and sensorimotor precision (e.g., if this was an independent robot that had to control a computer keyboard and mouse); see Figure 1B,C. Early versions of the AlphaStar agent did not need to control the camera and had an all-seeing perspective of all actively viewable regions, as well as inhuman "micro" precision of actions. Thus, while AlphaStar was able to develop novel and intriguing strategies, this is not its only advantage relative to a human player. In this article, I will discuss considerations for how AI agents can be made more comparable to humans, allowing us to learn the most from their strategies, rather than their capabilities that we cannot hope to match.
Using AlphaStar [8] as the main example, one approach to adjust for this is to limit AI agents to the same abilities as expert humans: Camera view. Humans play StarCraft through a screen that displays only part of the map along with a high-level view of the entire map (to avoid information overload, for example). The agent interacts with the game through a similar camera-like interface, which naturally imposes an economy of attention, so that the agent chooses which area it fully sees and interacts with. The agent can move the camera as an action. [. . . ] APM limits. Humans are physically limited in the number of actions per minute (APM) they can execute. Our agent has a monitoring layer that enforces APM limitations. This introduces an action economy that requires actions to be prioritized. [. . . ] [A]gent actions are hard to compare with human actions (computers can precisely execute different actions from step to step).
Indeed, it was generally agreed that "forcing AlphaStar to use a camera helped level the playing field" [16]. As such, even with the camera limitations later added to AlphaStar [8], the agent is able to click on objects on the screen that are not actually visible to human experts (i.e., objects at the end of the screen with only a few viewable pixels) ( [17] confirmed as replay "AlphaStarMid_053_TvZ"). Even with these constraints, it can be debated whether the limitations imposed on AlphaStar are sufficient given that humans have considerable limitations in their sensory and motor abilities, for instance: (1) the inability to actively attend to all of the presented visual information simultaneously; and (2) less targeting control of the temporal and spatial precision of their actions. In human experts, declines in performance can emerge as early as at 24 years old [18]. For instance, even though humans and AlphaStar could be matched to the same APM distribution characteristics, human actions are not all efficiently made-unlike AlphaStar. In a preliminary show-match [19,20], AlphaStar was able to demonstrate superhuman control, "it could attack with a big group of Stalkers, have the front row of stalkers take some damage, and then blink them to the rear of the army before they got killed" [16] (watch the demonstration with commentary [20] from 1:30:15). Humans have previously used this strategy as well, but are unable to have the spatial and temporal precision to execute it as well as AlphaStar is able. In another example, AlphaStar can exhibit precision control of multiple groups of units beyond human capabilities (e.g., see 1:41:35 and 1:43:30 in the previous video [20]). Clearly, human performance, even that of skilled humans, is not a suitable benchmark for modern AI agents; they simply are not matched in the sensorimotor processing and precision necessary for comparable real-time performance. Note, however, I am not critiquing the differences in the amount of experience between humans and AI agents, e.g., "During training, each agent experienced up to 200 years of real-time StarCraft play". While AI agents often are provided with orders of magnitude more experience than human agents in the tested scenario, humans benefit from a biological architecture that makes this relative difference hard to evaluate directly (e.g., genetic and physiological optimisation; see Zador [21] and LeDoux [22]). Moreover, I suggest that the principle measure of interest is not how quickly an agent reaches its current level of performance (e.g., number of matches or years of play), but the decision-making strategies that were present at the point of evaluation. Notably, these highlighted issues do not apply to turn-based strategy games, such as chess or Go, only to real-time strategy games. Moreover, it is also important to acknowledge that these concerns about AlphaStar's lack of limitations (e.g., camera view and APM limits) were not as relevant to previous StarCraft AI agents, which were not nearly as able to defeat human players [23].
Issues in comparing AI agents to humans are, of course, not limited to AlphaStar. In work using retro games where humans have been included as a benchmark, these may not truly be performance from expert humans. Mnih et al. [7] reported many measures based on comparisons to a "professional human games tester"; however, this appears to be a single human for all games and likely was not an expert in any of the games. The level of expertise for the human data here is particularly questionable, for instance the human score for Seaquest was 20,182 points (see Extended Data Table 2 of [7]), whereas a community high-score website with rigorous reporting procedures has 276,510 as the current world high-score [24]-a value 10 times higher than the expert in Mnih et al. [7]. Moreover, using tool-assistance (described in more detail later on), humans have achieved the maximum score possible-999,999 points [25]. These tool-assisted playthroughs can be precisely re-played based on a recorded set of button presses with frame-by-frame precision. This is not unique to only a single example, with other high scores being also well above those obtained by the human in Mnih et al. [7] (e.g., Kangaroo, 3035 vs. 47,800 [26]). Similar critiques have also been raised by Toromanoff et al. [27]. Even in more recent work (e.g., [28][29][30]), obtaining appropriate human benchmarks when making claims of the AI agent "achiev[ing] superhuman performance in a range of challenging and visually complex domains" [30] appears to be an afterthought.

Learning from the Cube
While the goal of this paper is to consider how machine-learning algorithms can be fairly compared to humans using video game benchmarks, a simplified case is also worth considering-the Rubik's Cube-which itself is no stranger to machine-learning optimisation (e.g., [31][32][33]). The Rubik's Cube, which arguably could be implemented as a video game itself, is a well-known game where the goal state is a 3 × 3 × 3 cube where each face of the cube is only a single colour, and otherwise is comprised of six unique colours. The initialisation state is any valid configuration of colours that can be obtained from an indeterminate number of twists such that the cube is considered scrambled. As such, the state space of the Rubik's Cube is 4.33 × 10 19 possible configurations (i.e., states). It is worth considering that the state space of a Rubik's Cube is significantly smaller than any permutation of the six colours along the 9 × 6 cubies (i.e., the visible small square panels that make up each face), as for instance, the centre white cubie is always directly opposite the centre yellow cubie. For every valid Rubik's Cube state, there are 11 unreachable states based on how colours along the cubies are inter-related; in other words, if one were to randomise the coloured stickers on a Rubik's Cube, there is a 11/12 (92%) chance that it would be unsolvable.
The so-called "beginner's method" for solving the Rubik's Cube [34] often averages more than 100 moves to solve the cube, where each step corresponds to a quarter-turn of a face or middle, either clockwise or counter-clockwise-or a half turn, or a rotation of the cube's perspective. Here, the methods are comprised of numerous algorithms, i.e., sequences of moves, designed to swap colours from portions of the cube or otherwise reach intermediate goal states (e.g., cross/"daisy", complete layer, 2 × 3 × 1 block), as illustrated in Figure 2. Illustration of various methods for solving a scrambled Rubik's Cube. A scrambled cube, with the scramble notation shown at the top, can be solved using a variety of methods. The number of moves required to solve the cube using each method is shown, along with pictures of intermediate goal states (e.g., cross/"daisy", complete layer, 2 × 3 × 1 block).
Competitive approaches for solving the cube rely on increasing numbers of algorithms and increasing complexity of the intermediate goal states [35]. For instance, the beginner layer-by-layer method only has a few defined/memorised algorithms, but a popular competitive solving method, the Friedrich/CFOP method, has 78 to 119 algorithms (depending on the variant of the method) and averages 55 moves to solve the cube. The ZZ method relies on more complicated intermediate goal states, as well as more memorised algorithms (up to 537), but results in slightly fewer moves, averaging in the low 40s. Most competitive Rubik's Cube solvers, known as "cubers", use the CFOP method due to its balance between the number of algorithms to memorise and the number of moves, even though the Roux and ZZ methods would reduce the number of necessary moves to solve the cube; most competitive cubing is based on overall solving speed, not number of moves, which are only indirectly related. As such, here, a limitation for humans even within tractable strategies is the algorithmic complexity to identify and execute more efficient solving strategies, even when this can be dissociated from the time related to sensorimotor actions.
Thistlewaite [36] proposed a method that could solve a Rubik's Cube from any initialisation state in up to 52 moves, later refined to 45 moves. However, this approach requires too many algorithms for a human to memorise and functions more as a look-up table of input and output state configurations, although it still involves intermediate goal states. Kociemba refined this approach by reducing the number of intermediate goal states, reducing the number of steps required to solve any cube to up to 29 moves. Later work has shown that any possible configuration of a Rubik's Cube can be transformed to the goal state using no more than 20 moves [37], and from this, a complete look-up table of all 4.33 × 10 19 valid states could be stored along with their associated solving moves, fully replacing the strategy with pre-memorised optimal solutions. While I do not think it is necessary for AI agents to be compared to humans, some researchers have developed robots that can solve a physical Rubik's Cube, rather than merely virtual representations of it (e.g., [38,39]).

Developing Better Benchmarks for AI Agents
A better comparison for real-time AI agents would be tool-assisted human performance, where the precise timing of actions is optimised frame-by-frame, as is done in tool-assisted speed-runs (TAS) [40]. In this approach, humans can play a game with the option of slowing it down and timing their sequence of button presses to perform actions with frame-by-frame temporal precision (as retro games do not use a mouse, the spatial precision concern with AlphaStar does not apply here). This approach would remove the definitively superhuman sensorimotor advantage of AI agents, while still allowing for the comparison of strategic planning abilities. This level of fine control allows for a theoretically perfect playthrough that would otherwise be nearly impossible to perform by a human. The usual goal of this approach is a "speed-run", that is to play through the game as quickly as possible, but could be readily applied by maximising a score or any other goal criteria. This would provide the humans with external aids to level the playing field and focus more on the strategic component of the gym environment (also see [41]). Some tool-assisted speed-runs use modified hardware to directly provide inputs into the original video game console as a controller [42]; however, in most cases, speed-runs are evaluated using software emulation [43].

AI Agents Performing beyond Human Capabilities
The suggestions here are not intended to be overly critical, just more "fair" of a comparison. With the Rubik's Cube, advanced strategies-i.e., those beyond the capabilities of humans-can transition from the initialisation state to being solved without requiring the more comprehensible intermediate states, such as a white daisy/cross as a first step. As outlined earlier, any Rubik's Cube state can be solved to the goal state in twenty moves or less [37]. This same work identified the number of actions required to solve a cube from each of the possible 4.33 × 10 19 initialisation states. Based on this algorithm, only 3% of states can be solved in 16 moves or less, but this increases to 30% for 17 moves or less and up to 97% for 18 moves or less. In contrast, common human strategies will often take at least 40 to 60 moves, though as highlighted earlier, these strategies are developed to optimise human processing and execution time (i.e., speed-cubing), not minimising moves taken. The algorithmic complexity of planning the optimal Rubik's Cube state a dozen moves ahead, without executing them yet, is not a task humans are well suited for, which is why the human solving methods involve memorising algorithms that help achieve intermediate goal states, somewhat similar to chess gambits and Go joseki.
Of interest are surprising examples of emergent behaviour. Chrabaszcz et al. [44] found that AI agents have the potential to grossly out-perform humans and find novel solutions, even if an "environment" has been available to humans for decades. In their work with a Q*bert agent, "the agent learns that it can jump off the platform when the enemy is right next to it, because the enemy will follow: although the agent loses a life, killing the enemy yields enough points to gain an extra life again. The agent repeats this cycle of suicide and killing the opponent over and over again". More impressively, however, "the agent discovers an in-game bug. First, it completes the first level and then starts to jump from platform to platform in what seems to be a random manner. For a reason unknown to us, the game does not advance to the second round but the platforms start to blink and the agent quickly gains a huge amount of points (close to 1 million for our episode time limit)". This behaviour has since been reproduced in the console version of Q*bert by a human [45], ruling out the possibility that this behaviour was only possible (1) due to a bug in the game emulation or (2) relying on rapid button presses that exceeded plausible human reaction times. In another example, and more comically, reference [46] reported on an overly greedy AI agent that can play Tetris-but not well-placing blocks haphazardly for their associated increase in points, but learning to pause the game just before the next block being generated would cause the game to be over. See [47] for further examples of emergent behaviour.
In a recent study, OpenAI researchers have constructed an environment where multiple AI agents play a game of hide-and-seek in teams and adapt to each others' strategies, in a form of multi-agent generative adversarial learning [48], as shown in Figure 3. In certain circumstances, the agents learned to exploit the physics of the environment to get on top of boxes and move around the environment "surfing" on a box, or even launching themselves into the air and successfully "seeking" the hiders mid-flight.
Published as a conference paper at ICLR 2020  Figure 1: Emergent Skill Progression From Multi-Agent Autocurricula. Through the reward signal of hide-and-seek (shown on the y-axis), agents go through 6 distinct stages of emergence. (a) Seekers (red) learn to chase hiders, and hiders learn to crudely run away. (b) Hiders (blue) learn basic tool use, using boxes and sometimes existing walls to construct forts. (c) Seekers learn to use ramps to jump into the hiders' shelter. (d) Hiders quickly learn to move ramps to the edge of the play area, far from where they will build their fort, and lock them in place. (e) Seekers learn that they can jump from locked ramps to unlocked boxes and then surf the box to the hiders' shelter, which is possible because the environment allows agents to move together with the box regardless of whether they are on the ground or not. (f) Hiders learn to lock all the unused boxes before constructing their fort. We plot the mean over 3 independent training runs with each individual seed shown with a dotted line. Please see openai.com/blog/emergent-tool-use for example videos.
time consuming and costly. Furthermore, the learned skills in these single-agent RL settings are inherently bounded by the task description; once the agent has learned to solve the task, there is little room to improve.
Due to the high likelihood that direct supervision will not scale to unboundedly complex tasks, many have worked on unsupervised exploration and skill acquisition methods such as intrinsic motivation. Visualisations of multi-agent adversarial learning and adaptations in strategy in OpenAI's hide-and-seek game [48]. Panels demonstrate the different strategies developed by the AI agents.
In another setting, it could be considered that the most impressive accomplishment of AlphaGo [49] is not that it can beat human grandmasters, but that the agent was able to develop never-before-seen play styles and strategies, even in a game invented millennia ago. As examples, this can be observed in the commentary from Go experts from the March 2016 challenge matches between Lee Sedol, one of the best players in the world, and AlphaGo, highlighted in the AlphaGo movie [50]  . In some of the cases, commentators first express confusion and/or consider the move to be an error, with the true implications only realised much later; e.g., in the Match 5 commentary: "Are we seeing another short circuit?", "There's no reason for white to be playing that move. It's a bad move. . . ", "We all say some of AlphaGo's moves are so weird and strange, and maybe mistakes. But after a game is finished, we have to doubt ourselves, our judgment", "I think it's important to study more about AlphaGo's mistake-like moves. Then maybe we can adjust our knowledge about Go", "It's playing some moves that are not really necessary". In other cases, the awe and surprise of the move is nearly immediate (e.g., Match 2: "Yeah, that's an exciting move. I think we are seeing an original move here. That is the kind of move that you play Go for."). AlphaGo can plan and consider many more moves ahead than any human, often 50-60 moves. AlphaGo even has the "awareness" that its move would be extremely unlikely to come from a human player (Match 2)-and similarly, had the awareness that Lee Sedol's critical move in Match 4 was also extremely unlikely. Since his matches against AlphaGo, Lee Sedol has retired from Go, shortly after the unveiling of AlphaGo Zero-which was able to beat the earlier AlphaGo (Lee) 100 games to zero [51]. Lee directly cited the AI agents as his reason for retiring: "With the debut of AI in Go games, I've realized that I'm not at the top even if I become the number one through frantic efforts. Even if I become the number one, there is an entity that cannot be defeated". [52] Where AlphaGo was trained on matches from expert players and followed by self-play, AlphaGo Zero began with only the game rules and did not have access to any human experiences. The report on AlphaGo Zero shows how, with continuing learning (i.e., hours of training), the agent is able to develop known Go joseki (patterns of moves) and in some cases mature past them and generate new, unrecorded winning joseki (see [51], Extended Data Figures 2 and 3). AlphaGo Zero has since been superseded by AlphaZero, where the AI agent has been generalised to be able to play chess and shogi at expert levels, as well as Go [53].
Like AlphaGo, AlphaStar has also been observed to develop novel strategies not yet explored by human players [8,17]. One of the expert StarCraft players reflected: "I was surprised by how strong the agent was, AlphaStar takes well-known strategies and turns them on their head. The agent demonstrated strategies I hadn't thought of before, which means there may still be new ways of playing the game that we haven't fully explored yet" [19]. In comparison to humans, AlphaStar appears more content to strategically kill a few units and immediately back away, rather than going "all in". More generally, AlphaStar is more decisive about when to engage in an attack, while the opposing humans have more difficulty deciding when to begin a conflict (e.g., back-and-forth movement showing hesitancy in committing to an attack). Watching replays since AlphaStar has had the camera constraint from AlphaStar's perspective are somewhat jarring as it often rapidly cycles between groups of units in different parts of the map, issuing a specific action command, then changing to another set of units within just a few seconds-e.g., no decision has to be made after the camera is changed; the action for those units has already been determined and simply was yet to be issued. More recent, individual games of AlphaStar also include odd behaviour that seems to be non-adaptive; though like AlphaGo, this may be preparing for situations we cannot identify. For instance, AlphaStar has been observed lifting and moving Terran buildings without an apparent goal and, in other instances, massing units that have limited utility (and are not sensible counters to the player). There are now more variants of AlphaStar playing against players on public servers, whereas the set of available play styles was initially more constrained and could only be played against in invitation-only demonstrations.

Conclusions
The discussion and advancement of AI technology is an on-going topic and ever evolving, with What Computers Can't Do [54] and Rebooting AI [55] serving as comprehensive overviews.
If the intent is to model the full agent capabilities, then AI agents must include, for example, computer vision and robotic actuators to appropriately see and act within the world. For instance, AlphaGo does not actually move the pieces on the board itself, though some are advancing the field in this direction, e.g., [39] is a one-handed robot that can solve the Rubik's Cube. If this is not the intended goal and only an aspect of agent behaviour is intended, then accommodations must also be included to constrain the behaviour to parameters that are meaningful and relevant to the environmental conditions. While this limitation hopefully is considered reasonable, this is not the current norm. Moreover, this approach will also help bring us closer to explainable AI-where humans can understand the inner workings of a "glass box" rather than the "black box" that underlies an AI agent's decision process [56][57][58][59]. For us to learn from AI developments, this is critical.
Some have posited a future where "robot psychologist" is a potential profession [60][61][62][63]. That is, a psychologist for robots, not a robot that acts as a psychologist; the goal here is to understand how robots think. The underlying "thinking" of object detection algorithms can now be somewhat explored using neural feature visualisation and inversion [64][65][66]. Before this hypothetical future can be realised, we need to make considerations about the tractable limitations in our AI; currently, a notable factor in the improvement of machine learning performance can be attributed to the use of increasing amounts of compute time and cores, not just more advanced agent architectures [67,68]. However, maybe, this future is not as distant as one may have previously thought.
Funding: This research received no external funding.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

AI
Artificial intelligence APM Actions per minute CFOP Cross, F2L (first two layers), OLL (orient last layer), PLL (permute last layer); Rubik's Cube solving method RTS Real-time strategy TAS Tool-assisted speed-run