Algorithm for Computing Approximate Nash Equilibrium in Continuous Games with Application to Continuous Blotto

Successful algorithms have been developed for computing Nash equilibrium in a variety of finite game classes. However, solving continuous games -- in which the pure strategy space is (potentially uncountably) infinite -- is far more challenging. Nonetheless, many real-world domains have continuous action spaces, e.g., where actions refer to an amount of time, money, or other resource that is naturally modeled as being real-valued as opposed to integral. We present a new algorithm for {approximating} Nash equilibrium strategies in continuous games. In addition to two-player zero-sum games, our algorithm also applies to multiplayer games and games with imperfect information. We experiment with our algorithm on a continuous imperfect-information Blotto game, in which two players distribute resources over multiple battlefields. Blotto games have frequently been used to model national security scenarios and have also been applied to electoral competition and auction theory. Experiments show that our algorithm is able to quickly compute close approximations of Nash equilibrium strategies for this game.


Introduction
Successful algorithms have been developed for computing approximate Nash equilibrium strategies in a variety of finite game classes, even classes that are challenging from a computational complexity perspective. For example, an algorithm that was recently applied for approximating Nash equilibrium strategies in six-player no-limit Texas hold'em poker defeated strong human professional players [8]. This is an extremely large extensive-form game of imperfect information. Even solving three-player perfect-information strategic-form games is challenging from a theoretical complexity perspective; it is PPAD-hard to compute a Nash equilibrium in two-player general-sum and multiplayer games, and it is widely believed that no efficient algorithms exist [10,11,12]. Strong algorithms have also been developed for stochastic games, even with multiple players and imperfect information [18]. Stochastic games have potentially infinite duration but a finite number of states and actions.
Continuous games are fundamentally different from finite games in several important ways. The first is that they are not guaranteed to have a Nash equilibrium; Nash's theorem only proved the existence of a Nash equilibrium in finite games [27]. A second challenge is that we may not even be able to represent mixed strategies in continuous games, as they correspond to probability distributions over a potentially (uncountably) infinite pure strategy space. So even if a game has a Nash equilibrium, we may not even be able to represent it, let alone compute it. Equilibrium existence results and algorithms have been developed for certain specialized classes; however, there are still many important game classes for which these results do not hold. Even two-player zero-sum games remain a challenge. For example, the fictitious play algorithm has been proven to converge to Nash equilibrium for finite two-player zero-sum games (and certain classes of multiplayer and nonzero-sum games), but this result does not extend to continuous games [32].
A strategic-form game consists of a finite set of players N = {1, . . . , n}, a finite set of pure strategies S i for each player i, and a real-valued utility for each player for each strategy vector (aka strategy profile), u i : S 1 × . . . × S n → R. A two-player game is called zero sum if the sum of the payoffs for all strategy profiles equals zero, i.e., u 1 (s 1 , s 2 ) + u 2 (s 1 , s 2 ) = 0 for all s 1 ∈ S 1 , s 2 ∈ S 2 .
A mixed strategy σ i for player i is a probability distribution over pure strategies, where σ i (s i ) is the probability that player i plays s i ∈ S i under σ i . Let Σ i denote the full set of mixed strategies for player i.
where σ * −i denotes the vector of the components of strategy σ * for all players excluding i. It is well known that a Nash equilibrium exists in all finite games [27]. In practice, all that we can hope for in many games is the convergence of iterative algorithms to an approximation of Nash equilibrium. For a given candidate strategy profile σ * , define The goal is to compute a strategy profile σ * with as small a value of as possible (i.e., = 0 indicates that σ * comprises an exact Nash equilibrium). We say that a strategy profile σ * with value constitutes an -equilibrium. For two-player zero-sum games, there are algorithms with bounds on the value of as a function of the number of iterations and game size, and for different variations is proven to approach zero in the limit at different worst-case rates (e.g., [20]).
If σ 1 i and σ 2 i are two mixed strategies for player i and p ∈ (0, 1), then we can consider mixed strategy The first interpretation, which is the traditional one, is that σ i is the mixed strategy that plays pure strategy s i ∈ S i with probability pσ 1 i (s i ) + (1 − p)σ 2 (s i ). Thus, σ i can be represented as a single mixed strategy vector of length |S i |. A second interpretation is that σ i is the mixed strategy that with probability p selects an action by randomizing according to the probability distribution σ 1 , and with probability 1 − p selects an action by randomizing according to σ 2 . Using this interpretation implementing σ i requires storing full strategy vectors for both σ 1 and σ 2 , though clearly the result would be the same as in the first case.
In extensive-form imperfect-information games, play proceeds down nodes in a game tree. At each node x, the player function P (x) denotes the player to act at x. This player can be from the finite set N or an additional new player called Chance or Nature. Each player's nodes are partitioned into information sets, where the player cannot distinguish between the nodes at a given information set. Each player has a finite set of available actions at each of the player's nodes (note that the action sets must be identical at all nodes in the same information set because the player cannot distinguish the nodes). When play arrives at a leaf node in the game tree, a terminal real-valued payoff is obtained for each player according to utility function u i . Nash equilibrium existence and computational complexity results from strategic-form games hold similarly for imperfect-information extensive-form games; e.g., all finite games are guaranteed to have a Nash equilibrium, two-player zero-sum games can be solved in polynomial time, and equilibrium computation for other game classes is PPAD-hard.
Randomized strategies can have two different interpretations in extensive-form games. Note that a pure strategy for a player corresponds to a selection of an action for each of that player's information sets. The classic definition of a mixed strategy in an extensive-form game is the same as for strategic-form games: a probability distribution over pure strategies. However, in general the number of pure strategies is exponential in the size of the game tree, so a mixed strategy corresponds to a probability vector of exponential size. By contrast, the concept of a behavioral strategy in an extensive-form game corresponds to a strategy that assigns a probability distribution over the set of possible actions at each of the player's information sets. Since the number of information sets is linear in the size of the game tree, representing a behavioral strategy requires only storing a probability vector of size that is linear in the size of the game tree. Therefore, it is much preferable to work with behavioral strategies than mixed strategies, and algorithms for extensive-form games generally operate on behavioral strategies. Kuhn's theorem states that in any finite extensive-form game with perfect recall, for any player and any mixed strategy, there exists a behavioral strategy that induces the same distribution over terminal nodes as the mixed strategy against all opponent strategy profiles [26]. The converse is also true. Thus, mixed strategies are still functionally equivalent to behavioral strategies, despite the increased complexity of representing them.
Continuous games generalize finite strategic-form games to the case of (uncountably) infinite strategy spaces. Many natural games have an uncountable number of actions; for example, games in which strategies correspond to an amount of time, money, or space. One example of a game that has recently been modeled as a continuous game in the AI literature is computational billiards, in which the strategies are vectors of real numbers corresponding to the orientation, location, and velocity at which to hit the ball [3]. Mixed strategies are the space of Borel probability measures on S i . The existence of a Nash equilibrium for any continuous game with continuous utility functions can be proven using Glicksberg's generalization of the Kakutani fixed point theorem [21]. The result is stated formally in Theorem 1 [14]. In general, there may not be a solution if we allow non-compact strategy spaces or discontinuous utility functions. We can define extensive-form imperfect-information continuous games similarly to that for finite games, with analogous definitions of mixed and behavioral strategies. While this existence result has been around for a long time, there has been very little work on practical algorithms for computing equilibria in continuous games. One interesting class of continuous games for which algorithms have been developed is separable games [34]; however, this imposes a significant restriction on the utility functions, and many interesting continuous games are not separable. Additionally, algorithms for computing approximate equilibria have been developed for several other classes of continuous games, including simulation-based games [35], graphical tree-games [33], and continuous poker models [19]. The continuous Blotto game that we consider does not fit in any of these classes, and in fact has discontinuous utility functions, so we cannot apply Theorem 1 or these algorithms.

Continuous Blotto Game
The Blotto game is a type of two-player zero-sum game in which the players are tasked to simultaneously distribute limited resources over several objects (or battlefields). In the classic version of the game, the player devoting the most resources to a battlefield wins that battlefield and the gain (or payoff) is then equal to the total number of battlefields won. The Blotto game was first proposed and solved by Borel in 1921 [5] and has been frequently applied to national security scenarios. It has also been applied as a metaphor for electoral competition, with two political parties devoting money or resources to attract the support of a fixed number of voters: each voter is a "battlefield" that can be won by one party. The game also finds application in auction theory where bidders must make simultaneous bids [36].
Initial approaches derived analytical solutions for special cases of the general problem. Borel and Ville proposed the first solution for three battlefields [6], and Gross and Wagner generalized this result for any number of battlefields [22]. However, they assumed that colonels have the same number of troops. Roberson computed optimal strategies of the Blotto games in the continuous version of the problem where all of the battlefields have the same weight, for models with both symmetric and asymmetric budgets [31]. Hart considered the discrete version, again when all battlefields have equal weight, and solved it for certain special cases [24]. It was not until 2016 that the first algorithm was provided to solve the general version of the game. Initially a polynomial-time algorithm that involved solving exponential-sized linear programs was presented [2], which was later improved to a linear program of polynomial size [4]. These polynomialtime algorithms are for the discrete version of the game; however, no general algorithm has been devised for the original continuous Blotto game. As described earlier, there are many challenges present for solving continuous games that do not exist for finite games, even for two-player zero-sum games.
Most of the prior approaches solve perfect-information versions of the game in which all players have public knowledge of the values of the battlefields. Adamo and Matros studied a Blotto game in which players have incomplete information about the other player's resource budgets [1]. Kovenock and Roberson studied a model where the players are subject to incomplete information about the battlefield valuations [25]. In both of these works, all players are equally uninformed about the parameters. Recently some work has provided analytical solutions for certain settings with asymmetric information, in which both players know the values of the battlefields but one player knows their order while the other player only knows a distribution over the possible orders [28,9]. This model is an imperfect-information game in which player 1 must select a strategy without knowing the order, while player 2 can select a different mixed strategy conditional on the actual order. We study and present an algorithm for the asymmetric imperfect-information continuous version of the Blotto game, which is perhaps the most challenging variant. Note that our approach also applies to the perfect information version as well. • Probability mass function Let s 1 (q) denote the probability of selecting slot q for s 1 ∈ S 1 .
• Pure strategy space of player 2 Let s 2 (o, q) denote the probability of selecting slot q under outcome o for s 2 ∈ S 2 .
• δ ∈ R > 0 • Utility function u 1 (s 1 , s 2 ) = o p(o) q C 1 (s 1 (q), s 2 (o, q)) for s 1 ∈ S 1 , s 2 ∈ S 2 , where Each player must select a real-valued amount of resources to put on the battlefield in slot q ∈ Q, subject to the constraint that the total does not exceed the player's budget B i . Player 1 does not know the outcome o, which defines the order of the battlefields; they only know that the outcome is o ∈ O with probability p(o). Player 2 knows the order and is able to condition their strategy on this additional information. For each slot q, if player 1 uses an amount of resources s 1 (q) that exceeds player 2's amount s 2 (o, q) by at least δ, then player 1 "wins" the battlefield o(q) in slot q and receives its value v o(q) (and player 2 receives −v o(q) ); if s 2 (o, q) ≥ s 1 (q) then player 2 wins v o(q) and player 1 loses v o(q) ; otherwise, both players get zero. This game is clearly zero sum because player 1 and player 2's payoff sum to zero for each situation.
Note that the utility function is discontinuous: payoffs for a given slot can shift abruptly between v o(q) , 0, and −v o(q) with arbitrarily small changes in the strategies. This means that Theorem 1 does not apply, and the game is not necessarily guaranteed to have a Nash equilibrium. The game does also not fall into the specialized classes of games such as separable games for which prior algorithms have been developed. Note that often the Blotto game is presented without the δ term; typically player 1 wins the battlefield if s 1 (q) > s 2 (o, q), and player 2 wins if s 2 (o, q) ≥ s 1 (q). We add in the δ term because our algorithm involves the invocation of an optimization solver, and optimization algorithms typically cannot handle strict inequalities. We can set δ to a value very close to zero.

Algorithm
Fictitious play is an iterative algorithm that is proven to converge to Nash equilibrium in two-player zerosum games (and in certain other game classes), though not in general for multiplayer or non-zero-sum games [7,32]. While it is not guaranteed to converge in multiplayer games, it has been proven that if it does converge, then the average of the strategies played throughout the iterations constitute an equilibrium [13]. Fictitious play has been successfully applied to approximate Nash equilibrium strategies in a three-player poker tournament to a small degree of approximation error [17,18]. More recently, fictitious play has also been used to approximate equilibrium strategies in multiplayer auction [29,30] and national security [16] scenarios. Fictitious play has been demonstrated to outperform another popular iterative algorithm, counterfactual regret minimization, in convergence to equilibrium in a range of multiplayer game classes [15].
In classical fictitious play, each player plays a best response to the average strategies of his opponents thus far. Strategies are initialized arbitrarily (typically they are initialized to be uniformly random). Then each player uses the following rule to obtain the average strategy at time t: where σ t i is a best response of player i to the profile σ t−1 −i of the other players played at time t − 1. The final strategy output after T iterations σ T is the average of the strategies played in the individual iterations (while the best response σ t i is the strategy actually played at iteration t).
The classical version of fictitious play involves representing two strategies per player; the current strategy σ t i and the current best response σ t i . Note that once we compute the next round strategy σ t+1 i from σ t i and σ t+1 i , we no longer need to maintain either σ t i or σ t 1 i in memory. We interpret σ t i as a single mixed strategy that selects action s j with probability 1 − 1 t σ t−1 i (s j ) + 1 t σ t i (s j ). An alternative, and seemingly nonsensical, way to implement fictitious play would be to separately store each of the pure strategies that are played σ t i , rather than to explicitly average them at each step. Using this representation, the best response can be computed by selecting the pure strategy that maximizes the average (or sum) of the utilities against σ 0 −i , . . . , σ t−1 −i . This method of implementing fictitious play seems nonsensical for several reasons. First, it involves picking a strategy that maximizes the sum of utilities against t different opponent strategies as opposed to maximizing the utility against a single strategy. And second, it involves storing t pure strategies for each player, which would require using significantly more memory than the original approach when t exceeds |S i |. Despite these clear drawbacks, nonetheless it is apparent that this approach is still equivalent to the original approach and results in the same sequence of strategies being played. When the algorithm is applied to an imperfect-information game, we can view it as operating with mixed as opposed to behavioral strategies (in contrast to prior algorithms for solving imperfect-information games). We refer to this new approach as "Redundant fictitious play" due to the fact that it "redundantly" stores all of the strategies played individually instead of storing them as a single mixed strategy. Redundant fictitious play is depicted in Algorithm 1.

Algorithm 1 Redundant fictitious play for two-player games
In Algorithm 1, we store T strategies for each player, where T is the total number of iterations. We can initialize strategies arbitrarily for the first iteration (e.g., to uniform random). For all subsequent iterations the strategy S i [t] is a pure strategy best response to a strategy of the opponent. The notation M ix(S i , 0, t−1) refers to the mixed strategy for player i that plays strategy S i [u] with probability 1 t , for 0 ≤ u ≤ t − 1; that is, it mixes uniformly over the strategies S i [0], . . . , S i [t−1]. The algorithm then computes the game value to player i under the current iteration strategies as well as the exploitability of each player (difference between best response payoff and game value). This determines the maximum amount that each player can gain by deviating from the strategies; we can then say that the strategies computed at iteration t − 1 constitute an Now, suppose that G is a continuous game and no longer a finite game. Assuming that we initialize the strategies S i [0] to be pure strategies, all of the strategies S i [t] are now pure strategies and the algorithm does not need to represent any mixed strategies. This is very useful, since for continuous games a mixed strategy may be a probability distribution that puts weight on infinitely many pure strategies and cannot be compactly represented. However, pure strategies can typically be represented compactly in continuous games. For example, if the strategy spaces are compact subsets of R n , then each pure strategy corresponds to a vector of n real numbers, which can be easily represented assuming that n is not too large. For example in continuous Blotto player 1 must select an amount of resource to use for each of |F | battlefields, and therefore storing a pure strategy requires storing |F | real numbers, which is easy to do. Thus, Redundant Fictitious Play can be feasibly applied to continuous games, while the classical version cannot.
The only remaining challenge for continuous games is the best response computation, which may be challenging for certain complex utility functions. However, for the common assumptions that the pure strategy spaces are compact and the utility functions are continuous, this optimization is typically feasible to compute.
For the continuous Blotto game, we present optimization formulations for computing player 1 and 2's best response below. Both of these are mixed integer linear programs (with a polynomial number of variables and constraints). Note that we are able to construct efficient best response procedures for this game despite the fact that the utility function is discontinuous.
Player 1's best response function is the following, where X q is a variable denoting the amount of resources put on slot q, and Y t,o,q is the amount of resources put on slot q under outcome o by player 2's fixed strategy at iteration t: The constraints in Equation (1) are called indicator constraints and state that if the binary variable b t,o,q has value equal to 1, then the linear constraint X q ≥ Y t,o,q + δ must hold. Indicator constraints are supported by many integer-linear program optimization solvers, such as CPLEX and Gurobi. We could additionally impose indicator constraints b t,o,q = 0 → X q ≤ Y t,o,q ; however, these are unnecessary and would significantly increase the size of the problem. To see the correctness of the procedure, suppose that X q ≥ Y t,o,q + δ and q X q = B 1 but that b t,o,q = 0. Then the objective clearly increases by setting b t,o,q = 1 instead to include the additional term p(o) · b t,o,q · v o(q) . So there cannot exist another solution satisfying the budget and indicator constraints with higher objective value.
While player 1 must assume that the outcome is distributed according to p, player 2 is aware of the outcome and therefore can condition their strategy on it. Therefore, player 2 solves a separate optimization for each value of o ∈ O to compute the best response to the strategy of player 1.
Player 2's best response function given outcome o ∈ O is the following, where Y q is a variable denoting the amount of resources put on slot q and X t,q is the amount of resources put on slot q according to player 1's fixed strategy at iteration t: Correctness of player 2's best response function follows by similar reasoning to that of player 1's. Player 1's best response optimization has T M |Q| binary variables b t,o,q , where T is the current algorithm iteration and M = |O| denotes the number of outcomes, and |Q| continuous variables X q . Since the number of indicator constraints is also T M |Q|, the size of the formulation is O(T M |Q|) = O(T M |F |), which is polynomial in all of the input parameters. Similarly, player 2 must solve M optimizations, each one with size O(T |Q|). Note that in practice this algorithm could be parallelized by solving each of these M + 1 optimizations simultaneously on separate cores as opposed to solving them sequentially (in our implementation we solve them sequentially). However, since player 1's optimization is much larger than each of player 2's, the bottleneck step is player 1's optimization, and such a parallelization may not provide a significant reduction in the runtime.
Note that as we run successive iterations of Algorithm 1, the size of these optimization problems becomes larger, since the opponent's strategy is a mixture over t pure strategies, where t is the current algorithm iteration. We have seen that the number of variables and constraints scales linearly in t. Therefore, we expect earlier iterations of the algorithm to run significantly faster than later iterations. We will see the exact magnitude of this disparity in the experiments in Section 4. A potential solution to this issue would be to include an additional parameter K in Algorithm 1. Instead of computing a best response to the mixture over all t of the opponent's pure strategies, a subset of K of them is selected by sampling and a best response is computed just to a uniform mixture over the pure strategies in the sampled subset. This sampling would occur for each iteration, so a potentially different subset of size K would be selected at each iteration. This would ensure that the complexity of the best response computations remains constant over all iterations and does not become intractable for later iterations. This approach would be unbiased and produces the same result in expectation over the sampling outcomes. However, it may lead to high variance in results and lead to poor convergence in practice. Perhaps this could be mitigated by performing multiple runs of the sampling algorithm in parallel and selecting the run with lowest value of .
Note that Algorithm 1 can be applied to extensive-form imperfect-information games in addition to simultaneous strategic-form games (in fact the continuous Blotto game that we apply it to has imperfect information for player 1, since player 1 does not know the value of o while player 2 does). As long as pure strategies can be represented and best responses can be computed efficiently (which are both the case for imperfect-information games), the algorithm can be applied. Also note that while we presented the algorithm just for a two-player game, it can also be run on multiplayer games (just as for standard fictitious play). The best response computations are still just a single agent optimization problem given fixed strategies for the opposing players. In fact, fictitious play has been demonstrated to obtain successful convergence to Nash equilibrium in a variety of multiplayer settings [15], despite the fact that it is not guaranteed to converge to Nash equilibrium in general for games that are not two-player zero-sum.
We can compute v * 1 [t] and 1 [t] for Algorithm 1 in the continuous Blotto game using the procedures depicted in Algorithms 2 and 3 (and analogously for v * 2 [t] and 2 [t]).
Algorithm 3 Procedure to compute 1 [t] in continuous Blotto
We assume that player 2 observes the outcome while player 1 does not. We used a budget B 1 = 10 for player 1 and B 2 = 7 for player 2. We used δ = 0.0001. We used the default feasibility tolerance in Gurobi, which is 1.0 × 10 −6 . We ran our algorithm for 5000 iterations and computed i for each player every 10 iterations. Recall that we defined the exploitability of the computed strategies at iteration t as [t] = max i i [t]. The experiments did not use any sampling and computed the best response against the opponent's full mixed strategy at each iteration using the mixed-integer linear programs described in Section 3. We used the parallel version of Gurobi's mixed integer linear programming solver [23] with six cores on a laptop.
The results are shown in Figure 1. It took slightly under 25,000 seconds (around 6.9 hours) to run 5000 iterations of our algorithm. The final strategies had an exploitability of 0.0307 for player 1 and 0.0292 for player 2, indicating that the strategies constitute an -equilibrium for = 0.0307. (After 5000 additional iterations decreased further to 0.021.) The exploitability values are not monotonically decreasing, and the lowest value in these experiments was actually obtained with = 0.0259 at iteration 4480. The expected value for player 1 in the final strategies is −0.10969. The exploitability fell below 0.05 for the first time after 1759.4 seconds (29.3 minutes), obtaining = 0.0494 on iteration 1400. From the figure we can also see that the runtimes varied for the different iterations, as expected (nearly half of the 5000 iterations were completed in the first 5000 seconds).

Conclusion
We presented a new algorithm for computing Nash equilibrium in a broad class of continuous games. The algorithm is based on integrating a novel variant of fictitious play in which the strategies from all iterations are stored with custom best response functions. Solving continuous games is particularly challenging as a Nash equilibrium is not even guaranteed to exist and mixed strategies may put weight on infinitely many pure strategies; yet for many realistic games it is more natural to model strategies as subsets of real numbers than as integers. We implemented our algorithm on a continuous imperfect-information model of the Blotto game, a well-studied model of resource allocation with applications to national security. We created a new mixed-integer linear program formulation for the best response function. We demonstrated that the algorithm converged quickly to an -equilibrium for equal to 0.03 after 5000 iterations of the algorithm (several hours), which corresponds to 30% of the minimum battlefield value. While the Blotto game has been studied analytically and efficient algorithms have been developed for the discrete case, this is the first algorithm for solving the continuous case.
"A" (Approved for Public Release, Distribution Unlimited)