A Two-Player Resource-Sharing Game with Asymmetric Information

This paper considers a two-player game where each player chooses a resource from a finite collection of options. Each resource brings a random reward. Both players have statistical information regarding the rewards of each resource. Additionally, there exists an information asymmetry where each player has knowledge of the reward realizations of different subsets of the resources. If both players choose the same resource, the reward is divided equally between them, whereas if they choose different resources, each player gains the full reward of the resource. We first implement the iterative best response algorithm to find an $\epsilon$-approximate Nash equilibrium for this game. This method of finding a Nash equilibrium may not be desirable when players do not trust each other and place no assumptions on the incentives of the opponent. To handle this case, we solve the problem of maximizing the worst-case expected utility of the first player. The solution leads to counter-intuitive insights in certain special cases. To solve the general version of the problem, we develop an efficient algorithmic solution that combines online convex optimization and the drift-plus penalty technique.

W = (W 1 , W 2 , ..., W n ) ⊤ , where W k is the reward random variable of resource k.We assume W k to be independent random variables for each 1 ≤ k ≤ n, taking non-negative real values.If both players choose the same resource k, each gets a utility of W k /2.If they choose different resources k, l, they receive utilities of W k and W l , respectively.It is assumed that the mean and the variance of W k exist and are finite for each 1 ≤ k ≤ n.Both players know the distribution of W .Our formulation allows for an information asymmetry between the players.In particular, {1, 2, . . ., n} can be partitioned into four sets {A, B, C, AB} where only player A observes the realizations of W k for k ∈ A, only player B observes the realizations of W k for k ∈ B, no player observes the realizations of W k for k ∈ C, and both players observe the realizations of This game can be used to model different real-world scenarios where the agents have asymmetric information regarding the involved information structure.One classic example is the problem of Multiple-Access Control (MAC) in communication systems.Here, communication channels are accessed by multiple users, and the data rate of a channel is shared amongst the users who select it [1].A channel can be shared using Time Division Multiple Access (TDMA) or Frequency Division Multiple Access (FDMA), where in TDMA, the channel is time-shared among the users [2], [3], whereas in FDMA, the channel is frequency-shared among the users [4].
In both cases, the total data rate supported by the channel can be considered the utility of the channel.The problem of information asymmetry arises since a user might have precise information regarding the total data rate offered by some channels but not others, and the known channels can be different for different users.On the other hand, the users in such a system cannot be trusted since the system may have malicious users (for instance, jammers) who focus on reducing the data rate available to genuine users.
Modified versions of this game apply to problems in economics.For instance, consider a firm that chooses a market to enter from a pool of market options.The chosen market may also be chosen by another firm.The reward of a market is the revenue it brings.Assume a simplified model where there exists a total revenue for each market, and the total revenue is divided equally among the firms entering the market.A reward known to all firms can be considered public information, while a reward known only to one firm is private information of that firm.The game defined above can be viewed as a stochastic version of the class of games defined in [5], which are resource-sharing games, also known as congestion games.In resource-sharing games, players compete for a finite collection of resources.In a single turn of the game, each player is allowed to select a subset of the collection of resources, where the allowed subsets make up the action space of the player.Each resource offers a reward to each player who selected the particular resource, where the reward offered depends on the number of players who selected it.
The relationship between the reward offered to a player by a resource and the number of users selecting it is captured by the reward function of the resource.A player's utility is equal to the sum of the rewards offered by the resources in the subset selected by the player.In [5], it is established that the above game has a pure-strategy (deterministic) Nash equilibrium.
Although in the classical setting, these games ignore the stochastic nature of the rewards offered by the resources, the idea of resource-sharing games has been extended to different stochastic versions [6], [7].Versions of the game with information asymmetry have been considered through the work of [8] in the context of Bayesian games, which considers the information design problem for resource-sharing with uncertainty.Similar Bayesian games have also been considered in [9], [10].It should be noted that in general resource-sharing games, no conditions are placed on the reward functions of the resources.The special case where the reward functions are non-decreasing in the number of players selecting the resource is called a cost-sharing game [11].These games are typically treated as games where a cost is minimized rather than a utility being maximized.In fair cost-sharing games, the cost of a resource is divided equally among the players selecting the resource.We consider a fair reward allocation model, where the reward of a resource is equally shared among the players selecting the resource.It should be noted that in this model, the players have opposite incentives compared to a fair-cost sharing model.
The work on resource-sharing games assumes that the players either cooperate or have the incentive to maximize a private or a social utility.It is interesting to consider a stochastic version of the game with asymmetric information between players who do not necessarily trust each other and who place no assumptions on the incentives of the opponents.In this context, the players have no signaling or external feedback and take actions based only on their personal knowledge of the reward realizations for a subset of the resource options.In this paper, we consider the above problem and limit our attention to the two-player singleton case, where each player can choose only one resource.
In the first part of the paper, we provide an iterative best response algorithm to find an June 2023 DRAFT ǫ-approximate Nash equilibrium of the system.In the second part, we solve the problem of maximizing the worst-case expected utility of the first player.We solve the problem in two cases.The first case is when both players do not know the realizations of the reward random variables of any of the resources, in which case an explicit solution can be constructed.This case yields a counter-intuitive solution that provides insight into the problem.One such insight is that, while it is always optimal to choose from a subset of resources with the highest average rewards, within that subset, one chooses the higher-valued rewards with lower probability.For the second case, we solve the general version of the problem by developing an algorithm that leverages the online optimization technique [12], [13] and the drift-plus penalty method [14].This algorithm generates a mixture of O(1/ε 2 ) pure strategies, which, when used in an equiprobable mixture, provides a utility within ε of optimality on average.Below, we summarize our major contributions.
• We consider the problem of a two-player singleton stochastic resource-sharing game with asymmetric information.We first provide an iterative best response algorithm to find an ǫapproximate Nash equilibrium of the system.This equilibrium analysis uses potential game concepts.
• When the players do not trust each other and place no assumptions on the incentives of the opponent, we solve the problem of maximizing the worst-case expected utility of the first player using a novel algorithm that leverages techniques from online optimization and the drift-plus penalty methods.The algorithm developed can be used to solve the general unconstrained problem of finding the randomized decision α ∈ {1, 2, . . ., n}, which maximizes E{h(x; Θ)}, where are non-negative random vectors with finite second moments, and h is a concave function such that h(x) = E{h(x; Θ)} is Lipschitz continuous, entry-wise non-decreasing and has bounded subgradients.
• We show our algorithm uses a mixture of only O(1/ε 2 ) pure strategies using a detailed analysis of the sample path of the related virtual queues (our preliminary work on this algorithm used a mixture of O(1/ε 3 ) pure strategies).Virtual queues are also used for constrained online convex optimization in [13], but our problem structure is different and requires a different and more involved treatment.

A. Background on Resource-Sharing Games
The classical resource-sharing game defined in [5] is a tuple (M, N , T , r), where M is a set of m players, N is a set of n resources, T = T 1 × T 2 × ... × T m where T j is the set of possible actions of player j (which is a subset of 2 N ), and r = (r 1 , r 2 , ..., r n ), where r i : Many game variations of the resource-sharing game have been studied [16].Weighted resourcesharing games [17], games with player-dependent reward functions [18], and games with resources having preferences over players [19] are some of the extensions.Singleton games, where each player is allowed to choose only one resource, have also been explored explicitly in the literature [20], [21].Some of the extensions of the classical resource-sharing game possess a pure Nash equilibrium in the singleton case.Two examples would be the games with player-specific reward functions for a resource [18] and the games with priorities where the resources have preferences over the players [19].
Resource-sharing games have been extended to several stochastic versions.For instance, ref.
[6] considers the selfish routing problem with risk-averse players in a network with stochastic delays.The work of [7] considers two scenarios where, in the first scenario, each player participates in the game with a certain probability, and in the second scenario, the reward functions are stochastic.The problem of information asymmetry in resource-sharing games has been addressed through the work of [22], [8], [9], [10].The work of [22] considers a network congestion game where the players have different information sets regarding the edges of the network.Further, ref. [8] considers a scenario with a single random state θ, which determines the reward functions.
The realization of θ is known to a game manager who strategically provides recommendations (signaling) to the players to minimize the social cost.An information asymmetry arises among the players in this case due to the actions of the game manager during private signaling, where the game manager provides player-specific recommendations.
Our formulation differs from the literature on resource-sharing games since we consider a scenario that is difficult to be analyzed using the standard equilibrium-based approaches.This is due to the fact that the players do not trust each other and place no assumptions on the incentives of the opponents, and they take action in the absence of a signaling mechanism or external feedback by just using their knowledge of the reward random variables.This motivates our formulation as a one-shot problem tackled using worst-case expected utility maximization.

B. Notation
We use calligraphic letters to denote sets.Vectors and matrices are denoted by boldface characters.For integers n and m, we denote by [n : m] the inclusive set of integers between n and m.Given a vector w ∈ R m , w k is used to denote the k-th element of w; w k:l for l ≥ k represents the l − k + 1 dimensional sub-vector (w k , w k+1 , . . ., w l ) ⊤ of w; for a subset S of integers from 1 to n {w k ; k ∈ S} represents the sub-vector of w with index in S. For z ∈ R m , we use z 2 , and z ∞ to denote the standard Euclidean norm (L2 norm), and the supremum to denote a subgradient of f at z.

II. MATERIALS AND METHODS
The code used for the simulations is implemented using Python programming language in the notebook https://rb.gy/wvt33.

III. FORMULATION Denote
Recall that X is known only to player A, Y is known only to player B, Z is known to both Let R C (g A , g B ) be the random variable representing the utility of player C ∈ {A, B}, given that player A uses strategy g A , and player B uses strategy g B .General strategies for players A and B can be represented by the Borel-measurable functions, where are the resources chosen by players A and B, respectively.Here, U A and U B are independent randomization variables uniformly distributed in [0, 1) and independent of W .A pure strategy for player A is a function g A that does not depend on U A , whereas a mixed strategy is a function g A that depends on U A .Hence, we drop the randomization variable when depicting a pure strategy.Pure strategies and mixed strategies for player B are defined similarly.Let S A and S B denote the sets of all possible strategies for players A and B, respectively.
It turns out that our analysis is simplified when Z is fixed.Fixing Z does not affect the symmetry between players A and B since Z is observed by both players A and B. Hereafter, we conduct the analysis by considering all quantities conditioned on Z. Define and, Note that p A k and p B k are the conditional probabilities of players A and B choosing k given Z. Define vectors Hence, we have which uses the independence of W k and Z when k ∈ AB.
Note that the utility achieved by player A given the strategies g A and g B can be written as Given the strategies g A and g B , we provide an expression for the expected utility of player A given Z, where the expectation is over the random variables X, Y , V , and the possibly random actions α A and α B .Taking expectations of (8) gives, Note that given Z, the random variables α A and α B are independent.Hence, we can split the last term (9) as follows, IV. COMPUTING THE ǫ-APPROXIMATE NASH EQUILIBRIUM This section focuses on finding an ǫ-approximate Nash equilibrium of the game.Fix ǫ > 0.
A strategy pair (g A , g B ) is defined as an ǫ-approximate Nash equilibrium if neither player can improve its expected reward by more than ǫ if it changes its strategy (while holding the strategy of the other player fixed).
Combining (10) with ( 9), we have that Similarly, for player B, we have First, we focus on finding the best response for players A and B, given the other player's strategy is fixed.

Lemma 1:
The best response for players A and B are given by α A = arg max 1≤k≤n A k , and , where A k and B k are given by, Proof: We find the best response for A, and the best response for B follows similarly.Notice that we can rearrange (11) as, The above expectation is maximized when A chooses according to the given policy.
June 2023 DRAFT Next, we find a potential function for the game.A potential function is a function of the strategies of the players such that the change of the utility of a player when he changes his strategy (while the strategies of other players are held fixed) is equal to the change of the potential function [15].

Theorem 1:
The function H(g A , g B ) given by, is a potential function for the game, where for k ∈ B are defined in ( 5) and ( 6).Moreover, we have that for all g A , g B ∈ S A × S B , The key to the proof is separating (15) (using ( 11), ( 12)) as, Consider updating the strategy of player A while holding the strategy of player B fixed.Notice that since k∈B c E k p B k + k∈B q B k is not affected in this process, from ( 16), we have that the change in the expected utility of player A is equal to the change of the H function.Similar holds when player B updates the strategy while holding player A's strategy fixed.Hence, this is indeed a potential function.
To prove the result on the boundedness of H(g A , g B ), notice that from the definition of H(g A , g B ), we have that, where the last inequality follows since ½ α A =k , ½ α B =k ≤ 1 and W i are independent.

June 2023 DRAFT
Using Theorem 1 with standard potential game theory (See, for example [36]), we have that the iterative best response algorithm with the best response found in Lemma 1, converges to an ǫ-approximate Nash equilibrium in at most (2 n k=1 E k )/ǫ iterations.

V. WORST-CASE EXPECTED UTILITY
Finding a Nash equilibrium using the above algorithm may not be desirable when the players do not trust each other and place no assumptions on the incentives of the opponent.To mitigate this issue, we consider maximizing the worst-case expected utility of player A. Similar to the case of finding the Nash equilibrium, the analysis is simplified when Z is fixed.
Notice that we can simplify (10) to yield, where Plugging the above in ( 9), we get that, The difficulty in dealing with E{R A (g A , g B )|Z} is that it depends on the strategy g B of player B which is not known to player A. Hence, given a strategy g A of player A, we first focus on obtaining the worst-case strategy g A of player B. Then we focus on finding the strategy g A of player A which maximizes E{R A (g A , g A )|Z}.This way, we can guarantee a minimum expected utility for player A irrespective of player B's strategy.
June 2023 DRAFT Lemma 2: For given g A ∈ S A , the strategy g B ∈ S B that minimizes E{R A (g A , g B )|Z} chooses α B = arg max 1≤k≤n Λ k , where, and Ω k are defined in (21).
Proof: Notice that the only term of E{R A (g A , g B )|Z} in ( 22) that depends on the strategy of player B is the last expectation.This expectation is maximized when player B chooses k for which Λ k is maximized. 1  Hence we have, where Λ k are defined in (23).We formulate a strategy for player A using the following optimization problem where f : R n → R is defined by, Although not used immediately, we derive certain properties of f in the following theorem, which are useful later.
Theorem 2: The function f 1 Ideally, player B may not have information about q A j and p A j .Hence, player B may not be able to utilize this exact strategy.Nevertheless, obtaining a better bound is impossible since we do not have any assumptions or information about player B's strategy.For instance, if player B assumes that player A is using a particular strategy and if player B's assumption turns out to be correct since player B knows the distributions of all Wj for 1 ≤ j ≤ n, player B's estimates of q A j and p A j are exact.
3) satisfies, for any x, y ∈ R n .
Proof: See Appendix A.
It is important to notice that the values of p A k for 1 ≤ k ≤ n and q A l for l ∈ A defined in (5) completely determine the optimal value of (P1).Hence notice that if we have a mechanism to find the set G A ⊂ R n+a defined by, we can solve the optimization problem, (P1.1): maximize (q, p) to find the optimal (p * , q * ), after which we find the optimal strategy g * for player A that satisfies It turns out that G A is a convex set, as established by the following lemma.Hence combining with Theorem 2, we have that (P1.1) is a convex optimization problem.
Theorem 3: G A is a convex set.Further in the special case a = 0, G A is the n-dimensional It turns out that finding G A for the general scenario is difficult, although the case a = 0 is handled by Theorem 3. In fact, when a = b = d = 0, an explicit solution can be obtained to (P1.1), which we describe in section V-A.In section V-B, we describe the solution to the general case.In Appendix H, we provide simpler alternative solutions to the special cases a = 0 (with no restriction on b) and a = 1 (with the additional assumption that W 1 has a continuous CDF). where, is the n-dimensional probability simplex.For this section, we assume without loss of generality that E k > 0 for all k.If at least one of the E k 's were zero, we could transform (P2) into a lower dimensional problem with non-zero E k 's.The following lemma constructs an explicit solution where the lowest index is chosen in the case of ties.The optimal solution for (P2) is given by p * where, Proof: See Appendix C.
It should be noted that this solution is not unique.For instance, consider the case when n = 2, , and E 2 = 1.In this case, the lemma finds the solution (p 1 , p 2 ) = (1, 0), but it should be noted that (p 1 , p 2 ) = (1/3, 2/3) is also a solution.It is also interesting that the solution assigns positive probabilities to the r resources with the highest average reward, although within these r resources, higher probabilities are assigned to the resources with lower rewards.
It should also be noted that the worst-case strategy can be arbitrarily worse than the Nash equilibrium strategy.For instance, consider the simple scenario with two resources such that where none of the players observe any of the reward realizations.In this case, a Nash equilibrium would be player A always choosing resource 1 and player B always choosing resource 2. Another Nash equilibrium would be player B always choosing resource 1 and player A always choosing resource 2. In either case, player A's expected utility is E 1 .However, notice that, from Lemma 3, the maximum worst-case expected utility of player A is Hence, E 1 can be scaled to obtain arbitrarily large deviation between the worst-case and the Nash equilibrium solutions.

B. Solving the general case
In this section, we focus on solving the most general version of (P1) (with no restrictions on the sets A, B, AB, C).In particular, we focus on finding a mixed strategy to optimize the worst-case expected utility for player A. It turns out that our optimal solution chooses from a mixture of pure strategies parameterized by Q ∈ R n , of the following form We name this special class of pure strategies as threshold strategies.We develop a novel algorithm to solve this problem.Our algorithm leverages techniques from drift-plus-penalty theory [14] and online convex optimization [12], [13].It should be noted that our algorithm runs offline and is used to construct an appropriate strategy for player A that approximately solves (P1) conditioned on the observed realization of Z.We show that we can get arbitrarily close to the optimal value of (P1) by using a finite equiprobable mixture of pure strategies of the above form.It should be noted that the algorithm developed in this section can be used to solve the general unconstrained problem of finding the randomized decision α ∈ {1, 2, . . ., n} which maximizes E{h(x; Θ)}, with finite second moments, and h is a concave function such that h(x) = E{h(x; Θ)} is Lipschitz continuous, entry-wise non-decreasing and has bounded subgradients.
We first provide an algorithm that generates a mixture of T pure strategies, after which we establish the closeness to the optimality of the mixture.We generate the mixture of T pure strategies {g A Q(t) } T t=1 by iteratively updating vector Q for T iterations, where Q(t), and g Q(t) (X) denote the state of Q and the pure strategy generated in the t-th iteration, respectively.
In addition to Q(t), we require another state vector γ(t) ∈ R n , which we also update in each iteration, and a parameter V which decides the convergence properties of the algorithm.We provide the specific details on setting V later in our analysis.We begin with Q(1) = γ(0) = 0.
In the t-th iteration (t ≥ 1), we independently sample X(t) and Ω(t) from the distributions of X and Ω, respectively, where Ω is defined in (21), while keeping Z fixed to it's observed value.Then we update γ(t) and Q(t + 1) as follows.First, we solve, to find γ(t), where where arg max returns the lowest index in the case of ties.Notice that f t is a concave function, which can be established by repeating the same argument used to establish the concavity of f in Theorem 2. Then we choose the action for the t-th iteration α A (t) = g A Q(t) (X(t)) (See (35)).Then to update Q(t + 1), we use, The algorithm is summarized as Algorithm 1 for clarity.
After creating the mixture {g A Q(t) } T t=1 of pure strategies, we choose one of them randomly with probability 1/T to take the decision.In the following two sections, we focus on solving (P3) and evaluating the performance of the Algorithm 1.
Algorithm 1: Algorithm to generate the optimal mixture of T pure strategies Sample X(t), and Ω(t) Choose γ(t) by solving (P3) Obtain Q(t + 1) using (39) 1) Solving (P3): Notice that the objective of (P3) can be written as Hence (P3) seeks to minimize a separable convex function over the box constraint γ(t) ∈ K.
The solution vector γ(t) is found by separately minimizing each component γ j (t) over [0, u j ], where The resulting solution is, where Π [0,u j ] denotes the projection onto [0, u j ].Notice that the above solution is obtained by projecting the global minimizer of the function to be minimized onto [0, u j ].
2) How good is the mixed strategy generated by Algorithm 1: Without loss of generality, we assume that E k > 0 for all 1 ≤ k ≤ n.The following theorem establishes the closeness of the expected utility generated by Algorithm 1 to the optimal value f opt of (P1).
Theorem 4: Assume α is set such that α ≥ V 2 , and we use the mixed strategy g A generated by Algorithm 1 to make the decision.Then, where, Ω is defined in (21), and f opt is the optimal value of (P1).Hence, by fixing ε > 0, and using where, We begin with the following two lemmas, which will be useful in the proof.

Lemma 4:
The drift is bounded above as, where D 1 is defined in (44).
Proof: See Appendix D.
The following is a well-known result regarding the minimization of strongly convex functions (See, for example [37]).
Lemma 5: Let G ⊂ R n be a convex set with a non-empty interior Let g : G → R be a convex function.Fix α > 0, and y ∈ C 0 .Let, Additionally, assume that x * ∈ C 0 .Then, for all z ∈ C. We have the following special cases.
• Using G = [0, ∞) n and C = I, we have C 0 = I 0 .In this case, if ω(x) = n j=1 x j ln(x j ), we have that, D ω (x y) = D(x y), where D(x y) is the Kullback-Leibler divergence.Now we move on to the main proof.Notice that the objective of (P3) can be written as where, Let g A, * be the strategy that is optimal for (P1).Let us define x * (t) ∈ R n , where where U A (t) for 1 ≤ t ≤ T is a collection of i.i.d uniform-[0, 1) random variables.Notice that y * = E{x * (t)|Z} is independent of t and belongs to K. Hence y * is feasible for (P3).Notice that, where (a) follows from Lemma 5 for the convex function h given by h(x) = g ′ t (γ(t − 1)) ⊤ x with G = R n , and C = K, since γ(t) is the solution to (P3) and y * is feasible for (P3).Also, step 5 in each iteration of Algorithm 1 of finding the action can be represented as the maximization over all possible actions α A ∈ {1, 2, • • • , n} at time-slot t.Hence comparing the scenario where is used in the t-th iteration with the scenario where g A, * is used with randomization variable U A (t) in the t-th iteration, we have the inequality, where the last equality follows since x * (t) is independent of H(t) conditioned on Z. Summing (54) and (56), Adding to both sides and using Lemma 4 yields, where the last inequality follows from the sub-gradient inequality for the concave function f t .Now we introduce the following lemma.
Lemma 6: We have, Proof: See Appendix E.
Substituting the bound from Lemma 6 in (58) we have that, The above holds for each t ∈ {1, 2, . . ., T }.Hence we first take the expectation conditioned on Z of both sides of the above expression, after which we sum from 1 to T , which results in, Notice that, where functions f and f t are defined in (26), and (37), respectively.Also, we have that, where (a) follows from the definition of f t in (37), since γ(t − 1) is a function of H(t − 1) and Ω(t) is independent of H(t − 1) conditioned on Z. Substituting (62) and ( 63) in (61), we have that, where (a) follows since y * , γ(T ), γ(0) ∈ K, and the last inequality follows from Jensen's inequality on the concave function f .(See the definition of D 3 in (44)).Since Q(1) = 0 and E{L(T + 1)|Z} ≥ 0, after some rearrangements above translates to, where D 2 is defined in (44).Now we prove that prove the following lemma.
Lemma 7: We have, Proof: We first introduce the following two lemmas.

Lemma 8:
The queues Q j (t) for 1 ≤ j ≤ n updated according to Algorithm 1 satisfy, Proof: See Appendix F.
The following lemma is vital in constructing the O( √ α) bound on the queue sizes, which leads to the O(1/ε 2 ) solution.It should be noted that an easier bound can be obtained on the queue sizes, which leads to a O(1/ε 3 ) solution.
Proof: See Appendix G. Now we move on to the main proof.Notice that, where (a) follows from the entry-wise non-decreasing property of f (Theorem 2-2) and (b) follows from Theorem 2-3.Combining (69), and Lemma 8 with the bound on Q(T ) given by Lemma 9, we are done with the proof of the lemma.
Combining Lemma 7 with (65), we are done with the proof of the theorem.

VI. SIMULATIONS
For the simulations, we use W j as exponential random variables.Notice that since we are conditioning on Z to solve the problem, the objective of (P1) defined in ( 26  Notice that it is difficult to compare the worst-case strategy and the ǫ-approximate Nash equilibrium strategy in general since the first can be computed without any cooperation between the players, whereas computing the second requires cooperation among players.Further, as described in Section V-A, the worst-case strategy can be arbitrarily worse than the Nash equilibrium strategy.Nevertheless, comparing Figures 4-left and 5-top, it can be seen that the worst-case strategy and the strategy at ǫ-approximate Nash equilibrium yield comparable expected utilities for player A when E 1 ≥ 2. For instance, in scenario 1, for E 1 ≥ 2, the approximate Nash equilibrium strategy coincides with the worst-case strategy of choosing resource 1 with probability 1.However, it should be noted that our algorithm for finding the ǫ-approximate Nash equilibrium does not necessarily converge to a socially optimal solution.For instance, in scenario 1, when E 1 = 2, player A chooses resource 1 with probability 1 and player B chooses resource 2 with probability 1 gives a higher utility for player A without changing the utility of player B.
In Figure 5, it is interesting to notice the variation in choice probabilities of different resources with E 1 .Notice that in scenario 1, the choice probability of resource 1 is non-decreasing for , non-increasing for E 1 ∈ [0.8, 1.9], and non-decreasing for E 1 ≥ 1.9.Similar behavior can also be observed for scenario 3.This is surprising since intuition suggests that the probability of choosing a resource should increase with the increasing mean of the reward random variable.However, notice that in scenarios 1 and 3, player B does not observe the reward realization of resource 1.This might force player A, playing for the worst case, to believe that player B increases the probability of choosing resource 1 with increasing E 1 , as a result of which  player A chooses resource 1 with a lower probability.Notice that the probability of choosing resource 1 in scenario 3 does not grow as fast as the other two.This is because player A observes W 1 and hence can refrain from choosing it when W 1 takes low values.

VII. CONCLUSIONS
We have implemented the iterative best response algorithm to find the ǫ-approximate Nash equilibrium of a two-player stochastic resource-sharing game with asymmetric information.To handle situations where the players do not trust each other and place no assumptions on the incentives of the opponent, we solved the problem of maximizing the worst-case expected  utility of the first player using a novel algorithm that combines drift-plus penalty theory and online optimization techniques.An explicit solution can be constructed when both players do not observe the realizations of any of the reward random variables.This special case leads to counter-intuitive insights.
In our approach, we have assumed that the reward random variables of different resources are independent.It should be noted that this assumption can be relaxed without affecting the analysis for the special case when both players do not observe the realizations of any of the reward random variables.An interesting question would be what happens in the general case when the reward random variables are not independent.While it is still possible to implement our algorithm in this setting, it is not guaranteed that the algorithm will converge to the optimal solution.Hence, finding an algorithm for this case that exploits the correlations between the reward random variables could be potential future work.
Several other extensions can be considered as well.One would be considering a scenario with multiple players.The general multiplayer case yields a complex information structure since the set of resources has to be split into 2 m subsets, where m is the number of players.Additionally, June 2023 DRAFT the idea of conditioning on the common information is difficult to be adapted for this case.
Nevertheless, various simplified schemes could be considered.One example would be a case with no common information.In this case, the set of resources is split into m+ 1 disjoint subsets where the i-th (1 ≤ i ≤ m) subset is the subset of resources of which the i-th player observes the rewards, and the m + 1-th subset is the subset of resources of which the rewards are observed by none of the players.Another interesting scenario is when no player observes any of the reward realizations.In both these cases, the expected utility can be calculated following a similar procedure to the two-player case, but finding the worst-case expected utility is difficult.Hence, we believe both cases could be potential future work.Another extension would be extending the algorithm to be implemented with a repeated game structure and in an online scenario.

APPENDIX A PROOF OF THEOREM 2
Notice that the term E{max{Ω j x j ; 1 ≤ j ≤ n}|Z} of f is convex since the max function is convex and expectation preserves convexity.Hence f is concave.

APPENDIX B
PROOF OF THEOREM 3 Define the vector-valued function P A : R a+d → R n where, Notice that the function P A completely determines the probabilistic decision of player A. We call P A the strategy function of player A.
where I is the n-dimensional probability simplex defined in (32), then p * solves (P2).
Lemma 12: Let where arg max returns the lowest index in the case of ties.Let us also define µ ∈ R n as Then we have 2)

Proof of Lemma 12:
1) Notice that by the definition of µ k , it is enough to prove the result for 1 ≤ k ≤ r.Notice that we are required to prove that for all We consider two cases.
Case 1: r = 1.This case reduces to, which is trivial.
Case 2: r > 1.Note that from the definition of r in (83), we have After substituting S r−1 = S r − 1 Er and rearranging, we have the desired result.2) Notice that 3) This follows from the definition of µ k for 1 ≤ k ≤ r.
4) There is nothing to prove if r = n.Hence, we can assume r < n.Since µ k = 0 for

Sr
. Notice that if we can prove the result for After substituting S r+1 = S r + 1 E r+1 and rearranging, we have the desired result.Now, we solve the problem using the above lemmas.Consider the problem defined in Lemma 11 with µ defined in Lemma 12. Specifically, consider the problem, where µ and r are defined in (83)-(84).For this choice of µ k we have where the last equality follows from Lemma 12-2.Now, due to Lemma 12-3 and Lemma 12-4, the optimal solution for (P2-3) is any (p, γ) such that γ ∈ R, and p ∈ I such that p k = 0 for k > r.In particular, consider the solution (p * , γ * ) given by, where the inequality (a) follows since γ(t) ∈ K and equality (b) follows from the fact that X(t) is independent of H(t) and Z.

APPENDIX E PROOF OF LEMMA 6
Notice that, where v is defined by, Ω(t) is given by, Ωk (t) = Ω k (t)½ {arg max 1≤j≤n {γ j (t−1)Ω j (t)}=k} , and arg max return the least index in case of ties.Notice that where (a) follows from the Cauchy-Schwarz inequality, (b) follows since v k ≥ 0, and Ωk (t) ≥ 0 for all 1 ≤ k ≤ n.
1) Notice that from the definition of Q j (t + 1) in (39), for j ∈ A we have where the inequality follows from the definition of u j in (101).The same argument can be repeated for j ∈ A c .
2) Notice that if γ j (t) = 0 then we have which follows since γ j (t) is the projection of γ j (t − 1) (See (42)).Hence we have that, where (a) follows from the subgradients of f t found in (38) and (b) follows from α ≥ V 2 .
3) Notice if we prove γ j (t) = 0, we can use the same argument inductively to establish the result.Assume the contrary that γ j (t) = 0. Then from part-2, we should have, which is a contradiction since γ j (t) ≥ 0. Hence we have the result.Now we use an inductive argument to prove the main result.Notice that the result is true for t We consider three cases.
where I is the simplex set defined in Theorem 3. Define I 0 = I ∩ R n + .We use the online mirror descent algorithm to solve this problem.We first formulate the above problem as an online convex optimization problem, where we obtain the solution by iteratively updating p for T iterations.In the t-th iteration, we sample Ω(t) from the distribution of Ω, where Ω is defined in (21).Let, where D(x y) denotes the Kullback-Leibler divergence between x and y given by, for x ∈ I and y ∈ I 0 , ∇f t (x) ∈ R n is the subgradient of f t at x which is given by, (∇f t (x)) j = −E j + 1 2 ½ (arg max 1≤k≤n {x k Ω k (t)}=j) Ω j (t), for each 1 ≤ j ≤ n, and α is the constant step size.It is assumed that arg max returns the lowest index in the case of ties.It should be noted that p(t) can be found explicitly [38].The solution is given by, for each k such that 1 ≤ k ≤ n.Note that p(t) ∈ I 0 .This is useful when establishing the performance bound of the algorithm.
We summarize the above algorithm under Algorithm 2 for brevity.We have the following lemma regarding the performance of the above algorithm.
Lemma 16: We have, (Notice that the upper bound is finite since W k for each k ∈ B} has finite variance).

B. Case a = 1
When a = 1, the set G A is more complex than the simplex I.The next two lemmas compute for each p ∈ I, the largest q ∈ R, such that (q, p) ∈ G A .For simplicity of exposition, this subsection assumes W 1 has a continuous CDF F W 1 : R → [0, 1].
Lemma 17: If (q, p) ∈ G A , then we have, Proof: See Appendix I is the reward function of resource i.Here, we use the notation N 0 = N ∪ {0}.Each player has complete knowledge about the tuple (M, N , T , r), but they do not have knowledge of the actions chosen by other players.For an action profile a = (a 1 , a 2 , ..., a m ) ∈ T , the count function # is a function from N × T to N 0 where #(i, a) = m k=1 ½ {i∈a k } .In other words, #(i, a) is the number of players choosing resource i under action profile a.We call the quantity r i (#(i, a)) the per-player reward of resource i under action profile a.The utility u j of player j is a function from T to R, where u j (a) = n i=1 ½ {i∈a k } r i (#(i, a)).In other words, u j (a) is the sum of the per-player rewards of the resources chosen by player j under action profile a. Resource-sharing games fall under the general category of potential games[15].Potential games are the class of games for which the change in reward of any player as a result of changing their strategy can be captured by the change in a global potential function.
A. Explicit solution for a = b = d = 0 When neither player knows any of the reward realizations, we have a = b = d = 0, and the problem reduces to the following.(P2): maximize n k=1 Figures 1-3 show pictorial representations of these cases.

Fig. 4 .
Fig. 4. Top: Case a = 0, b = 0, c = 3, d = 0. Middle: Case a = 0, b = 1, c = 2, d = 0. Bottom: Case a = b = c = 1, d = 0. Left: The expected utility of the players at the ǫ-approximate Nash equilibrium vs. E1.Middle: One possible solution for the probabilities of choosing different resources at the ǫ-approximate Nash equilibrium for player A vs. E1.Right: One possible solution for the probabilities of choosing different resources at the ǫ-approximate Nash equilibrium for player B vs. E1.