Analytical Method for Mechanism Design in Partially Observable Markov Games

: A theme that become common knowledge of the literature is the difﬁculty of developing a mechanism that is compatible with individual incentives that simultaneously result in efﬁcient decisions that maximize the total reward. In this paper, we suggest an analytical method for computing a mechanism design. This problem is explored in the context of a framework, in which the players follow an average utility in a non-cooperative Markov game with incomplete state information. All of the Nash equilibria are approximated in a sequential process. We describe a method for the derivative of the player’s equilibrium that instruments the design of the mechanism. In addition, it showed the convergence and rate of convergence of the proposed method. For computing the mechanism, we consider an extension of the Markov model for which it is introduced a new variable that represents the product of the mechanism design and the joint strategy. We derive formulas to recover the variables of interest: mechanisms, strategy, and distribution vector. The mechanism design and equilibrium strategies computation differ from those in previous literature. A numerical example presents the usefulness and effectiveness of the proposed method.


Brief Review
Hurwicz [1] published his seminal work on mechanism design that has emerged as a practical framework for tackling game theory problems with an engineering viewpoint when considering players that interact rationally [2][3][4]. For a survey see [5]. This theory is based on games with incomplete information for modeling mechanisms (that implements a social choice function) compatible with individual incentives that result in efficient decisions for maximizing the total reward. The primary aim consists of establishing games that consider independent private values and quasilinear payoffs [6,7], in which players receive messages containing information that is relevant to payoffs [8]. In the evolutions of the game, players commit to a mechanism that presents a result in terms of a function of the possibly untruthfully reported type. It should be pointed out that the mechanism is unknown. The mechanism designer determines a social choice function that is a mapping of the true type profile directly to the alternatives. However, a mechanism maps the reported type profile to the alternatives. The main task in computational mechanism design is to find a mechanism that both maintains the game-theoretic original futures and is computationally "efficient" and "feasible".
This approach makes it possible managing the restrictions and controlling the information of the players that are engaged in a game. From this perspective, Arrow [9] presented a framework to claim revelation that realizes efficiency and avoids the spend of resources in the incentive payments. d'Aspremont and Gerard-Varet [10] suggested two separate methods to design a mechanism with incomplete information: a) the former consist of the fact that players beliefs are not considered, and the second where they are. Saari [11] presented a mechanism design, which involves types of information. Rogerson [12] proposed a general approach of the hold-up problem, in which several players make relation-specific investments and then decide on some cooperative action proving that first-best solutions exist under a variety of different assumptions regarding the nature of information asymmetries. Mailath and Postlewaite [13] established an approach for the bargaining problems with asymmetric information while considering multiple agents. Miyakawa [14] provided a necessary and sufficient condition for the existence of a stationary perfect Bayesian equilibrium. Athey and Bagwell [15] and Hörner et al. [16] have relevant results on equilibria in repeated games, which consider communication. Clempner and Poznyak [17] suggested a Bayesian partially observable Markov games model supported by an AI approach. Different approaches are presented in the literature, for instance, see [18][19][20].

Main Results
We contribute to this literature by proposing original outcomes, presenting an analytical method for developing a mechanism that considers incomplete state information whose preferences evolve following a Markov process, and characterizing an approximately equilibrium behavior in game theory models [17]. The foundation of the proposed method is the derivation of formulas for computing the mechanism µ. Subsequently, given the mechanism, compute the equilibrium strategy. The derivation of these formulas rely on a direct mechanism design. We propose an extension of the Markov model, suggesting a new variable z that represents the product of the mechanism µ and the joint strategy c. Additionally, the joint strategy c is defined by the product of the strategy π, the observer q, and the distribution vector P. We derive formulas to recover the variables of interest: mechanism µ, the strategies π, and the distribution vectors P. We describe a method for the derivative of the player's equilibrium that instruments the design of the mechanism and we also showed the convergence of the proposed method.

Organization of the Paper
For ease of exposition, in the next section, we describe the Markov game model. In Section 3, we introduce the variables c and z and suggest the derivation of the formulas. The ergodicity condition expressed in z variables is proven in Section 4. The convergence to a Nash equilibrium is presented in Section 5. Section 6 concludes with some remarks.

Markov Games with Incomplete Information
Let us introduce a probability space (Ω, F , P ), where Ω is a finite set of elementary events, F is the discrete σ−algebra of the subsets of Ω, and P is a given probability measure defined on F . Let us also consider the natural sequence t = 1, 2, ... as a time argument. Let S be a finite set that consists of states {s 1 , . . . , s N }, N ∈ N, called the state space. A Stationary Markov chain [21,22] is a sequence of S-valued random variables s(t), t ∈ N, satisfying the Markov condition: The random variables s(t) are defined on the sample space Ω and they take values in S. The stochastic process {s(t), t ∈ N} is assumed to be a Markov chain. The Markov chain can be represented by a complete graph whose nodes are the states, where each edge (s i , s j ) ∈ S 2 is labeled by the transition probability in Equation (1). The matrix P = (p j|i ) (s(i),s(j))∈S ∈ [0, 1] N×N determines the evolution of the chain: for each n ∈ N, the power P n has in each entry (s i , s j ) the probability of going from state s i to state s j in exactly n steps.
Let MC = (S, A, {A(s)} s∈S , K, P) be a Markov chain [21,22], where S is a finite set of states, S ⊂ N and A is a finite set of actions. For each s ∈ S, A(s) ⊂ A is the nonempty set of admissible actions at state s ∈ S. Without loss of generality we may take A= ∪ s∈S A(s). Whereas, K = {(s, a)|s ∈ S, a ∈ A(s)} is the set of admissible state-action pairs. The variable p j|ik is a stationary controlled transition matrix , where p j|ik := P(X t+1 = s j |X t = s i , A t = a k ) ∀t ∈ N represents the probability that is associated with the transition from state s i to state s j , i = 1, N (i = 1, ..., N) and j = 1, N (j = 1, ..., N), under an action a k ∈ A(s i ), k = 1, K (k = 1, ..., K). The distribution vector is given by P(X t = s i ) = P i , such that P i ∈ S N , where S N = {s ∈ R N : ∑ N i=1 P(s i ) = 1, P(s i ) ≥ 0}. We consider the case where the process is not directly observable [23]. Let us associate with S the observation set Y, which takes values in a finite space {1, ..., M}, M ∈ N. The stochastic process {Y t , t ∈ N} is called the observation process. By observing Y t at time t information regarding the true value of X t is obtained. If X t = s i and A t = a k an observation Y t = y m will have a probability q m|ik := P(Y t = y m |X t = s i , A t = a k ), that denotes the relationship between the state and observation when an action a k ∈ A(s i ) is chosen at time t. The observation kernel is a stochastic kernel on Y, as given by Q = [q m|ik ]. We restrict ourselves to consider Q = [q m|i ].
,N denotes the initial observation kernel; v) P is the (a priori) initial distribution; and, vi) V ijmk , is the reward function at time t, given the state s i , the observable state y m , when the action a k ∈ A(s i , y m ) is taken.
A realization of the partially observable system at time t is given by the sequence (s 0 , y 0 , a 0 , s 1 , y 1 , a 1 , ...) ∈ Ω := (SYA) ∞ , where s 0 has a given by the distribution P(X 0 = s 0 ) and {A t } is a control sequence in A that is determined by a control policy. To define a policy we cannot use the (unobservable) states s 0 , s 1 , .... Then, we introduce the observable histories h 0 := (p, Y 0 ) ∈ H 0 and h t := (s 0 , y 0 , a 0 , ..., y t−1 , a t−1 , y t ) ∈ H t for all t ≥ 1 and H t := H t−1 (AY), if t ≥ 1. Now, a policy is defined as a sequence π k|m (t) , such that, for each t, π k|m (t) is a stochastic kernel on A given H t . The set of all policies is denoted by Π. A policy π k|m (t) ∈ Π and an initial distribution P(X 0 = s 0 ), also denoted as P 0 , determine all possible realizations of the POMDP. A control strategy satisfies that ∑ k π k|m (t) = 1 and π k|m (t) ≥ 0, m = 1, ..., M.
A game consists of a set N = {1, ..., n} of players (indexed by l = 1, n). We employ l in order to emphasize the l-th player's variables and −l subsumes all the other players' variables. The dynamics is described, as follows. At time t = 0, the initial state s 0 has a given a priori distribution P l i , and the initial observation y 0 is generated according to the initial observation kernel Q l 0 (y 0 |s 0 ). If, at time t, the state of the system is X t and the control A l t ∈ A l is applied, then each of strategy is allowed to randomize, with distribution π l k|m (t), over the pure action choices A l t ∈ A l (X t ). These choices induce immediate utilities V l ijmk . The system tries to maximize the corresponding one-step utility. Next, the system moves to new state X t+1 = s j , according to the transition probabilities P l (π l k|m (t)). Subsequently, the observation Y t is generated by the observation kernel Q l (Y t |X t ). Based on the obtained utility, the systems adapt a mixed strategy computing π l k|m (t + 1) for the next selection of the control actions. For any stationary strategies π l k|m (t) = π l k|m , we have Each player maximizes the individual payoff function U l (π k|m ), realizing the rule that is given by where for a given strategies π * k|m satisfy the Nash equilibrium [24,25] fulfilling, for all admissible π k|m , the condition

Main Relations
Following [21,26] and [27], let us introduce a matrix of elements c = [c imk ], as follows Formally, a mechanism is any function µ k|m , such that given c l imk represents the nonlinear programming problem and defining µ l k |m = µ k |m ∀l = 1, ..., n, we have that such that Now, let us introduce the z-variable, as follows where Notice that by the relations We define the solution of the problem (7) as z l * . The next lemma clarifies how we may recover µ * k |m and c l * imk .

Lemma 1.
Variables µ * k |m and c l * imk can be recovered from z l * imkk , as follows: Proof. See Appendix A.
Now, in order to derive π l * k|m andP l * m we have that Corollary 2. The strategy π l * k|m constructed from π l * kim (11), and the distributionP l * m are given by

Ergodicity Conditions Expressed in z Variables
We have derived the formulas, which maximize Equation (7) that is based on the variables z l * imkk and the formulas to recover the policy π l * k|m , the mechanism µ * k|m andP l * m . Accordingly, we focus our attention on the ergodicity restrictions. Theorem 1. The strategy π l * k|m and the mechanism µ l * k |m are in Nash equilibrium, where every agent maximizes its expected utility, for every l = 1, n, U l (µ l * k |m π l * k|m q l * m|i P l * i ) ≥Ũ l (µ l k |m π l k|m q l m|i P l i ) if the quantities of z l imkk satisfies the following restrictions Proof. See Appendix B.

Convergence Analysis
The Nash Equilibrium is a game theory concept that involves several players that determines the solution in a non-cooperative game in which each player lacks any incentive to only change his/her own strategy. A practical notion in deriving Nash equilibria is a player's best reply. The best reply is the strategy (or set of strategies) that maximizes/minimizes his/her payoff taking other players' strategies as given. Then, a player has not just one best-reply strategy, however he/she has a best-reply strategy for each arrangement of strategies for the other players. All of the Nash's equilibrium can be approximated in a (best reply) sequential process. We want to compute the solution of the problem (7), defined as z l * = µ l * k |m π l * k|m q l m|i P l * i , when considering the best reply approach. For solving the problem (7), let us consider a game whose strategies are denoted by x l ∈ X l , where X is a convex and compact set, where x l := col z l imkk and X l := Z l adm . Let x = (x 1 , ..., x n ) ∈ X be the joint strategy of the players and xl := x 1 , ..., x l−1 , x l+1 , ..., x n ∈ Xl be a strategy of the rest of the players adjoint to x l ∈ X l adm . We consider a Nash equilibrium problem with n players and denote, by x = (x l , xl) ∈ R n , the vector representing the x-th player's strategy X adm = X l adm × Xl adm . The method of Lagrange multipliers is an optimization approach for finding the local minimum (maximum) of a function subject to equality constraints (A eq ), as given in Equation (8). Let us consider the Lagrange function that is given by where the Lagrange vector-multipliers λ ∈ Λ may have any sign. The optimization problem L(x,x(x), λ) → min 2. Gradient approximation step: Let us define the following variables The equilibrium point that satisfies Equations (17) and (18) can be represented bỹ In addition, let us introduce the following variables and let us define the Lagrangian in terms of the previous variables L(w,ṽ) :=L(w 1 ,ṽ 2 ) −L(ṽ 1 ,w 2 ) Forw 1 =x,w 2 =ỹ,ṽ 1 =ṽ * 1 =x * andṽ 2 =ṽ * 2 =ỹ * , we have In these variables, the relation Equations (17) and (18) can be represented bỹ v * = arg miñ w∈X×Ỹ 1 2 w −ṽ * 2 +γL(w,ṽ * ) We provide the convergence analysis of the sequence {v n } n∈N in the following theorem [28].  (20) then, the sequence {v n } n∈N converges to a Nash equilibrium point v * ∈ V adm .
Proof. See Appendix C.

Political Numerical Example
The theory that is related to electoral competition originates in the original contributions of Hotelling [29] and Downs [30]. The proposed framework suggests a majority rule election, where political candidates compete for a position by simultaneously and independently proposing a model from a unidimensional policy space. It is common knowledge that the equilibrium of this model is fundamentally determined on the candidates' incentives for running for such a position. This example considers a three-player game (l = 1, 3) that is engaged in a political contest, in which the player with the highest performance wins. A question arises as to what is the design of a mechanism to select a candidate? The goal of each candidate is to end up on top. The next time a political position rolls around, pay attention to the campaigning. Candidates who are behind will talk about not only what a good choice they are for such a position, but also what a bad choice the front-runner is.
The assumption that is involved in this example considers the incomplete information version of the game, in which candidates have the same relative weight to their preference strategies versus their desire to win the position. This case is relevant from a theoretical point of view, and it is empirically important. The dynamics are modeled when considering N = 4, M = 4, and K = 2 with transition matrices for describing the evolution of the partially observed Markov game. The initial transition matrices are defined, as follows: Fixing θ = 0.055 in the extraproximal method that is given in Equations (17) and (18), we have that the Nash equilibrium results from computing the strategies and the mechanism design applying Equations (10) and (13), which are given, as follows:

Conclusions
This paper contributed to the literature on mechanism design for Markov games with incomplete state information (partially observable). We suggested an analytical method for the design of a mechanism. The main result of this work is based on the introduction of the new variable z, which makes the game problem computationally tractable and allow for obtaining the mechanism solution µ and the strategies π for all of the players in the game. The variable z allows for the introduction of new natural additional linear restrictions for computing the Nash equilibrium of the game. A no feasible solution can be detected with a simple test on the variable z, i.e., it is possible to detect unusual conditions in the solver of the game given the information available for the simplex. A major advantage of introducing this variable relies on the fact it can be efficiently implemented for real settings, which is consistent with the engineering approach for designing economic mechanisms or incentives, toward desired objectives, where players act rationally. We applied these results to a numerical example that is related to political promotion.
In relation to future work, there are several challenges that are left to address. One interesting technical challenge is that of addressing extremum seeking in the context of mechanism design [31][32][33]. Another interesting challenge would be to consider the observer design approach in order to extend the mechanism design theory [23] . Proof. Since µ * k |m does not depend on indices l, i, k it may be obtained from Equations (8) and (9): Let us define z l * αβγk as follows To verify that the definitions of µ * k |m and c l * imk (10) are correct we need to check the fulfilling of Equations (5) and (6), i.e., µ * k |m ∈ M adm and c l * imk ∈ C l adm .
(a) As for variables µ * k |m , these properties follow directly: since z l * imkk ≥ 0. Summing (A2) by k directly leads to the property ∑ K k =1 µ * k |m = 1. (b) To prove that c l * imk ∈ C l adm defined by (A1) notice that which leads to following relation Then, z l * αβγk ∈ S, see Equation (9). The Lemma is proved.