A Markovian Mechanism of Proportional Resource Allocation in the Incentive Model as a Dynamic Stochastic Inverse Stackelberg Game

This paper considers resource allocation among producers (agents) in the case where the Principal knows nothing about their cost functions while the agents have Markovian awareness about his/her strategies. We use a dynamic setup of the stochastic inverse Stackelberg game as the model. We suggest an algorithm for solving this game based on Q-learning. The associated Bellman equations contain functions of one variable for the Principal and also for the agents. The new results are illustrated by numerical examples.


Introduction
Stackelberg games date back to the monograph [1].The original setup includes two players, Leader (Principal) and Follower (Agent).The Leader makes the first move by choosing his/her strategy and informing the Follower of it.After that, the Follower seeks for an optimal response by maximizing his/her payoff function.There are Stackelberg games, in which the Leader has a constant strategy, and inverse Stackelberg games, in which the Leader's strategy is a function of the Follower's actions (feedback control mechanism).Stackelberg games with several Leaders and/or Followers are considered later.Inverse Stackelberg games in the static and dynamic setups are discussed in the surveys [2,3].
Inverse Stackelberg games provide a mathematical formalization for the incentive problem.The Principal designs a feedback control mechanism stimulating the agents to choose actions that are most beneficial for him/her.In this paper, we suggest a method for solving such problems in the dynamic setup under incomplete information about the agents' behavior.
Below the problem is stated so that the associated Bellman equations contain functions of one variable for the Principal and also for the agents.In addition, the agents become independent players as soon as they receive the information from the Principal.The latter chooses his/her behavior based on the response of the players.The agents make their decisions by predicting the Principal's behavior.An elementary implementation of this principle is the Markov principle, i.e., at a time t the Principal uses the agents' response at the time (t − 1) to design his/her behavior.In turn, the agents observe the behavioral history of the Principal in order to predict his/her choice at the time (t + 1).The equilibrium calculation method for the Stackelberg game relies on online learning (more specifically, the Q-learning procedure) and recursive statistical estimation.The Q-learning procedure is mostlyused for solving the statistical dynamic programming problem, with the calculation of the Q-function.As a rule, the Q-function depends on two variables, the phase and control variables; so, the Q-function is defined on the Cartesian product of finite sets.The method has slow convergence, and the rate of convergence essentially depends on the number of elements in the definitional domain of the arguments.Therefore, the Stackelberg game in one of the two setups considered below is stated so that the resulting Q-function depends on a single argument.In the other setup, the slow Q-learning procedure is replaced by a faster one-dimensional maximization algorithm for a concave function of one variable.In both setups, the agents involve recursive statistics.Thus, in comparison to the standard Q-learning procedure, the suggested algorithms are expected to guarantee a faster convergence to the equilibrium of this game.
The main contribution of this paper is as follows: 1.
We have developed fast converging algorithms to calculate the solution of the dynamic inverse Stackelberg game without sufficient information about the agents.

2.
The suggested algorithms can be considered as numerical methods for solving the corresponding static inverse Stackelberg game without sufficient information about the payoff functions of the agents.

3.
This game-theoretic model has been applied for optimal resource allocation among producers in the case of insufficient information about their cost functions.
The remainder of this paper is organized accordingly.In Section 2, we provide a survey of the existing publications on this subject.In Section 3, we discuss the static incentive problem as an inverse Stackelberg game and also describe its dynamic extension in two setups.In Section 4, we present the results of calculations using numerical examples for both setups of the game.Finally, in Section 5, we give some concluding remarks and outline for future research.

Related Work
Resource allocation mechanisms in the static setup are studied in contract theory [4] and also in control of organizational systems [5].Here, the main concern of the investigators is to design strategy-proof mechanisms [6,7].More specifically, by assumption, the Principal does not know the exact characteristics of the agents and the latter can use this fact for strategic manipulation (information distortion for their own benefit), see [8].A possible way to eliminate manipulation was described in the book [5]; the author suggested a strategy-proof direct resource allocation mechanism.Such an approach relies on the hypothesis that the optimal control is obtained by solving the static inverse Stackelberg game in which the Principal knows the goals of all agents.The paper [9] considered the discrete-time incentive model with Markovian dynamics and discounted payoff function on the infinite planning horizon.As demonstrated here, the approximate Stackelberg solution can be found by solving an optimal control problem with the difference between the controller's income and executor's cost as the optimality criterion.
In particular importance, previous research suggested no common methods for solving inverse Stackelberg games.Meanwhile, the paper [24] proved the theorem on the ε-optimal guaranteeing strategy in the static inverse Stackelberg game, which reduces the maximin problem with bound variables to nonlinear programming problems with independent variables; as a result, calculations were considerably simplified.In [25,26], this approach was extended to dynamic inverse Stackelberg games.The corresponding theorem actually reduces constrained maximin calculation over complex functional spaces to the calculation of multiple maximins over finite-dimensional spaces.
The authors [27] analyzed a dynamic modification of the proportional resource allocation mechanism.Their suggested approach to the game-theoretic modeling of resource allocation in the hierarchical Principal-agent system possesses the following features.
(1) Resource dynamics are explicitly described as the phase variable depending on the Principal's control.The control function can be non-differentiable; in this case, the dynamic equation is interpreted in terms of the Lebesgue-Stieltjes integral.(2) The Principal's control has smooth variations, which is formalized using the Lipschitz property of the control function.This assumption seems natural for the majority of real organizational and economic systems.(3) The Principal allocates resources among the agents proportionally to their actions, which stimulates the latter to choose more intensive plans.(4) This hypothesis is used to develop a genetic algorithm for calculating the Principal's optimal strategy with a non-uniform partition of the time interval.
Evolutionary modeling and genetic algorithms were described in the monographs [28,29].The paper [30] presented a hybrid learning procedure for artificial neural networks.The authors [31] proposed a genetic algorithm for solving the Germeier game with one Follower and the control function that satisfies the Lipschitz condition.Genetic algorithms for solving Stackelberg games were also considered in [32,33].
This paper is focused on the case in which the Principal has insufficient information about the goals of all agents while the latter does not know the Principal's strategy for the whole duration of the game, merely its history (without loss of generality, the awareness structure will be considered Markovian).The problem statement involves statistical estimation and reinforcement learning, including the Q-learning [34,35].Reinforcement learning was used for calculating Stackelberg equilibria in [36,37].Particularly, as noted in [36], this algorithm has instability in the case of several Followers, especially if the latter calculates the Nash equilibrium for solving their problems.In this paper, we suggest a Stackelberg game-based model, for which there exists a stable algorithm.

Static Setup and Dynamic Generalization
Consider a single Principal and M agents controlled by him/her.The static incentive model as an inverse Stackelberg game has the form Here ψ(x) denotes the Principal's income, a concave increasing function such that ψ(0) = 0; f i (x) is the cost of agent i, a convex increasing function such that f i (0) = 0; φ i (x, γ) gives the incentive of agent i, i.e., the compensation of his/her cost by the Principal; i = 1, . . ., M; finally, γ indicates the Principal's control strategy.Then the optimal incentive mechanism has the form [5] However, this solution may be impossible to calculate directly due to insufficient information about the cost functions f i (x) that is available to the Principal.So, we will suggest a computational scheme based on solving a dynamic stochastic inverse Stackelberg game.
Assume the Principal stimulates the activity of all agents by allocating resources among them.Their activity is described by a vector sequence x M (t) = (x i (t)) M i=1 , t = 0, 1, . . .At each time t, the Principal chooses a control strategy γ(t), and each agent i obtains a corresponding part φ i (x i (t − 1), γ(t)), i = 1, . . ., M, of the resource.This resource allocation mechanism is selected because the Principal does not know the cost of all agents; while choosing a local control strategy γ(t), he/she may rely just on the available history x M (0), . . . ,x M (t − 1).This paper considers the case of Markovian processes: at each time t, the available information about the preceding value x(t − 1) is used only.The Principal cannot inform the agents about his/her control strategies for the whole duration of the game and the agents cannot calculate these strategies.Therefore, the agents (like the Principal) should be guided by the available information at the time t, i.e., by the sequence γ(1), . . ., γ(t).Below we consider the two setups of the dynamic inverse Stackelberg game as a resource allocation model.

Dynamic Setup 1
The first model consists in the following.The set of admissible influences on agents (the set of all Principal's scenarios) is finite.The agents suppose that γ(1), . . ., γ(t) is a segment of a Markov sequence defined in the stochastic basis 1), . . ., γ(t)), i.e., the minimal sigma-subalgebra supplemented with the events of zero probability.The agents solve their problems with the discounted payoff functions where 0 < β < 1 and E G γ y (.) denotes the conditional expectation operator with the probability measure induced by the sequence γ given γ(0) = y.
Consider the problem of agent i: This problem is solved by dynamic programming; the associated Bellman functions satisfy the equations (1) In addition, x i (0) = x, γ(0) = y.The measure q γ (du/y) in the integrand expression is the transition core of the Markov sequence γ.Write the Bellman function in the form V i (x, y) = φ i (x, y) + W i (y).As a result, we obtain the following Fredholm integral equation of the second kind in the unknown function W i (y): So, the existence of a unique solution for the Bellman equation is equivalent to the existence of a unique solution for the Fredholm integral equation of the second kind [38].Consider the space B(−∞, ∞) of all bounded functions with the norm sup du/y) is bounded above, then Equation (2) has a unique solution.For the function φ i (y) to be bounded, a sufficient condition is that the functions φ i (z, u) − f i (z) are bounded above for all u.Moreover, if the functions f i (x) are convex and φ i (z, u) are convex in the variable z for any u, then there exists a unique solution of the problem The optimal activity of the agents satisfies the equality What is of fundamental importance, the optimal behavior of all agents in this model depends on the Principal's local control strategy γ(t) only--in no way on the preceding values of the sequence.Therefore, the Principal's problem is to calculate max The Principal does not know the relationship (3), while the agents do not know the transition core of the Markov sequence γ.In our case, this core is defined by the transition probability matrix because a Markov chain has a finite set of admissible states.As mentioned earlier, the Principal and agents can observe the response of each other.So, the Principal learns by observing the response of agents to his/her actions while the agents learn by observing the Principal's response to their actions.The problem becomes much easier in comparison with the general Q-learning procedure for solving the dynamic inverse Stackelberg games that wereconsidered in [36]: for the agents, it is required to approximate the transition probability matrix; for the Principal, to maximize the function of one variable.Statistical estimation of a transition probability matrix and maximization of a function of one variable are not so computationally intensive as calculation of the Q-functions for the Principal and agents.Let Γ = γ 1 , . . ., γ r and denote by P the transition probability matrix.The maximum likelihood estimate P t of this matrix is designed as follows.Consider a matrix sequence G such that G t (i, j) = G t−1 (i, j) + I {γ t−1 =γ i ,γ t =γ j } , where I {S} means the indicator of a set S. The initial value is G 0 = 0.The maximum likelihood estimate P t has the form P t (i, j) = G t (i,j) du/y) can be written as the approximate problem If the function F i t (z) = − f i (z) + β∑ k P t (j, k)φ i z, γ k is concave and bounded above, then problem (5)   has the unique solution x * i (t) = argmaxF i t (z).Then the Principal observes R γ j , i.e., the response of agents to the control strategy γ j : R After that, the Principal modifies the Q-function using the reinforcement learning algorithm and chooses the next control strategy for the agents based on the probability distribution P l t+1 = p l t+1 γ 1 , . . ., p l t+1 (γ r ) .The probabilities can be calculated by the Boltzmann scheme, which is used in annealing [36].This is a random search algorithm for maximization (minimization) of generally non-differentiable functions f (x).The idea consists in comparing the current solution x t with a randomly chosen one y t in a small neighborhood of the former.Transition to the randomly chosen solution occurs if f (y t ) > f (x t ) ( f (y t ) < f (x t )).If this condition fails, then transition to a next random value y t is performed with the probability p t = exp − f (x t )− f (y t ) . Therefore, it is natural to choose the next value γ with the probability In formula (7) and also above, the parameter T-"temperature"-adjusts the degree of randomness for control strategies.This parameter is decreasing from a given maximal value to a given minimal value, e.g., T t+1 = δT t , 0 < δ < 1.The initial condition is Q l 0 ≡ 0. The values h t regulating the amplitude of variations are supposed to satisfy the standard conditions They guarantee the almost sure convergence to the optimal function Q that is close to R if each element of the set Γ appears an infinite number of times in the course of learning.
Thus, the suggested algorithm consists of the following iterative operations: 1. calculation of the maximum likelihood estimate for the transition probability matrix; 2.
calculation of the next value of the Q-function.
Once again, we underline an important advantage of the suggested algorithm: the Q-function depends on a single argument only.The first operation is to calculate the maximum likelihood estimate instead of a next value of the Q-function, which guarantees faster convergence of the algorithm against its counterparts in which the Q-function depends on two arguments and a next value of Q is calculated at the first and second stages.Also, note that the maximum likelihood estimates are consistent and asymptotically efficient, i.e., they all use available information and obeys the normal distribution in asymptotics.

Dynamic Setup 2
In the second model of this game, the agents assume that the Principal makes "no sudden moves" in control choice (see Postulate 2 from [27]).So, they describe the Principal's behavior by The sequence ε t consists of independent identically distributed random elements with the mass concentrated near the origin.For the second model, max Consider the second term and apply the Taylor expansion up to the second order inclusive, following the standard approach of stochastic calculus (e.g., the Ito formula).As a result, we obtain the optimization problem provided that the second derivative exists.If for all y the goal functions of the agents are concave and bounded, then there exist unique solutions z * i (y) for the agents' problems (9).The sample means for the moments of distribution of γ are consistent estimates for the moments in the right-hand side of Formula (9).Therefore, the agents solve the problems ∂y 2 γ(t) 2 b t , while the Principal can use any maximization method of a concave function of one variable without derivatives.If the function R(x) is concave and bounded above, then the consistent sample moments and the convergent one-dimensional search procedure guarantee that this algorithm converges to the equilibrium of the Stackelberg game in probability.
Thus, the learning procedure for the second model consists of the following iterative operations: 1. calculation of the estimates for the first and second moments of distribution; 2.
calculation of the next approximation γ.
In comparison with the classical Q-learning [34,35], this approach does not calculate the Q-function and is based on estimating the first and second moments of distribution and one-dimensional maximization of a function of one variable, which forms an obvious advantage.In comparison with the previous algorithm (see Section 3.2), it does not need the preliminary analysis of the problem to find the Principal's behavioral scenarios (the set Γ).

Examples and Numerical Calculations
Consider an example in which ψ(x) = √ x, φ i (x, y) = xy, and f i (x) = µ i x 2 .For the first model, problem (5) takes the form: max For the second model, problem (9) takes the form max z −µ i z 2 + βz(γ(t) + a t ) , with the solution , where . The response R(γ(t)) is the same as above.
For this problem, the Stackelberg equilibrium satisfies the equalities Choose the following numerical parameters for trial calculations: Then the equilibrium values are γ * = 0.4524029361, x * 1 = 0.2035813212, x * 2 = 0.1017906606.
Consider two examples as follows.
In the first example, the set Γ = {0.4, 0.45, 0.5} contains a value close to the equilibrium in position 2. Our calculations were performed in Maple.The algorithm yielded the following results.The calculated values of the Q-function are Q = vector(0.02789054257,0.4073228841, 0.09271767315).The maximal value of the Q-function is 0.4073228841 (position 2), which corresponds to the value γ = 0.45 from the set Γ.The calculated transition probability matrix has the form P = matrix([0, 3/4, 1/4], [1/91, 87/91, 3/91], [3/4, 1/4, 0]).The maximal element of this matrix, which is close to 1, stands at the junction of column 2 and row 2. In other words, the most probable prefix of the chain is 0.45, 0.45, . . ., which also corresponds to the value γ = 0.45 from the set Γ.The calculated values x 1 = 0.2029945055, x 2 = 0.1014972528 are close to the equilibrium.
The second model.The results yielded by the algorithm for the second model are presented in Tables 1 and 2. This table has the same notations as before.The obtained results indicate that fast convergence and additional comments are unnecessary.
The second algorithm seems to be preferable if the admissible set Γ has no values close to the equilibrium.In our example, the second algorithm demonstrated a faster convergence to the equilibrium than the first.However, for the set Γ containing a value close to the equilibrium, the first algorithm yielded more accurate results.At the same time, the second algorithm had a higher accuracy rate for the set Γ without such values.

Conclusions and Future Work
Proportional allocation is the most natural mechanism to distribute resources, which has been approved by the practical control of organizational systems.For this mechanism, the problem of strategy-proofness (protection against manipulation) comes at the forefront because the agents are interested in overrating their real resource demands unknown to the Principal.The static proportional resource allocation mechanism was studied in control of organizational systems (see [5]); a modification of this mechanism that guarantees strategy-proofness was also designed there.
In this paper, we have suggested a dynamic proportional resource allocation mechanism based on learning.We have constructed two stochastic models (setups) of the dynamic inverse Stackelberg game, each guaranteeing the existence of stationary equilibrium.The first model involved the ideology of a finite set of the Principal's behavioral scenarios while the second relied on the natural limits of all Principal's actions.Both models have been illustrated using numerical examples.Each model is associated with an algorithm to find equilibrium in the inverse Stackelberg game.
The experimental results allow us to formulate the following hypothesis.The developed algorithms for solving the dynamic stochastic inverse Stackelberg game can be also used for solving the corresponding static inverse Stackelberg game with insufficient information about the cost functions of all agents.This hypothesis still needs deeper analysis, which will be the subject of futureresearch.

Table 1 .
The results yielded by the algorithm.

Table 2 .
The results of numerical calculations.