The Convergence of a Cooperation Markov Decision Process System

In a general Markov decision progress system, only one agent’s learning evolution is considered. However, considering the learning evolution of a single agent in many problems has some limitations, more and more applications involve multi-agent. There are two types of cooperation, game environment among multi-agent. Therefore, this paper introduces a Cooperation Markov Decision Process (CMDP) system with two agents, which is suitable for the learning evolution of cooperative decision between two agents. It is further found that the value function in the CMDP system also converges in the end, and the convergence value is independent of the choice of the value of the initial value function. This paper presents an algorithm for finding the optimal strategy pair (πk0,πk1) in the CMDP system, whose fundamental task is to find an optimal strategy pair and form an evolutionary system CMDP(πk0,πk1). Finally, an example is given to support the theoretical results.


Introduction
Artificial intelligence technology has become one of the most important technologies, nowadays. AlphaGo, unmanned driving, voice recognition, face recognition and other well-known technologies involve artificial intelligence. Artificial intelligence technology has been widely used in economic construction, education, medical treatment, social life and other fields. Many industries have also begin to evolve to artificial intelligence, intelligence has become an important direction of industrial development. Machine learning generally includes supervised learning, unsupervised learning and reinforcement learning [1]. Among them, reinforcement learning is an important research field. Reinforcement learning is learned by interacting with a dynamic environment through trial and error, is an important branch in the field of machine learning and artificial intelligence. It is leap to control the behavior of agents through rewards and punishments. Reinforcement learning is widely used in many fields, such as: electronic game [2], board games [3], maze game, recommendation system [4], and so forth.
The important basis of reinforcement learning [5] is Markov Decision Process (MDP) system [6]. From the evolution process of the computational model, Markov Process (MP) system is the basic model, which introduces reward function and execution behavior, and introduces a mapping from state set to behavior set as a strategy to solve Markov decision problem. Under the given policy π, the Markov decision process system degenerates into a Markov process system with a reward function MDP π . Once the policy is given, the main task of the Markov process system with the reward function is to evolve and generate the expected reward values collected in each state. The main objective of the Markov decision system is to find an optimal strategy [7,8] π * , so that the expected reward value collected on the Markov chain can be maximized under this strategy.
In the usual Markov decision process system, only an agent is learning and evolving. However, in real problems, the decision-making ability of an agent is far from enough, and many problems can be solved by using a multi-agent [9][10][11]. Therefore, we encounter two kinds of problems: one is that multi-agent restrict each other to accomplish their target tasks in the same environment. Such systems are called antagonistic systems [12] or game systems [13]. In this kind of system, all the intelligent agent executes the strategy behavior alone, according to mutually restricts or plays games with each other intelligent agent to finishes its own target task. The optimal strategy is represented as agent strategy vector group and the reward value is reflected as the reward value of its own execution. Littman [12] proposed minimax-q algorithm for multi-agent competitive reinforcement learning based on Markov decision framework based on zero-sum game in 1994. The other is that multi-agent work together to accomplish a target task in the same environment. Such systems are called collaborative systems [14] or cooperative [15]. In this kind of system, all the intelligence agent executes the strategy behavior alone, according to cooperates with each other to completes one target task jointly. The optimal strategy is the vector set of agent strategy, and the reward value is the social comprehensive value of combinatorial execution. Littman [16] also proposed a collaborative team q-learning algorithm in 2001. The multi-agent cooperative algorithm is used to solve problem about traffic signal signal coordination control [17], multi-agent path planning [18], multi-satellite collaborative task planning [19], and so forth. In the case of multi-agent, the state shift is the result of all agents acting together, so the rewards for agents also depend on the joint strategy. Iwata K et al. [6]analyzed the statistical properties of multi-agent learning based on Markov decision process. However, the multiple agents in this paper are independent, and the joint action will make the state transfer and reward value of the whole system, but only migrates the Markov decision process of a single agent to multi-agent. Zhang W et al. [20] has began to introduce the communication mechanism into the multi-agent system. Although agents begin to communicate with each other, the consumption of memory was enlarged, and agents could not predict the impact of their actions on the action execution of other agents. This paper only considers the Cooperation Markov Decision Process (CMDP) system of two agents, which is suitable for the evolutionary learning system of cooperative decision between two agents. Two-agent games for multi-agent reinforcement learning are similar to perceptrons for neural networks.In this kind of learning model, agents alternately execute behaviors, seek optimal criteria based on social value, seek optimal strategies (π 0 k , π 1 k ) , and jointly complete the target task. This paper introduces a cooperation Markov decision process system in the form of definition, two trade agent (Alice and Bob) on the basis of its strategy to perform an action. that is, after Bob observes that Alice performs an action, Bob is deciding which action to perform, and further Bob's execution of the action will also affect the execution of Alice's next action. The Markov system learns evolutionally through a given strategy pair and looking for a pair of optimal strategy (π 0 k , π 1 k ) , form an evolution system CMDP (π 0 k ,π 1 k ) , and gives the algorithm [21,22] to search for the optimal strategy pair. In this paper, the convergence property of the value function of the MDP system with the participation of a single agent is given, the convergence phenomenon of the value function in the cooperation Markov decision process system proposed in this paper is further explored, and the correctness of the property is proved from both the experimental and theoretical perspectives.

Markov Reward Process System
A finite Markov process system is constituted by two tuples < S, P > , S = {s 1 , . . . . . . , s n } is a set of finite states (containing n states), define a probability distribution P over S, P(s, s ) represents the probability of transition from state s to state s , what's more, ∀s ∈ S : Σ s ∈S P(s, s ) = 1. Markov Reward Process is a kind of Markov chain with value, which consisted of four tuples < S, P, R, γ >. Introduced a reward function, R : S × S → R + (R + means non-negative real number set). So R(s, s ) is the reward value obtained by transferring from state s to state s ; 0 < γ < 1 is the attenuation factor, also known as the discount factor. Recursively defines an expected reward value (value function) collected by Markov process system: Equation (1) recorded the reward value obtained on the Markov chain X 0 , X 1 , X 2 , . . . , X t , X t+1 , . . . with the state s as the initial state. The term R(s, s ) + γ · V(s ) indicated that: starting from the state s, the reward value obtained by one-step transition to s and plus the discount of the expected reward value collected from the state s . Equivalently, writen (1) as the following iterative formula: Among them, R T represents the transpose of the matrix R and diag(A) represents the column vector formed by the diagonal elements of the matrix A.
Now we make an experiment, experiment 1: take probability matrix P , reward function R, and discount factor γ as follows. Observe the convergence of the value function V through an example.
Initially, initial vector is set as V 0 = [0, 0, 0, 0, 0] T , the convergence of the value function is shown in the following Figure 1: According to Figure 1, we can observed that starting from the initial state, the value function (V-function) will eventually converge as the number of iterations steps. We further speculate that the final convergence value has nothing to do with the initial value setting. Proof. From the iterative Formula (2), it is easy to obtain that, simplify Equation (3) to calculate the difference in value function between (k + 1)-step and k-step, Let P is the probability transfer matrix, R is the reward value matrix, the discount factor satisfy 0 < γ < 1. We have where R i min and R i max record the minimum and maximum value of the reward value in the i-th row of the reward matrix R, respectively. V 0 → (s) represent the value of the initial moment: V 0 T , the minimum and maximum value of V are V min = min{V 0 → (s)} and V max = max{V 0 → (s)} , respectively. Reduce the left side of the inequality (2.5) and enlarge the right side. Then we conclude, where R * min and R * max record the minimum and maximum value in the reward matrix R, respectively. Finally, because of the discount factor 0 < γ < 1, with the constant increase of k, the value of γ k approaches zero infinitely, so the influence of R * min , R * max , V max and V min are much smaller than γ k under general constants number, therefore 0 ≤ V k+1  (3):

Proof. Simplifying Equation
Because P is a probability transition matrix, then P k is a probability transition matrix also. Then we can get, applying (7), (8) and simplifying gives expand the inequality (9), expand the inequality (10), It is found that, because of the discount factor 0 < γ < 1, the value of 1 γ k continuously approaches infinity as the value of k increases, so 1 γ k + 1 γ k−1 + · · · + 1 γ 1 and monotonically increase on the Formulas (11) and (12). Therefore, the effects of V min and V max in the initial value function V 0 → (s) can be ignore for the final value function V k → (s). so the choice of the initial V 0 → (s) does not affect the convergence value of the final value function V k → (s) . It further proves that our conjecture is correct.

Markov Decision Process System
Markov decision process system introduces the behavior in Markov reward process system and modifies the reward function to the behavior execution step. MDP system is a random process on state, reward, action sequences, and consists of five-tuples < S, A, P, R, γ >. A = {a 1 , ..., a m } is a set of finite action, P(s, a, s ) represents the probability of transition from state s to state s by performing action a. P : Similarly, γ is the attention factor also called discount factor, satisfy 0 < γ < 1. π is a strategy of MDP system, π : S → A. Such as, π(s) = a represents the execution of action a in state s. So, let V π (s) is the value of s under a given policy π, we can include Equation (13) is called the Bellman recursive equation. In the function V π , if see behavior as a variable, we can induce a state-action value function, that is, a Q-function, Q : Q π (s, a) is understood as follows: starting from state s, after executing action a and following executing strategy π , the expected reward value is collected. The goal for any given Markov decision process is to find an optimal strategy, that is, to obtain the most reward strategy.
Policy function π : S → A, determines the V-function and Q-function that indicate the expected collection of reward values, and Q π can be determined by V π . Therefore, our goal is to find a strategy π that maximizes the function V π . A V-function is optional means if for any policy π, for any state s ∈ S, satisfies V * (s) ≥ V π (s). A policy is optimal indicates if for any policy π, for any state s ∈ S, A natural problem is the existence of optimal strategy π * . Second, if it exists, how should we to calculate π * ?
According to Theorem 1 in Section 2, the value function of the Markov reward process system will converge, and the value function of the Markov decision process system formed by introducing behaviors on Markov reward process system must also converge. Theoretical results [23,24] show that: an optimal strategy for a finite Markov decision process system < S, A, P, R, γ > exists, and the optimal V-function V * and the optimal strategy π * have the following relationship: (i) If the optimal function V * has been obtained, the optimal strategy π * can be obtained: (ii) If the optimal policy π * has been obtained, the optimal function V * can be obtained: Equations (15) and (16) give a cross-iterative solution process, and finally the optimal strategy π * and the optimal function V * are obtained.
The execution process of the MDP system is shown in the following Figure 2:

Cooperation Markov Decision Process System
This section introduces a joint formal system that two agents run in the same Markov decision process system, called the cooperation Markov decision process system. Suppose A = (a i,j ) n×n , B = (b i,j ) m×m are two matrices, the tensor product of matrix A and B is defined as: A ⊗ B = (a i,j B) . If P 1 , P 2 are two probability matrices but not necessarily the same number of rows, so P 1 ⊗ P 2 is a probability matrix.

Cooperation Markov Decision Process System with Two Agents
We now consider that, in the same system, there are two agents A and B, each agent performing their own action and completing the target task in a cooperative manner. For the "cooperative" type, the two agents cooperate to accomplish the same goal. Formally, such a system is defined as the following quintuple: Among them, S is a finite state set, here represents the state pair of two agents; A is an action set, here represents the action pair of two agents; P i is the probability transfer function of Agent i , i = (0, 1), define as P i : 1]. For any (s 1 , s 2 , a) ∈ S × S × A, there hold that Σ (s ,a )∈S×A P i (s 1 , s 2 , a, s , a ) = 1. In other words, for any fixed (s 1 , s 2 , a), the P i (s 1 , s 2 , a, • , • ) specifies a probability distribution for S × A. Intuitively, P 0 (s 1 , s 2 , a, s , a ) express that when Agent 0 , Agent 1 in state pair (s 1 , s 2 ), when Agent 0 after executes action a transitions to s state, and the probability that collaborator Agent 1 responds to execution of action a . R i is the reward function of Agent i , i = (0, 1), defined as: R i : S × S × A × S × A → R (real number set). Intuitively, R 0 (s 1 , s 2 , a, s , a ) express that when Agent 0 , Agent 1 in state pair (s 1 , s 2 ), Agent 0 after executes action a transitions to s state, and the reward value that collaborator Agent 1 responds to execution of action a .
Obviously, two agents Agent 0 , Agent 1 correspond to two V-functions: V i : S × S → R (i = 0, 1), in the joint form, we use the social value to describe the joint V-function V : S × S → R. Formally defined as: Among them, parameter α is called balance parameter. Policy function defined as: π i : S × S → A (i = 0, 1), π i (s, s ) = a understand as: when Agent i is in s state and observe that when Agent 1−i is in s state, Agent i executes action a. For a given strategy pair (π 0 , π 1 ) , the evolutionary cooperation Markov decision process system CMDP (π 0 ,π 1 ) obtains a stable V-function whose iterative equation is:

Convergence of the Social Value Function of CMDP System
Since the value function will eventually converge during Markov's decision-making process, we guess whether the value function also has the same properties in CMDP system. We first observe by an experiment: There are 9 state groups in the CMDP system. Figure 3 depicts the change curve of the social value function under the strategy of (π 0 , π 1 ) = s 0 s 1 s 2 a 0 a 1 a 1 , s 0 s 1 s 2 a 1 a 0 a 1 . As can be seen from the Figure 3, in the cooperation Markov decision process system introduced in this paper, as the number of iterations increases, the social value of the system increases and finally converges to a stable value. In the CMDP system, agent Agent0 has NS0 states and agent Agent1 has NS1 states, where NS0 = NS1. The set of actions that two agents can perform are {1, 2, 3, . . . ,a} and each agent has an action that can be performed. Let s denote the state of agent Agent0, and t denote the state of agent Agent1, for a given strategy pair (π 0 , π 1 ), let i = Pi0(s) denote the action performed by Agent0 in the s state under the π 0 strategy, let j = Pi0(t) denote the action perform by Agent1 in the t state under the π 1 strategy, among them, i, j ∈ {1, 2, 3, . . . , a}. The cooperation Markov decision process system with the participation of two agents performs iterative calculation according to Formula (18).

Theorem 3.
In a Cooperation Markov Decision Process system in which two agents participate, the two agents perform coordinated actions on the CMDP system, and the social value function converges.
Proof. According to the iterative evolution process in Equation (18), we can get the calculation idea as follows: In Formula (19), A 1 and P 1 respectively represent the expected reward value matrices and probability transition matrix obtained by Agent0 under a certain state and a known strategy pair (π 0 , π 1 ), among them, A 1 = diag(P 1 • R 1 T ) represents the probability of the state performing the action multiplied by the reward value for performing the action. A 2 and P 2 respectively represent the expected reward value matrix and probability transition matrix obtained by Agent0 under a certain state and a known strategy pair (π 0 , π 1 ), among them, A 2 = diag(P 2 • R 2 T ) represents the probability of the state performing the action multiplied by the reward value for performing the action. The dimensions of A 1 and A 2 are NS0 * NS1 rows and 1 column, P and Q are square matrices of NS0 * NS1 dimension.
Step 1 Calculate → V 1 . Given any initial value function → V 0 , it can be calculated using Equation (19), Step 2 Find It can be known from Equation (19) Step 3 According to equation (19), → V 2 can be used to find We can known from Equation (19) that by substituting → V 0 3 in Equation (24) and Subtract the two equations to calculate the value function difference between time k and k + 1 : P 1 and P 2 are two probability transition matrices, α + β = 1, thus, αP 1 + βP 2 is a probability transition matrix and the k-th power of the probability transition matrix is still the probability matrix, so (αP 1 + βP 2 ) k is a probability transition matrix. Because of discount factor 0 < r < 1, then r k decreases monotonically with the increase of k. When k is large enough, r k → 0 . That is to say: So, for any given strategy pair (π 0 , π 1 ), → V k+1 = → V k , in the CMDP system introduced in this paper, the final value function V will converge. If the discount factor is larger, the convergence speed is slower, the smaller the discount factor is, the faster the convergence speed is.
Obviously, P 1 and P 2 are two probability transition matrices, α + β = 1 , thus (αP 1 + βP 2 ) is a probability transition matrix, ∀k ∈ N, (αP 1 + βP 2 ) k is a probability transition matrix. r is the discount factor and 0 < r < 1, then with the increase of k, the r k decreases monotonically. When k is large enough, r k → 0, that lim k→∞ r k = 0, so lim Therefore, the choice of the initial function

Algorithm for Optimal Strategy Pairs in Cooperation Type CMDP System
From the theoretical results in Section 4.2, it can be known that under the strategy pair (π 0 , π 1 ), the V-function generated by the iteration of Equation (18) is recorded as V (π 0 ,π 1 ) to form a stable value. Similarly, according function V and improve (π 0 , π 1 ) to obtain (π 0 , π 1 ) by the following greedy method.
In order to obtain the algorithm of the optimal strategy pair (π * 0 , π * 1 ) in the Cooperation Markov Decision Process system. We modeled the optimal strategy of a single agent. Let Formula (18) be the iterative calculation formula of the "evolution module" part of the algorithm, Formula (29) be the strategy improvement method of the "strategy improvement module" in the algorithm. According the method of solving the single-agent Markov decision process system's optimal strategy, this section presents the algorithm for solving the optimal strategy of the joint type cooperation Markov decision process system, the optimal strategy pair is (π * 0 , π * 1 ). The detail code of Algorithm 1 can see in Appendix A. The execution process of the above algorithm is shown in the following Figure 4:

An Application Example of Cooperation Markov Decision Process System
Reinforcement learning is one of the most concerned topic in artificial intelligence after deep learning. Reinforcement learning agents interact with the environment and take actions that receive feedback similar to learning strategies in medicine. In fact, many applications of reinforcement learning in health care have to do with finding the best treatment strategy. Many reading learning software also use intensive learning to provide customized learning materials and content for the tutoring system to meet the specific needs of students. The recommendation system of e-commerce changes the recommendation strategy in real time according to the feedback of users (ignore, click and buy).
The core problem of multi-agent system research is to seek to establish an effective cooperation mechanism, so that multi-agent with simple function and independent function can complete complex target tasks or solve complex problems through negotiation, coordination and cooperation. In the multi-agent system, each agent learns and improves its own strategy by interacting with the environment to obtain the reward value, so as to obtain the optimal strategy under the environment is the process of multi-agent reinforcement learning. The transfer of multi-agent state depends on the actions of all agents, and the joint actions determine the transfer of the whole system. The return of each agent is not only related to its own actions, but also the actions of other agents will affect the return of the current agent. If multi-agent are involved, there are three types of cooperative game, competition game and mixed game combined between agents in the same environment.
The proposed joint multi-agent in this paper only analyzes the relationship between two agents, two agents take interactive actions and complete cooperation to achieve the goal together. When intelligence agent B observed intelligence agent A take an action, intelligence agent B will perform an action. At the same time, the next action of agent A also depends on the agent B's action. Repeat the process of perform action above, the behavior of the two agents influence each other and evolve and learn in the CMDP system of a given strategy pair to find a pair of optimal strategies finally.
Background: a couple found a gold bar not far from home at the same time, this gold bar needs two people together, one carried one end to carry home, they want to carry the gold bar to home, they need avoid the trap and then everyone arrives at the side of the gold bar, after lifting the gold bar, they also have to bypass the obstacles in front of the home, finally, the gold bar is successfully transported home.
The background is transformed into the environment as shown in the Figure 5, the simulation is carried out on the figure. Two people are compared to two agents respectively. In addition, the figure can only move up, down, or left, and each position can only accommodate one agent. In this experiment, agents in the same environment, through mutual cooperation, the gold bars to the designated location. The whole process is divided into the following three steps: Step 1 Bypass the trap. When passing through the trap area, each position can only accommodate one agent, two agents cannot be in the same position at the same time, so they need to interact with each other which agent passe first. When agent B observes that agent A is going to pass through the trap position, agent B cannot pass through the trap position.
Step 2 Reach the designated position. The two agents must reach the two ends of the gold bar to pick up the gold bar, so the two agents can not reach the same place at the same time. When one of the agents reaches one end of the gold bar, the other must walk to the other end of the gold bar before the two agents can work together to pick up the gold bar.
Step 3 Delivery of gold bar. Because the gold bar has a certain length, in the process of transporting the gold bar to the designated position, the two agents are at different positions. In the process of transporting the gold bar, the two agents also need to consider how to bypass the obstacles and the positions of the two agents. The two agents need to cooperate and interact to complete the task. In the process of gold bar transportation, through continuous exploration and trial and error in the environment, the two intelligence agent will find the best way to move. Finally, the two agents cooperate through the trap to reach the two ends of the gold bar, and then bypass the obstacles to successfully transport the gold bar home. In such an example, each agent can only move up, down, left, and right to adjacent positions in the environment, and the sum of the probabilities of moving to all adjacent positions is equal to 1. The agent obtains certain benefits by moving the position. In order to avoid falling into traps and avoiding obstacles, the probability and benefits of moving to the vicinity of traps and obstacles are also relatively small compared to other locations. The goal is that through constant trial and error and learning in the environment, the two agents can take an optimal way of movement to get the most benefit. We express the probability and reward value of moving adjacent positions in this problem with a square matrix, and the probability and reward value of the joint action of two agents are the tensor product of two probability matrices and two reward matrices. Using the algorithm 1 in this article, given an initial strategy, the system will obtain a value function under this strategy. Under the value function at this time, the agent performs all joint actions and selects the joint action that enables the system to obtain the highest reward value. Use this to modify the initial strategy, and then obtain a new value function under the modified strategy. According to this method, iteratively modify the strategy and value function, and finally evolve to learn a strategy that can obtain the maximum reward value. That is, the optimal strategy.
Conversely, if the two agents do not cooperate in any of the steps described in the appeal, it is difficult to successfully transport the gold bar home. Firstly, there will be a conflict when they bypass the trap, both agents cannot simultaneously adopt their own optimal way to pass the trap position. Secondly, after passing through the trap, then something happens that the two agents are on the same side of the bar. After that, the two agents may not move in the same direction, which will also cause the task to fail. So in such an instance, the interaction between two agents becomes extremely important.

Conclusions
This paper first proves the convergence of the value function of the monomer Markov decision process system. Furthermore, a cooperation Markov decision process system of two agents is introduced, which is suitable for an evolutionary learning system of cooperative (or antagonistic) decision between two agents. The main difference between cooperation and confrontation lies in the way the value function is defined and the way of combination (or restriction). In this paper, only CMDP system of cooperative type is considered. In this kind of learning model, agents alternately execute behaviors, seek the optimal criterion with social value, find the optimal strategy (π * 0 , π * 1 ), and jointly complete the target task. This paper presents an algorithm for finding the optimal strategy pair (π * 0 , π * 1 ), whose fundamental task is to find a pair of optimal strategies to form an evolutionary system CMDP (π * 0 ,π * 1 ) . Finally, the convergence of the value function in system CMDP is proved, and the convergence value is not affected by the value of the initial value function. This paper only introduces the cooperation Markov decision process system involving two agents, but many problems in reality involve more agents. As the number of agents increases, the interaction between agents in the system becomes more complex, and the difficulty of analysis also increases exponentially. Based on the fact that deep learning has strong perceptual ability, but lacks certain decision-making ability, while reinforcement learning has decision-making ability but short of doing anything about perceptual problems. Therefore, combining the two and complementing each other's advantages, a new concept-deep reinforcement learning is proposed. so we will use the next deep reinforcement learning method to study the properties of the cooperation Markov decision process system.