Multi-Agent Planning under Uncertainty with Monte Carlo Q-Value Function

Decentralized partially observable Markov decision processes (Dec-POMDPs) are general multi-agent models for planning under uncertainty, but are intractable to solve. Doubly exponential growth of the search space as the horizon increases makes a brute-force search impossible. Heuristic methods can guide the search towards the right direction quickly and have been successful in different domains. In this paper, we propose a new Q-value function representation—Monte Carlo Q-value function QMC, which is proved to be an upper bound of the optimal Q-value function Q∗. We introduce two Monte Carlo tree search enhancements—heavy playout for a simulation policy and adaptive samples—to speed up computation of QMC. Then, we present a clustering and expansion with Monte-Carlo algorithm (CEMC)—an offline planning algorithm using QMC as Q-value function, which is based on the generalized multi-agent A* with incremental clustering and expansion (GMAA*-ICE or ICE). CEMC calculates Q-value functions as required, without computing and storing all Q-value functions. An extended policy pruning strategy is used in CEMC. Finally, we present empirical results demonstrating that CEMC outperforms the best heuristic algorithm with a compact Q-value presentation in term of runtime for the same horizon, and has less memory usage for larger problems.


Introduction
Cooperative decision-making under uncertainty is a common problem in many scenarios such as sensor networks and airport service robots.Consider a scenario where a robot soccer team cooperates to beat the opponent.A total of 11 robot players dash in a simulated field, kicking, passing, and shooting a ball to score a goal.Each of them could only acquire his local observation and every action taken is uncertain, which means passing may be unsuccessful or shooting may be biased.Another example is congestion control, which is an important task in the Internet.The router's buffer accommodates transient packets to guarantee a stable network link.The distributed routers need to come up with a strategy to avoid congestion.These kinds of problems can be modeled as a decentralized partially observable Markov decision process (Dec-POMDP), which provides a general multi-agent model for decision-making under uncertainty.However, Dec-POMDPs are provably intractable (NEXP-Complete, nondeterministic exponential time, it is known that P ⊆ NP ⊆ PSPACE ⊆ EXP ⊆ NEXP) [1] and difficult to solve with the number of joint policies growing explosively [2].One of the primary reasons is that the search space of Dec-POMDPs grows exponentially with the number of actions and observations, and doubly exponentially with the horizon of the problem [3].The optimal policy can be extracted from the optimal Q-value function Q * for Dec-POMDPs, but solving Q * is also proved to be NEXP-Complete [4].
In a previous work [5], Monte Carlo tree search (MCTS) [6] is introduced into the MMDP (Multiagent MDP) model, which is a simplified model for Dec-POMDPs, to solve ad hoc agent teams decision-making.MCTS is a sample-based tree search algorithm using Monte Carlo simulations as state evaluations, which has been successful in Computer Go [7].MCTS will converge to the optimal value if exploration (find new nodes that are not sampled) and exploitation (find promising nodes that are already sampled) are traded off appropriately when given sufficient time and can be stopped at any time to get an approximate result.
To tackle the doubly exponential problem, we propose an offline algorithm called a clustering and expansion with Monte-Carlo algorithm (CEMC) based on incremental clustering and expansion (ICE), a version of generalized multi-agent A* (GMAA*) with incremental clustering and expansion, which is the best heuristic method in this literature.MCTS is integrated into the CEMC framework to fast obtain the approximate Q-value function.MCTS is an anytime algorithm, and its efficiency is up to sample times and the degree of accuracy of the Q-value function evaluation.Thus, we propose an adaptive sample method according to the horizon and the number of joint actions and observations.The best upper bound found in the MCTS algorithm is used to prune the nodes that will not be the optimal ones in any case.The term MCTS in this paper denotes the specific algorithm that uses UCB1 (a version of upper confidence bound) as the tree selection policy proposed by Kocsis and Szepesvári [6,8].
The key contribution is twofold: We present a new Q-value function representation-Monte Carlo Q-value function Q MC and an offline planning algorithm CEMC.Heavy playout for simulation policy and adaptive sample are introduced to speed up the computation of Q MC .The algorithm CEMC calculates the Q-value functions as required, without computing and storing all Q-value functions and it uses an extended pruning strategy for the policy pool.We present empirical results demonstrating that CEMC outperforms the best heuristic algorithm with a compact Q-value presentation in term of runtime for the same horizon, and has less memory usage for larger problems.
The remainder of this paper is structured as follows.We first introduce some related work in Section 2.Then, we outline some background knowledge on Dec-POMDPs, ICE algorithm, approximate Q-value functions, MCTS, and collaborative Bayesian Games in Section 3. Section 4 gives the Q-value function representation and proves that Q MC is an upper bound of the optimal Q-value function.Then, an adaptive sample and heavy rollout technique are proposed.Section 5 presents the core algorithm CEMC and a pruning strategy.We empirically prove the efficiency and effectiveness of CEMC in Section 6 and discuss the results.We conclude and outline the future research in Section 7.

Related Work
The Dec-POMDP is an extension of the POMDP model which has been proved to be doubly exponential [1].The brute force search becomes intractable when the number of agents or the problem size grows.In recent years, a lot of work has focused on fast computing an optimal policy of larger horizon, such as dynamic programming [9][10][11][12], heuristic search [2][3][4][13][14][15], and mixed integer linear programming [16].
GMAA* framework [4] is one of the state-of-the-art exact algorithms in the heuristic search literature, which is based on multi-agent A* (MAA*) [2] and uses an approximate heuristic function to guide the search in the whole search space.GMAA*-ICE (or ICE) [3] is a heuristically guided algorithm based on a GMAA* framework [4] and has demonstrated impressive performance over GMAA*.Two successful components used in ICE are lossless clustering [13] and incremental expansion [14].Recently, a Bayesian learning method [17] and approximate inferences and heuristics [18] are proposed for solving Dec-POMDP.
Silver and Veness [19] proposed a Monte-Carlo algorithm-POMCP-for online planning in large POMDPs.Unlike our approach, POMCP combines a Monte-Carlo update of the agent's belief state with a Monte-Carlo tree search from the current belief state.Importantly, POMCP is suitable for single-agent partially-observable and unfactored Markov decision process (MDP).Amato and Oliehoek [20] proposed a scalable approach based on sample-based planning called factored-value partially observable Monte Carlo planning (FV-POMCP), to combat the intractability of an increasing number of agents.However, FV-POMCP requires a decomposition of the value function into a set of overlapping factors.

Background
Consider a scenario where a number of agents inhabit in a particular environment modeled with a Dec-POMDP, where the perception of the agents is uncertain and they must work together to achieve their goal.At each discrete time step t = 0, 1, . . ., H − 1 every agent takes an action and the combination of these actions influences the world, making it transform to the next state.The goal of the agents is to try to come up with a plan that maximizes their expected long-term rewards.The goal or task need to be completed with cooperation, because no single agent is competent enough.Therefore, the plan computed should be decentralized.
In this section, we briefly review some background of the decentralized POMDPs model, ICE, approximate Q-value functions, MCTS, and collaborative Bayesian games.We refer the readers to the previous work [2][3][4]21,22] about the detailed background.

Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) Model
Dec-POMDP generalizes POMDP to a multi-agent version where rewards are common and based on joint actions, but observations are individualistic [23,24].The detail definition is as follows.Definition 1.A decentralized partially observable Markov decision process is defined as a 9-tuple <I, S, A, T, O, O, R, b 0 , h>, where . . ,n} is the finite set of agents; • S = s 0 , . . ., s |S|−1 is the finite set of states, where s 0 is the initial state; and a joint policy is denoted as π =< π 1 , π 2 , . . ., π n > which specifies a policy for each agent.The goal of solving Dec-POMDPs is to calculate an optimal joint policy to maximize the expected long-term accumulated reward, which is defined as )) is the optimal Q-value function.
The policy π i of agent i specifies actions for all steps of Dec-POMDPs, and π i can be represented as a sequence of decision rules ).The decision rule δ t i for agent i at step t is a mapping from the length-t history to action δ t i : θ t i → A i .The partial joint policy ϕ t is defined as a sequence of decision rules specified for steps 0, . . ., t − 1, which is denoted as ϕ t = (δ 0 , δ 1 , . . ., δ t−1 ).

Generalized MAA* with Incremental Clustering and Expansion (GMAA*-ICE)
GMAA*-ICE (or ICE) [3] is a heuristically guided algorithm based on GMAA* framework [4], which generalizes the first multi-agent A* search MAA* [2].It uses V(ϕ t ) = V 0...t−1 (ϕ t ) + H t...h−1 (ϕ t ) to guide A* search over partially specified joint policies to construct a tree from top-down, where V 0...t−1 (ϕ t ) is the actual expected reward achieved over the first t steps and H t...h−1 (ϕ t ) is a heuristic value for the remaining h-t steps.ICE uses an admissible heuristic to guarantee that it can find an optimal policy.
The key contributions of ICE are lossless clustering and incremental expansion.The authors introduced a probabilistic equivalence criterion that can be used to decide whether two individual action-observation histories, θ i,a , θ i,b for agent i, can be clustered.Especially in problems that have many histories satisfying probabilistic equivalence, ICE can significantly speed up planning.Incremental expansion defines a method of incrementally constructing a search tree by expanding the most promising child ϕ t+1 = (ϕ t , β * ) where β * is the optimal solution of collaborative Bayesian games.The author defined a node priority which helps find the tight lower bound early by first expanding deeper nodes [2].
ICE [3] is our baseline algorithm, but our methods are different from it as we integrate Q-value function computing with policy solving, so CEMC calculates the Q-value functions as required, and does not have to store and maintain a large matrix of the Q-value function.

Approximate Q-Value Functions
As mentioned above, ICE uses V(ϕ t ) = V 0...t−1 (ϕ t ) + H t...h−1 (ϕ t ) to guide the search and H t...h−1 (ϕ t ) is a heuristically estimated value using approximate Q-value functions.There are three well-known approximate Q-value functions, Q MDP , Q POMDP and Q BG , which are all approximations for the optimal Q-value function Q * .Although the optimal policy can be extracted from Q * , it is costly and impractical to compute.
Szer and Charpillet [25] used Q MDP to solve Dec-POMDP that is approximated as an underlying MDP.This underlying MDP is defined by a single agent which takes joint actions but has the same state set, transition function, and reward function as original Dec-POMDPs.It assumes that the state can be fully observed and the single agent gains more information than Dec-POMDPs do.
Q POMDP [25] is another approximation for Q * .In this case, Dec-POMDPs are treated as underlying POMDPs defined by a single agent which takes joint actions and receives joint observations but has the same state set, transition function, observation function, and reward function.It assumes that underlying POMDPs are settings of centralized control, which allow for better coordination.This means that the single agent knows the joint action-observation history while the agents in Dec-POMDPs can only gain local information such as individual actions and observations.Q BG is proposed by Oliehoek et al. [4], which assumes that the agents know the joint action-observation history and the joint action for the previous time step t − 1.
This assumption defines a Bayesian games (BGs) for each ( ,a) pair.The actions taken by agents can be solved by constructing BGs.
Q MDP , Q POMDP and Q BG are all proved to be upper bounds of Q * [4].These approximate Q-value functions form a hierarchy of upper bounds to Q * : This means that all these approximate Q-value functions are admissible heuristic for Dec-POMDPs.
Apart from the Q-value functions mentioned above, Oliehoek et al. [18] proposed a factored Q-value function as a heuristic payoff function for factored Dec-POMDP.The precondition of a decomposition is that the Q-value function of every joint policy is factored.Each local component of this kind of Q-value function depends on a subset of state factors and the action-observation histories and actions of a subset of agents.

Monte Carlo Tree Search (MCTS)
MCTS builds a search tree incrementally and asymmetrically in a given domain by taking a random sample.Each node in the tree denotes a state and links to child nodes which represent actions taken.It works by repeating selection, expansion, simulation, and backpropagation on a loop until some predefined computational budget, usually runtime or memory, is reached [26].
Many extensions have been reviewed in the literature [22].MCTS is used to find the optimal feature subset through the feature space in [27].In this paper, we use plain UCT (upper confidence bounds for trees) [6] as the tree policy for computing the Q-value function.Plain UCT is MCTS with UCB1 (a version of upper confidence bound), the tree policy of which is: where x i is the mean value; n is the number of times that the parent node has been visited; n i is the number of times child i has been visited, and c > 0 is a constant.The role of a tree policy is to select and create a node, which is carried out in the selection and expansion phase.A default policy is in the simulation phase.Its function is to play out the domain to estimate a state value.In a previous paper [28], four criteria for selecting actions are listed: (i) max child, (ii) robust child, (iii) max-robust child, and (iv) secure child.If the search terminates, one can choose one of the four selection mechanisms above for selecting an action to take.Our tree policy chooses the highest value node in the selection phase and only one node is expanded in the expansion phase.In the default policy (simulation phase), we replaced a random sample with simulation enhancement best-first policy.The advantage of randomly selecting actions is that it is simple, requires no domain knowledge, and can quickly simulate many times.However, it is not likely to be realistic compared to rational players [22].The enhancement we proposed can use domain knowledge to simulate realistically.The details will be introduced in Section 3.3.

Collaborative Bayesian Games
A Bayesian game (BG) [29] is an augmented normal form game in which a number of agents interact in a one-shot manner, and each of them holds some private information.This private information defines the type of the agent.A collaborative Bayesian game (CBG) is a BG in which agents receive identical payoffs.A CBG is defined as follows: Definition 2. A collaborative Bayesian game (CBG) [3,4] B(b 0 , ϕ t ) =< I, A, Θ, Pr(•), u > modeling stage t of a Dec-POMDP, given initial state distribution b 0 and past joint policy ϕ t , consists of: That is, given b 0 , for each ϕ t , it is possible to construct a CBG B(b 0 , ϕ t ).Therefore, B(b 0 , ϕ t ) represents a decision-making problem for stage t, if ϕ t was followed for the first t stages starting from b 0 .We simply write B(ϕ t ) below.
In a CBG, each agent uses a Bayesian game policy β i that maps individual types to actions

Monte Carlo Q-Value Function
In this section, we propose a new Q-value function representation-Monte Carlo Q-value function Q MC -and give proof of an upper bound of the optimal Q-value function Q * .Q MC combines the precision of tree search and the anytime character of sample-based Monte Carlo, and has more flexibility than other Q-value functions do.Then, we present two MCTS enhancements-heavy playout for simulation policy and adaptive sample-to speed up the computation of Q MC .

Q MC
In reinforcement learning [30], the Q-value function Q(s, a) is defined as the expected return starting from state s, taking action a, and then following policy π: Different from the former, the state in Q MC is the joint history → θ t at time step t.We define Q MC as: where MC is the Monte Carlo method, which returns the optimal action MC( where R and R ˆ are respectively the immediate and future reward at  , and the term added to it is called a UCT bonus, which is used to balance exploitation and exploration. R R + ˆ is the mean value of reward.The c parameter is a constant ( 0 c  ), which is used to tune how much influence the bonus has on the action selection.It is set bigger for problems with more branches and smaller for deeper ones.We did not consider adaptive setting of parameter c in this paper.
At the expansion phase of MC Q , it expands only one node.At the simulation phase, MC Q replaces the default random policy with heavy playout-an enhancement of Monte Carlo, which is presented in Section 3.  At the selection phase of Q MC , UCB1 is used to choose the best action a * where R and R are respectively the immediate and future reward at .., and the term added to it is called a UCT bonus, which is used to balance exploitation and exploration.R + R is the mean value of reward.The c parameter is a constant (c > 0), which is used to tune how much influence the bonus has on the action selection.It is set bigger for problems with more branches and smaller for deeper ones.We did not consider adaptive setting of parameter c in this paper.
At the expansion phase of Q MC , it expands only one node.At the simulation phase, Q MC replaces the default random policy with heavy playout-an enhancement of Monte Carlo, which is presented in Section 3.3.At the backpropagation phase, Q MC updates the values or statistics of nodes that are visited in this iteration with the result of the finished play.Lemma 1. Q MC is an upper bound of the optimal Q-value function Q*.

Proof See appendix.
From the representation of Q MC we can see that it treats Dec-POMDPs as models in which a single agent takes joint actions and receives joint observations, but has the same state set S, transition function T, observation function O and reward function R. The assumption makes more information available than the underlying Dec-POMDPs.Q MC is an overestimation of the optimal Q-value function and makes A*-like search find the optimal policy.The pseudocode of computing Q MC is shown in Algorithms 1 and 2.

Adaptive Sample
Due to the anytime character of MCTS, sufficient sample times (K) can make Q MC converge to Q * , but it is costly to calculate a converged Q MC .However, a small K leads to a loose Q-value function or even a wrong policy.Thus, in this article, we present a method for an adaptive sample, which is a function of joint actions and time steps.That is: where m is a constant, determined by the specific problem, |A| is the number of joint actions and t is a time step.K is in proportion to joint actions and inverse proportion to the square of time steps.This means that problems with more joint actions need larger K to make Q MC converge.With the tree of Q MC constructed from top-down, K shrinks gradually.We use this technique to balance the accuracy of convergence and computing time.

Best-First Simulation Policy
Simulation policy uses a default policy to play a game quickly.After that, MCTS propagates back the simulation results.The default policy is typically domain dependent.As there is no value like ) for guiding the action selection, uniformly choosing available actions is a common scheme.However, the heuristic knowledge of the problem can be involved into the default policy to strengthen simulation.Silver [31] discusses some factors that influence simulation performance.In this paper, we extend the rollout phase by incorporating domain knowledge into playouts, which are called heavy playouts [32].If it is not the last time step, the algorithm recursively calls for the Rollout function.Then it returns the best immediate reward (immR) and backpropagates the value obtained and updates N( ).This Best-First Simulation Policy (BFSP) returns the promising node first to ensure that the heuristic value is admissible or a guaranteed overestimation.The detail is shown in Algorithm 3.
Especially we would like to note that the selectBestA() function selects the best action combining an idea of reinforcement learning and the history heuristic.Reinforcement learning (RL) reinforces the actions that make success.The history heuristic assumes that an action good in one position may be good in another.Joint actions, which acquire the highest simulating value, will be given more chances in the later simulation.Concretely, we keep an average accumulated value v( → a ) of each joint action in a playout.Apart from v( → a ), a separate visit of each action n i and total visits of all actions, n in the simulation phase is also stored.An action is selected by a softmax function with inputs n .This implies that the larger z will correspond to a larger probability.If it is at an initial state or the chance of nodes are equal, a random legal action will be selected.For example, if a problem has nine joint actions, in the start of a simulation phase, an action is selected through this softmax function.After the simulation ends, its accumulated value to the terminal is recorded and the visit number increases by one.In another simulation phase, the statistics are reused, which we call the history heuristic.

Multi Agent Planning Using Q MC
In this section, we present our offline planning algorithm-CEMC-which is based on an ICE framework.Then, we introduce a new policy pruning strategy to speed up planning.CEMC integrates Q-value function computing and plan solving, and calculates the Q-value functions as required, which avoids the need to compute and store all Q-value functions.Thus, CEMC does not calculate and maintain a large matrix for Q-value, which achieves improvement in runtime and memory.

Offline Planning Algorithm-CEMC
Offline algorithms generate a complete policy before execution begins.The policy specifies what actions to take in any possible runtime situation.Before presenting CEMC, we first introduce two successful techniques used in ICE, which is also adopted by CEMC.
The first one is lossless clustering.Two individual action-observation histories θ i,a , θ i,b for agent i can be losslessly clustered when they satisfy probabilistic equivalence criterion: where → θ =i is the joint action-observation histories of all agents except agent i.This means that the probability distribution over the other agent's joint action-observation histories are identical for both θ i,a , θ i,b .
The second one is incremental expansion which defines a method of incrementally constructing a search tree by expanding the most promising child ϕ t+1 = (ϕ t , β * ) where β * is the optimal solution of CBGs.A child is selected by node priority.Node q in the search tree is a 3-tuple < ϕ t , v, h >.
That is, if the heuristic values are equal, the greater depth will be chosen first.This helps find the tight lower bound early by first expanding deeper nodes and converge quickly.
The CEMC algorithm (Algorithm 4) starts planning with an unspecified partial joint policy ϕ 0 in an initialized policy pool.It greedily selects the best partial joint policy in the pool according to the node priority(Select(L)).Then, it proceeds by constructing a CBG using lossless clustering (ClusterBG()).After that, the CBG is solved, which returns a 2-tupe β t , V(β t ) where β t is a Bayesian game policy, and V(β t ) is given by the function Solver(): where P( → θ t |b 0 , ϕ t ) is the probability distribution over joint histories and Q MC ( )) is the Q-value function computed by Algorithm 1.
Algorithm 4: CEMC Input: a Dec-POMDP, an empty policy pool L Output: an optimal joint policy π * 1: Initialize() 2: while ¬empty(L) do N IE ← IncExp() //incrementally expand nodes L • Prune(v) //pruning the policy pool The set of child nodes N IE is expanded using the incremental expansion technique and only the promising child node β t, * will be put into N IE (IncExp()).The heuristic value v of the node is given by the function ComputePolicyAndValue(): where V 0...t−1 (ϕ t ) is the actual expected reward of ϕ t and V(β t ) is given by Equation (12).We use two pruning strategies during planning.If time step t < h − 1, nodes with values less than the best Q-value function Q * MC will be pruned in advance.If t = h − 1, CEMC computes fully expanded policies.Nodes with values less than the best lower bound v can be pruned.CEMC ends when the policy pool becomes empty or the highest value in the policy pool is less than that of the best full policy.
The main idea of CEMC is that it integrates the Q-value function computing with plan solving, which benefits from the anytime character of Q MC .The original ICE computes and stores the Q-value functions in a N jaoh × N ja matrix, where N jaoh is the number of joint action-observation histories and N ja is the number of joint actions.It is costly to calculate complete Q-value functions in advance.For example, for the benchmark problem, Mars Rovers (Appendix Table A1), there are about 2.82 × 10 14 joint action-observation histories and 36 joint actions.It needs to compute about 1.02 × 10 16 ( → θ , → a ) value pairs.The algorithm CEMC calculates the Q-value functions only when required, without computing and storing all Q-value functions.

Policy Pruning Strategy
Pruning is split into hard pruning, which removes branches of the tree permanently, and soft pruning, which temporarily removes branches, but then adds them back into the search at some later point [33].In this paper, we use hard pruning as our pruning strategy.
The ICE algorithm maintains a policy pool.A node in the pool is denoted as a tuple < ϕ, v, h >, where ϕ is a partial policy, v is its value, and h is its depth.We know the value of a node is an actual value plus an estimated value: ϕ = V 0...t−1 (ϕ t ) + V(β t ).The estimated value V(β t ) is an overestimation of the actual value from this node to a terminal.Actually, the value of a node ϕ is the value of a path or a policy across this node.Therefore, with the search descending deep down to the terminal or the finite horizon, the value of a node is also descending (note that the actual value becomes larger and larger, and the estimated value becomes close to the actual value).When the search reaches the terminal, the value of the terminal node can be seen as the best lower bound ever found.If the value of a node in the policy pool is smaller than the best lower bound, it would never be the optimal.Therefore, it uses the best lower bound v to prune the policy pool.However, for some problems in the literature, the policy pool shrinks slowly, which wastes a lot of time in computing useless nodes.
We use an example to illustrate it.Consider, for example, Figure 2 is a search tree.
The policy pool is a priority list, which has several nodes, such as [ ϕ 2 , 10.125, 2 , ϕ 1 , 10.125, 1 , ϕ 2 , 9.721, 2 , ϕ 2 , 6.182, 2 , . . .].The superscript of ϕ is its time step or its depth.At each loop the ICE algorithm chooses the first node in the pool and expands it.If the algorithm reaches the last step, it returns the fully expanded node ( ϕ 3 , 5.363, 3 ) as the best lower bound to prune the policy pool.However, node ϕ 2 , 6.182, 2 > node ϕ 3 , 5.363, 3 according to the node priority, and it will not be pruned.Assuming that the child of the node ϕ 2 is ϕ 3 , 4.182, 3 , this fully expanded policy is obviously not the optimal one and should have been pruned in advance, but node ϕ 2 , 6.182, 2 will be still expanded in these iterations.Not pruning in advance makes the search algorithm go the wrong direction and this is to be solved in our policy pruning strategy.
Thus, these useless nodes like ϕ 2 , 6.182, 2 should be removed as early as possible.We improve the pruning function using the max Q-value function at time step t Q t, * MC , combined with the best lower bound v, that is: where t is the deepest level in the policy pool, v * is computed through Q t, * MC : The Equation (14) illustrates that if t < h − 1, all nodes with values less than v * can be pruned and if CEMC reaches the last step, any node with values less than v can be pruned.Note that pruning using v * is done during the planning, it does not have to wait for full expansion.This technique can reduce the number of nodes in the policy pool and make the algorithm converge quickly, which is tested experimentally in Section 5.
the value of a path or a policy across this node.Therefore, with the search descending deep down to the terminal or the finite horizon, the value of a node is also descending (note that the actual value becomes larger and larger, and the estimated value becomes close to the actual value).When the search reaches the terminal, the value of the terminal node can be seen as the best lower bound ever found.If the value of a node in the policy pool is smaller than the best lower bound, it would never be the optimal.Therefore, it uses the best lower bound v to prune the policy pool.However, for some problems in the literature, the policy pool shrinks slowly, which wastes a lot of time in computing useless nodes.
We use an example to illustrate it.Consider, for example, Figure 2 is a search tree.The policy pool is a priority list, which has several nodes, such as [ 2 10 125 2    , ., , 1 10 125 1 The superscript of  is its time step or its depth.At each loop the ICE algorithm chooses the first node in the pool and expands it.If the algorithm reaches the last step, it returns the fully expanded node ( 3 5 363 3    , ., ) as the best lower bound to prune the policy pool.However, node 2 6 182 2 , according to the node priority, and it will not be pruned.
Assuming that the child of the node 2 , , this fully expanded policy is obviously not the optimal one and should have been pruned in advance, but node 2 6 182 2     , ., will be still expanded in these iterations.Not pruning in advance makes the search algorithm go the wrong direction and this is to be solved in our policy pruning strategy.  will be left in the policy pool and they may be still expanded in later iterations.
The dashed circle is an unvisited node.
Thus, these useless nodes like , should be removed as early as possible.We improve the pruning function using the max Q-value function at time step t MC Q t ,* , combined with the best lower bound v , that is: where t is the deepest level in the policy pool, v * is computed through Figure 2. Incremental clustering and expansion (ICE) search tree.Due to inappropriate policy pruning, some nodes like ϕ 2 will be left in the policy pool and they may be still expanded in later iterations.The dashed circle is an unvisited node.

Experiments
This section experimentally demonstrates and validates the effectiveness of all techniques proposed: the new Q-value function representation Q MC , the adaptive sample, best-first simulation policy (BFSP), policy pruning strategy, and the core algorithm CEMC.We give several benchmark problems in the literature, followed by experimental data.Finally, we analyze and discuss the results.

Experiments Setup
We give some benchmark problems, including Dec-Tiger, Grid-Small, Recycling Robots, Box Pushing, Mars Rovers, Broadcast Channel.All these problems have two agents, with different numbers of states, actions and observations, as shown in Table A1 (see Appendix A), which lists the number of joint policies for different horizons.These domains can be downloaded at MASPlan (http://masplan.org/).
The problem, Broadcast Channel, has two agents, each of them must choose whether or not to send a message in a Broadcast Channel.If both agents send, a collision occurs.Therefore, they must cooperate to send messages.The problem, Recycling Robots, also has two agents in the environment, in which both of them move around using their motors, pick up cans, and place them in a specified place using their mechanical arms.They must complete the task cooperatively.The benchmark Mars Rovers involve two rovers performing certain scientific experiments by choosing to check, drill, or sample.They can move in four directions.Some locations need only one agent and some locations require their cooperation.
Experiments were run on Ubuntu 16.04 with Intel Core i5 CPU and 6 GB memory.CEMC was implemented in C++ based on the MADP Toolbox [34].Every average result was run over 100 times.We compared CEMC with the state-of-the-art exact heuristic algorithm ICE with a compact Q-value function implementation Q BG .Five statistics are taken into consideration: time of computing Q-value function t Q , time of calculating the optimal joint policy t p , overall time t o , value of optimal policy V * , and value of random policy V r .We would like to note that ICE may not get the same performance contrasted with the original paper due to the computing resources, but all results including ICE and CEMC were tested in the same environment, so the time and scalability results are directly comparable.

Comparing CEMC and ICE
Table 1 shows performance results for ICE and CEMC."-" indicates unknown values, caused by getting the wrong results or a time limit violation (less than 3600 s is valid)."*" indicates out of memory.Table 1 clearly shows that CEMC outperforms the state-of-the-art heuristic algorithm with Q BG .Especially in solving scalable problems such as Grid-Small, Box Pushing, and Mars Rovers, CEMC shows significant improvement without computing and storing all Q-values.CEMC allows for getting the needed Q-value to guide A* search without pre-computation of all Q-value function.1 we can see that in some problems like Grid-Small, Box Pushing, and Mars Rovers, CEMC is significantly better than ICE, while in some problems like Dec-Tiger, Recycling Robots, and Broadcast Channel the difference between both may not be as obvious as the three previous ones.Actually, the number of joint decision rules for time step t in the search tree is which means the complexity is exponential with the number of actions and observations, and doubly exponential with the time step.The search tree contains the action-observation histories, the actions taken, and the observations received.Therefore, the number of branches is related to the action and observation space.Although the domain Mars Rovers has a larger state space, it has little influence on the scale of the search tree.What it influences is the efficiency of state transition and the computing time of the probability like P(o|a, s ) .

Analysis of Policy Pruning Strategy
To illustrate the effectiveness of the extended policy pruning strategy, we list the max number of policies in the policy pool for different horizons.Figure 3 shows a comparison of the original ICE and CEMC.Note that the original ICE maintains a larger policy pool than CEMC.For problems with approximatively V(ϕ t ), ICE expands nodes back and forth to construct a tree carefully.This will significantly increase the search time.We use Equation ( 14

Analysis and Discussion
CEMC outperforms the state-of-the-art exact heuristic algorithm ICE with compact Q-value function Q BG in terms of runtime and memory.There are three reasons for CEMC's performance.
Firstly, we propose a new Q-value function representation Q MC , which is an upper bound for the optimal Q-value function Q * .Q MC is an anytime Q-value function, which makes it possible to compute as required.Adaptive sample and heavy rollout speed up the computation of Q MC .
Secondly, CEMC calculates the Q-value function as required and does not maintain a large matrix to store all Q( → θ t , → a ).For problems with a large joint action-observation space and horizon, maintaining the matrix will run out of memory.In addition, it can be seen from problems, Box Pushing and Mars Rovers, in Table 1 that it is costly to calculate the Q-value function (time of computing Q-value for ICE is more than 1100 s).If not every ( → θ t , → a ) pair is computed, it will save much time.A tighter upper bound can speed up planning, as it allows the algorithm to prune the policy pool more [4], but a tighter value function increases time of computing Q-value function, which accordingly increases the overall time (total time for computing Q-value and planning).
Thirdly, an extended policy pruning strategy is used in CEMC.The pruning strategy uses the best lower bound ever found v and the best accumulated Q-value function at time step t Q t, * MC to prune the policy pool.This technique avoids expanding those nodes that will not be the optimal ones in any case.

Conclusions
In this paper, we combine the advantages of the heuristic algorithm and MCTS, and propose a new Q-value function representation and a novel offline algorithm CEMC, generalizing the prior method.In particular, we demonstrate two important results: (i) Computation of the Q-value function as required is an important reason for the CEMC's performance.This permits us no longer to maintain a large matrix to store all Q-value function pairs, which saves much memory; (ii) There is no need to compute Q-value functions as compact as possible.Experimental results demonstrate that it is costly to compute compact Q-value functions.Other techniques such as an adaptive sample, heavy rollout, and policy pruning strategy are introduced to speed up the algorithm.In the future work, we plan to analyze the relationship between the loose extent and sample times in order to find the smallest K to optimally solving Dec-POMDPs.

Lemma 1. Proofs
We use a Dec-POMDP with k steps delayed communication model to prove Lemma 1.This model can be denoted as an augmented MDP, which is a reformulation of the prior work [35,36] and extended to the Dec-POMDP setting [4].
The augmented MDP is defined as M =< Ŝ, Â, T, R >, where an augmented state ŝ ∈ Ŝ is composed of a joint action-observation history

.
Because of the partially observability, the probability multiplied by Q MC is a conditional probability of all observations at → θ t and → a .The solution of Q MC can be illustrated in Figure 1.The black filled nodes are joint histories, and the solid links and dotted links are respectively joint actions taken and joint observations received by agents.Each joint history node contains two statistical values-N( → θ ), visiting count of node → At the selection phase of MC Q , UCB1 is used to choose the best action a * 2 argmax aA

Figure 1 .
Figure 1.Tree of Monte Carlo Q-value function.

Figure 2 .
Figure 2. Incremental clustering and expansion (ICE) search tree.Due to inappropriate policy pruning, some nodes like 2

20 Figure 3 .
Figure 3. Max number of policies in the policy pool for different horizons.This indicator illustrates that ICE maintains a larger policy pool than CEMC, which may make the former expand useless nodes.6.2.3.Convergence of MC Q More sample times can make the approximate Q-value function

Figure 3 . 20 Figure 4 .
Figure 3. Max number of policies in the policy pool for different horizons.This indicator illustrates that ICE maintains a larger policy pool than CEMC, which may make the former expand useless nodes.6.2.3.Convergence of Q MCMore sample times can make the approximate Q-value function Q MC converge and be close to Q * , but it needs more runtime.To illustrate the convergence of Q MC , we gathered the statistics of the max Q MC of Dec-Tiger with the sample time K increasing, as shown in Figure4.The max values of Q MDP , Q POMDP and Q BG for Dec-Tiger with horizon h = 3 and h = 4 are also listed.We can see from the figure that the max Q MC declines and converges as K increases.Note that Q MC is as compact as Q POMDP with horizon h = 3.In horizon h = 4, Q MC is more compact than Q POMDP but looser than Q BG .The goal of Figure4is to show that Q MC is an upper bound of Q * and we can use a less compact Q-value function to reduce time of computing, but it still can guide A* to search the optimal policy.Appl.Sci.2019, 9, x FOR PEER REVIEW 17 of 20

Figure 4 .
Figure 4. Convergence of Q MC .as the sample time K increases.Q MC converges to Q POMDP , which means Q MC is an upper bound for Q * , but is less compact than Q BG .
n} is the finite set of agents; • A is the set of joint actions; • Θ is the set of joint types, each of which specifies a type for each agent θ =< θ 1 , θ 2 , . . ., θ n >; • Pr(•) is a probability distribution over joint types; • u is a heuristic payoff function mapping joint types and joint actions to a real number: u(θ, a).
3. At the backpropagation phase, Tree of Monte Carlo Q-value function.
MCQis an upper bound of the optimal Q-value function Q * .Proof.See appendix.From the representation of MC Q we can see that it treats Dec-POMDPs as models in which a single agent takes joint actions and receives joint observations, but has the same state set S , transition function T , observation function O and reward function R .The assumption makes Algorithm 1: Q MC ( θ t , a) Input: joint action-observation history θ t at t, joint action a Output: Monte-Carlo Q-value function Q MC ( θ t , a) 1: K ← adaptSet(numA, t) //adaptive setting of sample 2: immR ← ComputeImmR( a) 3: t ← θ t .timeStep4: for k < K do 8:

Table 1 .
Comparable results of some benchmarks.

Table A1 .
Domain parameters and number of π for different h.