# HMCTS-OP: Hierarchical MCTS Based Online Planning in the Asymmetric Adversarial Environment

^{*}

## Abstract

**:**

## 1. Introduction

- We model the online planning problem in the asymmetric adversarial environment as an MDP and extend the MDP to the semi-Markov decision process (SMDP) by introducing the task hierarchies. This provides the theoretical foundation for MAXQ hierarchical decomposition.
- We derive the MAXQ value hierarchical decomposition for the defined hierarchical tasks. The MAXQ value hierarchical decomposition provides a scalable way to calculate the rewards of hierarchical tasks in HMCTS-OP.
- We use the MAXQ-based task hierarchies to reduce the search space and guide the search process. Therefore, the computational cost is significantly reduced, which enables the MCTS to search deeper to find better action within a limited time frame. As a result, the HMCTS-OP can perform better in online planning in the asymmetric adversarial environment.

## 2. Background

#### 2.1. Markov and Semi-Markov Decision Process

#### 2.2. MAXQ

- ${T}_{i}$ represents the termination condition of subtask ${M}_{i}$, which is used to judge whether ${M}_{i}$ is terminated. Specifically, ${S}_{i}$ and ${G}_{i}$ are the active states and termination states of ${M}_{i}$ respectively. If the current state $s\in {G}_{i}$ or the predefined maximum calculation time or number of iterations is reached, ${T}_{i}$ is set to 1, indicating that ${M}_{i}$ is terminated.
- ${A}_{i}$ is a set of actions; it contains both primitive actions and high-level subtasks.
- ${\tilde{R}}_{i}$ is the optional pseudo-reward function of ${M}_{i}$.

#### 2.3. MCTS

## 3. Related Work

## 4. Method

#### 4.1. Asymmetric Adversarial Environment Modeling

**State space.**The joint state of online planning in the asymmetric adversarial environment contains state variables that cover all units. It is represented as a high dimensional vector $s=\left({s}_{0},{s}_{1},{s}_{2},\dots {s}_{n}\right)$, which includes the information of $n$ units. For each unit, the state variable $s$ is defined as $s=\left(player,\text{}x,y,hp\right)$, where $\left(x,y\right)$ is the current position, $player$ is the owner of the unit (red or blue), and $hp$ represents the health points. The initial health points of the base, light, and worker are 10, 4, and 1, respectively.

**Action space**. Each movable unit has nine optional primitive actions, which are listed as follows:

- Four navigation actions: NavUp (move upward), NavDown (move downward), NavLeft (move leftward), and NavRight (move rightward).
- Four attack actions: FirUp (fire upward), FirDown (fire downward), FirLeft (fire leftward), and FirRight (fire rightward).
- Wait.

**Transition function.**Each move action has an 0.9 probability of moving to the target position successfully and a probability of 0.1 of staying in the current position. Each fire action has an 0.9 probability of damaging the opponent successfully, and a probability of 0.1 of failing.

**Reward function**. Each primitive action has a reward of −1. If the agent attacks the opponent successfully and the opponent loses 1 health point, the agent will get a reward of 20. Conversely, if the agent is attacked by the opponent and loses 1 health point, it will get a reward of −20. Moreover, the reward is 100 for winning the game and −100 for losing the game.

#### 4.2. MAXQ Hierarchical Decomposition

- -
- NavUp, NavDown, NavLeft, NavRight, FirUp, FirDown, FirLeft, FirRight, and Wait: These actions are defined by the RTS; they are primitive actions. When a primitive action is performed, a local reward of −1 will be assigned to each primitive action. This method ensures that the online policy of the high-level subtask can reach the corresponding goal state as soon as possible.
- -
- NavToNeaOpp, NavToCloBaseOpp, and FireTo: The NavToNeaOpp subtask will move the light military unit to the closest enemy unit as soon as possible by performing NavUp, NavDown, NavLeft, and NavRight actions and taking into account the action uncertainties. Similarly, the NavToCloBaseOpp subtask will move the light military unit to the enemy unit closest to the base as fast as possible. The goal of the FireTo subtask is to attack enemy units within a range.
- -
- Attack and Defense: The purpose of Attack is to destroy the enemy’s units to win by planning the attacking behaviors, and the purpose of Defense is to defend against the enemy’s units to protect bases by carrying out defensive behaviors.
- -
- Root: This is a root task. The goal of Root is to destroy the enemy’s units and protect bases. In the Root task, the Attack subtask and the Defense subtask are evaluated by the hierarchical UCB1 policy according to the HMCTS-OP, which is described in the next section in detail.

#### 4.3. Hierarchical MCTS-Based Online Planning (HMCTS-OP)

Algorithm 1 PLAY. | |

1: | function PLAY (task $t$, state $s$, MAXQ hierarchy $M$, rollout policy ${\pi}_{rollout}$) |

2: | repeat |

3: |
$a\leftarrow $ HMCTS-OP$\left(t,s,M,{\pi}_{rollout}\right)$ |

4: |
${s}^{\prime}\leftarrow $ Execute$\left(s,a\right)$ |

5: | $s\leftarrow $ s′ |

6: | until termination conditions |

7: | end function |

Algorithm 2 HMCTS-OP. | |

1: | function HMCTS-OP (task $t$, state $s$, MAXQ hierarchy $M$, rollout policy ${\pi}_{rollout}$) |

2: | if $t$ is primitive then Return $\langle R\left(s,t\right),t\rangle $ |

3: | end if |

4: | if $s\notin {S}_{t}$ and $s\notin {G}_{t}$ then Return $\langle -\infty ,nil\rangle $ |

5: | end if |

6: | if $s\in {G}_{t}$ then Return $\langle 0,nil\rangle $ |

7: | end if |

8: | Initialize search tree $T$ for task $t$ |

9: | while within computational budget do |

10: | HMCTSSimulate ($t,s,M,0,{\pi}_{rollout}$) |

11: | end while |

12: | Return GetGreedyPrimitive($t,s$) |

13: | end function |

Algorithm 3 HMCTSSimulate. | |

1: | function HMCTSSimulate (task $t$, state $s$, MAXQ hierarchy $M$, depth $d$, rollout policy ${\pi}_{rollout}$) |

2: | steps = 0 // the number of steps executed by task $t$ |

3: | if $t$ is primitive then |

4: | $\left({s}^{\prime},r,1\right)\leftarrow $Generative-Model$\left(s,t\right)$ // domain generative model-simulator |

5: | steps = 1 |

6: | Return (${s}^{\prime},r,steps$) |

7: | end if |

8: | if $t$ terminates or $d>{d}_{max}$ then |

9: | Return ($s,0,0$) |

10: | end if |

11: | if node ($t,s$) is not in tree $T$ then |

12: | Insert node ($t,s$) to $T$ |

13: | Return Rollout($t,s,d,{\pi}_{rollout}$) |

14: | end if |

15: | if node ($t,s$) is not fully expanded then |

16: | choose $a\in $ untried subtask from $subtasks\left(t\right)$ |

17: | else |

18: | $a=\mathrm{arg}ma{x}_{{a}^{\prime}}Q\left(t,s,{a}^{\prime}\right)+c\sqrt{\frac{\mathrm{log}N\left(t,s\right)}{N\left(t,s,{a}^{\prime}\right)}}$ //${a}^{\prime}\in subtasks\left(t\right)$ |

19: | end if |

20: | (${s}^{\prime},r,k$) = HMCTSSimulate($a,s,M,d,{\pi}_{rollout}$) |

21: | (${s}^{\u2033},{r}^{\prime},{k}^{\prime}$) = Rollout ($t,{s}^{\prime},d+k,{\pi}_{rollout}$) |

22: | $r=r+{\gamma}^{k}{r}^{\prime}$ |

23: | $steps=steps+k+{k}^{\prime}$ |

24: | $N\left(t,s\right)=N\left(t,s\right)+1$ |

25: | $N\left(t,s,a\right)=N\left(t,s,a\right)+1$ |

26: | $Q\left(t,s,a\right)=Q\left(t,s,a\right)+\frac{r-Q\left(t,s,a\right)}{N\left(t,s,a\right)}$ |

27: | Return (${s}^{\u2033},r,steps$) |

28: | end function |

Algorithm 4 Rollout. | ||

1: | function Rollout (task $t$, state $s$, depth $d$, rollout policy ${\pi}_{rollout}$) | |

2: | $steps=0$ | |

3: | if $t$ is primitive then | |

4: | $\left({s}^{\prime},r,1\right)\leftarrow $ Generative-Model$\left(s,t\right)$ // domain generative model-simulator | |

5: | Return (${s}^{\prime},r,1$) | |

6: | end if | |

7: | if $t$ terminates or $d>{d}_{max}$ then | |

8: | Return ($s,0,0$) | |

9: | end if | |

10: | $a~{\pi}_{rollout}\left(s,t\right)$ | |

11: | (${s}^{\prime},r,k$) = Rollout($a,s,d,{\pi}_{rollout}$) | |

12: | (${s}^{\u2033},{r}^{\prime},{k}^{\prime}$) = Rollout ($t,{s}^{\prime},d+k,{\pi}_{rollout}$) | |

13: | $r=r+{\gamma}^{k}{r}^{\prime}$ | |

14: | $steps=steps+k+{k}^{\prime}$ | |

15: | Return (${s}^{\u2033},r,steps$) | |

16: | end function |

Algorithm 5 GetGreedyPrimitive. | |

1: | function GetGreedyPrimitive (task $t$, state $s$) |

2: | if $t$ is primitive then |

3: | Return $t$ |

4: | else |

5: | $a=\mathrm{arg}ma{x}_{{a}^{\prime}}Q\left(t,s,{a}^{\prime}\right)$ |

6: | Return GetGreedyPrimitive ($a,s$) |

7: | end if |

8: | end function |

## 5. Experiment and Results

#### 5.1. Experiment Setting

^{44}states, which represents a very large planning problem.

**UCT**: This is a standard instance of an MCTS algorithm. UCB1 is used as the tree policy to select actions and is described in Equation (4). The parameter setting is $C=0.05$.

**NaiveMCTS**[21]: This uses naïve sampling in the MCTS. The parameter settings are ${\pi}_{0}\left({\epsilon}_{0}=0.4\right)$, ${\pi}_{1}\left({\epsilon}_{1}=0.33\right)$, ${\pi}_{g}\left({\epsilon}_{g}=0\right)$.

**InformedNaiveMCTS**[20]: This learns the probability distribution of actions in advance and incorporates the distribution into the NaiveMCTS.

**HMCTS-OP**: This is a hierarchical MCTS-based online planning algorithm with a UCB1 policy. The parameter setting is a discount factor of $\gamma =0.99$.

**Smart-HMCTS-OP**: This is an HMCTS-OP algorithm equipped with a hand-coded rollout policy, which is named the smart rollout policy. The smart rollout policy randomly selects the action, and it is five times more likely to choose an attack action than other actions.

#### 5.2. Results

_{R11}~Group

_{R15}, Group

_{U11}~Group

_{U15}, Group

_{R21}~Group

_{R25}, Group

_{U21}~Group

_{U25}, Group

_{R31}~Group

_{R35}, Group

_{U31}~Group

_{U35}) for each scenario and each opponent. Figure 7 and Table 4 show that there were significant differences in cumulative rewards between different algorithms for all scenarios and opponents.

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Vien, N.A.; Toussaint, M. Hierarchical Monte-Carlo Planning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 3613–3619. [Google Scholar]
- Hostetler, J.; Fern, A.; Dietterich, T. State Aggregation in Monte Carlo Tree Search. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Québec City, QC, Canada, 27–31 July 2014. [Google Scholar]
- He, R.; Brunskill, E.; Roy, N. PUMA: Planning under Uncertainty with Macro-Actions. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; Volume 40, pp. 523–570. [Google Scholar]
- Dietterich, T.G. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. J. Artif. Intell. Res.
**2000**, 13, 227–303. [Google Scholar] [CrossRef] [Green Version] - Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: New York, NY, USA, 1994; ISBN 0471619779. [Google Scholar]
- Browne, C.; Powley, E. A survey of monte carlo tree search methods. IEEE Trans. Comput. Intell. AI Games
**2012**, 4, 1–49. [Google Scholar] [CrossRef] [Green Version] - Kocsis, L.; Szepesvári, C. Bandit Based Monte-Carlo Planning. In European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; pp. 282–293. ISBN 978-3-540-45375-8. [Google Scholar]
- Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn.
**2002**, 47, 235–256. [Google Scholar] [CrossRef] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: London, UK, 2018. [Google Scholar]
- Črepinšek, M.; Liu, S.H.; Mernik, M. Exploration and exploitation in evolutionary algorithms: A survey. ACM Comput. Surv.
**2013**, 45, 1–33. [Google Scholar] [CrossRef] - Li, Z.; Narayan, A.; Leong, T. An Efficient Approach to Model-Based Hierarchical Reinforcement Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–10 February 2017; pp. 3583–3589. [Google Scholar]
- Doroodgar, B.; Liu, Y.; Nejat, G. A learning-based semi-autonomous controller for robotic exploration of unknown disaster scenes while searching for victims. IEEE Trans. Cybern.
**2014**, 44, 2719–2732. [Google Scholar] [CrossRef] [PubMed] - Schwab, D.; Ray, S. Offline reinforcement learning with task hierarchies. Mach. Learn.
**2017**, 106, 1569–1598. [Google Scholar] [CrossRef] [Green Version] - Le, H.M.; Jiang, N.; Agarwal, A.; Dudík, M.; Yue, Y.; Daumé, H. Hierarchical imitation and reinforcement learning. In Proceedings of the ICML 2018: The 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 7, pp. 4560–4573. [Google Scholar]
- Bai, A.; Wu, F.; Chen, X. Online Planning for Large Markov Decision Processes with Hierarchical Decomposition. ACM Trans. Intell. Syst. Technol.
**2015**, 6, 1–28. [Google Scholar] [CrossRef] - Bai, A.; Srivastava, S.; Russell, S. Markovian state and action abstractions for MDPs via hierarchical MCTS. In Proceedings of the IJCAI: International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 3029–3037. [Google Scholar]
- Menashe, J.; Stone, P. Monte Carlo hierarchical model learning. In Proceedings of the fourteenth International Conference on Autonomous Agents and Multiagent Systems, Istanbul, Turkey, 4–8 May 2015; Volume 2, pp. 771–779. [Google Scholar]
- Sironi, C.F.; Liu, J.; Perez-Liebana, D.; Gaina, R.D. Self-Adaptive MCTS for General Video Game Playing Self-Adaptive MCTS for General Video Game Playing. In Proceedings of the International Conference on the Applications of Evolutionary Computation; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Neufeld, X.; Mostaghim, S.; Perez-Liebana, D. A Hybrid Planning and Execution Approach Through HTN and MCTS. In Proceedings of the IntEx Workshop at ICAPS-2019, London, UK, 23–24 October 2019; Volume 1, pp. 37–49. [Google Scholar]
- Ontañón, S. Informed Monte Carlo Tree Search for Real-Time Strategy games. In Proceedings of the IEEE Conference on Computational Intelligence in Games CIG, New York, NY, USA, 22–25 August 2017. [Google Scholar]
- Ontañón, S. The combinatorial Multi-armed Bandit problem and its application to real-time strategy games. In Proceedings of the Ninth Artificial Intelligence and Interactive Digital Entertainment Conference AIIDE, Boston, MA, USA, 14–15 October 2013; pp. 58–64. [Google Scholar]
- Theodorsson-Norheim, E. Kruskal-Wallis test: BASIC computer program to perform nonparametric one-way analysis of variance and multiple comparisons on ranks of several independent samples. Comput. Methods Programs Biomed.
**1986**, 23, 57–62. [Google Scholar] [CrossRef] - Morris, H.; Degroot, M.J.S. Probability and Statistics, 4th ed.; Addison Wesley: Boston, MA, USA, 2011; ISBN 0201524880. [Google Scholar]

**Figure 4.**The overall task hierarchies of the online planning problem in the asymmetric adversarial environment.

**Figure 6.**The average game time and scores of 50 games for each scenario and each algorithm against two fixed opponents (Random and UCT): (

**a**,

**b**) the results in scenario 1; (

**c**,

**d**) the results in scenario 2; (

**e**,

**f**) the results in scenario 3.

Scenario 1 | Scenario 2 | Scenario 3 | |
---|---|---|---|

Size | 8 × 8 | 10 × 10 | 12 × 12 |

Units | 23 | 23 | 23 |

Maximal states | 10^{36} | 10^{41} | 10^{44} |

**Table 2.**The average cumulative reward over 50 games for each scenario and each algorithm against two fixed opponents (Random and UCT).

Opponent | Scenario | UCT | NaiveMCTS | Informed NaiveMCTS | HMCTS-OP | Smart-HMCTS-OP |
---|---|---|---|---|---|---|

Random | 8 × 8 | −38.94 | −57.7 | −153.56 | 109.94 | 276.34 |

10 × 10 | 105.02 | 47.98 | 30.12 | 242.24 | 312.66 | |

12 × 12 | −6.64 | 55.72 | 69.88 | 247.52 | 261.38 | |

UCT | 8 × 8 | −929.98 | −987.68 | −995.06 | −354.64 | −288.38 |

10 × 10 | −694.42 | −610.48 | −840.8 | −236.34 | −95.5 | |

12 × 12 | −653.12 | −660.16 | −659.54 | −173.58 | −89.34 |

Opponent | Scenario | UCT | NaiveMCTS | Informed NaiveMCTS | HMCTS-OP | Smart-HMCTS-OP |
---|---|---|---|---|---|---|

Random | 8 × 8 | $Grou{p}_{R11}$ | $Grou{p}_{R12}$ | $Grou{p}_{R13}$ | $Grou{p}_{R14}$ | $Grou{p}_{R15}$ |

10 × 10 | $Grou{p}_{R21}$ | $Grou{p}_{R22}$ | $Grou{p}_{R23}$ | $Grou{p}_{R24}$ | $Grou{p}_{R25}$ | |

12 × 12 | $Grou{p}_{R31}$ | $Grou{p}_{R32}$ | $Grou{p}_{R33}$ | $Grou{p}_{R34}$ | $Grou{p}_{R35}$ | |

UCT | 8 × 8 | $Grou{p}_{U11}$ | $Grou{p}_{U12}$ | $Grou{p}_{U13}$ | $Grou{p}_{U14}$ | $Grou{p}_{U15}$ |

10 × 10 | $Grou{p}_{U21}$ | $Grou{p}_{U22}$ | $Grou{p}_{U23}$ | $Grou{p}_{U24}$ | $Grou{p}_{U25}$ | |

12 × 12 | $Grou{p}_{U31}$ | $Grou{p}_{U32}$ | $Grou{p}_{U33}$ | $Grou{p}_{U34}$ | $Grou{p}_{U35}$ |

Group | Group_{R11}–Group_{R15} | Group_{U11}–Group_{U15} | Group_{R21}–Group_{R25} | Group_{U21}–Group_{U25} | Group_{R31}–Group_{R35} | Group_{U31}–Group_{U3} |
---|---|---|---|---|---|---|

p-value | 1.2883 × 10^{−12} | 1.2570 × 10^{−32} | 7.5282 × 10^{−13} | 3.1371 × 10^{−22} | 3.0367 × 10^{−11} | 3.6793 × 10^{−22} |

Opponent | Scenario | Group | UCT | NaiveMCTS | Informed NaiveMCTS |
---|---|---|---|---|---|

Random | 8 × 8 | HMCTS-OP | 0.4145 | 0.2623 | 0.0115 |

Smart-HMCTS-OP | 1.7033 × 10^{−7} | 4.0795 × 10^{−8} | 9.9303 × 10^{−9} | ||

10 × 10 | HMCTS-OP | 0.1634 | 0.0331 | 0.0010 | |

Smart-HMCTS-OP | 6.7051 × 10^{−7} | 2.6123 × 10^{−8} | 9.9410 × 10^{−9} | ||

12 × 12 | HMCTS-OP | 6.2448 × 10^{−6} | 4.2737 × 10^{−4} | 6.7339 × 10^{−4} | |

Smart-HMCTS-OP | 1.1508 × 10^{−7} | 1.4085 × 10^{−5} | 2.4075 × 10^{−5} | ||

UCT | 8 × 8 | HMCTS-OP | 9.9611 × 10^{−9} | 9.9217 × 10^{−9} | 9.9217 × 10^{−9} |

Smart-HMCTS-OP | 9.9230 × 10^{−9} | 9.9217 × 10^{−9} | 9.9217 × 10^{−9} | ||

10 × 10 | HMCTS-OP | 1.1821 × 10^{−6} | 1.9423 × 10^{−4} | 9.9353 × 10^{−9} | |

Smart-HMCTS-OP | 9.9718 × 10^{−9} | 5.6171 × 10^{−8} | 9.9217 × 10^{−9} | ||

12 × 12 | HMCTS-OP | 7.1822 × 10^{−8} | 3.0630 × 10^{−8} | 1.5334 × 10^{−8} | |

Smart-HMCTS-OP | 9.9384 × 10^{−9} | 9.9260 × 10^{−9} | 9.9225 × 10^{−9} |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lu, L.; Zhang, W.; Gu, X.; Ji, X.; Chen, J.
HMCTS-OP: Hierarchical MCTS Based Online Planning in the Asymmetric Adversarial Environment. *Symmetry* **2020**, *12*, 719.
https://doi.org/10.3390/sym12050719

**AMA Style**

Lu L, Zhang W, Gu X, Ji X, Chen J.
HMCTS-OP: Hierarchical MCTS Based Online Planning in the Asymmetric Adversarial Environment. *Symmetry*. 2020; 12(5):719.
https://doi.org/10.3390/sym12050719

**Chicago/Turabian Style**

Lu, Lina, Wanpeng Zhang, Xueqiang Gu, Xiang Ji, and Jing Chen.
2020. "HMCTS-OP: Hierarchical MCTS Based Online Planning in the Asymmetric Adversarial Environment" *Symmetry* 12, no. 5: 719.
https://doi.org/10.3390/sym12050719