You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

20 January 2023

Tensor Implementation of Monte-Carlo Tree Search for Model-Based Reinforcement Learning

and
Faculty of Management Science and Informatics, University of Žilina, Univerzitná 8215/1, 010 26 Žilina, Slovakia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Deep Reinforcement Learning for Robots and Agents

Abstract

Monte-Carlo tree search (MCTS) is a widely used heuristic search algorithm. In model-based reinforcement learning, MCTS is often utilized to improve action selection process. However, model-based reinforcement learning methods need to process large number of observations during the training. If MCTS is involved, it is necessary to run one instance of MCTS for each observation in every iteration of training. Therefore, there is a need for efficient method to process multiple instances of MCTS. We propose a MCTS implementation that can process batch of observations in fully parallel fashion on a single GPU using tensor operations. We demonstrate efficiency of the proposed approach on a MuZero reinforcement learning algorithm. Empirical results have shown that our method outperforms other approaches and scale well with increasing number of observations and simulations.

1. Introduction

Reinforcement learning (RL) is a core machine learning topic that is concerned with how agents should perform actions in an environment in order to maximize the total cumulative reward. The agent learns by interacting with the environment through trial and error and uses feedback from its own actions and experiences. Its purpose is to find optimal or nearly-optimal strategy that maximizes the reward function. This strategy is referred to as policy. RL based methods has achieved outstanding accomplishments in a number of domains, e.g., games [1,2], autonomous driving [3], UAVs [4,5,6], robotics [7], and traffic signal control [8].
RL may be divided into two fundamental categories: model-free, and model-based. Model-free approaches directly learns a value function or a policy by interacting with the environment. Model-based RL uses model of environment in order to perform decision making through planning. This model is commonly represented by a Markov decision process (MDP) [9] consisting of two components: a state transition function, and a reward function. MDP is widely used in artificial intelligence for modeling sequential decision-making scenarios with probabilistic dynamics [10,11,12].
One of the most employed planning techniques is Monte-Carlo tree search (MCTS). MCTS combines the precision of tree search with the generality of random sampling [13]. It is a well-established approach to look-ahead search in context of model-based RL to extend and improve decision making process by building and traversing a tree. It was utilized in AlphaGO [1], the first program to achieve superhuman performance in GO, and in many of its successors [2,14,15].
MCTS is very challenging to parallelize due to its inherent sequential nature where each rollout depends on the statistics computed from previous simulations [16]. This problem is even more challenging when GPU architecture is involved. One of the problems is the SIMD execution scheme within GPU which causes that a standard CPU parallel implementation such as root-parallelism fail [17].
In this paper, we propose a parallel implementation that aims to evaluate multiple unique MCTS trees and is fully implemented on a graphics processing unit (GPU). A common way to use MCTS is to gradually build one large tree. This tree is continuously used and updated for the needs of the task at hand. We focus on a different type of tasks, namely tasks that need to evaluate a large number of unique trees instead of building one tree gradually. We show that the GPU tensor implementation of MCTS is suitable for this task and can outperform CPU and CPU-GPU implementations despite the atomic nature of the operations in MCTS. As an example of such task, we present a model-based RL, which was our prime motivation in developing this implementation. Here, a large number of different states (observations) need to be evaluated in each training iteration. Each state is represented as a root in a unique tree, which is then processed by MCTS. These trees are built in order to obtain actual training data and are then discarded and rebuilt from new states in a subsequent training iteration. Efficient generation and evaluation of new trees during the training is the key to the performance of these methods. Therefore, our implementation can dramatically speed up the training process through more efficient evaluation of large number of observations.
To fully exploit GPU’s capabilities, an environment model should also be implemented on the GPU. This is crucially important for overall GPU implementation of MCTS. Model-based RL can be divided into methods with an explicitly given model and with a learned model. For methods with an explicitly given model, the possibility of implementing the model on GPU as well as its complexity is highly application dependent. In the case of methods with a learned model, the feasibility and efficiency of the implementation depends on the employed learning algorithm to learn model dynamics.
Inspired by the recent success of MuZero [2], we demonstrate efficiency of the proposed MCTS implementation on a MuZero algorithm. MuZero is a model-based RL method that learns dynamics model within its search. We show that the proposed implementation can be easily integrated with a learned dynamics model represented by deep neural network (DNN). Although in this paper we focus on a parallel implementation of MCTS in conjuction with MuZero, our implementation is broadly applicable to all model-based RL methods that utilized MCTS. The contribution of this work is the following:
  • We propose fully parallel GPU implementation of MCTS that can simultaneously evaluate MCTS on large number of observations. We used number of observations 50 - 750 for our evaluation and show that the proposed method scales well with the increasing number of observations. The code is available at https://github.com/marrekb/MuZero (accessed on 11 December 2022).
  • We compare our method with existing MCTS implementations on a use case inspired by the MuZero and Atari games domain. The choice of DNN architecture used in RL agent is application dependent and have a significant impact on the overall performance. Therefore, we also report results for setup without using DNN. A pattern can be seen in the results which shows that the proposed method is the most computationally efficient.
The rest of the paper is organized as follows. The upcoming Section 2 contains an overview of related work. We discuss important role of MCTS in RL methods on example of MuZero, and existing parallel approaches to MCTS. In Section 3 we describe our proposed parallel implementation of MCTS. Section 4 is devoted to experiments and their evaluation. We summarize our conclusions in Section 5.

3. Proposed Implementation

In the case of Werner implementation, the inefficient parts are computation of prediction for each tree separately and higher communication overhead (e.g., sending collected data to the main process, and updating the DNNs in child processes). In the case of the EfficientZero, the disadvantage is the sequential processing of most MCTS phases in each processed tree.
Our implementation is based on tensor operations because they can be automatically parallelized on the GPU. Therefore, in each phase all trees are processed in parallel. We used Python library for deep learning—PyTorch [45]. However, the proposed method can be easilly implemented in other libraries such as TensorFlow [46].

3.1. Data Structure and Notation

In this section we introduce data structure and the established notations. Our data structure consists of the multiple tensors to store different attributes of MCTS nodes.
During the design, we were inspired by the structure of the Q table. Q table is a matrix holding q values. Rows represent states and columns possible actions in environment. Although Q table stores all possible states, we need to store significantly smaller number of states. MCTS method is applied on batch which consists of C T observations (i.e., running environments) with C S MCTS simulations per observation. Therefore, we need to store C T + C S × C T states. First C T states are roots obtained from observations via representation function. Next C S × C T states will be explored and stored during the MCTS simulations ( C T states per simulation). In our implementation we have to leave the zero row empty for implementation reasons (explained later), so the total number of rows in each tensor is C R = C T + C S × C T + 1 .
Data structure of proposed method (shown in Figure 1) consists of following tensors:
  • Tensor S—all states are held in tensor S. The number of dimensions of tensor S is equal to the number of dimensions of the state + 1 . The index of the first dimension represents unique IDs that are shared across all tensors. States are added into tensor S during the initialization (roots) and simulations (explored states) in order that they were visited (roots have indices from 1 to C T , explored states from the first simulation have indices between C T + 1 and 2 × C T and so on).
  • Tensor Q—q values are stored in tensor Q, similar to the Q table. First index (of the row) is the same unique ID of the node as we mentioned in tensor S. The size of tensor is C R × | A | . At the beginning of MCTS method, all values in tensor Q are initialized to zeros. During the phase of backpropagation, q values of traversed nodes are updated.
  • Tensor R—holds the predicted rewards by dynamics function. It works on the same principle as tensor Q and has the same size C R × | A | . Rewards are put into tensor during phase of expansion.
  • Tensor P—stores the predicted probabilities computed by prediction function. The size is similar like previous two tensors C R × | A | .
  • Tensor N—unlike the previous tensors composed by real numbers, tensor N consists of integers. It stores the numbers of visits of nodes (executed combinations of states and actions). The size of tensor is again C R × | A | .
  • Tensor E—the last tensor E (with the size C R × | A | ) holds IDs of children nodes. For example, if there is an edge from the parent node I D = i after taking action a to the children node I D = j , then E ( i , a ) = j . If there is no edge between two nodes, the value in tensor is zero. For that reason, zeroth row is empty in each tensor.
Figure 1. Data structure—the data of each node is stored in 6 tensors. IDs represent indices of nodes (they are not stored as tensor values). For example, the highlighted rows represent all stored data of a node with ID = 1.
The complete data structure consists of the above mentioned 6 tensors. The first index of each tensor is the ID of the node (index of the row). In the tensor S, index i returns i-th state (of the node with I D = i ) whereas in other tensors i-th vector is returned instead. E.g., in tensor Q, the index i returns vector of q values Q ( i ) = [ Q ( i , 0 ) , Q ( i , 1 ) , , Q ( i , | A | 1 ) ] . On the other hand, index i in combination with action index a returns scalar q value Q ( i , a ) .
In our implementation, we often index via vectors (as a part of tensor operations). E.g., combination of node index i and vector of actions a = [ a 0 , a 15 , a 17 ] , where a 0 , a 15 and a 17 represent integer numbers, applied to tensor Q results in Q ( i , a ) = [ Q ( i , a 0 ) , Q ( i , a 15 ) , Q ( i , a 17 ) ] . Indexing by combination of two vectors i = [ i 2 , i 5 , i 1 , i 25 ] and a = [ a 10 , a 5 , a 17 , a 10 ] applied to tensor Q lead to Q ( i , a ) = [ Q ( i 2 , a 10 ) , Q ( i 5 , a 5 ) , Q ( i 1 , a 17 ) , Q ( i 25 , a 10 ) ] .

3.2. Initialization and Preprocessing

At the beginning of the method, local variables are initialized (Algorithm 1, lines 1–9) based on the obtained parameters o, C S , and | A | . Vector of unique root indices i R = [ 1 , 2 , 3 , , C T 1 , C T ] is generated. These indices are utilized as unique identifiers across all tensors.
The tensor of root states s is computed from the obtained batch of observations by representation function. The tensor of probabilities and vector of values are predicted by the prediction function. Dirichlet noise is added to the tensor of probabilities in order to support the exploration of MuZero. States and tensors of probabilities are inserted into tensors S and P via a vector of root indices (Algorithm 1, lines 14–15).
Algorithm 1 Initialization
Require: 
batch of observations o
Require: 
the number of MCTS simulations C S
Require: 
the number of actions | A |
 1:
C T | o |
 2:
C N C S + 1
 3:
C R 1 + C T × C N
 4:
S tensor of zeros with size C R × s t a t e d i m
 5:
P tensor of zeros with size C R × | A |
 6:
Q tensor of zeros with size C R × | A |
 7:
R tensor of zeros with size C R × | A |
 8:
N integer tensor of zeros with size C R × | A |
 9:
E integer tensor of zeros with size C R × | A |
10:
i R integers from the interval 1 , C T
11:
s f r ( o | θ r )
12:
p , v f p ( s | θ p )
13:
p p + Dirichlet noise
14:
S ( i R ) s
15:
P ( i R ) p

3.3. Phase of Selection

The pseudocode of the phase of selection is given in Algorithm 2, and the flowchart is shown in Figure 2. During one iteration of MCTS, each tree is traversed to find the leaf node. Let the L-step trajectory consists of combinations of traversed nodes and edges (actions) inorder, then we can write trajectory as τ = [ ( I D 0 , a 0 ) , ( I D 1 , a 1 ) , ( I D 2 , a 2 ) , , ( I D L 1 , a L 1 ) ] . Tensor I holds IDs of traversed nodes and tensor A indices of selected actions chosen by the PUCT method. Both tensors have size C N × C T . C N is the number of simulations increased by one because the first ID and action in trajectory belong to the root of a tree. Each tree has its trajectory stored in one column whose index corresponds to a particular tree. Therefore, the number of columns is C T . Vector l stores the indices of the last elements of each trajectory. Information stored in all three tensors are used in the next phases.
Figure 2. Flowchart of the phase of selection.
Each trajectory starts from the root. Root indices (IDs) are inserted into the zero row of tensor I (Algorithm 2, line 4).
Indices of active trajectories are stored in vector i N . In other words, vector i N remembers indices of trees that are still active in the phase of selection. At the beginning of each selection phase, vector is filled by vector of all tree indices because each tree takes a part in the selection phase.
The loop of the selection phase starts with condition i N . Condition checks whether the vector of active trajectories is empty. If there is no active trajectory, the phase of selection is finished. Otherwise, IDs of nodes in active trajectories are assigned to vector i by indexing via the current step and vector i N in the tensor I (Algorithm 2, line 8). Computing of the PUCT method (Equation (1)) for current nodes is fully executed by tensor operations of adding, multiplying, dividing, etc. The method returns vector of selected actions of current nodes.
Indices of the last items in active trajectories are updated (Algorithm 2, line 11).
The IDs of the nodes accessed using the actions a in current nodes i are obtained from tensor E. As we mentioned before, if there is no children node for combination ( ( I D , a ) ), 0 is returned instead. Obtained IDs and possible zero indices are inserted into a new row of the tensor I.
For simplicity’s sake, let the number of trees C T = 4 , number of active trajectories | i N | = 3 , indices of active trajectories i N = [ 0 , 1 , 3 ] , node IDs i = [ 5 , 7 , 8 ] , selected actions a = [ 3 , 1 , 0 ] and IDs of children nodes E ( i , a ) = [ E ( 5 , 3 ) = 15 , E ( 7 , 1 ) = 0 , E ( 8 , 0 ) = 21 ] = [ 15 , 0 , 21 ] . Since I ( s t e p , i N ) = [ 15 , 0 , 21 ] , updated row of the tensor is I ( s t e p ) = [ 15 , 0 , 0 , 21 ] . I ( s t e p , 0 ) = 15 and I ( s t e p , 3 ) = 21 indicate active trajectories. I ( s t e p , 1 ) = 0 represents inactive trajectory. This trajectory has been terminated in the current step because the node with I D = 7 has no children, e.g., E ( 7 , 1 ) = 0 . I ( s t e p , 2 ) = 0 represents a trajectory that was already inactive.
Finally, we update the indices of nonzero values from the current row of I. If there is no nonzero value, the phase of selection is completed. On the other hand, if there is at least one nonzero value (active trajectory), the next phase of the MCTS simulation begins.
Algorithm 2 Phase of selection
 1:
I integer tensor of zeros with size C N × C T
 2:
A integer tensor of zeros with size C N × C T
 3:
l integer vector of zeros with length C T
 4:
I ( 0 ) i R
 5:
s t e p 0
 6:
i N integers from the interval ( 0 , C T )
 7:
while i N do
 8:
    i I ( s t e p , i N )
 9:
    a apply PUCT on nodes i
10:
    A ( s t e p , i N ) a
11:
    l ( i N ) s t e p
12:
    s t e p s t e p + 1
13:
    I ( s t e p , i N ) E ( i , a )
14:
    i N return indices of nonzero values from I ( s t e p )
15:
end while

3.4. Phase of Expansion and Simulation

We joined phases of expansion and simulation (Algorithm 3 and Figure 3) because both phases are interconnected and it is effective to implement them together. Based on the vector of last indices l , combinations of last nodes and actions are identified for all trajectories (Algorithm 3, lines 1–3). Tensor of new states s and vector of obtained rewards r are computed by dynamics function. The new states are then used in the prediction function to predict tensor of probabilities p and vector of values v .
Figure 3. Flowchart of the phase of expansion and simulation.
IDs of new nodes are computed from the vector of roots i R . Tensors of children nodes E and rewards R are updated by new data.
At the end of the expansion phase, predicted tensors of states s and probability distributions p are added to the tensors S and P as data of new nodes (Algorithm 3, lines 9–10).
Algorithm 3 Phase of expansion and simulation
Require: 
current number of simulation C I
 1:
k integers from the interval 0 , C T 1
 2:
i I ( l , k )
 3:
a A ( l , k )
 4:
s , r f d ( S ( i ) , a | θ d )
 5:
p , v f p ( s | θ p )
 6:
i n e w i R + C T × ( C I + 1 )
 7:
E ( i , a ) i n e w
 8:
R ( i , a ) r
 9:
S ( i n e w ) s
10:
P ( i n e w ) p

3.5. Phase of Backpropagation

During the phase of backpropagation (Algorithm 4 and Figure 4), trajectories are traversed from the end to the start to update q values and numbers of visits. At the beginning of the backpropagation phase, the local variable s t e p stores the length of the longest trajectory obtained in the selection phase. Therefore, the loop of backpropagation starts from s t e p 1 .
Figure 4. Flowchart of the phase of backpropagation.
We first select the trajectory indices that were active in the selection phase during the actual s t e p . Based on the obtained indices, IDs and actions in actual s t e p are identified from tensors I and A (Algorithm 4, lines 3–4).
Values of active trajectories v ( i N ) are used to update q values of parent nodes (values of non-active trajectories remain unchanged). In the end, tensors Q and N are updated by standard formulas (Equation (4) of [2]).
The loop of backpropagation ends by updating q values and the number of visits of root nodes.
Algorithm 4 Phase of backpropagation
1:
for s t e p decrement from s t e p 1 to 0 do
2:
    i N return indices of nonzero values from I ( s t e p )
3:
    i I ( s t e p , i N )
4:
    a A ( s t e p , i N )
5:
    v ( i N ) R ( i , a ) + γ × v ( i N )
6:
    Q ( i , a ) Q ( i , a ) × N ( i , a ) + v ( i N ) N ( i , a ) + 1
7:
    N ( i , a ) N ( i , a ) + 1
8:
end for

3.6. Post Processing

The MCTS method completes after C S simulations have been executed. At the end of the MCTS method, it is necessary to calculate the probability distribution and the state value of each root state. Probability of root action a is computed as N ( s r o o t , a ) b N ( s r o o t , b ) . The values of the root states are computed as a weighted arithmetic mean of root q values and their probability distributions. All computations are carried out on GPU exploiting tensor operations for adding, multiplication and division.

4. Experiments

The proposed implementation aims at the evaluation of a large number of MCTS instances in parallel. We demonstrate this capability with an example from the RL domain where a large number of observations need to be processed at the same time. Each observation represents the state in the environment (root state in MCTS tree) in which we want to perform an action. MCTS algorithm is utilized in RL to improve action selection process.
We compared proposed implementation to three MCTS implementations used in model-based RL methods. We measured the performance of individual implementations in terms of how long it takes to simultaneously evaluate MCTS on batch of observations. For each experiment, we report the mean and standard deviation calculated from 100 replications. All experiments were carried out on a single NVIDIA GeForce RTX 2080 Ti graphics card and 16-core Intel(R) Xeon(R) CPU E5-2643 @ 3.30GHz processor.
The list of compared implementations together with their designation is as follows:
  • GPU—The proposed method fully implemented on GPU using tensor operations and described in Section 3.
  • CPUGPU_P—Implementation inspired by EfficientZero [36] which uses multiple processes. Each process builds multiple trees and stores copy of DNN on GPU. However, maintaining a copy of DNN by children processes is memory inefficient which limits their use especially in single GPU scenarios. Therefore, in our modification, only the parent process stores a DNN on GPU.
    At the beginning of MCTS method, batch of observations is given to representation and prediction functions (on GPU) to predict states, probability distributions and state values in roots. Predicted data are split and sent to the children processes in which trees are initialized (one tree per root’s state). MCTS simulations are executed in each process. Phase of selection is executed sequentially on CPU. States of selected nodes and actions are sent back into parent process. After receiving data from all children processes, new states, rewards, probabilities and values are predicted by one forward of dynamics and prediction functions (on GPU). Predicted data are split and sent again into children processes in which phases of expansion and backpropagation are performed (also on CPU). After executing MCTS simulations, the parent process receives and post-processes the results.
  • CPUGPU_S—A sequential implementation of CPUGPU_P method without multi-processing. Phases of selection, expansion and backpropagation are performed on CPU inorder. As in the previous implementation, all data is processed as a batch by DNN on GPU. Both CPUGPU_P and CPUGPU_S were implemented by our team and used as a part of AlphaZero in [47].
  • CPU—Last approach represents Werner’s MCTS implementation. We used the source code from author’s GitHub repository [35].

4.1. Model of Environment

We chose MuZero algorithm as an example of model-based RL method that utilize MCTS. Our experimental setup was inspired by the domain of Atari games. One observation (state of the environment) was represented by 128 × 96 × 96 tensor. The number of possible actions was set to 18, which is the maximum amount of possible actions in domain of Atari games. MuZero algorithm uses DNN to approximate representation, dynamics and prediction function. We use original MuZero’s architecture [2] with a few modifications.
The kernel size is 3 × 3 for all operations. The convolution operations padding is set to 1. Representation function is identical copy of MuZero’s original function. It consists of:
  • 1 convolution with stride 2 and 128 kernels, output resolution 48 × 48
  • 2 residual blocks with 128 kernels
  • 1 convolution with stride 2 and 256 kernels, output resolution 24 × 24
  • 3 residual blocks with 256 kernels
  • average pooling with stride 2, output resolution 12 × 12
  • 3 residual blocks with 256 kernels
  • average pooling with stride 2, output resolution 6 × 6
Output size of representation function, i.e., state size is 256 × 6 × 6 .
The input to the dynamics function is 257 × 6 × 6 dimensional tensor consisted of state and action. Action represents tensor 1 × 6 × 6 filled by value a t c o u n t o f a c t i o n s . The dynamics function provides two outputs. The first is the new state and the second is the reward. The reward in original MuZero implementation was computed as a linear combination of categorical output. We formulated reward prediction problem as a regression task instead. The second change from the original implementation is in the structure of hidden layers of reward head as they were not described in the documentation.
Layers of dynamics functions are:
  • 1 convolution with 256 kernels, output resolution 6 × 6
  • 8 residual blocks with 256 kernels, output resolution 6 × 6 (the last residual block is also used as a representation of the new state)
  • flattening, number of output neurons is 9216
  • fully connected layer with 512 neurons
  • fully connected layer with 1 neuron (reward output)
Outputs of representation and dynamics functions are used as inputs for prediction function. It also provides two outputs—probability distribution and state value. The common part is composed of flattening and linear layer with 512 neurons. Each output head consists of one linear layer with 512 neurons. Last linear layer of probability distribution has 18 neurons. As in the case of the reward head, last output layer of the state value head has one neuron. Again, the categorical task has been reformulated into a regression task.
Our goal was not to train MuZero agent but to compare the computational speed of individual MCTS implementations. Therefore, we used a randomly initialized DNN model to simulate all necessary computations.

4.2. MCTS Parameters

Batch of observations is generated as a random tensor of size C T . Each of these observations is processed by a unique instance of MCTS with the number of simulations set to C S .
Although MCTS uses PUCT formula to select action, we modified the action selection mechanism to test the edge cases of tree formation based on the following scenarios:
  • Random action—in this scenario, the action is selected randomly. This scenario causes the tree to build in breadth (tree resembles a balanced tree). The scenario reflects, for example, the behavior at the beginning of RL agent training, when the DNN produces approximately uniform probability distribution of actions.
  • Constant action—in this scenario, the selected action is replaced by a constant action. This scenario causes the tree to build in depth (tree resembles a linked list). The scenario reflects, for example, the behavior of the trained or overfitted agent, when the DNN produces one dominant action.
The use of scenarios is implemented by changing the value of probability distribution function according to the given scenario and setting the scalar values (e.g., state value and reward) to zero.

4.3. Results

We report results for two sets of experiments. In the first set, we measured the performance of MCTS adjusted for the effect of DNN model used by MuZero. Architecture of DNN utilized by MuZero is strongly dependent on the application. Therefore we first provide performance of compared implementations without the DNN model. Second set of experiments shows results on the use case of MuZero with DNN model described earlier.
Results for experiments without DNN model are shown in Table 1 and Table 2. We fixed the number of simulation C S to 50. Batch of observations size C T was set to 50, 100, 250, 500 and 750. In the case of CPUGPU_P implementation, we report results for number of processes set to 2, 5, 10 and 25. For CPU implementation, we report results with number of processes set to 5, 10 and 15. In both cases, a larger and smaller number of processes resulted in higher computation time.
Table 1. Effect of number of observations on computation time. Results of experiments for random action without DNN model ( C S = 50).
Table 2. Effect of number of observations on computation time. Results of experiments for constant action without DNN model ( C S = 50).
In both random action and constant action scenarios we see a similar pattern. Proposed GPU implementations is the fastest for all tested values of the C T parameter. The difference in speed increases with increasing number of observations. In Figure 5 we show detail comparison between two best performing implementations for both scenarios. For C T = 750 , proposed GPU implementations is 34.8 times faster for random action scenario and 16 times faster for constant action scenario than the second best performing CPUGPU_P implementations (with 10 and 25 processes respectively). In the constant action scenario (Table 2), the computational time increased for most of the investigated implementations and their settings. This increase is due to a growing trajectories obtained by the phase of selection in MCTS.
Figure 5. Comparison of execution time of the two best performing implementations with respect to the number of observations. Results for both scenarios without DNN.
We further investigated influence of the number of MCTS simulations C S and report results for two best performing implementations in Table 3, Table 4 and Figure 6. The C T parameter was set to 100. We can observe that the computation time increases significantly as the number of simulations increases, especially in constant action scenario. Proposed GPU implementation is 12.1 times faster for C S = 50 and 8.7 times faster for C S = 400 in random action scenario. In constant action scenario, GPU implementation is 4.2 times faster for C S = 50 and 2 times faster for C S = 400 .
Table 3. Effect of number of simulations on computation time. Results of experiments for random action without DNN ( C T = 100).
Table 4. Effect of number of simulations on computation time. Results of experiments for constant action without DNN ( C T = 100).
Figure 6. Comparison of execution time of the two best performing implementations with respect to the number of simulations. Results for both scenarios without DNN.
Last experiments measured performance with DNN model. We omitted CPU method from results due to the high computation requirements associated with the execution of DNN model (e.g., 338 s for C S = 50 , and C T = 50 ). Similarly to previous experiments, we fixed C S to 50 and set C T to 50, 100, 250, 500 and 750. Results presented in Table 5, Table 6 and Figure 7 show similar trend as in experiments without DNN. Proposed GPU is the best performing implementation for all values of parameter C T . Even with the DNN model, which consumes most of the computation, GPU implementation is 4.7 times faster for random action scenario and 7.7 times faster for constant action scenario than the second best performing CPUGPU_P implementations (with 10 and 25 processes respectively), for C T = 750 .
Table 5. Effect of number of observations on computation time. Results of experiments for random action with DNN model ( C S = 50).
Table 6. Effect of number of observations on computation time. Results of experiments for constant action with DNN model ( C S = 50).
Figure 7. Comparison of execution time of the two best performing implementations with respect to the number of observations. Results for both scenarios with DNN.

5. Conclusions

In this paper, we proposed a parallel implementation of MCTS that efficiently evaluates large number of MCTS trees at once. It utilizes tensor operations and is fully implemented on GPU. We show that the atomic nature of MCTS operations can be transformed into vector operations suitable for GPU. We demonstrated this capability using the example of MuZero model-based RL agent in Atari game domain. Model-based RL agents often combine DNN and MCTS approaches to improve action selection. During the offline training, these RL agents requires to process a huge amount of observations in parallel. These observations are represented by unique root nodes in MCTS.
We compared our implementation with approaches based on the Werner and EfficientZero implementations. We show that the proposed approach gives the best results and scales well with the number of observations and number of simulations. We tested two scenarios of tree formations: random action and constant action. For both scenarios the proposed implementations yield the best results. In experiments without DNN, the proposed implementation is 34.8 times faster for the random action and 16 times faster for the constant action scenario than the second best performing CPUGPU_P implementation ( C T = 750 and C S = 50 ).
We further investigated the effect of DNN in RL agent. In experiments with DNN and for the random action, the proposed implementation is 4.7 times faster than the second best performing CPUGPU_P for the value of parameters C T = 750 and C S = 50 . In the case of the constant action, this difference is 7.7 fold. Therefore, we observed a performance improvement over the benchmark implementations in both non-DNN and DNN settings.
Although we report results when MCTS is utilized within the MuZero RL agent, the proposed implementation can be used wherever a large number of MCTS instances need to be processed in parallel. The closest example to us is the use in the model-based RL. We show an example of model-based MuZero agent which uses DNN to learn model dynamics. The proposed implementation can also be utilized in model-based RL agent with an explicitly given model of environment. In this case the overall performance depends on the complexity of environment and its implementation.
Our implementation was tested using a single GPU. We see no restrictions for deployment on a larger number of GPUs. However, for more complex computing infrastructures, we expect that directly tailored methods will yield better results, as they can exploit the specifics of a given infrastructure.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app13031406/s1.

Author Contributions

Conceptualization, M.B. and P.T.; methodology, M.B. and P.T.; software, M.B.; validation, M.B. and P.T.; formal analysis, M.B. and P.T.; investigation, M.B.; resources, M.B. and P.T.; data curation, M.B.; writing—original draft preparation, M.B. and P.T.; writing—review and editing, M.B. and P.T.; visualization, M.B. and P.T.; supervision, P.T.; project administration, M.B. and P.T.; funding acquisition, P.T. All authors have read and agreed to the published version of the manuscript.

Funding

This publication was realized with support of Operational Program Integrated Infrastructure 2014–2020 of the project: Intelligent operating and processing systems for UAVs, code ITMS 313011V422, co-financed by the European Regional Development Fund.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Source codes and raw experiment results are available on https://github.com/marrekb/MuZero (accessed on 11 December 2022) in the folder “proposed_mcts”. Data presented in this study are available in Supplementary Materials.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MCTSMonte-Carlo tree search
RLreinforcement learning
MDPMarkov decision process
GPUgraphics processing unit
CPUcentral processing unit
DNNdeep neural network

References

  1. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
  2. Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef] [PubMed]
  3. Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Al Sallab, A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4909–4926. [Google Scholar] [CrossRef]
  4. Azar, A.T.; Koubaa, A.; Ali Mohamed, N.; Ibrahim, H.A.; Ibrahim, Z.F.; Kazim, M.; Ammar, A.; Benjdira, B.; Khamis, A.M.; Hameed, I.A.; et al. Drone deep reinforcement learning: A review. Electronics 2021, 10, 999. [Google Scholar] [CrossRef]
  5. Hodge, V.J.; Hawkins, R.; Alexander, R. Deep reinforcement learning for drone navigation using sensor data. Neural Comput. Appl. 2021, 33, 2015–2033. [Google Scholar] [CrossRef]
  6. Munaye, Y.Y.; Juang, R.T.; Lin, H.P.; Tarekegn, G.B.; Lin, D.B. Deep reinforcement learning based resource management in UAV-assisted IoT networks. Appl. Sci. 2021, 11, 2163. [Google Scholar] [CrossRef]
  7. Andrychowicz, O.M.; Baker, B.; Chociej, M.; Jozefowicz, R.; McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Powell, G.; Ray, A.; et al. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 2020, 39, 3–20. [Google Scholar] [CrossRef]
  8. Gregurić, M.; Vujić, M.; Alexopoulos, C.; Miletić, M. Application of deep reinforcement learning in traffic signal control: An overview and impact of open traffic data. Appl. Sci. 2020, 10, 4011. [Google Scholar] [CrossRef]
  9. Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
  10. Kolobov, A. Planning with Markov decision processes: An AI perspective. Synth. Lect. Artif. Intell. Mach. Learn. 2012, 6, 1–210. [Google Scholar]
  11. Moerland, T.M.; Broekens, J.; Jonker, C.M. Model-based reinforcement learning: A survey. arXiv 2020, arXiv:2006.16712. [Google Scholar]
  12. Duarte, F.F.; Lau, N.; Pereira, A.; Reis, L.P. A survey of planning and learning in games. Appl. Sci. 2020, 10, 4529. [Google Scholar] [CrossRef]
  13. Browne, C.B.; Powley, E.; Whitehouse, D.; Lucas, S.M.; Cowling, P.I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; Colton, S. A survey of monte carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 2012, 4, 1–43. [Google Scholar] [CrossRef]
  14. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
  15. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef]
  16. Liu, A.; Chen, J.; Yu, M.; Zhai, Y.; Zhou, X.; Liu, J. Watch the unobserved: A simple approach to parallelizing monte carlo tree search. arXiv 2018, arXiv:1810.11755. [Google Scholar]
  17. Rocki, K.; Suda, R. Large-scale parallel Monte Carlo tree search on GPU. In Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, Anchorage, AK, USA, 16–20 May 2011; pp. 2034–2037. [Google Scholar]
  18. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  19. Kaiser, L.; Babaeizadeh, M.; Milos, P.; Osinski, B.; Campbell, R.H.; Czechowski, K.; Erhan, D.; Finn, C.; Kozakowski, P.; Levine, S.; et al. Model-based reinforcement learning for atari. arXiv 2019, arXiv:1903.00374. [Google Scholar]
  20. Racanière, S.; Weber, T.; Reichert, D.; Buesing, L.; Guez, A.; Jimenez Rezende, D.; Puigdomènech Badia, A.; Vinyals, O.; Heess, N.; Li, Y.; et al. Imagination-augmented agents for deep reinforcement learning. arXiv 2017, arXiv:1707.06203. [Google Scholar]
  21. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
  22. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 1928–1937. [Google Scholar]
  23. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  24. Badia, A.P.; Piot, B.; Kapturowski, S.; Sprechmann, P.; Vitvitskyi, A.; Guo, Z.D.; Blundell, C. Agent57: Outperforming the atari human benchmark. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 507–517. [Google Scholar]
  25. Guo, X.; Singh, S.; Lee, H.; Lewis, R.L.; Wang, X. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. Adv. Neural Inf. Process. Syst. 2014, 27, 3338–3346. [Google Scholar]
  26. Gabirondo-López, J.; Egaña, J.; Miguel-Alonso, J.; Urrutia, R.O. Towards Autonomous Defense of SDN Networks Using MuZero Based Intelligent Agents. IEEE Access 2021, 9, 107184–107199. [Google Scholar] [CrossRef]
  27. Yilmaz, E.; Sanni, O.; Kotwicz Herniczek, M.; German, B. Deep Reinforcement Learning Approach to Air Traffic Optimization Using the MuZero Algorithm. In Proceedings of the AIAA AVIATION 2021 FORUM, Virtual Event, 2–6 August 2021; p. 2377. [Google Scholar]
  28. Mirsoleimani, S.A.; Plaat, A.; Van Den Herik, J.; Vermaseren, J. Parallel monte carlo tree search from multi-core to many-core processors. In Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, Helsinki, Finland, 20–22 August 2015; Volume 3, pp. 77–83. [Google Scholar]
  29. Chaslot, G.M.B.; Winands, M.H.; Herik, H. Parallel monte-carlo tree search. In Proceedings of the International Conference on Computers and Games, Beijing, China, 29 September–1 October 2008; Springer: Berlin/Heidelberg, Germany; pp. 60–71. [Google Scholar]
  30. Steinmetz, E.; Gini, M. More trees or larger trees: Parallelizing Monte Carlo tree search. IEEE Trans. Games 2020, 13, 315–320. [Google Scholar] [CrossRef]
  31. Soejima, Y.; Kishimoto, A.; Watanabe, O. Evaluating root parallelization in Go. IEEE Trans. Comput. Intell. AI Games 2010, 2, 278–287. [Google Scholar] [CrossRef]
  32. Barriga, N.A.; Stanescu, M.; Buro, M. Parallel UCT search on GPUs. In Proceedings of the 2014 IEEE Conference on Computational Intelligence and Games, Dortmund, Germany, 26–29 August 2014; pp. 1–7. [Google Scholar]
  33. Świechowski, M.; Mańdziuk, J. A hybrid approach to parallelization of Monte Carlo tree search in general game playing. In Challenging Problems and Solutions in Intelligent Systems; Springer: Berlin/Heidelberg, Germany, 2016; pp. 199–215. [Google Scholar]
  34. Liu, A.; Liang, Y.; Liu, J.; Broeck, G.V.d.; Chen, J. On effective parallelization of monte carlo tree search. arXiv 2020, arXiv:2006.08785. [Google Scholar]
  35. Werner Duvaud, A.H. MuZero General: Open Reimplementation of MuZero. 2019. Available online: https://github.com/werner-duvaud/muzero-general (accessed on 11 December 2022).
  36. Ye, W.; Liu, S.; Kurutach, T.; Abbeel, P.; Gao, Y. Mastering atari games with limited data. arXiv 2021, arXiv:2111.00210. [Google Scholar]
  37. Lapan, M. Deep Reinforcement Learning. Das Umfassende Praxis-Handbuch: Moderne Algorithmen für Chatbots, Robotik, Diskrete Optimierung und Web-Automatisierung inkl. Multiagenten-Methoden; MITP-Verlags GmbH & Co. KG: Frechen, Germany, 2020. [Google Scholar]
  38. Scholz, J.; Weber, C.; Hafez, M.B.; Wermter, S. Improving Model-Based Reinforcement Learning with Internal State Representations through Self-Supervision. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
  39. Koul, A. Muzero-Pytorch. 2020. Available online: https://github.com/koulanurag/muzero-pytorch (accessed on 11 December 2022).
  40. Voskuil, K. Muzero. 2021. Available online: https://github.com/kaesve/muzero (accessed on 11 December 2022).
  41. Gras, J. MuZero. 2019. Available online: https://github.com/johan-gras/MuZero (accessed on 11 December 2022).
  42. Krishnamurthy, Y. Simple-Muzero. 2020. Available online: https://github.com/yamsgithub/simple-muzero (accessed on 11 December 2022).
  43. Sivaraj, M. Muzero. 2020. Available online: https://github.com/madhusivaraj/muzero (accessed on 11 December 2022).
  44. Schaposnik, F. Muzero. 2021. Available online: https://github.com/fidel-schaposnik/muzero (accessed on 11 December 2022).
  45. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., D’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates Inc.: Sydney, NSW, Australia, 2019; pp. 8024–8035. [Google Scholar]
  46. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
  47. Balaz, M.; Tarabek, P. AlphaZero with Real-Time Opponent Skill Adaptation. In Proceedings of the 2021 International Conference on Information and Digital Technologies (IDT), Zilina, Slovakia, 22–24 June 2021; pp. 194–199. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.