MCTS-Based Policy Improvement for Reinforcement Learning

Csippán, György; Péter, István; Kővári, Bálint; Bécsi, Tamás

doi:10.3390/make7030098

Open AccessArticle

MCTS-Based Policy Improvement for Reinforcement Learning

by

György Csippán

^1,2,

István Péter

^1,2

,

Bálint Kővári

^1,2

and

Tamás Bécsi

^1,*

¹

Department of Control for Transportation and Vehicle Systems, Faculty of Transportation Engineering and Vehicle Engineering, Budapest University of Technology and Economics, H-1111 Budapest, Hungary

²

Asura Technologies Ltd., H-1122 Budapest, Hungary

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(3), 98; https://doi.org/10.3390/make7030098

Submission received: 16 July 2025 / Revised: 22 August 2025 / Accepted: 8 September 2025 / Published: 10 September 2025

(This article belongs to the Topic AI and Computational Methods for Modelling, Simulations and Optimizing of Advanced Systems: Innovations in Complexity, Second Edition)

Download

Browse Figures

Versions Notes

Abstract

Curriculum Learning (CL) is a potent field in Machine Learning that provides several excellent techniques for enhancing the performance of the training process given the same data points, regardless of the training method used. In this research, we propose a novel Monte Carlo Tree Search (MCTS)-based technique that enhances model performance, articulating the utilization of MCTS in Curriculum Learning. The proposed approach leverages MCTS to optimize the sequence of batches during the training process. First, we demonstrate the application of our method in Reinforcement Learning, where sparse rewards often diminish convergence and deteriorate performance. By leveraging the strategic planning and exploration capabilities of MCTS, our method systematically identifies and selects trajectories that are more informative and have a higher potential to enhance policy improvement. This MCTS-guided batch optimization focuses the learning process on valuable experiences, accelerating convergence and improving overall performance. We evaluate our approach on standard RL benchmarks, demonstrating that it outperforms conventional batch selection methods regarding learning speed and policy effectiveness. The results highlight the potential of combining MCTS with CL to optimize batch selection, offering a promising direction for future research in efficient Reinforcement Learning.

Keywords:

reinforcement learning; Monte Carlo Tree Search; batch optimization

1. Introduction

Reinforcement Learning has gained significant attention for its successes in complex, often unpredictable environments. Through trial-and-error learning, agents try to maximize cumulative rewards, allowing them to make decisions that can have long-term consequences. Thanks to its flexibility and generality, RL has been successfully applied in several fields, from autonomous driving systems [1,2] and robotics [3] to video game playing [4], board games [5], algorithm development [6,7], and resource management [8]. However, one of the key challenges in RL lies in the efficiency of learning; specifically, how to make optimal use of collected experiences to enable rapid policy improvement. Traditional RL algorithms rely heavily on experience replay mechanisms, where past interactions (or trajectories) are stored and later used to update the policy or value functions. This replay, while effective, is often performed with simple batch sampling techniques, such as sampling from a discrete uniform distribution, that do not account for the relative importance or value of each experience.

The following problem arises: proper learning signals can be sparse or unevenly distributed in real-world or complex environments. Some experiences may be more critical for shaping an agent’s behavior than others. By treating all experiences as equally important, random sampling methods may waste valuable computational resources on less informative or redundant data, slowing down convergence and leading to suboptimal policies. The challenge is to develop more intelligent sampling strategies that prioritize the most informative experiences, speeding up the learning process and improving the learned policy’s quality. In terms of sampling, not only is the selection of influential training samples important, but it is also crucial to understand what kind of experiences are required in a given phase during the agent’s training. Unfortunately, training sample prioritization techniques do not tackle this issue directly since the agent cannot choose from different sampled batches before using them to update the agent’s policy.

Monte Carlo Tree Search [9] has emerged as a robust algorithm in domains where decision-making requires balancing exploration (discovering new states) and exploitation (refining known good states). Initially designed for planning in large search spaces like those in strategic games like Go [10], MCTS iteratively builds a search tree by performing random simulations and using the results to guide future explorations. It excels at focusing computational resources on promising areas of the state space while maintaining a balance between exploring new possibilities and exploiting known strategies. This capacity for strategic exploration makes MCTS a compelling candidate for integration with RL, particularly in batch sequence optimization.

In this research, we propose a novel integration of MCTS into the batch sequence optimization of RL algorithms. By leveraging the search capabilities of MCTS, our approach seeks to systematically identify and sample trajectories that hold more significant potential for improving the policy. Rather than using the sampled batches in a random order, our method prioritizes batches in a given training phase, which is more likely to result in meaningful policy updates. By guiding the batch selection process with MCTS, we aim to focus learning on the most valuable and informative experiences at a given stage of training, thereby accelerating convergence and improving overall policy performance. This approach contrasts with conventional techniques that result in slower learning and suboptimal performance, especially in complex environments.

To evaluate the efficacy of our proposed method, we perform extensive experiments using standard RL benchmarks. These benchmarks provide a well-established framework for comparing our MCTS-guided batch selection method with the traditional RL training approach. The results of our experiments demonstrate that our method consistently outperforms uniform batch sampling strategies, both in terms of learning speed and the quality of the final learned policies. Specifically, our approach leads to faster convergence rates and more robust policy performance, indicating that MCTS is highly effective in guiding the agent’s learning process.

This work makes a significant contribution to the field of RL by introducing a novel method for batch sequence optimization that combines the strengths of MCTS with traditional RL techniques. Our method not only enhances the efficiency of experience replay, but also opens the door for future research on integrating planning and search algorithms into the experience prioritization process of RL. By focusing on the most valuable batches, our approach demonstrates the potential for improving the scalability and effectiveness of RL algorithms, particularly in environments where informative experiences are sparse.

2. Related Work

Our work draws on several research areas that explore the integration of MCTS with RL, efficient sampling in experience replay, and adaptive exploration strategies.

MCTS has been increasingly adopted in RL for its structured approach to exploration and planning, particularly in domains requiring sequential decision-making. A pivotal example is the algorithm in [10], where MCTS is combined with RL to create a policy capable of outperforming human experts in the game of Go. This success demonstrated how MCTS can guide action selection in complex environments by balancing exploration and exploitation. Following this, the authors of [11] introduced the Thinking Fast and Slow with Deep Learning and Tree Search framework, which combines MCTS with deep RL for strategy games, effectively using MCTS to guide the neural network towards optimal actions. The authors of [12] explored Learning to Search with MCTS, where MCTS is used as a structured exploration mechanism in complex domains. This shows that it can lead to sample efficiency gains by focusing exploration on valuable decision paths.

In parallel, experience prioritization methods like experience replay have become essential to improving learning efficiency in RL. In [13], the researchers introduced Prioritized Experience Replay, which proposed selectively replaying experiences based on a priority metric to increase learning frequency from impactful transitions. This method laid the groundwork for efficient sampling by prioritizing more informative samples in a replay buffer. Building on this, ref. [14] reviewed Experience Replay Optimization, comparing multiple replay techniques that selectively choose samples to improve RL convergence rates and efficiency. These studies illustrate the benefits of sampling experiences based on importance, providing a foundation for our approach to optimizing the batch sequence with MCTS.

Adaptive sampling also aligns with Curriculum Learning (CL), where experiences are presented progressively to enhance learning efficiency. The authors in [15] introduced Curriculum Learning, which organizes training samples to gradually increase task difficulty, helping the model to improve its performance on complex tasks. The work in [16] extended this in Automated Curriculum Learning for Neural Networks, where the sampling strategy is dynamically adjusted to tailor the sequence of experiences based on the model’s progress. This concept parallels our goal of prioritizing more challenging experiences through MCTS-guided sampling.

While finding the right sequence of batches can be seen as an automatic way of laying out a curriculum of optimal learning sample progression, the conventional usage of the term in RL mostly refers to a progression of increasingly difficult tasks, and the sequence of experiences (samples) drawn from these tasks, as explained in [17]. However, CL and RL have several types of connections. The authors in [18] proposed a method that enables better real-world performance in robotics for RL controllers by utilizing CL to mitigate the gap between simulation and real-world scenarios. Ref. [19] presents a similar use case in robotics, but this research shows how CL can enhance the convergence capabilities of RL agents with the automatically generated curriculum. However, the proposed method is different since it does not require a specific curriculum for the agent, but organizes the existing curriculum in the best possible way to attain the highest performance.

Work in Supervised Learning (SL) with a focus on fast initial convergence in resource-constrained settings, like [20], where the authors use simple heuristics as proxies for sample difficulty in order to implement a domain-relevant curriculum, indicates that the general meaning of the term and its applicability to RL might go beyond the task-sequence design approach. This inspired us to move away from the specific CL definitions used in RL and instead treat the optimal sequence of batches as an emergent curriculum arising from the strategic search capability of MCTS.

Our approach, while building upon the aforementioned foundations, is novel by virtue of its exploration of possible advantages in overlooked aspects of the RL training loop, with the help of established tree search algorithms. The proposed MCTS-guided batch optimization aims to enhance sample efficiency and accelerate policy improvement, offering a new direction for research on efficient RL training methods.

3. Contribution

As detailed above, MCTS has been integrated into RL in several ways, improving the performance and exploration capabilities of the agent. This paper proposes a novel approach of using MCTS in the RL training process that is related to prioritization of experiences, and involves the following:

MCTS is used to optimize the batch sequence during training. This means that several batches are sampled from the memory buffer at the end of every single episode. The batches are used individually to conduct policy updates, and this process is handled as a planning task for MCTS. Consequently, during the training process, MCTS constructs a tree where the MCTS aims to find the path from the root, which is the untrained agent, to a leaf where the agent reaches the best possible performance with the given memory buffer.
We differentiate our method from established sample prioritization methods by moving focus away from fine-grained assessment of each sample’s contribution to the agent’s policy and towards focusing on the sequence of experiences, choosing a more coarse-grained approach of treating whole randomly drawn batches as chunks of experiences to be ordered as an efficient curriculum.
We also differentiate the proposed method from any other Curriculum Learning strategies in the literature for RL or SL problems, since our method does not use any heuristic or principle for ordering the batches during training, such as simple samples first, but uses the MCTS algorithm, which can yield the optimal sequence with sufficient iteration.

In summary, with this problem formulation, MCTS is used to find a trajectory that ensures continuous policy improvement by optimizing the batch sequence. To our knowledge, we are the first to study this overlooked aspect of the training process.

4. Methodology

4.1. Baseline RL Algorithms

To explore the effect of our MCTS-based batch order optimization algorithm, we chose a widely used and simple yet versatile agent type, the Deep-Q Network (DQN), introduced in [4]. By training the agent as described by the authors, we establish a clear and simple baseline to compare with. This allows us to analyze the effect imposed by the ordering of randomly drawn batches of experience samples.

To the best of our knowledge, we are the first to inspect this aspect of training deep neural networks on RL tasks, with no other baseline methods available where the order of applying batch updates is treated as a sequence optimization problem, with the goal of attaining the fastest convergence given a set of pre-generated batches. We aim to explore the advantage given by batch sequence optimization in isolation; therefore, we present the comparative performance of our method on several RL tasks when measured against standard DQN training.

4.2. Integration of MCTS into DQN Training

Experience prioritization methods in RL are well known to improve training sample efficiency. The difficulty arises from several factors related to the task itself and the environment model:

The problem of credit assignment, which challenges understanding of the long-term effects of choosing the action in the current state.
The problem of reward sparsity, which makes it difficult to collect truly valuable learning signals.

But what are priority metrics really doing? Priority metrics indirectly focus on composing the next optimal batch for the agent, with the hope that when it is used to update the model, it will improve the agent’s policy to a greater extent than standard uniformly distributed sampling. Unfortunately, stochastic sampling cannot guarantee that the composed batch will improve the policy during the update.

To tackle this issue, we propose an MCTS-based batch sequence optimization that creates a policy update trajectory that reaches the best possible outcome from the given training samples if enough planning is provided to MCTS. In our implementation, each batch was sampled uniformly at random from the replay buffer at expansion, and nodes were evaluated based on the validation reward obtained after updating the agent with that batch (called the core reward,

r_{c o r e}

). Node selection during tree traversal followed the Upper Confidence Bound for Trees (UCT) criterion [9], balancing exploitation of high-reward nodes and exploration of less-visited ones. The main parameters were fixed across experiments. The exploration constant was set to the theoretically optimal value

c = \sqrt{2} \approx 1.4

, as introduced in the original UCT paper. The maximum tree depth was always equal to the number of training episodes, and the branching factor was determined by the number of candidate batches evaluated per step, which was set to 2, in order to keep the time complexity of the algorithm as low as possible.

Our goal with regard to integrating the MCTS algorithm into the Reinforcement Learning process was to optimize the agent’s learning trajectory. Conceptually, each node represents a specific agent model state

f (x; θ_{i})

. The root node has randomly initialized parameters (

θ_{0}

), while subsequent children are created by updating the agent policy based on batch

b_{j}

. A linear sequence of such updates, with randomly sampled batches, represents a traditional training loop, while optimizing the sequence of batches to be applied transforms it into a tree search problem, as seen in Figure 1. Such a representation allows us to model different learning trajectories of the agent based on different experiences, looking for the optimal curriculum.

Algorithm 1 shows the pseudocode version of our implementation, while Figure 2 illustrates it, focusing on the four classical steps of Monte Carlo Tree Search: node selection, expansion, simulation or rollout, and backpropagation of the reward signal, which is the validation reward of the agent with parameters corresponding to the given node.

Algorithm 1 MCTS-Guided Batch Sequence Optimization

External parameters: branching factor B, train episodes T, exploration constant

c_{UCT}

, environment

E

.

1:: procedure MCTS_Search( $agent, M, E; B, T, c_{UCT}$ )
2:: root ← Node $(e p o c h = 0, ϵ = agent . ϵ, r_{core} = 0, p a r e n t = ⌀)$
3:: while True do
4:: v ← TreePolicy $(root, B, T, c_{UCT})$
5:: if IsTerminal( $v, T$ ) then
6:: break
7:: end if
8:: R ← Rollout $(v, T, agent, M, E)$
9:: Backpropagate( $v, R$ )
10:: end while
11:: return root
12:: end procedure
13:
14:: procedure TreePolicy(node, B, T, $c_{UCT}$ )
15:: x ← node
16:: while True do
17:: if IsTerminal( $x, T$ ) then
18:: return x
19:: end if
20:: if not IsFullyExpanded( $x, B$ ) then ▹ not reached maximum child count B
21:: return Expand(x)
22:: end if
23:: x ← BestChild $(x, c_{UCT})$ ▹ Select based on UCT criterion
24:: if $x = ⌀$ then
25:: return node
26:: end if
27:: end while
28:: end procedure
29:
30:: procedure Expand(node)
31:: restore agent from node
32:: $ϵ^{'} \leftarrow ϵ \cdot ϵ_{decay}$
33:: batch ← SampleUniform( $M$ )
34:: Fit(agent, batch)
35:: $r_{core}$ ← Validate(agent, $E, ϵ^{'}$ )
36:: child ← Node $(epoch + 1, ϵ^{'}, r_{core}, p a r e n t = node)$
37:: save agent in child
38:: return child
39:: end procedure
40:
41:: procedure Rollout(node, T, agent, $M$ , $E$ )
42:: restore agent from node
43:: Train(agent, $s t a r t = node . epoch$ , $e n d = T - 1$ , $M$ )
44:: return validate(agent, $E$ )
45:: end procedure
46:
47:: procedure Backpropagate(node, R)
48:: while node $\neq ⌀$ do
49:: increment visit count in node
50:: update cumulative return in node with R
51:: node ← node.parent
52:: end while
53:: end procedure

At each depth level of the search tree, multiple nodes are explored based on the branching factor, which determines how many different batches of experiences are applied to the model in each episode. This mechanism enables us to examine various potential model states, giving insight into the agent’s learning trajectories across different combinations of batches. For each expansion, a batch is sampled from the replay buffer and applied to the model, and the updated model state is stored in a new node.

After expanding the search tree with new nodes, the agent’s performance is evaluated following the batch update. The resulting performance metric is assigned as a core reward to each corresponding node, indicating the immediate contribution of the batch to the improvement of the agent’s policy. This evaluation is critical for guiding future batch selections, as it forms the basis for decision-making within the MCTS framework.

Once the search tree reaches a predefined depth corresponding to a specific episode, the agent performs a rollout to simulate its performance in the remaining episodes without further MCTS-based sampling. This simulation provides insight into the agent’s expected performance in future episodes, as it would occur under standard RL training. The performance metrics obtained during this rollout are used to update the value of the starting node, giving a more comprehensive assessment of the long-term potential of the batch associated with that node.

Following the rollout, the resulting rewards are backpropagated through the search tree. During backpropagation, nodes contributing to higher rewards are reinforced, while those associated with lower rewards are de-emphasized. This process refines the search tree by updating the value estimates of each node based on their impact on the agent’s overall performance, guiding future batch selections toward more promising trajectories.

When the search tree reaches the target depth, corresponding to the desired number of training episodes, the branch yielding the highest cumulative reward is selected. This optimal branch represents the sequence of batch updates that produced the most effective improvements in the agent’s policy. Following this selected branch, we can observe how the model was progressively updated, achieving optimal performance with fewer training episodes than traditional RL methods.

To evaluate the effectiveness of the MCTS-optimized approach, we compared the agent’s performance against a baseline agent trained with random batch selection. The core rewards from the nodes along the selected branch serve as performance indicators. Our results show that the MCTS-guided method achieves comparable or superior convergence with fewer episodes, reducing redundant or less informative experiences. This optimized batch selection process accelerates the agent’s learning and improves the final policy outcomes.

In summary, compared to a typical MCTS problem formulation for a sequential decision-making task, our formulation differs in the following aspects:

The nodes in the tree are not the states of the game played by MCTS, but the trained neural network’s weights.
The edges are not actions that change the state of the game, but batches that are used for updating the policy of the agent.
In the rollout process, the game is not played with random actions until its conclusion; instead, the steps are individual episodes, with the random actions of the rollout corresponding to fitting the agent with batches in random order.

4.3. Ablation Rationale and Parameter Effects

Our MCTS component has a single free selection parameter, the exploration constant

c_{UCT}

. With our node statistics, the tree policy maximizes

UCT (u \to v) = \underset{\hat{Q} (v)}{\underset{︸}{\frac{Q (v) + r_{core} (v)}{N (v)}}} + c_{UCT} \sqrt{\frac{2 ln N (u)}{N (v)}},

where

N (\cdot)

denotes visit counts,

Q (\cdot)

denotes the backed-up cumulative return, and

r_{core} (\cdot)

denotes the immediate validation reward after the batch update at v. Setting

c_{UCT} = 0

reduces the policy to greedy exploitation of

\hat{Q} (v)

, while increasing

c_{UCT}

emphasizes optimism for rarely visited children, approaching near-uniform exploration among low-

N (v)

nodes. We therefore fix

c_{UCT} = \sqrt{2}

following the canonical choice of UCT, which balances early exploration with eventual concentration on high-value branches.

Two other quantities are fixed by design rather than tuned: (i) The tree depth equals the number of training episodes T, because each depth corresponds to one gradient update; varying the depth would change the total update budget and confound comparisons. (ii) The branching factor is set to

B = 2

, the minimal value that still allows nontrivial look-ahead. Increasing B primarily increases the computation time (and node count) without altering the learning objective.

Finally, we do not contrast against Prioritized Experience Replay (PER) as an “ablation” of our module: PER prioritizes individual transitions, whereas our method plans over the sequence of batches. These are orthogonal mechanisms: PER could be used to construct batches, over which MCTS then plans their order. To isolate the contribution of planning, we keep batch sampling uniform in all reported experiments and leave PER+MCTS combinations to future work.

5. Experiments

5.1. Environments

For experiments, we used the well-known OpenAI Gym environments [21] to articulate the generality of the proposed methodology. We chose these tasks as a diverse subset of common RL problems, with varying reward sparsity and running time (step size). A summary of the environments can be seen in Table 1 and is illustrated in Figure 3. For deterministic outcomes, we set the random seed to 42 in each case.

Across all experiments, we used a DQN with two fully connected hidden layers (256 units each, with ReLU activation), optimized with Adam (learning rate

10^{- 3}

). The discount factor was

γ = 0.999

; the exploration rate

ϵ

decayed multiplicatively by

0.995

per episode; and the target network was updated every 10 episodes. The MCTS parameters were set as described in the previous section. We set the memory buffer size to 10,000 samples and sampled batches of 128 at each agent update for every experiment.

5.2. Results

To assess the impact of MCTS-guided batch selection, we conducted a comprehensive set of experiments across multiple environments from the OpenAI Gym toolkit, each chosen for its distinct dynamics, varying levels of complexity, and different reward structures. These environments ranged from simpler tasks like CartPole, which requires balancing a pole on a moving cart, to more complex, high-dimensional tasks such as the Highway environment, which involves continuous control and dynamic obstacles.

By using this diverse set of environments, we rigorously evaluated how effectively MCTS enhances learning efficiency and policy quality. Specifically, we aimed to see whether MCTS could improve upon traditional batch selection strategies, which typically do not optimize the order of the batches sampled from the memory buffer of the agent. These standard methods often struggle in environments with sparse rewards or complex state spaces, where selecting the most informative experiences is crucial to accelerating the learning process.

Through these experiments, we examined the speed of convergence to an optimal policy and the overall stability of learning, particularly in environments where consistently making the right decisions is key to long-term success. Our goal was to determine whether MCTS can guide the learning process in a way that focuses the agent’s attention on the most critical experiences, leading to faster learning, more stable performance, and ultimately better policies than those achieved through traditional batch sampling techniques.

In doing so, we sought to provide a robust comparison between the MCTS-guided method and established techniques, evaluating the strengths and weaknesses of each approach in environments with different reward distributions and dynamics. This range of tests allowed us to determine the broader applicability of MCTS-guided batch selection across various Reinforcement Learning challenges.

Importantly, all experiments were carefully seeded to ensure reproducibility and direct comparability between the MCTS and DQN training runs. Seeding was crucial in this context, as it ensured that both the initialization of environments and randomized decisions (such as action selection during exploration) were identical for both methods. This guaranteed that the experience replay buffers were filled with the same data, thus ensuring that our agent and DQN agents started from the same set of weights and could utilize the same data set. Without seeding, differences in the data collected by each agent during training could introduce variability unrelated to the actual effectiveness of the algorithms themselves, making it challenging to attribute performance improvements to MCTS alone. Seeding provided a controlled environment to accurately compare the performance of MCTS against DQN on an even playing field.

Based on the results, the MCTS-based batch sequence optimization performance falls into three distinct categories. In the first case, thanks to batch sequence optimization, the agent can empirically navigate between trajectories in the optimization space so effectively that it solves the given control task in much fewer gradient updates than the baseline DQN that utilizes the traditional training concept, hence randomly fitting batch after batch. The results can be seen in Figure 4. The MCTS-based batch sequence optimization technique portrays superior performance in the CartPole and Acrobot environments by solving the environment in far fewer iterations.

In the second case, the MCTS-based batch sequence optimization technique induces an apparent advantage in cumulative reward during the whole training process when compared to the classic approach. This phenomenon shows the superiority of the proposed method. Hence, apparently MCTS can find a better way through the optimization space when adjusting the order of batches in order to find optimum. Figure 5 shows the above-described effect on the MountainCar and CliffWalking environments.

In the final case, the MCTS-based technique outperforms the classic trial-and-error-based random batch order by only a small margin. However, at the end of the training process, it still achieves better performance, which indicates the potential of the proposed method. Figure 6 shows the convergence of both methods in the Highway-v0 and Taxi-v3 environments.

The results mentioned above, underlined by the figures showing the evolution of validation scores over time, clearly show the performance and potential of the MCTS-based batch selection technique. Specific results in terms of the final cumulative reward are shown in Table 2, summarizing the performances of both methods in each environment. It can be seen that in every case, the proposed method yields better performance. In the case of the CartPole environment, the final cumulative reward is the same since it is a relatively simple control problem, and both methods can solve it perfectly. However, the convergence plots in Figure 4 show that our method reaches the goal criterion in fewer iterations.

5.3. General Observations

Across all the environments tested, the MCTS-integrated approach consistently demonstrated superior performance compared to baseline methods in key metrics, such as convergence speed and overall policy effectiveness. This improvement was especially pronounced in environments where rewards were sparse or unevenly distributed. In these cases, traditional sampling methods struggled to effectively guide the learning process due to the limited availability of informative experiences.

This prioritization allowed the MCTS-guided approach to avoid the inefficiencies commonly associated with random batch sequences, where the agent often wastes time learning from redundant or less valuable experiences. As a result, the agent can accelerate its learning process, converging to an optimal policy with fewer episodes and less computational effort. Additionally, MCTS facilitated a more stable and consistent learning curve across training episodes, reducing the variance typically seen in baseline methods, where the agent may struggle to find informative experiences, particularly in sparse environments.

Overall, by systematically identifying and sampling high-value trajectories, the MCTS-based batch selection improved the learning process’s efficiency and enhanced the quality of the final learned policy. This capability of focusing on the most impactful data underscores the potential of MCTS as a robust tool for improving Reinforcement Learning, especially in challenging environments with complex reward structures.

5.4. Computational Overhead

MCTS introduces a planning cost on top of standard DQN training. In general, a fully expanded MCTS tree is exponential in depth, and the percentage of expansion relative to the full tree depends stochastically on the choice of

c_{UCT}

(for lower values, the exploitation term dominates, and thus the tree grows fast depthwise in the already-explored directions, while for larger values, it broadens to favor previously unseen regions), so this causes quite a computational overhead, which unfortunately limits the feasibility of extensive hyperparameter tuning. For this reason, in our implementation, we deliberately chose the branching factor

B = 2

, in order to limit this complexity as much as possible. Alternatively, one can also limit the number of iterations and continue normal training from the most promising node of the partially explored tree after the limit has been reached.

6. Conclusions

This paper introduced a novel integration of Monte Carlo Tree Search (MCTS) into the batch selection process of Reinforcement Learning (RL) algorithms. By leveraging the strategic exploration and exploitation capabilities of MCTS, we demonstrated how our approach systematically prioritizes the most informative batches, accelerating the learning process and enhancing overall policy performance. Through extensive experimentation on standard RL benchmarks, we showed that MCTS-guided batch selection outperforms conventional random and uniform sampling techniques, achieving faster convergence and more robust policy outcomes.

Our work highlights the potential of combining planning algorithms like MCTS with Reinforcement Learning to address the challenge of sample inefficiency. By focusing learning on valuable trajectories, this method improves the computational efficiency of RL algorithms and opens new avenues for integrating advanced search techniques in experience prioritization. Future research can focus on further integrating MCTS into the training process, since there is nothing specific to Reinforcement Learning about batch sequence optimization. Consequently, it can enhance the performance of any technique that utilizes gradient descent. Furthermore, it is also possible to use MCTS as a general tool for hyperparameter optimization during the training process to tailor the parameters of the training to the given problem. Using the same learning rate method can be a straightforward choice for such an experiment, and in the end, MCTS would create a specific learning curve for the problem instead of using a pre-defined functional shape, such as the traditional warm-up cosine annealing.

Although our experiments target RL, the procedure itself is task-agnostic: a node stores parameters

θ

, an edge applies a mini-batch update, and MCTS requires only a scalar evaluation functional

V (θ)

after each update. The key practical condition for broader applicability (e.g., supervised or self-supervised learning) is that V be stochastic and sufficiently decoupled from the training batch used for the update, so that repeated planning cannot overfit the validation signal. In RL, this is provided by randomized rollouts; in non-RL settings, the same effect can be achieved with rotating K-fold validation, bootstrapped/held-out streams, or strong data augmentation that yields an i.i.d. evaluation stream. Under these conditions, UCT approximately maximizes the expected generalization improvement (not merely training loss), making the search over batch sequences equally meaningful beyond RL. The computational trade-off mirrors our RL case (depth equals the number of planned updates; the branching factor equals the number of candidate batches per step), while the per-evaluation cost is often lower than that of an RL rollout, which further supports portability to general ML.

Based on the results, it can be seen that the integration of the MCTS algorithm can open a new phase in automated Curriculum Learning since it has the ability to yield the best possible performance from a given dataset, regardless of the Machine Learning problem class. In our opinion, the proposed method does not possess any RL-specific components, which makes it possible to utilize this method for any type of Machine Learning problem that utilizes batches, such as Computer Vision, which struggles with a lack of training data. Consequently, along with the above, our research will focus on validating the proposed method across different types of Machine Learning problem classes.

Author Contributions

Conceptualization, B.K., T.B. and G.C.; methodology, G.C. and B.K.; software development, I.P. and G.C.; validation, G.C., B.K. and I.P.; resource and funding acquisition, T.B.; data curation, G.C.; writing—original draft preparation, G.C. and I.P.; writing—review and editing, B.K. and T.B.; visualization, G.C. and B.K.; supervision, T.B.; project administration, T.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the European Union within the framework of the National Laboratory for Autonomous Systems (RRF-2.3.1-21-2022-00002). The research reported in this paper is part of project no. BME-NVA-02, implemented with the support provided by the Ministry of Innovation and Technology of Hungary from the National Research, Development and Innovation Fund, financed under the TKP2021 funding scheme. B.K was supported by project no. 2024-2.1.1-EKÖP-2024-00003, implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the EKÖP-24-4-I-BME-150 funding scheme.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

https://github.com/csgyorgy404/MCTS-Batch-Optimization-RL.git, accessed on 15 July 2025.

Conflicts of Interest

The authors have no competing interests to declare relevant to this article’s content.

References

Kővári, B.; Knáb, I.G.; Esztergár-Kiss, D.; Aradi, S.; Bécsi, T. Distributed highway control: A cooperative reinforcement learning-based approach. IEEE Access 2024, 12, 104463–104472. [Google Scholar] [CrossRef]
Mihály, A.; Vu, V.T.; Do, T.T.; Thinh, K.D.; Vinh, N.V.; Gáspár, P. Linear Parameter Varying and Reinforcement Learning Approaches for Trajectory Tracking Controller of Autonomous Vehicles. Period. Polytech. Transp. Eng. 2025, 53, 94–102. [Google Scholar] [CrossRef]
Ichter, B.; Pavone, M. Robot motion planning in learned latent spaces. IEEE Robot. Autom. Lett. 2019, 4, 2407–2414. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef] [PubMed]
Fawzi, A.; Balog, M.; Huang, A.; Hubert, T.; Romera-Paredes, B.; Barekatain, M.; Novikov, A.; R Ruiz, F.J.; Schrittwieser, J.; Swirszcz, G.; et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 2022, 610, 47–53. [Google Scholar] [CrossRef] [PubMed]
Mankowitz, D.J.; Michi, A.; Zhernov, A.; Gelmi, M.; Selvi, M.; Paduraru, C.; Leurent, E.; Iqbal, S.; Lespiau, J.B.; Ahern, A.; et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature 2023, 618, 257–263. [Google Scholar] [CrossRef] [PubMed]
Anoushee, M.; Fartash, M.; Akbari Torkestani, J. An intelligent resource management method in SDN based fog computing using reinforcement learning. Computing 2024, 106, 1051–1080. [Google Scholar] [CrossRef]
Kocsis, L.; Szepesvári, C. Bandit Based Monte-Carlo Planning. In Machine Learning: ECML 2006; Fürnkranz, J., Scheffer, T., Spiliopoulou, M., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4212, pp. 282–293. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Anthony, T.; Tian, Z.; Barber, D. Thinking Fast and Slow with Deep Learning and Tree Search. arXiv 2017, arXiv:1705.08439. [Google Scholar] [CrossRef]
Guez, A.; Weber, T.; Antonoglou, I.; Simonyan, K.; Vinyals, O.; Wierstra, D.; Munos, R.; Silver, D. Learning to Search with MCTSnets. arXiv 2018, arXiv:1802.04697. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2016, arXiv:1511.05952. [Google Scholar] [CrossRef]
Zha, D.; Lai, K.H.; Zhou, K.; Hu, X. Experience Replay Optimization. arXiv 2019, arXiv:1906.08387. [Google Scholar] [CrossRef]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; ACM: New York, NY, USA, 2009; pp. 41–48. [Google Scholar] [CrossRef]
Graves, A.; Bellemare, M.G.; Menick, J.; Munos, R.; Kavukcuoglu, K. Automated Curriculum Learning for Neural Networks. arXiv 2017, arXiv:1704.03003. [Google Scholar] [CrossRef]
Narvekar, S.; Peng, B.; Leonetti, M.; Sinapov, J.; Taylor, M.E.; Stone, P. Curriculum learning for reinforcement learning domains: A framework and survey. J. Mach. Learn. Res. 2020, 21, 7382–7431. [Google Scholar]
Wang, L.; Xu, Z.; Stone, P.; Xiao, X. Grounded curriculum learning. arXiv 2024, arXiv:2409.19816. [Google Scholar] [CrossRef]
Karni, Z.; Simhon, O.; Zarrouk, D.; Berman, S. Automatic curriculum determination for deep reinforcement learning in reconfigurable robots. IEEE Access 2024, 12, 78342–78353. [Google Scholar] [CrossRef]
Irandoust, S.; Durand, T.; Rakhmangulova, Y.; Zi, W.; Hajimirsadeghi, H. Training a Vision Transformer from scratch in less than 24 hours with 1 GPU. In Proceedings of the Has It Trained Yet? NeurIPS 2022 Workshop, New Orleans, LA, USA, 2 December 2022. [Google Scholar]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]

Figure 1. Training a Reinforcement Learning agent, reframed as a tree search problem to be solved by Monte Carlo Tree Search.

Figure 2. An overview of the MCTS-based batch sequence optimization algorithm.

Figure 3. Snapshots from the environments that were utilized for evaluation purposes during our experiments.

Figure 4. Environments where the agent with the optimized batch order exhibits significantly faster convergence and solves the problem with fewer gradient updates compared to the baseline training.

Figure 5. Environments where the agent with the optimized batch order shows an advantage in validation episode rewards during the entirety of the training.

Figure 6. Environments where the agent with the optimized batch order shows an improvement over the baseline method, but only by a narrow margin.

Table 1. Properties of the studied environments.

Environment	Reward Sparsity	Termination	Max Steps
MountainCar	Sparse	Goal or timeout	200
Acrobot	Sparse	Swing-up or timeout	500
CartPole	Dense	Failure or timeout	500
Taxi-v3	Semi-sparse	Drop-off or timeout	200
CliffWalking	Sparse	Goal/cliff/timeout	200
Highway-v0	Dense	Collision/off-road/timeout	40

Table 2. Comparison of our method with model-free RL, in terms of final validation reward over a single episode.

Method	MountainCar	Acrobot	Taxi-v3	Highway-v0	CliffWalking	CartPole
Ours	−168.87	−171.81	6.67	18.89	−177.56	200.0
RL	−181.45	−207.23	7.56	17.23	−251.65	200.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Csippán, G.; Péter, I.; Kővári, B.; Bécsi, T. MCTS-Based Policy Improvement for Reinforcement Learning. Mach. Learn. Knowl. Extr. 2025, 7, 98. https://doi.org/10.3390/make7030098

AMA Style

Csippán G, Péter I, Kővári B, Bécsi T. MCTS-Based Policy Improvement for Reinforcement Learning. Machine Learning and Knowledge Extraction. 2025; 7(3):98. https://doi.org/10.3390/make7030098

Chicago/Turabian Style

Csippán, György, István Péter, Bálint Kővári, and Tamás Bécsi. 2025. "MCTS-Based Policy Improvement for Reinforcement Learning" Machine Learning and Knowledge Extraction 7, no. 3: 98. https://doi.org/10.3390/make7030098

APA Style

Csippán, G., Péter, I., Kővári, B., & Bécsi, T. (2025). MCTS-Based Policy Improvement for Reinforcement Learning. Machine Learning and Knowledge Extraction, 7(3), 98. https://doi.org/10.3390/make7030098

Article Menu

MCTS-Based Policy Improvement for Reinforcement Learning

Abstract

1. Introduction

2. Related Work

3. Contribution

4. Methodology

4.1. Baseline RL Algorithms

4.2. Integration of MCTS into DQN Training

4.3. Ablation Rationale and Parameter Effects

5. Experiments

5.1. Environments

5.2. Results

5.3. General Observations

5.4. Computational Overhead

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI