Benefits of Using LLM for Long-Term Planning with Distilled Subtask Model Compared to End-to-End Reinforcement Learning in the MiniGrid Simulator

Pluškoski, Aleksandar; Ciganović, Igor; Jovanović, Miloš; Vasiljević, Jelena

doi:10.3390/electronics15091921

Open AccessArticle

Benefits of Using LLM for Long-Term Planning with Distilled Subtask Model Compared to End-to-End Reinforcement Learning in the MiniGrid Simulator

by

Aleksandar Pluškoski

^1,*

,

Igor Ciganović

¹

,

Miloš Jovanović

^2,3

and

Jelena Vasiljević

¹

School of Computing, University Union, 11000 Belgrade, Serbia

²

Technical Test Center, Ministry of Defense, 11000 Belgrade, Serbia

³

Faculty of Information Technology, Belgrade Metropolitan University, 11000 Belgrade, Serbia

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(9), 1921; https://doi.org/10.3390/electronics15091921

Submission received: 28 March 2026 / Revised: 20 April 2026 / Accepted: 29 April 2026 / Published: 1 May 2026

(This article belongs to the Special Issue Machine Learning and Cognitive Robotics)

Download

Browse Figures

Versions Notes

Abstract

Policy learning under delayed reward conditions remains a significant challenge for end-to-end reinforcement learning (RL) agents. The difficulty increases for problems that require long-term planning and the execution of multiple dependent subtasks. As a result, solutions based on a single monolithic policy often suffer from unstable training. One possible solution to this problem could be to delegate the long-term planning to a separate model. This paper presents an implementation comprising two models: a large language model (LLM) for long-term planning and an execution model that solves subtasks. The execution model was trained via distillation from multiple teacher models trained with RL on individual tasks. The results presented in this paper demonstrate the benefits of this approach. By delegating long-term planning to the LLM, the agent can solve more complex problems than end-to-end agents trained with the proximal policy optimization (PPO) algorithm.

Keywords:

machine learning (ML); artificial intelligence (AI); deep learning; reinforcement learning; neural networks; MiniGrid; large language model; model distillation

1. Introduction

Reinforcement learning is a branch of machine learning (ML) that is learning optimal behavior through interaction with the environment [1]. In RL, the agent interacts with the environment and chooses actions according to its policy (neural network). As a result, it receives feedback from the environment in the form of a reward that represents the quality of the agent’s behavior. Unlike supervised learning, where training relies on a predefined data set, RL agents generate their own data by interacting with the environment (through trial and error). This paradigm has been successfully applied to a wide range of problems. However, RL methods are particularly sensitive to reward design, and their performance degrades significantly in environments with infrequent or delayed rewards [2].

The solution presented in this paper evaluates the advantages of extracting the long-term planning task into a separate model, specifically, a large language model (LLM). The execution of the plan provided by the LLM is then handled by the “execution” model trained using a combination of a standard RL algorithm and distillation. The results are then compared to a classical end-to-end RL model. The proposed architecture separates long-term planning and execution: the LLM operates at a higher level by decomposing a complex task into a series of subtasks, while the execution model handles the execution of each task given by the LLM. Every sub-task is relatively simple and can be solved by a classical RL algorithm. This modular approach is then compared against an end-to-end RL agent in the MiniGrid (v3.0.0) environment. Performance is evaluated in terms of task execution success and robustness in environments that require long-term planning.

LLMs [3] have recently emerged as models capable of understanding complex patterns in the sequential data. Built primarily using transformer networks, LLMs are trained on massive amounts of text data and have shown strong performance on tasks such as natural language understanding, reasoning, planning, and following instructions. Recent work has shown that LLMs can go beyond pure language tasks, serving as high-level planners and decision-makers [4]. These capabilities make LLMs a promising candidate for handling long-term planning in sequential decision-making problems, where thinking about long-term goals and breaking them down into smaller, simpler steps is required.

Distillation of knowledge is a transfer learning technique that uses a “teacher” model to generate a dataset, which is then used to train a “learner” model in a supervised manner [5]. The learner model is trained to replicate the teacher’s knowledge as closely as possible. Instead of learning from interactions with the environment, the distilled model learns from the teacher’s output. This allows it to inherit the teacher’s knowledge and, in some cases, surpass it while being more efficient and transparent [6]. This is achieved by having stricter, more precise optimization targets in the supervised learning format. It is shown that, by learning from rollouts (datasets) generated by teachers trained on different tasks, the learner model can learn all of them [7]. This is particularly interesting in situations where achieving this using a single RL agent would require a considerably more complex model or be impossible.

The MiniGrid environment is a popular 2D RL benchmark designed to test algorithms that require navigation, object interaction on the map, and long-term planning [8]. It provides a simple framework for creating environments of varying complexity. Tasks in a MiniGrid environment often require an agent to perform multiple dependent tasks. For example, opening a box, finding keys, unlocking a door, and reaching a goal, before eventually receiving any reward. This makes it particularly suitable for studying the problems of the scarce reward, whose solutions require long-term planning. Because of these features, MiniGrid has become a widely used benchmark for evaluating hierarchical reinforcement learning, modular policies, and planning-based approaches. The environment is defined as a grid of fixed dimensions, where each square can contain an object. Additionally, objects have a color. All of this is represented by the codes as shown in Table 1 and Table 2.

The solution presented in this paper also provides a comparison of different distillation variants. It demonstrates how complex problems can be solved using RL, by delegating long-term planning to the LLM and using the distilled RL model to perform already defined, simpler tasks. By doing this, the stability of the training is increased, and scalability and robustness are improved.

The papers [9,10,11] demonstrate the potential of applying LLMs to reinforcement learning. In accordance with previous work [12], the goal was to minimize the model complexity. This was achieved through problem decomposition, which enabled the design of a modular agent.

The paper [10] presents a solution that uses an LLM to augment the training process. The RL model is guided through training to achieve the global goal by completing an array of sub-goals. The LLM is used to define those sub-goals at the start of the episode. Given the computational cost of using the LLM, the paper [10] proposes implementing a surrogate model trained to approximate the LLM output during RL training. While this removes the LLM from the inference phase, it introduces considerable overhead during the training. Automatic discovery of the sub-goals is an appealing but, in the end, futile pursuit in this example. The RL agent’s neural network requires predictable textual input (in one of its inputs). In the solution presented in [10], the cosine similarity is used to select a predefined sub-goal based on the LLM’s output. This approach could allow for more flexible sub-goals, but at the expense of not fully leveraging the PPO algorithm’s learning potential. In contrast, the solution presented in this paper employs the LLM during the inference phase and thus greatly simplifies training and transfers the cost of invoking the LLM to the phase where it is less impactful.

The solution presented in the paper [9] highlights the differences among open-source LLMs and does not utilize the reinforcement learning algorithm. The LLM is presented with the detailed scene description, including precise coordinates of the relevant objects and the current reward. The learning happens through conversation history. The LLM is tasked to interpret the success of the actions based on the provided reward. The LLM’s output is essentially a direct command to the environment. The execution module just remaps the LLM’s textual response into the actual command code. The paper shows significantly better results when utilizing the “lama-2.3” model fine-tuned on the Korean language. The solution proposed in this paper replaces the execution module with an RL agent. Since the RL agent can interpret significantly more complex commands, the LLM is not overburdened by direct control. This, in turn, allows the LLM to focus on the high-level planning, where LLMs show the most potential.

The paper [7] presents a mechanism for transferring knowledge from multiple teacher models to a single agent using distillation. This aligned with the proposed goal to keep the agent relatively simple. Additionally, it was possible to use those teacher models to create an agent with a mixture of expert architectures [13].

2. Related Work

RL addresses sequential decision-making problems by learning a policy that maximizes the total reward through interaction with the environment, typically modeled as a Markov decision process (MDP) [1]. Despite many successes, end-to-end RL algorithms remain unstable when faced with complex multi-step tasks. Since the reward is received after the agent performs a long sequence of actions, learning becomes unstable, often preventing convergence. These limitations have led to the development of alternative paradigms that perform better when long-term planning is required [14,15].

Curriculum learning is one of those new paradigms. The curriculum is designed by defining a set of skills the agent should learn to solve the problem effectively. These skills are then ordered by difficulty, and the agent is guided to learn them one by one, with difficulty gradually increasing, thus allowing agents to acquire simple skills before tackling more complex goals [16]. In RL, it has been shown that this type of learning increases efficiency and convergence speed, especially in environments with scarce rewards [17]. Although curriculum learning solves some training problems, it still relies on a single policy that teaches increasingly complex behaviors and does not encourage modularity or reuse of learned subskills.

Knowledge distillation in neural networks is another. It is a technique in which a model is trained to replicate the behavior of a teacher model, thereby transferring knowledge more effectively [5]. First, the teacher model is trained using a reinforcement learning algorithm. The teacher is then used to generate a dataset by recording its interactions with the environment. That dataset is later used to train a new model in a supervised manner. Because the optimisation target is much stronger and clearer, the learner can be simpler, with fewer parameters, and achieve better performance. In RL, distillation is used to transfer the policy, improving efficiency and stability by training the agent in a supervised manner rather than a noisy environment [18]. Distillation using multiple teacher models, each trained on a different task, enables knowledge compression by producing a single model that combines all the teachers’ skills [19].

Introducing LLMs into reinforcement learning has recently attracted significant attention as a means of incorporating high-level reasoning and semantic understanding into decision-making systems [9,10,11,20]. Language is the natural tool for reasoning. Expectedly, LLMs achieve great results in the field of RL by complementing agents in the domains of planning and reasoning. LLMs are based on the transformer architecture and trained on large amounts of text using self-supervised learning objectives [3]. Transformers demonstrated a superior ability to model sequential data compared to the previous techniques. Outside of LLMs, in RL, transformers are used when the decision is based on a sequence of states [21,22]. There are attempts to completely remove the RL model and use just LLM for reinforcement learning [23]. Here, the LLM acts as an agent and, with specifically crafted prompts, is guided to explore the environment by issuing natural-language commands. The knowledge accumulated during the learning phase is kept in the LLM context.

The solution presented in this paper seeks a middle ground between incorporating the LLM into the RL algorithm itself [10] and issuing commands at maximum granularity [9], where the LLM literally outputs commands that a human controller would. Instead, we are relying on an RL agent to learn to solve relatively simple problems on its own and use an LLM only to plan the solution to the more complex problem. The LLM should issue commands such as: picking up a specific object, unlocking a door, etc. And the RL agent should be able to find a route to the said object, open the unlocked doors on that route, etc.

3. Methodology

3.1. Environment

This work extends the MiniGrid environment to implement a controllable, custom setup that allows for variable mission complexity, predetermined object-placement rules, and textual scene descriptions for LLM prompt generation. Mission complexity is primarily determined by the number of rooms on the map and the number of objects in each room. The map size was fixed to an 11 × 11 matrix. Due to size constraints, the maximum number of rooms was set to four. The only deviation from the default environment’s behaviour and reward structure was the use of the “done” command to signal the end of sub-task execution instead of the global goal. An example of the running environment is shown in Figure 1.

Doors between the rooms could be set to always be unlocked or open during the training. Doors have a color, and if they are locked, they can be unlocked only by the key of the same color. The keys can be hidden in the boxes. To unlock the door, the agent would first need to find the key by opening the boxes, then use it to unlock the door. The RL model used proved unable to solve this. But it learned to open the door when it was unlocked. So, during the training, all the doors were left unlocked. This proved to be a problem later. The agent would get confused by the locked doors because they were never seen during the training. This was solved by introducing randomly placed doors in the rooms (not between).

The number of objects could be varied. The rules for object placement were developed to avoid blocking the agent, the goal (green tile), or the doors. Also, the keys were placed so that the agent could always reach the goal.

For the LLM to reason about the solution to the problem posed by the environment, a textual representation of the map was needed. This was incorporated into the scene generation in the environment itself. The example of the output is presented in Table 3, together with the answer generated by the LLM in Table 4. The goal was to convey all the information needed to devise a plan to achieve a given mission. The representation should include information about the number of rooms, the condition of the doors between rooms, the objects in those rooms, and the overall mission. The amount of information was reduced only to the essentials. This proved beneficial for the LLM’s performance.

3.2. Proximal Policy Optimization

For the reinforcement learning algorithm, Proximal Policy Optimization (PPO) [24] was used. We also tested Q-learning [25], but it performed worse than PPO. The “stable baselines 3” (v2.7.0) [26] implementation of the algorithm was used with custom extractors and policy networks. PPO is a successor to the trust region policy optimization (TRPO) algorithm [27]. PPO was introduced to address TRPO’s computational complexity. Both algorithms use the trust-region method to bound the Kullback–Leibler (KL) divergence between the old and new policies. While TRPO computes the Hessian matrix, which is computationally very expensive, PPO uses simple clipping to approximate the KL divergence constraint. Both are policy gradient methods that work by computing an estimator of the policy gradient and using it in the regular gradient ascent algorithm. Policy gradient estimator has the following form:

\hat{g} = \hat{E_{t}} [\nabla_{θ} \log π_{θ} (a_{t}| s_{t}) \hat{A_{t}}]

(1)

where

π_{θ}

is the current policy parameterised by

θ

,

a_{t}

is the action taken at time-step t;

s_{t}

is the environment state at time-step t,

\hat{A_{t}}

is the value of the advantage function at the timestep t, and

\hat{E_{t}}

is the average expectation over a batch of samples. Policy gradients can be arbitrarily large. This can lead to destructively large policy updates. Trust region methods are used to limit this update. PPO simplifies this by using a simple clipping method in the following form:

L (θ) = \hat{E_{t}} [\min (\frac{π_{θ} (a_{t}| s_{t})}{π_{θ_{o l d}} (a_{t}| s_{t})} \hat{A_{t}}, c l i p (\frac{π_{θ} (a_{t}| s_{t})}{π_{θ_{o l d}} (a_{t}| s_{t})} \hat{A_{t}}, 1 - ϵ, 1 + ϵ) \hat{A_{t}})]

(2)

where ϵ is a small constant representing the clipping factor. PPO is an on-policy algorithm. Policy is synchronized between the actor and the learner. This allows for a further reduction in the deviation between the policy used to generate the trajectory and the one used during the optimization step.

3.2.1. Policy

The custom network used for the experiment had three inputs in the extractor stage. This was determined from the environment’s output, as shown in Table 5. The direction input represents the player’s orientation, the image parameter matrix represents the player’s viewport, and the mission vector contains the tokenized string of the current goal.

The extractor consisted of three networks, one for each of the inputs. The outputs of these networks were concatenated and used as a shared input for the policy and value heads. The network architecture is shown in Figure 2. The outputs of the network are policy, a probability distribution over the environment’s action space, and value, the learned value of the current state. Additional model details are shown in Appendix A.3.

3.2.2. Hyperparameter Search

Reinforcement learning algorithms are very sensitive to variations in hyperparameter values. As it is practically impossible to optimize hyperparameters manually, sane defaults were used as a starting point [28] for the “Hydra” (v1.3.2) population-based sweeper [29]. The Population-Based Bandit (PB2) algorithm was used. This algorithm uses a Bayesian model to determine the best mutation for the top-performing agents in the population. This allows for greater sample efficiency and a smaller population than in standard population-based training. The search ranges for the individual parameters used in the sweeper are provided in Appendix A.1. It was observed that training became unstable in the later stages. This was the result of too many update steps during the training. To remedy this issue, the training process was divided into multiple runs, gradually reducing the number of epochs per training step. The parameter values for the different training stages are provided in Appendix A.2.

3.3. Policy Distillation

Policy distillation is a transfer learning method. The goal is to transfer the teachers’ knowledge into a learner model. This is usually done to reduce model size or to compress knowledge from multiple teachers into a single learner [7,30]. The goal in this type of training is to learn a policy that mimics a teacher’s or teachers’ policies. The solution presented in this paper uses multiple teacher models. Each teacher is trained using the PPO algorithm to solve a single task. These teacher models are then used to generate a dataset consisting of environment states and the policies output by the teachers for those states. The learner model is trained on the teacher policy distributions, not the sampled actions. The collected dataset covered all of the sub-tasks available in the environment. The final model would generalize on different, previously unseen, environment configurations. The sub-tasks were predefined and fixed. The learner is then trained using said dataset in a supervised manner. For the loss function, the KL divergence between the policies output by the teacher and learner models was used. We tried two options: to train a new model from the randomly initialized weights and to use distillation as a fine-tuning step for the pretrained RL model. The results are presented and discussed in the next section.

Careful consideration was required during the sub-goal design process. The RL teacher models would be trained to solve those sub-goals. They needed to be simple enough to solve using the PPO algorithm and clear enough for the LLM to use. Additionally, the sub-goals had to be robust enough to allow for any possible mission in the given environment to be solved. The PPO models were allowed to learn as much as they could. The point was not to replace the RL model with the LLM, but to augment it with long-term reasoning skills. Through experimentation, we arrived at four sub-goals that proved enough given the earlier considerations. Those sub-goals were “go to <object>”, “go to goal”, “toggle <object>” (used for opening doors and boxes), and “pick up <object>”.

3.4. Large Language Model

As stated earlier, the LLM was used to decompose a goal set by the environment into an array of predefined sub-goals based on the environment’s scene description. The goal was to create a self-contained solution that did not rely on external services. To achieve that, an open-source, self-hosted model was used. We tested several models, and the “qwen3:30b” gave us the best results. The testing process was relatively simple. The LLM was given a set of predefined prompts (scene descriptions) with the user-defined solutions and was expected to provide the same or a comparable response. Comparison of the tested open source models is shown in Table 6.

During the testing (inference) phase, the environment would output the current state and a scene description. This description would be provided to the LLM to define a plan. This happens only in the first step of the episode. The LLM is not prompted in later steps if the plan already exists. The sub-goals constituting the plan are then executed sequentially by the distilled RL model. The RL model is trained to perform the “done” action when the sub-goal is finished as a signal to switch to the next one. The system prompt used to instruct the LLM is provided in Appendix B, Table A4.

Additionally, an end-to-end model was trained solely using the PPO algorithm, without an LLM for planning. The results are compared in the next section.

4. Experimental Results

The PPO agents were used as experts in the distillation process. They were trained first. Additionally, another PPO model was trained on all of the problems simultaneously (the “all” model). One model was capable of solving all subtasks. This was done as a potential starting point for the distillation training and in order to compare the performance to that of the specialized experts. First, we trained expert agents from randomly initialized models; then we fine-tuned the “all” model on a single task. This proved beneficial as “pick up” and “toggle” agents showed significant improvement. However, “go to goal” and “go to object” agents showed no improvement, and in some cases even degraded performance. Improved results are shown in Table 7.

The PPO agents were trained in multiple steps. Figure 3 shows the average reward over 100 episodes at each training step for the “all” model. As stated earlier, the PPO agents were trained using multiple runs. The training was repeated until satisfactory results were achieved, with different problems requiring different numbers of training runs. Some hyperparameters were gradually reduced to stabilize training in later stages, as shown in Appendix A.2. The “all” model required 8 runs to stop improving. Figure 3 shows the smoothed graph of model “all/7” as higher than “all/6”, but in testing, they achieved identical results.

Figure 4 shows the same type of graph for the fine-tuning of the “pick up” model. As expected, the model starts with some skill at solving the presented problem and continues to improve. It can be observed that generalized knowledge acquired during training across all problems simultaneously allows it to achieve better results after focusing on a single task than when training only on that task. Also, fine-tuning on a single task after training on all of them further improves performance on that task, even though the continuation of the training on all tasks showed no improvement.

Figure 3 and Figure 4 show learning progress through consecutive training runs. Expectedly, it can be observed that the performance gain is larger in the beginning and gradually plateaus in the later runs. The last runs in both examples show almost no improvement. This was used as a signal to stop further training.

Table 8 shows the performance of the trained models across all problems. Expectedly, “go to goal” and “go to object” models score 0% in problems they were never trained on. Interestingly, “pick up” and “toggle” models also show no skill in other problems, even though they were fine-tuned from the “all” model.

After the experts were trained, they were used to generate the dataset for the distillation training. As stated earlier, the dataset contained an environment state and the policy produced by the expert for that given state. The distilled models were trained to imitate the expert policy. The distillation was done in two variants, starting from the randomly initialized model and starting from the “all” model. Table 9 shows the results.

Interestingly, distillation from the randomly initialized model yields significantly worse results than PPO alone. But distilling the PPO model does show further improvement.

The distilled PPO model was selected as the system’s execution model, with the LLM used for planning. Further testing was conducted on complex problems that required multiple steps to solve. The LLM would decompose the global goal into an array of sub-goals, which the distilled PPO model would then execute. As a benchmark for the results, another PPO model was trained. It was trained to solve complex problems, similarly to the “all” model, but the environment was not limited to missions solvable by a single expert. Table 9 shows the results.

The results shown in Table 10 are expectedly lower than those in Table 9 because each step in the multi-step solution has some probability of failure, which multiplies when the steps are executed sequentially. Due to the number of testing episodes, a detailed analysis of the failure cause was not feasible. However, in the first 10 episodes (Table 11) that were observed individually, it was observed that the cause for failure was always the execution model. It should be stated that a perfect score of 100% is not achievable. The score is calculated using the formula:

R = 1 - 0.9 (\frac{n u m_s t e p s}{m a x_s t e p s})

(3)

This reduces the reward for each action the agent performs. Table 10 shows the average reward received over 1000 episodes, while Table 11 presents the rewards received in the first 10 episodes. Here, it can be observed that episodes requiring simple solutions show similar results, but episodes requiring multi-step solutions prove unsolvable by the PPO agent without the LLM.

Additionally, Table 10 demonstrates individual contributions from the knowledge distillation and the modular design. “PPO” model is the non-distilled variant of the hierarchical model. Compared to the “Distilled PPO” it shows significantly lower results. The contribution of the modular design can be seen by comparing the “Distilled PPO” with “No Language Model”, the latter being the end-to-end, monolithic model. Again, the hierarchical model shows better results.

Source code for the implemented solution can be found at the address stated into the Supplementary Material section. Also, a video demonstrating execution of the proposed solution can be found at the YouTube like provided in the Supplementary Materials.

5. Discussion

The main insight of this work was the effectiveness of delegating planning tasks to an LLM and the improvement in RL agent performance through knowledge distillation. Further improvement may be possible by combining multiple distillation stages with reinforcement learning. The RL agent would learn basic skills, and distillation would then make the agent more robust, providing a better starting point for further skill acquisition using RL.

Despite all the advantages, the proposed solution has several weaknesses. The performance of the entire system depends on the selection and design of the teacher models (experts). If the required skill is not covered by the teacher agents, the resulting agent would be missing that skill. Similar limitations also exist in the modular and skill-based RL, where generalization is constrained by the available set of skills [31]. Although large language models exhibit strong planning capabilities, they can still generate inconsistent or suboptimal plans in unfamiliar environments, a challenge noted in recent LLM-based research [32].

The new results suggest several directions for further improvement. An important problem to solve is the automatic detection of subtasks, thereby reducing dependence on manual design. Previous work in skill discovery and option learning suggests that sub-tasks can be identified using unsupervised or weakly supervised learning [33]. Integrating these techniques with the solution presented in this paper would improve scalability and generalization.

Another weakness of this solution is the connection between the LLM and the execution model. As it stands, the LLM prompt is defined manually in parallel with sub-task design. Proposed improvements to sub-task discovery could be combined with a learned prompt [23] that would organically evolve during the agent’s training.

This modular approach provides an additional benefit: the ability to easily replace the planning module (the LLM) with a more capable one. Since the LLM is completely removed from the training process, replacing (improving) it is simply a matter of changing a single configuration parameter. One possible improvement would be the use of a multimodal LLM, which would make it unnecessary to modify the environment to provide a textual representation of the current state.

A natural continuation of this line of research would be an application to swarm robotics. In an additional study, experiments with a drone simulator were conducted. However, at the time of writing, the results remain inconclusive. Extending the solution to scenarios involving multiple drones [34] represents a promising direction for future work. In particular, the design of a swarm coordination layer that connects LLM-based planning with the low-level execution of individual drones warrants further investigation.

Another interesting area of research is the implementation of these techniques with wireless sensor networks (WSN) [35]. Traditional WSNs operate under rigid protocols that struggle to accommodate the dynamic, heterogeneous conditions characteristic of real-world deployments. By embedding RL agents at the network edge individual sensor nodes can learn optimal policies for resource allocation, duty cycling, and routing through interaction with their environment. The integration of LLMs introduces a high-level reasoning and contextual understanding layer that overcomes the limitations of conventional signal processing. LLMs can interpret content from aggregated sensor streams, translate complex environmental observations and allow for natural-language-driven network reconfiguration. This combined approach allows for a self-organizing wireless sensor network (WSN) architecture. In this architecture, reinforcement learning (RL) manages low-level, reactive control tasks, while large language models (LLMs) handle high-level, strategic decision-making.

This work shows that using an LLM for planning and a distilled RL model for sub-task execution offers a scalable and efficient alternative to end-to-end RL. Distillation, as a basic learning mechanism, addresses the challenge of learning multiple independent tasks, and the modular approach addresses the sparse-reward problem, enabling the creation of more robust, interpretable, and efficient RL systems.

6. Conclusions

The solution presented in this paper shows the advantages of decoupling the planning unit from the execution unit. Combining the LLM with the RL model shows significantly better performance than a monolithic model could achieve. Traditional end-to-end RL solutions have proven to struggle in environments with sparse rewards. In contrast, the proposed architecture separates planning and execution, enabling the execution model to learn in a dense-reward environment.

The results show that distilling knowledge further improves RL agents, making them more robust and less susceptible to input noise. Independent training of teacher policies simplifies learning and enables better performance on specialized tasks, with improved training stability, further improving the final agent’s performance. This allows the system to scale more easily to more complex problems than traditional end-to-end RL solutions.

In comparison with related studies [9,10], the key limitations were identified. The solution in paper [9] decomposed the problem into highly granular subtasks, thereby unnecessarily placing an additional burden on the LLM and failing to leverage the PPO algorithm’s learning capability. The results clearly show that the LLM cannot effectively solve the problem presented. This is remedied by using an LLM fine-tuned for the Korean language. The solution presented in this paper clearly shows better results that are language-independent.

The solution presented in paper [10] proposed an alternate strategy, incorporating an LLM directly into the RL training process. Although this integration introduces additional flexibility and potential performance gains, it comes at a significantly higher computational cost. Building on the prior work [12], it was instead decided to reduce the complexity of both the agent architecture and the training procedure. This design choice results in substantially faster training cycles and improved accessibility, primarily by reducing hardware requirements, while achieving comparable results.

When you look at these comparisons as a whole, the key contribution of the proposed approach is not any single component, but how the responsibilities are split between the LLM and the RL agent. Methods like curriculum learning and knowledge distillation are useful for organizing training and compressing behavior, but they stay entirely within the RL framework and do not address higher-level reasoning or planning. On the other side, approaches that rely heavily on LLMs tend to lose the responsiveness and sample efficiency that make RL effective in dynamic environments.

Our approach tries to strike a balance between these two directions. The LLM is used only for high-level, object-oriented planning, while the actual execution is handled by a distilled RL agent that is both robust and efficient. This separation allows each part to do what it does best, resulting in a system that performs better than either component would on its own. It is this clear division of roles that sets the method apart from previous work and represents its main contribution.

In general, the results show that delegating planning to an LLM and allowing the RL models to focus solely on control and execution yield promising results compared to monolithic RL agents. The hybrid approach improves robustness, performance, and interpretability in complex environments that require structural, multilevel reasoning.

Supplementary Materials

The source code for the solution presented in this paper can be found at https://github.com/Idokorro/MiniGrid-RL, accessed 20 March 2026, and the video demonstrating the results at https://www.youtube.com/watch?v=EfYVCSxA6Gw, accessed 20 March 2026.

Author Contributions

Conceptualization, I.C. and A.P.; methodology, I.C. and A.P.; software, I.C. and A.P.; data curation, I.C. and A.P.; investigation, I.C. and A.P.; writing—original draft preparation, I.C. and A.P.; writing—review and editing, I.C., A.P., M.J. and J.V.; supervision, M.J. and J.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available LLM was used in this study. This LLM can be found at https://ollama.com/library/qwen3 (accessed on 17 December 2025), and the MiniGrid environment can be found at https://minigrid.farama.org/index.html (accessed on 17 December 2025).

Acknowledgments

During the preparation of this manuscript, the authors used “ChatGPT” (GPT-5.2) for the purposes of spelling and grammar correction. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RL	Reinforcement Learning
LLM	Large Language Model
PPO	Proximal Policy Optimization
ML	Machine Learning
AI	Artificial Intelligence
MDP	Markov Decision Process
TRPO	Trust Region Policy Optimization
KL	Kullback–Leibler
PB2	Population-Based Bandit
WSN	Wireless Sensor Network

Appendix A

Appendix A.1

Table A1 shows the search ranges for the individual parameters used during population-based training to find optimal hyperparameter values.

Table A1. Search space for population-based training.

Parameter	Type	Range
initial_learning_rate	uniform_float	0.000001–0.01
final_learning_rate	uniform_float	0.000001–0.01
batch_size	categorical	[32, 64, 128, 256]
gamma	uniform_float	0.8–0.9997
horizon	categorical	[256, 512, 1024, 2048]
n_epochs	uniform_float	4–10
gae_lambda	uniform_float	0.9–1.0
clip_range	categorical	[0.1, 0.2, 0.3]
clip_range_vf	uniform_float	0.0–1.0
normalize_advantage	categorical	[true, false]
ent_coef	uniform_float	0.0001–0.1
vf_coef	uniform_float	0.5–1.0
max_grad_norm	uniform_float	0.1–1.0

Appendix A.2

Table A2 shows the different parameter values used across the multiple runs. Each run was 20 million steps, with each step consisting of 16,384 samples. The number of epochs in each run is given in Table A2.

Table A2. Parameter variations for multiple runs.

Model	Run	Nb. of Epochs	Initial Learning Rate	Final Learning Rate
Go to goal	1	6	0.001	0.00003
Go to goal	2	4	0.0003	0.000003
Pick up	1	9	0.001	0.00003
Pick up	2	9	0.0003	0.000003
Pick up	3	9	0.0003	0.000003
Pick up	4	9	0.0003	0.000003
Pick up	5	6	0.0002	0.000003
Pick up *	2	4	0.0003	0.000003
Pick up *	3	4	0.0003	0.000003
Pick up *	4	4	0.0003	0.000003
Go to object	1	7	0.001	0.00003
Go to object	2	7	0.0003	0.000003
Go to object	3	4	0.0003	0.000003
Toggle	1	4	0.001	0.00003
Toggle	2	4	0.0003	0.000003
Toggle	3	4	0.0003	0.000003
Toggle	4	4	0.0003	0.000003
Toggle *	4	4	0.0003	0.000003
Toggle *	4	4	0.0003	0.000003
All	1	6	0.001	0.00003
All	2	6	0.0003	0.000003
All	3	6	0.0003	0.000003
All	4	4	0.0003	0.000003
All	5	4	0.0003	0.000003
All	6	4	0.0003	0.000003
All	7	4	0.0002	0.000003
No language model	1	6	0.001	0.00003
No language model	2	6	0.001	0.00003
No language model	3	4	0.0002	0.000003

* Fine-tuning, starting from the “All” model.

Appendix A.3

The model consisted of an extractor and policy networks. The extractor would ingest raw input data and provide embeddings for the policy network, which would in turn output a policy distribution for the provided input and an expected value (actor and critic policies).

The extractor network had 3 sub-modules for 3 inputs. Textual input for the current mission: This module had 2 layers, an embedding layer accepting 32 characters as an input and outputting an embedding vector with 1024 dimensions (32 per character) and a gated recurrent unit with 128 neurons. The direction input module had only 1 linear layer with 16 neurons. The structure of the image module is shown in Table A3.

The policy network had 2 outputs: policy and value. Both had 1 linear layer with 64 neurons.

Table A3. The image sub-module structure from the extractor network.

Layer	Layer Type	Num. Kernels	Strides
1	Conv2d	16	2,2
2	ReLU	/	/
3	MaxPool2d	/	2
4	Conv2d	32	2,2
5	ReLU	/	/
6	Conv2d	64	2,2
7	ReLU	/	/
8	Flatten	/	/

Appendix B

Table A4 shows the system prompt used by the LLM.

Table A4. LLM system prompt.

You are giving instructions to a robot.
Available instructions are: “go to”, “pick up”, “toggle”, “go to goal”.
Available objects are: “key”, “door”, “ball”, “box”.
Available colors are: “red”, “green”, “yellow”, “blue”, “purple”, “grey”.

Give a simple answer, consisting only of available instructions, colors and objects.
Don’t look for alternative routes.
Give your answer in a step by step format. Try to have as little as possible steps in your answer.
If the door is locked, a key of the same color is needed. Doors are unlocked by toogling them while holding the key.
Keys can only toggle locked doors of the same color. Robot must first pick up a key before toggling the door.
Boxes do not need keys to be toggled.
Consider everything directly accessible unless it is in another room.
Box can contain other objects with the same color. If the object you are looking for is not present but there is a box, try opening the box.
Make sure to reference only objects present in the scene.
Do not tell the robot to go into the room or to open the unlocked door.

Examples:

The scene contains:

-: green ball
-: yellow ball
-: blue key
-: purple door
-: yellow door
-: green key
-: goal

Mission: go to goal’

Answer:
1. go to goal

The scene contains:

-: green ball
-: purple ball
-: red box
-: purple box
-: green box
-: blue key

Mission: pick up purple box’

Answer:
1. pick up purple box

The scene contains:
Two rooms. Left and right.
There is a locked grey door between the rooms
Left room contains:

-: goal
-: green ball
-: yellow key

Right room contains:

-: robot
-: grey box
-: red key
-: grey ball

Mission: go to goal

Answer:
1. toggle grey box
2. pick up grey key
3. toggle grey door
4. go to goal

References

Sutton, S.R.; Barto, G.A. Reinforcement Learning. In Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018; pp. 2–5. [Google Scholar]
Rengarajan, D.; Vaidya, G.; Sarvesh, A.; Kalathil, D.; Shakkotta, S. Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, N.A.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar] [CrossRef]
Parthasarathy, B.V.; Zafar, A.; Khan, A.; Shahid, A. The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv 2024. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015. [Google Scholar] [CrossRef]
Li, P.; Siddique, U.; Cao, Y. Symbolic Policy Distillation for Interpretable Reinforcement Learning. In Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
Liu, S.; Lever, G.; Wang, Z.; Merel, J.; Ali Eslami, M.S.; Hennes, D.; Czarnecki, M.W.; Tassa, Y.; Omidshafiei, S.; Abdolmaleki, A.; et al. From Motor Control to Team Play in Simulated Humanoid Football. arXiv 2021. [Google Scholar] [CrossRef] [PubMed]
Chevalier-Boisvert, M.; Dai, B.; Towers, M.; de Lazcano, R.; Willems, L.; Lahlou, S.; Pal, S.; Castro, P.S.; Terry, J. Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks. arXiv 2023, arXiv:2306.13831. [Google Scholar] [CrossRef]
Park, B.-J.; Yong, S.-J.; Hwang, H.-S.; Moon, I.-Y. Optimizing Agent Behavior in the MiniGrid Environment Using Reinforcement Learning Based on Large Language Models. Appl. Sci. 2025, 15, 1860. [Google Scholar] [CrossRef]
Ruiz-Gonzalez, U.; Andres, A.; Del Ser, J. Large Language Models for Structured Task Decomposition in Reinforcement Learning Problems with Sparse Rewards. Mach. Learn. Knowl. Extr. 2025, 7, 126. [Google Scholar] [CrossRef]
Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Fu, C.; Gopalakrishnan, K.; Hausman, K.; et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv 2022. [Google Scholar] [CrossRef]
Ciganović, I.; Pluškoski, A.; Jovanović, M. Smart Autonomous Vehicle—One Proposed Realisation. Robot. Manag. 2020, 25, 9–14. [Google Scholar]
Ciganović, I.; Pluškoski, A.; Jovanović, M.; Vasiljević, J. Evaluation of Multi-Model Architecture Against Single-Model PPO in the MiniGrid Environment. Electronics 2026, submitted. [Google Scholar]
Arjona-Medina, A.J.; Gillhofer, M.; Widrich, M.; Unterthiner, T.; Brandstetter, J.; Hochreiter, S. RUDDER: Return Decomposition for Delayed Rewards. arXiv 2018. [Google Scholar] [CrossRef]
Mu, J.; Zhong, V.; Raileanu, R.; Jiang, M.; Goodman, N.; Rocktäschel, T.; Grefenstette, E. Improving Intrinsic Exploration with Language Abstractions. arXiv 2022. [Google Scholar] [CrossRef]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum Learning. In Proceedings of the 26th International Conference on Machine Learning, Montreal, QC, Canada, 4–18 June 2009. [Google Scholar]
Narvekar, S. Curriculum Learning in Reinforcement Learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Rusu, A.A.; Colmenarejo, G.S.; Gulcehre, C.; Desjardins, G.; Kirkpatrick, J.; Pascanu, R.; Mnih, V.; Kavukcuoglu, K.; Hadsell, R. Policy Distillation. arXiv 2015. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, W.; Wang, J. Adaptive Multi-Teacher Multi-level Knowledge Distillation. arXiv 2021. [Google Scholar] [CrossRef]
Ewald, D.; Rogowski, F.; Suśniak, M.; Bartkowiak, P.; Blumensztajn, P. Exploring the Cognitive Capabilities of Large Language Models in Autonomous and Swarm Navigation Systems. Electronics 2026, 15, 35. [Google Scholar] [CrossRef]
Chen, X.; Lv, H.; Yin, L.; Fang, J. Multi-Agent Collaboration for 3D Human Pose Estimation and Its Potential in Passenger-Gathering Behavior Early Warning. Electronics 2026, 15, 78. [Google Scholar] [CrossRef]
Tang, C.; Liu, Y.; Wu, Y.; Han, W.; Yin, Q.; Zheng, X.; Zeng, W.; Zhang, Q. MoE-World: A Mixture-of-Experts Architecture for Multi-Task World Models. Electronics 2025, 14, 4884. [Google Scholar] [CrossRef]
Hu, L.; Huo, M.; Zhang, Y.; Yu, H.; Xing, P.E.; Stoica, I.; Rosing, T.; Jin, H.; Zhang, H. LMGAME-BENCH: How Good are LLMs at Playing Games? arXiv 2025. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, A. Proximal Policy Optimization Algorithms. arXiv 2017. [Google Scholar] [CrossRef]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
Schulman, J.; Levine, S.; Moritz, P.; Jordan, I.M.; Abbeel, P. Trust Region Policy Optimization. arXiv 2015. [Google Scholar] [CrossRef]
Andrychowicz, M.; Raichuk, A.; Stańczyk, P.; Orsini, M.; Girgin, S.; Marinier, R.; Hussenot, L.; Geist, M.; Pietquin, O.; Michalski, O.; et al. What Matters in On-Policy Reinforcement Learning? A Large-Scale Empirical Study. arXiv 2020. [Google Scholar] [CrossRef]
Eimer, T.; Lindauer, M.; Raileanu, R. Hyperparameters in Reinforcement Learning and How to Tune Them. arXiv 2023. [Google Scholar] [CrossRef]
Czarnecki, M.W.; Pascanu, R.; Osindero, S.; Jayakumar, M.S.; Swirszcz, G.; Jaderberg, M. Distilling Policy Distillation. arXiv 2019. [Google Scholar] [CrossRef]
Andreas, J.; Klein, D.; Levine, S. Modular Multitask Reinforcement Learning with Policy Sketches. arXiv 2016. [Google Scholar] [CrossRef]
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv 2023. [Google Scholar] [CrossRef]
Florensa, C.; Held, D.; Geng, X.; Abbeel, P. Automatic Goal Generation for Reinforcement Learning Agents. arXiv 2017. [Google Scholar] [CrossRef]
Jovanović, M.; Vasiljević, J.; Lazić, D.; Medenica, I.; Nagamalai, D. Swarm Robotics in the Military Operations: Challenges and Opportunities. In Proceedings of the 6-th International Conference on NLP & Information Retrieval (NLPI 2025), Vienna, Austria, 15–16 March 2025. [Google Scholar]
Medenica, I.; Jovanović, M.; Vasiljević, J.; Radulović, N.; Lazić, D. Optimization of Delay Time in ZigBee Sensor Networks for Smart Home Systems Using a Smart-Adaptive Communication Distribution Algorithm. Electronics 2025, 14, 3127. [Google Scholar] [CrossRef]

Figure 1. The screenshot of the environment during the experiment. Objects are color coded. Objects with the same color, relate to one another. Green key unlocks green door, and purple box contains purple key. Green square is the goal. Red triangle is the player. Area shaded in light grays is the field of view of the player.

Figure 2. The neural network architecture.

Figure 3. The training timeline for the “all” model. The Y axis shows the average reward over the hundred episodes, and the X axis shows the training step.

Figure 4. The training timeline for the “pick up” model was fine-tuned from the “all” model. The X axis shows the average reward received in the hundred episodes, and the Y axis shows the training step.

Table 1. Color codes in the MiniGrid environment.

Color	Code
red	0
green	1
blue	2
purple	3
yellow	4
grey	5

Table 2. Object codes in the MiniGrid environment.

Object	Code
unseen	0
empty	1
wall	2
floor	3
door	4
key	5
ball	6
box	7
goal	8
lava	9
agent	10

Table 3. Textual representation of the generated scene.

The scene contains:
Two rooms. Left and right.
There is a locked grey door between the rooms
Left room contains:

-: goal
-: green ball
-: yellow key

Right room contains:

-: robot
-: grey box
-: red key
-: grey ball

Mission: go to goal

Table 4. Response provided by the LLM.

1. toggle grey box
2. pick up grey key
3. toggle grey door
4. go to goal

Table 5. Single environment state.

Value	Type
direction	uint8
image	array(3, 7, 7), uint8
mission	array(32), int64

Table 6. LLM benchmark.

Model	Match (Out of 14)
Qwen3:30b (selected)	14
llama3:8b	9
deepseek-r1:32b	11
gemma2:27b	13

Table 7. Results achieved by training a PPO agent to solve a specific problem starting from a randomly initialized model and from the model trained on all of the problems simultaneously.

Model	Result
Pick up	57%
Pick up (fine-tune)	68%
Toggle	47%
Toggle (fine-tune)	65%

Table 8. The results achieved by experts during the training were measured across all problems.

Model	Go to Goal	Go to Object	Pick Up	Toggle
Go to goal	86%	0%	0%	0%
Go to object	0%	72%	0%	0%
Pick up	0%	0%	68%	0%
Toggle	0%	0%	0%	65%
All	75%	65%	59%	58%

Table 9. The results achieved by different models were measured simultaneously across all problems.

Model	Result
PPO All	65%
Distillation	56%
Distillation from PPO	67%

Table 10. The results achieved by different models are measured on complex problems.

Model	Result
PPO	33%
Distilled PPO	55%
No Language Model	43%

Table 11. The reward received in the first 10 episodes.

No Language Model	Distillation
0	0.7322314
0.9553719163	0.9479339
0	0.7247934
0	0
0.7247933745	0.7563426
0.8289256096	0.7523325
0	0.57603306
0.8661156893	0
0	0
0.9553719163	0.9628099

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pluškoski, A.; Ciganović, I.; Jovanović, M.; Vasiljević, J. Benefits of Using LLM for Long-Term Planning with Distilled Subtask Model Compared to End-to-End Reinforcement Learning in the MiniGrid Simulator. Electronics 2026, 15, 1921. https://doi.org/10.3390/electronics15091921

AMA Style

Pluškoski A, Ciganović I, Jovanović M, Vasiljević J. Benefits of Using LLM for Long-Term Planning with Distilled Subtask Model Compared to End-to-End Reinforcement Learning in the MiniGrid Simulator. Electronics. 2026; 15(9):1921. https://doi.org/10.3390/electronics15091921

Chicago/Turabian Style

Pluškoski, Aleksandar, Igor Ciganović, Miloš Jovanović, and Jelena Vasiljević. 2026. "Benefits of Using LLM for Long-Term Planning with Distilled Subtask Model Compared to End-to-End Reinforcement Learning in the MiniGrid Simulator" Electronics 15, no. 9: 1921. https://doi.org/10.3390/electronics15091921

APA Style

Pluškoski, A., Ciganović, I., Jovanović, M., & Vasiljević, J. (2026). Benefits of Using LLM for Long-Term Planning with Distilled Subtask Model Compared to End-to-End Reinforcement Learning in the MiniGrid Simulator. Electronics, 15(9), 1921. https://doi.org/10.3390/electronics15091921

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Benefits of Using LLM for Long-Term Planning with Distilled Subtask Model Compared to End-to-End Reinforcement Learning in the MiniGrid Simulator

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Environment

3.2. Proximal Policy Optimization

3.2.1. Policy

3.2.2. Hyperparameter Search

3.3. Policy Distillation

3.4. Large Language Model

4. Experimental Results

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1

Appendix A.2

Appendix A.3

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI