1. Introduction
Reinforcement learning is a branch of machine learning (ML) that is learning optimal behavior through interaction with the environment [
1]. In RL, the agent interacts with the environment and chooses actions according to its policy (neural network). As a result, it receives feedback from the environment in the form of a reward that represents the quality of the agent’s behavior. Unlike supervised learning, where training relies on a predefined data set, RL agents generate their own data by interacting with the environment (through trial and error). This paradigm has been successfully applied to a wide range of problems. However, RL methods are particularly sensitive to reward design, and their performance degrades significantly in environments with infrequent or delayed rewards [
2].
The solution presented in this paper evaluates the advantages of extracting the long-term planning task into a separate model, specifically, a large language model (LLM). The execution of the plan provided by the LLM is then handled by the “execution” model trained using a combination of a standard RL algorithm and distillation. The results are then compared to a classical end-to-end RL model. The proposed architecture separates long-term planning and execution: the LLM operates at a higher level by decomposing a complex task into a series of subtasks, while the execution model handles the execution of each task given by the LLM. Every sub-task is relatively simple and can be solved by a classical RL algorithm. This modular approach is then compared against an end-to-end RL agent in the MiniGrid (v3.0.0) environment. Performance is evaluated in terms of task execution success and robustness in environments that require long-term planning.
LLMs [
3] have recently emerged as models capable of understanding complex patterns in the sequential data. Built primarily using transformer networks, LLMs are trained on massive amounts of text data and have shown strong performance on tasks such as natural language understanding, reasoning, planning, and following instructions. Recent work has shown that LLMs can go beyond pure language tasks, serving as high-level planners and decision-makers [
4]. These capabilities make LLMs a promising candidate for handling long-term planning in sequential decision-making problems, where thinking about long-term goals and breaking them down into smaller, simpler steps is required.
Distillation of knowledge is a transfer learning technique that uses a “teacher” model to generate a dataset, which is then used to train a “learner” model in a supervised manner [
5]. The learner model is trained to replicate the teacher’s knowledge as closely as possible. Instead of learning from interactions with the environment, the distilled model learns from the teacher’s output. This allows it to inherit the teacher’s knowledge and, in some cases, surpass it while being more efficient and transparent [
6]. This is achieved by having stricter, more precise optimization targets in the supervised learning format. It is shown that, by learning from rollouts (datasets) generated by teachers trained on different tasks, the learner model can learn all of them [
7]. This is particularly interesting in situations where achieving this using a single RL agent would require a considerably more complex model or be impossible.
The MiniGrid environment is a popular 2D RL benchmark designed to test algorithms that require navigation, object interaction on the map, and long-term planning [
8]. It provides a simple framework for creating environments of varying complexity. Tasks in a MiniGrid environment often require an agent to perform multiple dependent tasks. For example, opening a box, finding keys, unlocking a door, and reaching a goal, before eventually receiving any reward. This makes it particularly suitable for studying the problems of the scarce reward, whose solutions require long-term planning. Because of these features, MiniGrid has become a widely used benchmark for evaluating hierarchical reinforcement learning, modular policies, and planning-based approaches. The environment is defined as a grid of fixed dimensions, where each square can contain an object. Additionally, objects have a color. All of this is represented by the codes as shown in
Table 1 and
Table 2.
The solution presented in this paper also provides a comparison of different distillation variants. It demonstrates how complex problems can be solved using RL, by delegating long-term planning to the LLM and using the distilled RL model to perform already defined, simpler tasks. By doing this, the stability of the training is increased, and scalability and robustness are improved.
The papers [
9,
10,
11] demonstrate the potential of applying LLMs to reinforcement learning. In accordance with previous work [
12], the goal was to minimize the model complexity. This was achieved through problem decomposition, which enabled the design of a modular agent.
The paper [
10] presents a solution that uses an LLM to augment the training process. The RL model is guided through training to achieve the global goal by completing an array of sub-goals. The LLM is used to define those sub-goals at the start of the episode. Given the computational cost of using the LLM, the paper [
10] proposes implementing a surrogate model trained to approximate the LLM output during RL training. While this removes the LLM from the inference phase, it introduces considerable overhead during the training. Automatic discovery of the sub-goals is an appealing but, in the end, futile pursuit in this example. The RL agent’s neural network requires predictable textual input (in one of its inputs). In the solution presented in [
10], the cosine similarity is used to select a predefined sub-goal based on the LLM’s output. This approach could allow for more flexible sub-goals, but at the expense of not fully leveraging the PPO algorithm’s learning potential. In contrast, the solution presented in this paper employs the LLM during the inference phase and thus greatly simplifies training and transfers the cost of invoking the LLM to the phase where it is less impactful.
The solution presented in the paper [
9] highlights the differences among open-source LLMs and does not utilize the reinforcement learning algorithm. The LLM is presented with the detailed scene description, including precise coordinates of the relevant objects and the current reward. The learning happens through conversation history. The LLM is tasked to interpret the success of the actions based on the provided reward. The LLM’s output is essentially a direct command to the environment. The execution module just remaps the LLM’s textual response into the actual command code. The paper shows significantly better results when utilizing the “lama-2.3” model fine-tuned on the Korean language. The solution proposed in this paper replaces the execution module with an RL agent. Since the RL agent can interpret significantly more complex commands, the LLM is not overburdened by direct control. This, in turn, allows the LLM to focus on the high-level planning, where LLMs show the most potential.
The paper [
7] presents a mechanism for transferring knowledge from multiple teacher models to a single agent using distillation. This aligned with the proposed goal to keep the agent relatively simple. Additionally, it was possible to use those teacher models to create an agent with a mixture of expert architectures [
13].
2. Related Work
RL addresses sequential decision-making problems by learning a policy that maximizes the total reward through interaction with the environment, typically modeled as a Markov decision process (MDP) [
1]. Despite many successes, end-to-end RL algorithms remain unstable when faced with complex multi-step tasks. Since the reward is received after the agent performs a long sequence of actions, learning becomes unstable, often preventing convergence. These limitations have led to the development of alternative paradigms that perform better when long-term planning is required [
14,
15].
Curriculum learning is one of those new paradigms. The curriculum is designed by defining a set of skills the agent should learn to solve the problem effectively. These skills are then ordered by difficulty, and the agent is guided to learn them one by one, with difficulty gradually increasing, thus allowing agents to acquire simple skills before tackling more complex goals [
16]. In RL, it has been shown that this type of learning increases efficiency and convergence speed, especially in environments with scarce rewards [
17]. Although curriculum learning solves some training problems, it still relies on a single policy that teaches increasingly complex behaviors and does not encourage modularity or reuse of learned subskills.
Knowledge distillation in neural networks is another. It is a technique in which a model is trained to replicate the behavior of a teacher model, thereby transferring knowledge more effectively [
5]. First, the teacher model is trained using a reinforcement learning algorithm. The teacher is then used to generate a dataset by recording its interactions with the environment. That dataset is later used to train a new model in a supervised manner. Because the optimisation target is much stronger and clearer, the learner can be simpler, with fewer parameters, and achieve better performance. In RL, distillation is used to transfer the policy, improving efficiency and stability by training the agent in a supervised manner rather than a noisy environment [
18]. Distillation using multiple teacher models, each trained on a different task, enables knowledge compression by producing a single model that combines all the teachers’ skills [
19].
Introducing LLMs into reinforcement learning has recently attracted significant attention as a means of incorporating high-level reasoning and semantic understanding into decision-making systems [
9,
10,
11,
20]. Language is the natural tool for reasoning. Expectedly, LLMs achieve great results in the field of RL by complementing agents in the domains of planning and reasoning. LLMs are based on the transformer architecture and trained on large amounts of text using self-supervised learning objectives [
3]. Transformers demonstrated a superior ability to model sequential data compared to the previous techniques. Outside of LLMs, in RL, transformers are used when the decision is based on a sequence of states [
21,
22]. There are attempts to completely remove the RL model and use just LLM for reinforcement learning [
23]. Here, the LLM acts as an agent and, with specifically crafted prompts, is guided to explore the environment by issuing natural-language commands. The knowledge accumulated during the learning phase is kept in the LLM context.
The solution presented in this paper seeks a middle ground between incorporating the LLM into the RL algorithm itself [
10] and issuing commands at maximum granularity [
9], where the LLM literally outputs commands that a human controller would. Instead, we are relying on an RL agent to learn to solve relatively simple problems on its own and use an LLM only to plan the solution to the more complex problem. The LLM should issue commands such as: picking up a specific object, unlocking a door, etc. And the RL agent should be able to find a route to the said object, open the unlocked doors on that route, etc.
3. Methodology
3.1. Environment
This work extends the MiniGrid environment to implement a controllable, custom setup that allows for variable mission complexity, predetermined object-placement rules, and textual scene descriptions for LLM prompt generation. Mission complexity is primarily determined by the number of rooms on the map and the number of objects in each room. The map size was fixed to an 11 × 11 matrix. Due to size constraints, the maximum number of rooms was set to four. The only deviation from the default environment’s behaviour and reward structure was the use of the “done” command to signal the end of sub-task execution instead of the global goal. An example of the running environment is shown in
Figure 1.
Doors between the rooms could be set to always be unlocked or open during the training. Doors have a color, and if they are locked, they can be unlocked only by the key of the same color. The keys can be hidden in the boxes. To unlock the door, the agent would first need to find the key by opening the boxes, then use it to unlock the door. The RL model used proved unable to solve this. But it learned to open the door when it was unlocked. So, during the training, all the doors were left unlocked. This proved to be a problem later. The agent would get confused by the locked doors because they were never seen during the training. This was solved by introducing randomly placed doors in the rooms (not between).
The number of objects could be varied. The rules for object placement were developed to avoid blocking the agent, the goal (green tile), or the doors. Also, the keys were placed so that the agent could always reach the goal.
For the LLM to reason about the solution to the problem posed by the environment, a textual representation of the map was needed. This was incorporated into the scene generation in the environment itself. The example of the output is presented in
Table 3, together with the answer generated by the LLM in
Table 4. The goal was to convey all the information needed to devise a plan to achieve a given mission. The representation should include information about the number of rooms, the condition of the doors between rooms, the objects in those rooms, and the overall mission. The amount of information was reduced only to the essentials. This proved beneficial for the LLM’s performance.
3.2. Proximal Policy Optimization
For the reinforcement learning algorithm, Proximal Policy Optimization (PPO) [
24] was used. We also tested Q-learning [
25], but it performed worse than PPO. The “stable baselines 3” (v2.7.0) [
26] implementation of the algorithm was used with custom extractors and policy networks. PPO is a successor to the trust region policy optimization (TRPO) algorithm [
27]. PPO was introduced to address TRPO’s computational complexity. Both algorithms use the trust-region method to bound the Kullback–Leibler (KL) divergence between the old and new policies. While TRPO computes the Hessian matrix, which is computationally very expensive, PPO uses simple clipping to approximate the KL divergence constraint. Both are policy gradient methods that work by computing an estimator of the policy gradient and using it in the regular gradient ascent algorithm. Policy gradient estimator has the following form:
where
is the current policy parameterised by
,
is the action taken at time-step
t;
is the environment state at time-step
t,
is the value of the advantage function at the timestep
t, and
is the average expectation over a batch of samples. Policy gradients can be arbitrarily large. This can lead to destructively large policy updates. Trust region methods are used to limit this update. PPO simplifies this by using a simple clipping method in the following form:
where
ϵ is a small constant representing the clipping factor. PPO is an on-policy algorithm. Policy is synchronized between the actor and the learner. This allows for a further reduction in the deviation between the policy used to generate the trajectory and the one used during the optimization step.
3.2.1. Policy
The custom network used for the experiment had three inputs in the extractor stage. This was determined from the environment’s output, as shown in
Table 5. The direction input represents the player’s orientation, the image parameter matrix represents the player’s viewport, and the mission vector contains the tokenized string of the current goal.
The extractor consisted of three networks, one for each of the inputs. The outputs of these networks were concatenated and used as a shared input for the policy and value heads. The network architecture is shown in
Figure 2. The outputs of the network are policy, a probability distribution over the environment’s action space, and value, the learned value of the current state. Additional model details are shown in
Appendix A.3.
3.2.2. Hyperparameter Search
Reinforcement learning algorithms are very sensitive to variations in hyperparameter values. As it is practically impossible to optimize hyperparameters manually, sane defaults were used as a starting point [
28] for the “Hydra” (v1.3.2) population-based sweeper [
29]. The Population-Based Bandit (PB2) algorithm was used. This algorithm uses a Bayesian model to determine the best mutation for the top-performing agents in the population. This allows for greater sample efficiency and a smaller population than in standard population-based training. The search ranges for the individual parameters used in the sweeper are provided in
Appendix A.1. It was observed that training became unstable in the later stages. This was the result of too many update steps during the training. To remedy this issue, the training process was divided into multiple runs, gradually reducing the number of epochs per training step. The parameter values for the different training stages are provided in
Appendix A.2.
3.3. Policy Distillation
Policy distillation is a transfer learning method. The goal is to transfer the teachers’ knowledge into a learner model. This is usually done to reduce model size or to compress knowledge from multiple teachers into a single learner [
7,
30]. The goal in this type of training is to learn a policy that mimics a teacher’s or teachers’ policies. The solution presented in this paper uses multiple teacher models. Each teacher is trained using the PPO algorithm to solve a single task. These teacher models are then used to generate a dataset consisting of environment states and the policies output by the teachers for those states. The learner model is trained on the teacher policy distributions, not the sampled actions. The collected dataset covered all of the sub-tasks available in the environment. The final model would generalize on different, previously unseen, environment configurations. The sub-tasks were predefined and fixed. The learner is then trained using said dataset in a supervised manner. For the loss function, the KL divergence between the policies output by the teacher and learner models was used. We tried two options: to train a new model from the randomly initialized weights and to use distillation as a fine-tuning step for the pretrained RL model. The results are presented and discussed in the next section.
Careful consideration was required during the sub-goal design process. The RL teacher models would be trained to solve those sub-goals. They needed to be simple enough to solve using the PPO algorithm and clear enough for the LLM to use. Additionally, the sub-goals had to be robust enough to allow for any possible mission in the given environment to be solved. The PPO models were allowed to learn as much as they could. The point was not to replace the RL model with the LLM, but to augment it with long-term reasoning skills. Through experimentation, we arrived at four sub-goals that proved enough given the earlier considerations. Those sub-goals were “go to <object>”, “go to goal”, “toggle <object>” (used for opening doors and boxes), and “pick up <object>”.
3.4. Large Language Model
As stated earlier, the LLM was used to decompose a goal set by the environment into an array of predefined sub-goals based on the environment’s scene description. The goal was to create a self-contained solution that did not rely on external services. To achieve that, an open-source, self-hosted model was used. We tested several models, and the “qwen3:30b” gave us the best results. The testing process was relatively simple. The LLM was given a set of predefined prompts (scene descriptions) with the user-defined solutions and was expected to provide the same or a comparable response. Comparison of the tested open source models is shown in
Table 6.
During the testing (inference) phase, the environment would output the current state and a scene description. This description would be provided to the LLM to define a plan. This happens only in the first step of the episode. The LLM is not prompted in later steps if the plan already exists. The sub-goals constituting the plan are then executed sequentially by the distilled RL model. The RL model is trained to perform the “done” action when the sub-goal is finished as a signal to switch to the next one. The system prompt used to instruct the LLM is provided in
Appendix B,
Table A4.
Additionally, an end-to-end model was trained solely using the PPO algorithm, without an LLM for planning. The results are compared in the next section.
4. Experimental Results
The PPO agents were used as experts in the distillation process. They were trained first. Additionally, another PPO model was trained on all of the problems simultaneously (the “all” model). One model was capable of solving all subtasks. This was done as a potential starting point for the distillation training and in order to compare the performance to that of the specialized experts. First, we trained expert agents from randomly initialized models; then we fine-tuned the “all” model on a single task. This proved beneficial as “pick up” and “toggle” agents showed significant improvement. However, “go to goal” and “go to object” agents showed no improvement, and in some cases even degraded performance. Improved results are shown in
Table 7.
The PPO agents were trained in multiple steps.
Figure 3 shows the average reward over 100 episodes at each training step for the “all” model. As stated earlier, the PPO agents were trained using multiple runs. The training was repeated until satisfactory results were achieved, with different problems requiring different numbers of training runs. Some hyperparameters were gradually reduced to stabilize training in later stages, as shown in
Appendix A.2. The “all” model required 8 runs to stop improving.
Figure 3 shows the smoothed graph of model “all/7” as higher than “all/6”, but in testing, they achieved identical results.
Figure 4 shows the same type of graph for the fine-tuning of the “pick up” model. As expected, the model starts with some skill at solving the presented problem and continues to improve. It can be observed that generalized knowledge acquired during training across all problems simultaneously allows it to achieve better results after focusing on a single task than when training only on that task. Also, fine-tuning on a single task after training on all of them further improves performance on that task, even though the continuation of the training on all tasks showed no improvement.
Figure 3 and
Figure 4 show learning progress through consecutive training runs. Expectedly, it can be observed that the performance gain is larger in the beginning and gradually plateaus in the later runs. The last runs in both examples show almost no improvement. This was used as a signal to stop further training.
Table 8 shows the performance of the trained models across all problems. Expectedly, “go to goal” and “go to object” models score 0% in problems they were never trained on. Interestingly, “pick up” and “toggle” models also show no skill in other problems, even though they were fine-tuned from the “all” model.
After the experts were trained, they were used to generate the dataset for the distillation training. As stated earlier, the dataset contained an environment state and the policy produced by the expert for that given state. The distilled models were trained to imitate the expert policy. The distillation was done in two variants, starting from the randomly initialized model and starting from the “all” model.
Table 9 shows the results.
Interestingly, distillation from the randomly initialized model yields significantly worse results than PPO alone. But distilling the PPO model does show further improvement.
The distilled PPO model was selected as the system’s execution model, with the LLM used for planning. Further testing was conducted on complex problems that required multiple steps to solve. The LLM would decompose the global goal into an array of sub-goals, which the distilled PPO model would then execute. As a benchmark for the results, another PPO model was trained. It was trained to solve complex problems, similarly to the “all” model, but the environment was not limited to missions solvable by a single expert.
Table 9 shows the results.
The results shown in
Table 10 are expectedly lower than those in
Table 9 because each step in the multi-step solution has some probability of failure, which multiplies when the steps are executed sequentially. Due to the number of testing episodes, a detailed analysis of the failure cause was not feasible. However, in the first 10 episodes (
Table 11) that were observed individually, it was observed that the cause for failure was always the execution model. It should be stated that a perfect score of 100% is not achievable. The score is calculated using the formula:
This reduces the reward for each action the agent performs.
Table 10 shows the average reward received over 1000 episodes, while
Table 11 presents the rewards received in the first 10 episodes. Here, it can be observed that episodes requiring simple solutions show similar results, but episodes requiring multi-step solutions prove unsolvable by the PPO agent without the LLM.
Additionally,
Table 10 demonstrates individual contributions from the knowledge distillation and the modular design. “PPO” model is the non-distilled variant of the hierarchical model. Compared to the “Distilled PPO” it shows significantly lower results. The contribution of the modular design can be seen by comparing the “Distilled PPO” with “No Language Model”, the latter being the end-to-end, monolithic model. Again, the hierarchical model shows better results.
Source code for the implemented solution can be found at the address stated into the
Supplementary Material section. Also, a video demonstrating execution of the proposed solution can be found at the YouTube like provided in the
Supplementary Materials.
5. Discussion
The main insight of this work was the effectiveness of delegating planning tasks to an LLM and the improvement in RL agent performance through knowledge distillation. Further improvement may be possible by combining multiple distillation stages with reinforcement learning. The RL agent would learn basic skills, and distillation would then make the agent more robust, providing a better starting point for further skill acquisition using RL.
Despite all the advantages, the proposed solution has several weaknesses. The performance of the entire system depends on the selection and design of the teacher models (experts). If the required skill is not covered by the teacher agents, the resulting agent would be missing that skill. Similar limitations also exist in the modular and skill-based RL, where generalization is constrained by the available set of skills [
31]. Although large language models exhibit strong planning capabilities, they can still generate inconsistent or suboptimal plans in unfamiliar environments, a challenge noted in recent LLM-based research [
32].
The new results suggest several directions for further improvement. An important problem to solve is the automatic detection of subtasks, thereby reducing dependence on manual design. Previous work in skill discovery and option learning suggests that sub-tasks can be identified using unsupervised or weakly supervised learning [
33]. Integrating these techniques with the solution presented in this paper would improve scalability and generalization.
Another weakness of this solution is the connection between the LLM and the execution model. As it stands, the LLM prompt is defined manually in parallel with sub-task design. Proposed improvements to sub-task discovery could be combined with a learned prompt [
23] that would organically evolve during the agent’s training.
This modular approach provides an additional benefit: the ability to easily replace the planning module (the LLM) with a more capable one. Since the LLM is completely removed from the training process, replacing (improving) it is simply a matter of changing a single configuration parameter. One possible improvement would be the use of a multimodal LLM, which would make it unnecessary to modify the environment to provide a textual representation of the current state.
A natural continuation of this line of research would be an application to swarm robotics. In an additional study, experiments with a drone simulator were conducted. However, at the time of writing, the results remain inconclusive. Extending the solution to scenarios involving multiple drones [
34] represents a promising direction for future work. In particular, the design of a swarm coordination layer that connects LLM-based planning with the low-level execution of individual drones warrants further investigation.
Another interesting area of research is the implementation of these techniques with wireless sensor networks (WSN) [
35]. Traditional WSNs operate under rigid protocols that struggle to accommodate the dynamic, heterogeneous conditions characteristic of real-world deployments. By embedding RL agents at the network edge individual sensor nodes can learn optimal policies for resource allocation, duty cycling, and routing through interaction with their environment. The integration of LLMs introduces a high-level reasoning and contextual understanding layer that overcomes the limitations of conventional signal processing. LLMs can interpret content from aggregated sensor streams, translate complex environmental observations and allow for natural-language-driven network reconfiguration. This combined approach allows for a self-organizing wireless sensor network (WSN) architecture. In this architecture, reinforcement learning (RL) manages low-level, reactive control tasks, while large language models (LLMs) handle high-level, strategic decision-making.
This work shows that using an LLM for planning and a distilled RL model for sub-task execution offers a scalable and efficient alternative to end-to-end RL. Distillation, as a basic learning mechanism, addresses the challenge of learning multiple independent tasks, and the modular approach addresses the sparse-reward problem, enabling the creation of more robust, interpretable, and efficient RL systems.
6. Conclusions
The solution presented in this paper shows the advantages of decoupling the planning unit from the execution unit. Combining the LLM with the RL model shows significantly better performance than a monolithic model could achieve. Traditional end-to-end RL solutions have proven to struggle in environments with sparse rewards. In contrast, the proposed architecture separates planning and execution, enabling the execution model to learn in a dense-reward environment.
The results show that distilling knowledge further improves RL agents, making them more robust and less susceptible to input noise. Independent training of teacher policies simplifies learning and enables better performance on specialized tasks, with improved training stability, further improving the final agent’s performance. This allows the system to scale more easily to more complex problems than traditional end-to-end RL solutions.
In comparison with related studies [
9,
10], the key limitations were identified. The solution in paper [
9] decomposed the problem into highly granular subtasks, thereby unnecessarily placing an additional burden on the LLM and failing to leverage the PPO algorithm’s learning capability. The results clearly show that the LLM cannot effectively solve the problem presented. This is remedied by using an LLM fine-tuned for the Korean language. The solution presented in this paper clearly shows better results that are language-independent.
The solution presented in paper [
10] proposed an alternate strategy, incorporating an LLM directly into the RL training process. Although this integration introduces additional flexibility and potential performance gains, it comes at a significantly higher computational cost. Building on the prior work [
12], it was instead decided to reduce the complexity of both the agent architecture and the training procedure. This design choice results in substantially faster training cycles and improved accessibility, primarily by reducing hardware requirements, while achieving comparable results.
When you look at these comparisons as a whole, the key contribution of the proposed approach is not any single component, but how the responsibilities are split between the LLM and the RL agent. Methods like curriculum learning and knowledge distillation are useful for organizing training and compressing behavior, but they stay entirely within the RL framework and do not address higher-level reasoning or planning. On the other side, approaches that rely heavily on LLMs tend to lose the responsiveness and sample efficiency that make RL effective in dynamic environments.
Our approach tries to strike a balance between these two directions. The LLM is used only for high-level, object-oriented planning, while the actual execution is handled by a distilled RL agent that is both robust and efficient. This separation allows each part to do what it does best, resulting in a system that performs better than either component would on its own. It is this clear division of roles that sets the method apart from previous work and represents its main contribution.
In general, the results show that delegating planning to an LLM and allowing the RL models to focus solely on control and execution yield promising results compared to monolithic RL agents. The hybrid approach improves robustness, performance, and interpretability in complex environments that require structural, multilevel reasoning.