1. Introduction
Planning is a central problem in artificial intelligence. It involves transforming a high-level goal into a sequence of steps that an agent can execute in order to achieve the desired outcome. Reliable planning is especially important when agents interact with unpredictable environments, where mistakes in reasoning or execution may accumulate and cause failure.
Classical planning approaches rely on symbolic world models. They can produce interpretable action sequences and behave predictably when the model matches the environment, but they degrade quickly when assumptions are violated (missing objects, wrong effects, partial observability, unexpected events), which forces manual maintenance of the model.
Reinforcement learning (RL) offers a different approach. Instead of relying on symbolic rules, an RL agent learns policies from interaction with the environment. This makes RL suitable for high-dimensional tasks such as robotics or games. However, RL methods typically require extensive training data and long training times. They are also limited in their ability to generalize: a policy trained in one environment often performs poorly when the task is modified or when new constraints are introduced [
1,
2].
Large language models (LLMs) have recently been proposed as an alternative tool for planning [
3]. Their strength lies in flexible reasoning and their ability to represent tasks in natural language. LLMs can generate action sequences, propose subgoals, or even synthesize domain specifications. Early studies show that they can produce valid plans in simplified settings, but their output is inconsistent [
4,
5,
6,
7]. Failures often occur in longer tasks, or in situations where logical constraints must be satisfied exactly. This raises the question of how LLMs can be systematically combined with established planning methods in order to benefit from both flexibility and reliability.
We investigate this question in two settings:
Both environments highlight different aspects of the planning problem: abstract reasoning in simulation, and grounding of plans in real-world actions. The contributions of this work are as follows:
A framework that integrates LLM-based reasoning into a Sense–Plan–Code–Act (SPCA) loop, combining text generation with structured validation.
An experimental study of LLM planning performance in MiniGrid tasks, including a taxonomy of common error types.
A demonstration of how the same framework can be extended to a robotic manipulation setup, showing the portability of the approach.
The remainder of this paper is organized as follows.
Section 2 reviews related work in symbolic planning, reinforcement learning, and LLM-based planning.
Section 3 presents the proposed framework.
Section 4 reports on experiments in MiniGrid and robotics.
Section 5 analyzes the implications of the findings.
Section 6 concludes the paper and suggests directions for future work.
2. Related Work
Classical planning provides the most established foundation for goal-directed reasoning in AI. In this setting, tasks are described through symbolic states and actions, typically encoded in the Planning Domain Definition Language (PDDL). The domain defines action schemas (preconditions/effects) and predicates, while the problem instance specifies objects, initial state, and goal conditions. Standard planners such as Fast Downward search over these state transitions to generate plans that are both sound and complete [
9,
10]. In practice, the dominant cost is model acquisition: choosing the right abstraction level, enumerating predicates and action effects, and keeping the specification consistent as tasks evolve. This makes classical planning strong in well-modeled settings but expensive to deploy and maintain when the environment or task distribution changes.
Reinforcement learning (RL) offers an alternative by optimizing policies through interaction with the environment. In grid-based environments such as MiniGrid and BabyAI, discrete-action baseline RL algorithms such as Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C) and Deep Q-Networks (DQN) have been applied to sparse-reward and partially observable tasks [
1,
11,
12]. Although these methods achieve robust behaviors within a training distribution, they demand large numbers of samples and transfer poorly across task variations. Extensions such as STAP have aimed to sequence skills to handle longer horizons [
13], but scalability remains limited.
Large language models (LLMs) are token predictors trained at scale on text corpora, which makes them effective at generating natural language and code [
4,
5]; while not designed for planning, their generative ability has been adapted to produce symbolic descriptions, action sequences, or control code. Alignment techniques such as instruction tuning and chain-of-thought prompting improve their reasoning ability [
14,
15,
16,
17], but they remain prone to hallucinations and logical errors.
Several works explore how LLMs can augment symbolic planning. LLM-Planner and LLM+P translate natural language into PDDL fragments or full problem files, relying on a symbolic planner for validation [
18,
19]. World-model PDDL builders extend this loop by iteratively refining domains and problems until a solution is found [
20]. More advanced systems, such as Planning in the Dark and AutoTAMP, combine LLM proposals with automated critics or task-and-motion checks to ensure executability [
21,
22]. In these approaches, the LLM acts as a flexible generator, while external validators guarantee correctness.
Other research explores open-loop LLM planning without formal PDDL grounding. Inner Monologue introduces step-level feedback during execution [
23], while Embodied Chain-of-Thought produces reasoning traces tied to perceptual input [
24]. SayCan grounds language in a robot’s skill library by scoring each skill with two signals: how relevant it is to the goal and whether it is feasible in the current scene [
25]. SayCan operates over a fixed, hand-designed skill library and selects among existing skills rather than synthesizing new symbolic operators or long-horizon plans. Feasibility is captured implicitly via learned value functions, which improves grounding but offers limited global constraint checking and weaker inspectability compared to explicit symbolic planning. Visual–language models (VLMs) pair visual encoders with language models to perceive and reason about scenes in natural language (e.g., open-vocabulary recognition, object–relation descriptions, affordance cues). VoxPoser links language to low-level control by producing 3D value maps over a voxelized scene, which guide a motion planner to generate executable trajectories [
26]. In COME-Robot, GPT-4V serves as the VLM inside a closed-loop mobile manipulation system, interpreting RGB observations and proposing corrections during execution [
27]. Vision–language–action (VLA) models take a step further by mapping multimodal observations (and language goals) directly to actions, coupling perception with policy generation. PaLM-E demonstrates embodied multimodal language modeling that conditions on images and robot state to emit action-relevant commands for real tasks [
28], while OpenVLA provides an open-source VLA that learns end-to-end visuomotor policies from language-conditioned data [
29]. VLM and VLA approaches enable flexible perception and control but typically require large-scale demonstration data and offer limited guarantees about constraint satisfaction. Their internal decision processes are difficult to verify, which complicates safety assurance in long-horizon or safety-critical tasks. In addition, running large multimodal models in closed-loop control can be computationally and financially expensive.
In parallel, program and skill synthesis focuses on composing reusable behaviors. Traditional frameworks such as MoveIt Task Constructor and Tesseract define tasks as sequences of stages with explicit feasibility checks [
30,
31]. PDDLStream integrates symbolic reasoning with continuous samplers to verify kinematic or geometric constraints during planning [
32]. More recent LLM-based systems produce executable programs directly. While PDDLStream bridges symbolic and continuous planning, it relies on carefully engineered stream procedures and certificates, and performance is sensitive to sampling efficiency and heuristic guidance. Extending the system to new domains often requires substantial manual effort in defining new streams and validity checks. Code as Policies translates language instructions into code that calls robot controllers [
33], ProgPrompt structures prompts to bias models toward constrained outputs [
34], Instruct2Act generates Python loops for language-conditioned control [
35], and Voyager demonstrates iterative skill acquisition through program generation in open-world tasks [
36]. These systems reduce manual authoring effort but remain vulnerable to runtime errors.
Reliability has therefore become a central focus. Benchmarks show that language-driven robots may execute unsafe or logically invalid instructions if unchecked [
37]. To address this, systems incorporate validators [
21], runtime guardrails [
38], and memory mechanisms. RAG-Modulo retrieves past relevant experiences for reuse [
39], Memory
3 improves consistency with a structured memory store [
40], and multi-agent debate frameworks (MAD) introduce redundancy by cross-checking outputs [
41]. These augmentations improve robustness but still depend on external critics to ensure safety.
Recent benchmarks have been introduced to better capture the limitations of current methods. PlanBench measures symbolic executability and reasoning about change [
42], LoTa-Bench targets language-oriented task planners for embodied agents [
43], and TravelPlanner evaluates feasibility under real-world constraints [
44]. ActPlan-1K focuses on procedural planning in household activities with visual–language models [
45], while MFE-ETP provides a broader evaluation of multimodal foundation models on embodied planning tasks [
46]. Together, these benchmarks reveal recurring gaps: symbolic methods do not scale to open environments, RL requires excessive data, and LLM-based approaches struggle with constraint satisfaction and reliability.
In summary, symbolic planners guarantee validity but require complete models, RL policies achieve robustness at high cost, and LLMs provide flexibility at the expense of correctness. This motivates further exploration of hybrid frameworks where LLMs act as flexible proposers and external verifiers guarantee soundness.
4. Results
4.1. Planning (PDDL Generation)
The planning stage was evaluated independently to compare language model performance on standardized PDDL problems. Each model receives a textual prompt and must produce valid domain and problem files. Twelve models were tested across eleven domains (Blocksworld, Gripper, Depot, Driverlog, Satellite, Rovers, Tyreworld, Storage, Logistics, Termes, Floortile). Smaller local models under 8B parameters (run with Ollama: llama3.2, phi4-mini-reasoning, qwen3, gemma3, deepseek-r1:8b) fail to produce valid PDDL syntax and therefore solve no tasks. Consequently, all reported results are from seven larger OpenAI models.
Figure 6 reports the number of problems solved at attempt
k (
Solved@k) across five trials per model. All models improve with additional repair attempts, confirming that the fixed retry loop helps recover near-miss generations. No model solves all problems at
k = 1, but stronger models fix most issues in early attempts and then stagnate after about seven retries.
Figure 7 presents the percentage of problems solved within 10 retry attempts, together with the average number of attempts when success occurs. Each model–problem pair was repeated five times. These results are indicative rather than conclusive due to the small sample size. The best-performing models are
gpt-5-mini,
gpt-5,
o4-mini, and
o3, all reaching above 90% success within 10 attempts, with only
gpt-5-mini solving all problems.
gpt-5-mini also needed the fewest attempts on average (1.47), followed by
gpt-5,
o3, and
gpt-4.1 (1.65), and
o4-mini (2.42).
Note that models marked with a star in
Figure 6 and
Figure 7 do not support structured output and were prompted to return plain-text PDDL instead of JSON.
4.2. MiniGrid
We evaluated SPCA on 76 MiniGrid/BabyAI levels across 8 categories on a curriculum with a total of 102 levels across 9 categories.
Table 1 summarizes scope and runtime. The system completes 74.51% of all 102 levels in 325 runs, with an average of
SPCA rounds per level on the evaluated subset. All further statistics are calculated on the 76 evaluated levels.
Figure 8 depicts cumulative levels solved over time. The curve rises steeply at the start, indicating that the early levels are solved efficiently and reused in later tasks. The curve then flattens for categories like
Open Door,
Unlock Door, and
Unlock Pickup, which require more retries and repairs. The
Memory Ordering category stabilizes faster, suggesting stronger reuse of existing code and logic. A short execution video of the final MiniGrid agent is provided in the
Supplementary Materials.
Planner behavior is summarized in
Table 2. Most cases reuse a trusted domain, while syntax errors and fresh drafts occur less often.
Table 3 indicates that most corrections happen during coding, with semantic retries dominating. Syntax errors did not occur, indicating strong coding reliability of current LLMs. The low number of new function generations (12 in total) compared to 76 solved levels highlights efficient reuse of modular skills.
A closer inspection of the execution logs provides additional insight into how often repair and reuse mechanisms are required in practice. When a new skill had to be implemented via fresh code generation, only 50.0% of these cases resulted in direct success without further modification. This indicates that semantic repair is frequently necessary to stabilize newly generated skills, even in the MiniGrid setting.
In contrast, reuse of an existing planning domain is strongly associated with efficient convergence. Among successful levels where a previously learned domain was reused, 87.9% completed in a single SPCA round, indicating that reuse typically avoids the need for additional replanning or repair cycles.
Finally, the overall semantic repair burden is highly concentrated. Across all levels, 213 semantic repair attempts were recorded, of which 79.8% were incurred by only ten levels. These cases correspond primarily to structurally more complex tasks, suggesting that most semantic repair effort is driven by a small number of particularly challenging environments rather than being uniformly distributed across the curriculum.
To better understand what drives the large share of semantic repairs, we inspected the execution-level failure signals in MiniGrid. The most common semantic failure is goal_not_reached, where the plan executes without runtime errors but ends in a non-final state. The second most common failure, occurring mostly in the early curriculum levels, is stuck, where the agent makes no progress after a fixed number of primitive actions.
The prevalence of goal_not_reached failures is largely explained by the structure of the environment: the space of valid states is dominated by non-final states. Even in a minimal 3 × 3 grid with the agent starting in the top-left and the goal in the bottom-right, only one out of nine positions corresponds to the goal (11.1%), so many action sequences can be valid yet still fail to reach the goal under a bounded execution horizon.
By contrast, stuck failures are primarily linked to partial observability. With the agent-centric 7 × 7 observation window, the agent cannot see the full map layout and may enter dead ends or cycles without recognizing them early. In these cases, SPCA eventually learns simple recovery behaviors, such as retracing recent steps and performing local exploration to escape loops and continue progress toward the goal.
Figure 9 shows per–category task durations. Long tails appear in
Open Door and
Memory Ordering, while navigation and hazard avoidance tasks remain shorter. This pattern aligns with the cumulative level success curve, where the same categories required additional retries and repair cycles.
Figure 10 reports token usage per category. Token consumption grows with curriculum depth as the code knowledge base expands, increasing the context size for both PlannerLLM and CoderLLM. The CoderLLM shows higher variance due to longer semantic errors and temporary code fragments appended during repair cycles. All errors are appended to the coder context only within a single SPCA cycle, and the context is reset after either a successful execution or a replan.
Table 4 reports total input and output token counts together with the corresponding API cost for the MiniGrid experiments, broken down by model. Costs are computed using OpenAI list pricing as of October 22nd (see
https://pricepertoken.com/pricing-page/provider/openai, accessed on 20 December 2025). The cost distribution mirrors the token usage trends in
Figure 10, where the coding stage dominates overall token consumption. This is consistent with the higher number of coder calls and longer generated responses observed during semantic repair.
Comparison with Traditional RL
To compare SPCA with reinforcement learning methods, we recreated the baseline from the original BabyAI paper [
8] using the same architecture (CNN for image processing, GRU for text instruction processing and LSTM for memory) and obtained similar results.
Table 5 reports the success rate after 500 episodes, as well as the number of episodes (in thousands) required to reach 70% and 99% success. At 500 episodes, performance remains low on all levels, and achieving near-perfect results requires anywhere from tens of thousands up to one million episodes.
Figure 11 shows the success curve for
PickupLoc, where 882 k episodes are needed to reach 99% success. The curve follows a standard logarithmic shape with diminishing returns over time.
4.3. ROS 2 + Gazebo UR5
In the ROS 2 + Gazebo simulator, we tested 31 scenarios across four task groups. The system completes 70.97% of all levels. On the evaluated subset of 22 levels, the average time per task is
min with
SPCA rounds and 131 total runs.
Table 6 summarizes runtime statistics.
Planning statistics appear in
Table 7. Most tasks reuse existing domains, with replanning triggered by feasibility or collision constraints. The relatively high number of syntax repairs indicates that, despite reuse, the planner still rejects a non–negligible share of candidate domains or problems. Together, the four planner categories account for an average of 3.68 API calls per level. Coding results in
Table 8 again show no syntax errors and mostly semantic repairs. Only 10 new skills were generated for 22 levels, confirming efficient code reuse. All further statistics are calculated on the 22 evaluated levels. An execution video is available in the
Supplementary Materials.
Table 9 reports token usage across SenseLLM, PlannerLLM, and CoderLLM. SenseLLM and CoderLLM account for the majority of tokens, with SenseLLM slightly higher on average because it processes RGB images through the vision–language model, while the accompanying textual prompts with task specific instructions remain short. This also explains the very small standard deviation. CoderLLM is the second-largest contributor and shows higher variance because error traces and temporary code are appended during repair cycles. PlannerLLM has a smaller spread than in MiniGrid, as it processes only function signatures and docstrings instead of full code. Overall token usage still increases with later levels as the code knowledge base grows, even though this trend is not directly visible in the table.
5. Discussion
The standalone PDDL generation experiments indicate that strong LLMs can reliably draft solvable
domain/
problem pairs when paired with a planner/validator in a short, bounded repair loop, whereas smaller LLMs struggle with syntax. A practical limitation is that PDDL generation currently depends on strong proprietary LLMs: in our tests, all smaller open-source models below 8B failed to produce valid PDDL syntax, which limits accessibility and increases operational cost. One mitigation is to enforce syntax with a PDDL context-free grammar and structured output, reducing unparsable generations. A second mitigation is to fine-tune open models on PDDL to improve formatting and domain/problem consistency, while keeping the planner/validator loop as a semantic check. A third mitigation is to use larger open models that can match proprietary performance, at the cost of higher local compute requirements (GPU) or paid inference services. As seen in
Figure 6, the feedback loop almost doubles performance from Solved@1 to Solved@10, confirming that structured error summaries and targeted patches convert many near-misses into solutions. We also observe diminishing returns after roughly seven attempts for the strongest models (e.g., o3, o4-mini), indicating that a budget of
k = 10 is sufficient in practice. These trends align with recent systems that pair LLM proposals with formal checks and bounded patching [
19,
47,
48,
49].
The MiniGrid experiments show that symbolic planning stabilizes early in the curriculum. Once a domain has been accepted by the planner and validator, subsequent tasks are often solved with small edits rather than fresh models. This leads to a high share of domain reuse and relatively few syntax rejections. Failures happen primarily during execution, where cyclical exploration and action ordering errors trigger semantic retries in the coder loop. The coding stage therefore carries the main repair load, which is consistent with the design of SPCA. Most adjustments are small patches, such as adding guards, handling map boundaries, or retrying toggles. These patches converge into a reusable library of skills, making later tasks more efficient. The token use analysis confirms this trend: planner calls remain modest, while coder calls grow as the skill library expands. This growth increases cost and occasionally destabilizes behavior, which highlights the need for slimmer prompts that pass only relevant skills instead of full code contexts.
Reproducing the BabyAI baseline with the original CNN–GRU–LSTM architecture yields results consistent with the original paper [
8]. The model requires close to one million episodes to reach 99% success on
PickupLoc, while performance after 500 episodes remains low, around 14%, and near zero on harder levels such as
PutNextLocal. Although later studies report moderate gains in sample efficiency [
11,
12,
50,
51,
52,
53], reinforcement learning still demands large amounts of training and transfers poorly even between tasks within the same environment. In contrast, SPCA achieves comparable performance in only a few simulation runs per level by combining symbolic validation that filters logically invalid plans, targeted code repair that fixes execution errors without discarding successful logic, and reuse of previously verified domains and skills. This difference highlights a practical trade-off: gradient-based exploration learns robust behaviors but at high cost, whereas structured validation and repair enable fast, compositional generalization, an advantage that becomes crucial when simulations are expensive.
In the ROS 2 + Gazebo setting, the same pattern emerges under more demanding conditions. Symbolic domains remain stable once established, with replanning triggered mostly by motion planning faults or referee detecting collision constraints rather than by syntax errors. As in MiniGrid, most adjustments here are handled by the coding stage, where small patches to skills fix execution problems without altering the higher-level plan. Semantic retries dominate, but syntax errors are filtered out by the compiler and reload checks. Successful patches often adjust approach heights, retreat distances, or gripper timing, which shows that execution-level repair can handle physical margins without discarding the symbolic plan. Token budgets follow the same division as in MiniGrid: SenseLLM remains constant, PlannerLLM is moderate, and CoderLLM grows with the accumulated skill base. The main failure mode is occlusion, which is expected with a single fixed top-down RGB-D camera. Because the camera is mounted above the workspace, the arm sometimes blocks objects, leading to scene descriptions that omit items required by the plan and thereby introduce inconsistencies during execution. From a systems perspective, this limitation is primarily a sensing issue rather than a planning one. Several straightforward upgrades could reduce occlusion-induced instability, including multi-view sensing through an additional side camera or a wrist-mounted camera on the end-effector to provide complementary viewpoints. Stability can also be improved by making execution more perception-aware, for example by re-sensing after large motions, inserting simple guards that verify object visibility before committing to the next substep, or repositioning the arm or camera when visibility is lost. Maintaining a persistent 3D workspace representation built from successive depth observations would further allow the system to retain object hypotheses across short occlusions, while lightweight instance segmentation can help separate objects from arm geometry and clutter. Together, these measures would reduce the propagation of missing-object descriptions into later stages and improve robustness on real hardware.
Our evaluation is simulation-only (MiniGrid and ROS 2 + Gazebo); while Gazebo captures geometry and basic physics, it does not reproduce many real-world factors such as sensor noise and calibration drift, contact uncertainty, unmodeled dynamics, and appearance changes. As a result, the reported success rates should be interpreted as proof-of-concept evidence for the SPCA loop under controlled conditions rather than sim-to-real validation. Closing this gap would require hardware experiments or higher-fidelity simulation with explicit noise/perturbation models, more robust perception, and safety-constrained execution.
Across both environments, the results highlight the importance of dual critics. By separating symbolic validation from code execution, the system can localize errors and apply small repairs rather than discarding entire plans. This makes the loop more sample-efficient than reinforcement learning, which often requires many episodes to discover reliable strategies, and more reliable than end-to-end LLM agents, which lack external checks. At the same time, limitations remain. The coder loop is heavy, often requiring multiple retries per level, and prompt growth is a practical bottleneck. The reliance on strong reasoning language models for valid PDDL generation also suggests that lighter models are not yet adequate for planning roles. Finally, the environments used here, although diverse, remain simplified testbeds. Generalization to real-world robotics with higher variability in tasks, objects, and sensing conditions remains an open question. Note, that real world poses additional challenges, such as reflection and other camera effects in different lighting conditions, the cost of unbound path planning (the robot arm can break), safety of subjects in the room, non availability of deterministic critic in the real world (how do we know a task is finished successfully or it failed), etc.
These findings suggest that SPCA occupies a useful position between classical planning, reinforcement learning, and purely generative LLM agents. Its explicit separation of high-level intent from low-level execution, combined with deterministic critics, yields interpretable outcomes and bounded loops. The approach shows promise for long-horizon tasks under partial observability and physical constraints, while also exposing clear avenues for improvement in perception, prompt efficiency, and validation coverage.