Next Article in Journal
Artificial Intelligence for Predicting Treatment Response in Neovascular Age Macular Degeneration with Anti-VEGF: A Systematic Review and Meta-Analysis
Next Article in Special Issue
Assessing Interaction Quality in Human–AI Dialogue: An Integrative Review and Multi-Layer Framework for Conversational Agents
Previous Article in Journal
Scenario-Guided Temporal Prototypes in Reinforcement Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robot Planning via LLM Proposals and Symbolic Verification

1
Jozef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia
2
Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, 1000 Ljubljana, Slovenia
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2026, 8(1), 22; https://doi.org/10.3390/make8010022
Submission received: 20 November 2025 / Revised: 24 December 2025 / Accepted: 12 January 2026 / Published: 16 January 2026

Abstract

Planning in robotics represents an ongoing research challenge, as it requires the integration of sensing, reasoning, and execution. Although large language models (LLMs) provide a high degree of flexibility in planning, they often introduce hallucinated goals and actions and consequently lack the formal reliability of deterministic methods. In this paper, we address this limitation by proposing a hybrid Sense–Plan–Code–Act (SPCA) framework that combines perception, LLM-based reasoning, and symbolic planning. Within the proposed approach, sensory information is first transformed into a symbolic description of the world in Planning Domain Definition Language (PDDL) using an LLM. A heuristic planner is then used to generate a valid plan, which is subsequently converted to code by a second LLM. The generated code is first validated syntactically through compilation and then semantically in simulation. When errors are detected, local corrections can be applied and the process is repeated as necessary. The proposed method is evaluated in the OpenAI Gym MiniGrid reinforcement learning environment and in a Gazebo simulation on a UR5 robotic arm using a curriculum of tasks with increasing complexity. The system successfully completes approximately 71–75% of tasks across environments with a relatively low number of simulation iterations.

1. Introduction

Planning is a central problem in artificial intelligence. It involves transforming a high-level goal into a sequence of steps that an agent can execute in order to achieve the desired outcome. Reliable planning is especially important when agents interact with unpredictable environments, where mistakes in reasoning or execution may accumulate and cause failure.
Classical planning approaches rely on symbolic world models. They can produce interpretable action sequences and behave predictably when the model matches the environment, but they degrade quickly when assumptions are violated (missing objects, wrong effects, partial observability, unexpected events), which forces manual maintenance of the model.
Reinforcement learning (RL) offers a different approach. Instead of relying on symbolic rules, an RL agent learns policies from interaction with the environment. This makes RL suitable for high-dimensional tasks such as robotics or games. However, RL methods typically require extensive training data and long training times. They are also limited in their ability to generalize: a policy trained in one environment often performs poorly when the task is modified or when new constraints are introduced [1,2].
Large language models (LLMs) have recently been proposed as an alternative tool for planning [3]. Their strength lies in flexible reasoning and their ability to represent tasks in natural language. LLMs can generate action sequences, propose subgoals, or even synthesize domain specifications. Early studies show that they can produce valid plans in simplified settings, but their output is inconsistent [4,5,6,7]. Failures often occur in longer tasks, or in situations where logical constraints must be satisfied exactly. This raises the question of how LLMs can be systematically combined with established planning methods in order to benefit from both flexibility and reliability.
We investigate this question in two settings:
  • MiniGrid, a widely used benchmark for goal-directed tasks in a grid-world environment [8]. MiniGrid provides controlled conditions for evaluating planning systems and allows systematic analysis of errors.
  • A robotic manipulation scenario, where a robot must execute high-level goals under physical and temporal constraints.
Both environments highlight different aspects of the planning problem: abstract reasoning in simulation, and grounding of plans in real-world actions. The contributions of this work are as follows:
  • A framework that integrates LLM-based reasoning into a Sense–Plan–Code–Act (SPCA) loop, combining text generation with structured validation.
  • An experimental study of LLM planning performance in MiniGrid tasks, including a taxonomy of common error types.
  • A demonstration of how the same framework can be extended to a robotic manipulation setup, showing the portability of the approach.
The remainder of this paper is organized as follows. Section 2 reviews related work in symbolic planning, reinforcement learning, and LLM-based planning. Section 3 presents the proposed framework. Section 4 reports on experiments in MiniGrid and robotics. Section 5 analyzes the implications of the findings. Section 6 concludes the paper and suggests directions for future work.

2. Related Work

Classical planning provides the most established foundation for goal-directed reasoning in AI. In this setting, tasks are described through symbolic states and actions, typically encoded in the Planning Domain Definition Language (PDDL). The domain defines action schemas (preconditions/effects) and predicates, while the problem instance specifies objects, initial state, and goal conditions. Standard planners such as Fast Downward search over these state transitions to generate plans that are both sound and complete [9,10]. In practice, the dominant cost is model acquisition: choosing the right abstraction level, enumerating predicates and action effects, and keeping the specification consistent as tasks evolve. This makes classical planning strong in well-modeled settings but expensive to deploy and maintain when the environment or task distribution changes.
Reinforcement learning (RL) offers an alternative by optimizing policies through interaction with the environment. In grid-based environments such as MiniGrid and BabyAI, discrete-action baseline RL algorithms such as Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C) and Deep Q-Networks (DQN) have been applied to sparse-reward and partially observable tasks [1,11,12]. Although these methods achieve robust behaviors within a training distribution, they demand large numbers of samples and transfer poorly across task variations. Extensions such as STAP have aimed to sequence skills to handle longer horizons [13], but scalability remains limited.
Large language models (LLMs) are token predictors trained at scale on text corpora, which makes them effective at generating natural language and code [4,5]; while not designed for planning, their generative ability has been adapted to produce symbolic descriptions, action sequences, or control code. Alignment techniques such as instruction tuning and chain-of-thought prompting improve their reasoning ability [14,15,16,17], but they remain prone to hallucinations and logical errors.
Several works explore how LLMs can augment symbolic planning. LLM-Planner and LLM+P translate natural language into PDDL fragments or full problem files, relying on a symbolic planner for validation [18,19]. World-model PDDL builders extend this loop by iteratively refining domains and problems until a solution is found [20]. More advanced systems, such as Planning in the Dark and AutoTAMP, combine LLM proposals with automated critics or task-and-motion checks to ensure executability [21,22]. In these approaches, the LLM acts as a flexible generator, while external validators guarantee correctness.
Other research explores open-loop LLM planning without formal PDDL grounding. Inner Monologue introduces step-level feedback during execution [23], while Embodied Chain-of-Thought produces reasoning traces tied to perceptual input [24]. SayCan grounds language in a robot’s skill library by scoring each skill with two signals: how relevant it is to the goal and whether it is feasible in the current scene [25]. SayCan operates over a fixed, hand-designed skill library and selects among existing skills rather than synthesizing new symbolic operators or long-horizon plans. Feasibility is captured implicitly via learned value functions, which improves grounding but offers limited global constraint checking and weaker inspectability compared to explicit symbolic planning. Visual–language models (VLMs) pair visual encoders with language models to perceive and reason about scenes in natural language (e.g., open-vocabulary recognition, object–relation descriptions, affordance cues). VoxPoser links language to low-level control by producing 3D value maps over a voxelized scene, which guide a motion planner to generate executable trajectories [26]. In COME-Robot, GPT-4V serves as the VLM inside a closed-loop mobile manipulation system, interpreting RGB observations and proposing corrections during execution [27]. Vision–language–action (VLA) models take a step further by mapping multimodal observations (and language goals) directly to actions, coupling perception with policy generation. PaLM-E demonstrates embodied multimodal language modeling that conditions on images and robot state to emit action-relevant commands for real tasks [28], while OpenVLA provides an open-source VLA that learns end-to-end visuomotor policies from language-conditioned data [29]. VLM and VLA approaches enable flexible perception and control but typically require large-scale demonstration data and offer limited guarantees about constraint satisfaction. Their internal decision processes are difficult to verify, which complicates safety assurance in long-horizon or safety-critical tasks. In addition, running large multimodal models in closed-loop control can be computationally and financially expensive.
In parallel, program and skill synthesis focuses on composing reusable behaviors. Traditional frameworks such as MoveIt Task Constructor and Tesseract define tasks as sequences of stages with explicit feasibility checks [30,31]. PDDLStream integrates symbolic reasoning with continuous samplers to verify kinematic or geometric constraints during planning [32]. More recent LLM-based systems produce executable programs directly. While PDDLStream bridges symbolic and continuous planning, it relies on carefully engineered stream procedures and certificates, and performance is sensitive to sampling efficiency and heuristic guidance. Extending the system to new domains often requires substantial manual effort in defining new streams and validity checks. Code as Policies translates language instructions into code that calls robot controllers [33], ProgPrompt structures prompts to bias models toward constrained outputs [34], Instruct2Act generates Python loops for language-conditioned control [35], and Voyager demonstrates iterative skill acquisition through program generation in open-world tasks [36]. These systems reduce manual authoring effort but remain vulnerable to runtime errors.
Reliability has therefore become a central focus. Benchmarks show that language-driven robots may execute unsafe or logically invalid instructions if unchecked [37]. To address this, systems incorporate validators [21], runtime guardrails [38], and memory mechanisms. RAG-Modulo retrieves past relevant experiences for reuse [39], Memory3 improves consistency with a structured memory store [40], and multi-agent debate frameworks (MAD) introduce redundancy by cross-checking outputs [41]. These augmentations improve robustness but still depend on external critics to ensure safety.
Recent benchmarks have been introduced to better capture the limitations of current methods. PlanBench measures symbolic executability and reasoning about change [42], LoTa-Bench targets language-oriented task planners for embodied agents [43], and TravelPlanner evaluates feasibility under real-world constraints [44]. ActPlan-1K focuses on procedural planning in household activities with visual–language models [45], while MFE-ETP provides a broader evaluation of multimodal foundation models on embodied planning tasks [46]. Together, these benchmarks reveal recurring gaps: symbolic methods do not scale to open environments, RL requires excessive data, and LLM-based approaches struggle with constraint satisfaction and reliability.
In summary, symbolic planners guarantee validity but require complete models, RL policies achieve robustness at high cost, and LLMs provide flexibility at the expense of correctness. This motivates further exploration of hybrid frameworks where LLMs act as flexible proposers and external verifiers guarantee soundness.

3. Materials and Methods

3.1. The Framework: Sense → Plan → Code → Act

We address long-horizon tasks using a four-stage approach, Sense → Plan → Code → Act, SPCA. The central idea is that language models propose symbolic structure or code, while deterministic critics decide validity at critical points. Planning outputs are checked by a symbolic planner and validator, and code outputs are checked by a compiler and a simulator. If errors are found, the loop attempts bounded local repairs before returning to earlier stages. Figure 1 illustrates this pipeline, which we evaluate in both MiniGrid and a Robot Operating System 2 (ROS 2) + Gazebo robotic setup.

3.2. Sense

The Sense stage condenses raw input into a compact description that later stages can consume. For MiniGrid, the snapshot contains mission text, agent orientation, inventory, and a local egocentric grid. A global map is built incrementally to support replanning. For ROS 2 + Gazebo, a calibrated RGB-D camera observes the workspace. A vision–language model (SenseLLM) converts the image into a short scene summary describing objects, colors, and coarse spatial relations. Raw sensor streams remain accessible for actions during execution. All snapshots are checkpointed so that later repairs can resume from the same world state. Figure 2 shows an example of this process.
The vision prompt and image-handling details are described in Appendix D.

3.3. Plan

The Plan stage turns previous stage output together with the task description into a PDDL domain and problem. We use PlannerLLM, which drafts the files. The Unified Planning framework parses them and calls Fast Downward as the solver. TAMER validation then checks the resulting plan. If parsing, solving, or validation fails, the system enters a bounded repair loop with at most eight attempts. Each attempt summarizes the error and requests a targeted patch from the LLM. We use OpenAI GPT-o3 (version o3-2025-04-16) for fresh drafts and repair, and the lighter o4-mini (version o4-mini-2025-04-16) when reusing a trusted domain. Figure 3 shows the complete planning workflow.
Exact prompt templates and planning modes are provided in Appendixes Appendix B and Appendix E.

3.4. Code

The Code stage uses CoderLLM to generate or patch only the missing skills referenced in the plan. The produced small patches are guided by the PDDL schemas. Two critics enforce correctness. The first is compilation and reloading, which catch syntax and interface errors. The second is a test in the target environment (the Act stage), which handles semantic errors. If a patch fails, error traces are provided to the model. The system allows up to five repair attempts per skill. Successful patches are stored for reuse in later tasks. We use OpenAI Codex (codex-mini-latest) across all CoderLLM modes. In MiniGrid, skills are Python methods of an Agent class that emit primitive moves. In ROS 2 + Gazebo, skills are Python functions that interact with MoveIt 2 for motion planning and control.
The full code-generation prompts and repair protocol are detailed in Appendixes Appendix C and Appendix F.

3.5. Act

The Act stage executes a validated plan. Each symbolic step calls its bound skill, which then produces primitives or robot actions. Failure signals trigger repair at the code level. Only when repair budgets are exhausted does the system escalate to replanning.
We evaluate the framework in two environments. In MiniGrid/BabyAI, we use a curriculum of tasks with partial observability, including navigation, pickup, and unlocking. The environment itself provides termination signals, reporting success when goals are achieved and failure when retry budgets expire. Figure 4 shows one such task.
We use a curriculum with the following task groups (categories):
  • Goal Navigation: Move to a target location in simple layouts.
  • Static Obstacle Navigation: Go to a target while avoiding fixed obstacles.
  • Hazard Avoidance: Reach the goal while avoiding dangerous tiles (i.e., lava).
  • Pickup Only: Find and pick up a specified object.
  • Open Door: Open unlocked doors.
  • Memory Ordering: Use memory to perform actions in the right order.
  • Unlock Door: Fetch a key and unlock a locked door.
  • Unlock Pickup: Unlock doors and then pick up the target object.
  • Obstacle Blocking: Move obstacles and open doors to reach the goal.
In ROS 2 + Gazebo, we use a UR5 arm with a Robotiq 2F-85 gripper, a static top-down RGB-D camera, and six colored objects on a table. Tasks include Touch, Pick-up, Pick-and-place, and Stack. Unlike MiniGrid, where the environment itself signals termination, here a Referee node monitors collisions and success conditions defined in YAML. It reports success only if all positive contacts hold and forbidden contacts are absent at the same time. The Referee also checks timeout conditions. The YAML task specification and the collision-based Referee logic are detailed in Appendix A. Figure 5 shows the simulated workspace.
The curriculum consists of the following task groups (categories):
  • Touch: Move the gripper to make contact with a specified cube.
  • Pick-Up: Grip and lift a target cube so it is no longer touching the table.
  • Pick-and-Place: Pick up a cube and place it on a specified plate or next to another cube.
  • Stack: Arrange multiple cubes by stacking them in the correct color order.

3.6. Evaluation

We evaluate the SPCA framework in each of the two environments, using a fixed curricula of tasks.
In MiniGrid we test task completion across categories of increasing difficulty and measure how often symbolic plans are reused versus regenerated. We also profile coding reliability under repair loops and track prompt budgets to estimate LLM cost. In ROS 2 + Gazebo we test transfer to a robotic simulator with realistic kinematics and sensing. Here we measure completion of manipulation tasks (touch, pick-up, pick-and-place, stack) while enforcing collision rules defined externally.
For both environments we report the following metrics:
  • Completion rate (%): fraction of tasks solved within retry budgets.
  • Execution time: wall-clock time per task, measured from sensing to final verdict.
  • Planning diagnostics: counts of fresh (first-time domain), reuse (trusted domain reused), replan (new domain after failed rollout), and syntax (planner or validator rejection).
  • Coding diagnostics: counts of first (success on first attempt), semantic (runtime or environment-level failure), and syntax (compile or import failure).
  • Token budgets: prompt and completion tokens consumed by (SenseLLM), PlannerLLM, and CoderLLM.
  • SPCA rounds: number of outer Sense–Plan–Code–Act cycles attempted per task.
In addition to the end-to-end SPCA evaluation, we also test the Plan stage in isolation. Given a description of the problem in natural language, the language model is tasked with generating a complete PDDL domain and problem pair, that is then parsed and solved by a classical planner. Twelve models are tested across eleven well-known problems. The metrics reported are the share of solved problems within a small repair budget and the average number of repair attempts when a valid plan is found.

4. Results

4.1. Planning (PDDL Generation)

The planning stage was evaluated independently to compare language model performance on standardized PDDL problems. Each model receives a textual prompt and must produce valid domain and problem files. Twelve models were tested across eleven domains (Blocksworld, Gripper, Depot, Driverlog, Satellite, Rovers, Tyreworld, Storage, Logistics, Termes, Floortile). Smaller local models under 8B parameters (run with Ollama: llama3.2, phi4-mini-reasoning, qwen3, gemma3, deepseek-r1:8b) fail to produce valid PDDL syntax and therefore solve no tasks. Consequently, all reported results are from seven larger OpenAI models.
Figure 6 reports the number of problems solved at attempt k (Solved@k) across five trials per model. All models improve with additional repair attempts, confirming that the fixed retry loop helps recover near-miss generations. No model solves all problems at k = 1, but stronger models fix most issues in early attempts and then stagnate after about seven retries.
Figure 7 presents the percentage of problems solved within 10 retry attempts, together with the average number of attempts when success occurs. Each model–problem pair was repeated five times. These results are indicative rather than conclusive due to the small sample size. The best-performing models are gpt-5-mini, gpt-5, o4-mini, and o3, all reaching above 90% success within 10 attempts, with only gpt-5-mini solving all problems. gpt-5-mini also needed the fewest attempts on average (1.47), followed by gpt-5, o3, and gpt-4.1 (1.65), and o4-mini (2.42).
Note that models marked with a star in Figure 6 and Figure 7 do not support structured output and were prompted to return plain-text PDDL instead of JSON.

4.2. MiniGrid

We evaluated SPCA on 76 MiniGrid/BabyAI levels across 8 categories on a curriculum with a total of 102 levels across 9 categories. Table 1 summarizes scope and runtime. The system completes 74.51% of all 102 levels in 325 runs, with an average of 1.38 ± 1.00 SPCA rounds per level on the evaluated subset. All further statistics are calculated on the 76 evaluated levels.
Figure 8 depicts cumulative levels solved over time. The curve rises steeply at the start, indicating that the early levels are solved efficiently and reused in later tasks. The curve then flattens for categories like Open Door, Unlock Door, and Unlock Pickup, which require more retries and repairs. The Memory Ordering category stabilizes faster, suggesting stronger reuse of existing code and logic. A short execution video of the final MiniGrid agent is provided in the Supplementary Materials.
Planner behavior is summarized in Table 2. Most cases reuse a trusted domain, while syntax errors and fresh drafts occur less often.
Table 3 indicates that most corrections happen during coding, with semantic retries dominating. Syntax errors did not occur, indicating strong coding reliability of current LLMs. The low number of new function generations (12 in total) compared to 76 solved levels highlights efficient reuse of modular skills.
A closer inspection of the execution logs provides additional insight into how often repair and reuse mechanisms are required in practice. When a new skill had to be implemented via fresh code generation, only 50.0% of these cases resulted in direct success without further modification. This indicates that semantic repair is frequently necessary to stabilize newly generated skills, even in the MiniGrid setting.
In contrast, reuse of an existing planning domain is strongly associated with efficient convergence. Among successful levels where a previously learned domain was reused, 87.9% completed in a single SPCA round, indicating that reuse typically avoids the need for additional replanning or repair cycles.
Finally, the overall semantic repair burden is highly concentrated. Across all levels, 213 semantic repair attempts were recorded, of which 79.8% were incurred by only ten levels. These cases correspond primarily to structurally more complex tasks, suggesting that most semantic repair effort is driven by a small number of particularly challenging environments rather than being uniformly distributed across the curriculum.
To better understand what drives the large share of semantic repairs, we inspected the execution-level failure signals in MiniGrid. The most common semantic failure is goal_not_reached, where the plan executes without runtime errors but ends in a non-final state. The second most common failure, occurring mostly in the early curriculum levels, is stuck, where the agent makes no progress after a fixed number of primitive actions.
The prevalence of goal_not_reached failures is largely explained by the structure of the environment: the space of valid states is dominated by non-final states. Even in a minimal 3 × 3 grid with the agent starting in the top-left and the goal in the bottom-right, only one out of nine positions corresponds to the goal (11.1%), so many action sequences can be valid yet still fail to reach the goal under a bounded execution horizon.
By contrast, stuck failures are primarily linked to partial observability. With the agent-centric 7 × 7 observation window, the agent cannot see the full map layout and may enter dead ends or cycles without recognizing them early. In these cases, SPCA eventually learns simple recovery behaviors, such as retracing recent steps and performing local exploration to escape loops and continue progress toward the goal.
Figure 9 shows per–category task durations. Long tails appear in Open Door and Memory Ordering, while navigation and hazard avoidance tasks remain shorter. This pattern aligns with the cumulative level success curve, where the same categories required additional retries and repair cycles.
Figure 10 reports token usage per category. Token consumption grows with curriculum depth as the code knowledge base expands, increasing the context size for both PlannerLLM and CoderLLM. The CoderLLM shows higher variance due to longer semantic errors and temporary code fragments appended during repair cycles. All errors are appended to the coder context only within a single SPCA cycle, and the context is reset after either a successful execution or a replan.
Table 4 reports total input and output token counts together with the corresponding API cost for the MiniGrid experiments, broken down by model. Costs are computed using OpenAI list pricing as of October 22nd (see https://pricepertoken.com/pricing-page/provider/openai, accessed on 20 December 2025). The cost distribution mirrors the token usage trends in Figure 10, where the coding stage dominates overall token consumption. This is consistent with the higher number of coder calls and longer generated responses observed during semantic repair.

Comparison with Traditional RL

To compare SPCA with reinforcement learning methods, we recreated the baseline from the original BabyAI paper [8] using the same architecture (CNN for image processing, GRU for text instruction processing and LSTM for memory) and obtained similar results. Table 5 reports the success rate after 500 episodes, as well as the number of episodes (in thousands) required to reach 70% and 99% success. At 500 episodes, performance remains low on all levels, and achieving near-perfect results requires anywhere from tens of thousands up to one million episodes.
Figure 11 shows the success curve for PickupLoc, where 882 k episodes are needed to reach 99% success. The curve follows a standard logarithmic shape with diminishing returns over time.

4.3. ROS 2 + Gazebo UR5

In the ROS 2 + Gazebo simulator, we tested 31 scenarios across four task groups. The system completes 70.97% of all levels. On the evaluated subset of 22 levels, the average time per task is 5.13 ± 1.72 min with 1.38 ± 0.18 SPCA rounds and 131 total runs. Table 6 summarizes runtime statistics.
Planning statistics appear in Table 7. Most tasks reuse existing domains, with replanning triggered by feasibility or collision constraints. The relatively high number of syntax repairs indicates that, despite reuse, the planner still rejects a non–negligible share of candidate domains or problems. Together, the four planner categories account for an average of 3.68 API calls per level. Coding results in Table 8 again show no syntax errors and mostly semantic repairs. Only 10 new skills were generated for 22 levels, confirming efficient code reuse. All further statistics are calculated on the 22 evaluated levels. An execution video is available in the Supplementary Materials.
Table 9 reports token usage across SenseLLM, PlannerLLM, and CoderLLM. SenseLLM and CoderLLM account for the majority of tokens, with SenseLLM slightly higher on average because it processes RGB images through the vision–language model, while the accompanying textual prompts with task specific instructions remain short. This also explains the very small standard deviation. CoderLLM is the second-largest contributor and shows higher variance because error traces and temporary code are appended during repair cycles. PlannerLLM has a smaller spread than in MiniGrid, as it processes only function signatures and docstrings instead of full code. Overall token usage still increases with later levels as the code knowledge base grows, even though this trend is not directly visible in the table.

5. Discussion

The standalone PDDL generation experiments indicate that strong LLMs can reliably draft solvable domain/problem pairs when paired with a planner/validator in a short, bounded repair loop, whereas smaller LLMs struggle with syntax. A practical limitation is that PDDL generation currently depends on strong proprietary LLMs: in our tests, all smaller open-source models below 8B failed to produce valid PDDL syntax, which limits accessibility and increases operational cost. One mitigation is to enforce syntax with a PDDL context-free grammar and structured output, reducing unparsable generations. A second mitigation is to fine-tune open models on PDDL to improve formatting and domain/problem consistency, while keeping the planner/validator loop as a semantic check. A third mitigation is to use larger open models that can match proprietary performance, at the cost of higher local compute requirements (GPU) or paid inference services. As seen in Figure 6, the feedback loop almost doubles performance from Solved@1 to Solved@10, confirming that structured error summaries and targeted patches convert many near-misses into solutions. We also observe diminishing returns after roughly seven attempts for the strongest models (e.g., o3, o4-mini), indicating that a budget of k = 10 is sufficient in practice. These trends align with recent systems that pair LLM proposals with formal checks and bounded patching [19,47,48,49].
The MiniGrid experiments show that symbolic planning stabilizes early in the curriculum. Once a domain has been accepted by the planner and validator, subsequent tasks are often solved with small edits rather than fresh models. This leads to a high share of domain reuse and relatively few syntax rejections. Failures happen primarily during execution, where cyclical exploration and action ordering errors trigger semantic retries in the coder loop. The coding stage therefore carries the main repair load, which is consistent with the design of SPCA. Most adjustments are small patches, such as adding guards, handling map boundaries, or retrying toggles. These patches converge into a reusable library of skills, making later tasks more efficient. The token use analysis confirms this trend: planner calls remain modest, while coder calls grow as the skill library expands. This growth increases cost and occasionally destabilizes behavior, which highlights the need for slimmer prompts that pass only relevant skills instead of full code contexts.
Reproducing the BabyAI baseline with the original CNN–GRU–LSTM architecture yields results consistent with the original paper [8]. The model requires close to one million episodes to reach 99% success on PickupLoc, while performance after 500 episodes remains low, around 14%, and near zero on harder levels such as PutNextLocal. Although later studies report moderate gains in sample efficiency [11,12,50,51,52,53], reinforcement learning still demands large amounts of training and transfers poorly even between tasks within the same environment. In contrast, SPCA achieves comparable performance in only a few simulation runs per level by combining symbolic validation that filters logically invalid plans, targeted code repair that fixes execution errors without discarding successful logic, and reuse of previously verified domains and skills. This difference highlights a practical trade-off: gradient-based exploration learns robust behaviors but at high cost, whereas structured validation and repair enable fast, compositional generalization, an advantage that becomes crucial when simulations are expensive.
In the ROS 2 + Gazebo setting, the same pattern emerges under more demanding conditions. Symbolic domains remain stable once established, with replanning triggered mostly by motion planning faults or referee detecting collision constraints rather than by syntax errors. As in MiniGrid, most adjustments here are handled by the coding stage, where small patches to skills fix execution problems without altering the higher-level plan. Semantic retries dominate, but syntax errors are filtered out by the compiler and reload checks. Successful patches often adjust approach heights, retreat distances, or gripper timing, which shows that execution-level repair can handle physical margins without discarding the symbolic plan. Token budgets follow the same division as in MiniGrid: SenseLLM remains constant, PlannerLLM is moderate, and CoderLLM grows with the accumulated skill base. The main failure mode is occlusion, which is expected with a single fixed top-down RGB-D camera. Because the camera is mounted above the workspace, the arm sometimes blocks objects, leading to scene descriptions that omit items required by the plan and thereby introduce inconsistencies during execution. From a systems perspective, this limitation is primarily a sensing issue rather than a planning one. Several straightforward upgrades could reduce occlusion-induced instability, including multi-view sensing through an additional side camera or a wrist-mounted camera on the end-effector to provide complementary viewpoints. Stability can also be improved by making execution more perception-aware, for example by re-sensing after large motions, inserting simple guards that verify object visibility before committing to the next substep, or repositioning the arm or camera when visibility is lost. Maintaining a persistent 3D workspace representation built from successive depth observations would further allow the system to retain object hypotheses across short occlusions, while lightweight instance segmentation can help separate objects from arm geometry and clutter. Together, these measures would reduce the propagation of missing-object descriptions into later stages and improve robustness on real hardware.
Our evaluation is simulation-only (MiniGrid and ROS 2 + Gazebo); while Gazebo captures geometry and basic physics, it does not reproduce many real-world factors such as sensor noise and calibration drift, contact uncertainty, unmodeled dynamics, and appearance changes. As a result, the reported success rates should be interpreted as proof-of-concept evidence for the SPCA loop under controlled conditions rather than sim-to-real validation. Closing this gap would require hardware experiments or higher-fidelity simulation with explicit noise/perturbation models, more robust perception, and safety-constrained execution.
Across both environments, the results highlight the importance of dual critics. By separating symbolic validation from code execution, the system can localize errors and apply small repairs rather than discarding entire plans. This makes the loop more sample-efficient than reinforcement learning, which often requires many episodes to discover reliable strategies, and more reliable than end-to-end LLM agents, which lack external checks. At the same time, limitations remain. The coder loop is heavy, often requiring multiple retries per level, and prompt growth is a practical bottleneck. The reliance on strong reasoning language models for valid PDDL generation also suggests that lighter models are not yet adequate for planning roles. Finally, the environments used here, although diverse, remain simplified testbeds. Generalization to real-world robotics with higher variability in tasks, objects, and sensing conditions remains an open question. Note, that real world poses additional challenges, such as reflection and other camera effects in different lighting conditions, the cost of unbound path planning (the robot arm can break), safety of subjects in the room, non availability of deterministic critic in the real world (how do we know a task is finished successfully or it failed), etc.
These findings suggest that SPCA occupies a useful position between classical planning, reinforcement learning, and purely generative LLM agents. Its explicit separation of high-level intent from low-level execution, combined with deterministic critics, yields interpretable outcomes and bounded loops. The approach shows promise for long-horizon tasks under partial observability and physical constraints, while also exposing clear avenues for improvement in perception, prompt efficiency, and validation coverage.

6. Conclusions

In this paper, we introduced the Sense–Plan–Code–Act (SPCA) framework, which combines language models with deterministic critics for planning and execution. The system was tested in MiniGrid and in a ROS 2 + Gazebo setup with a UR5 robot. In both domains SPCA solved most tasks within retry budgets. Symbolic plans stabilized quickly, and most corrections were handled by small code patches, keeping the loop efficient and interpretable.
The main limitations are the heavy repair load in the coding stage, the growth of prompts as skills accumulate, and occlusion issues in the robotic setup. Addressing these challenges will require slimmer prompts, stronger validation, and more robust sensing. Overall, the results show that SPCA provides a practical balance between flexibility and reliability, offering a feasible path for building agents that link high-level reasoning with executable skills.

Supplementary Materials

Short videos demonstrating the task execution in each environment https://www.youtube.com/playlist?list=PLk9fPT9e1s8inTbz-8clYHUUjAuEnypyR, accessed on 22 December 2025. (The videos show only the execution of generated codebase).

Author Contributions

Conceptualization, D.P. and J.Ž.; methodology, D.P.; software, D.P.; validation, D.P. and J.Ž.; formal analysis, D.P.; investigation, D.P.; resources, J.Ž.; data curation, D.P.; writing—original draft preparation, D.P.; writing—review and editing, D.P. and J.Ž.; visualization, D.P.; supervision, J.Ž.; project administration, D.P.; funding acquisition, J.Ž. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partially funded by Slovenian Research Agency No. P2-0209.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All code for the SPCA system, including early prototypes, is available at https://github.com/DrejcPesjak/minigrid-crewai, accessed on 23 December 2025 (specifically, commit 6fcfb1f: https://github.com/DrejcPesjak/minigrid-crewai/commit/6fcfb1f7a1ed60598e39ae1b1930dcc6fa015dd8, accessed on 23 December 2025).

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT from OpenAI, Model GPT-5, Version—2025 October, for the purposes of grammar checking, reference formatting, and manuscript formatting. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
UR5 Universal Robots 5 kg–class 6-DOF arm
2F-85 Robotiq two-finger gripper, 85 mm stroke
ROS 2 Robot Operating System, version 2
RViz ROS visualization tool
PDDL Planning Domain Definition Language
LLM Large Language Model
GPT Generative Pretrained Transformer
SPCA Sense–Plan–Code–Act loop
RLHF Reinforcement Learning with Human Feedback
CoT Chain of Thought (reasoning style)
RAG Retrieval-Augmented Generation
VLMVision–Language Model
VLAVision–Language–Action model
RLReinforcement Learning
PPOProximal Policy Optimization (RL)
A2CAdvantage Actor-Critic
DQNDeep Q-Networks
STAPSequencing Task-Agnostic Policies
RGB-DColor + depth sensing modality
APIApplication Programming Interface
YAMLYAML Ai not Markup Language

Appendix A. YAML Config

This appendix describes how manipulation tasks are specified in YAML and how the Referee node evaluates success and failure during execution in ROS 2 + Gazebo. The YAML files define task metadata and contact-based conditions, while the Referee monitors Gazebo physics contacts and publishes a task verdict. Listing A1 shows one example level file.
Listing A1. Example level YAML for a stacking task.
Make 08 00022 i001

Appendix A.1. YAML Task Specification

Each level file defines a single task instance. The fields task_group and task_id identify the level for logging and curriculum grouping. The fields title and description provide the natural-language task statement and are the only task-specific text passed into the perception and planning prompts. The field time_limit_s sets a hard timeout enforced by the Referee.
Success and failure are encoded as contact constraints in success and fail. Conditions are specified as unordered pairs of Gazebo model names. The reserved name cobot refers to the robot arm and gripper. The list collisions_true denotes contacts that must be present, while collisions_false denotes contacts that must be absent. Under success, all required contacts must be present and all forbidden contacts must be absent at the same time. Under fail, any listed collisions_true contact triggers immediate failure. The optional fail.collisions_false field supports tasks that require maintaining a particular contact and triggers failure if that contact is not observed.

Appendix A.2. Referee Node and Evaluation Semantics

The Referee subscribes to Gazebo’s physics contact stream at /gazebo/default/physics/contacts and maintains a set of currently active contact pairs. Contacts are treated as active only within a short time-to-live window to avoid using stale events. On each evaluation tick, the Referee applies a fixed priority order. It first checks whether any failure condition is satisfied, then checks whether all success constraints hold simultaneously, otherwise it reports a running status. Independently, it reports timeout when time_limit_s is exceeded.
The Referee is implemented as an external environment-level component that provides an RL-style task signal of success, failure, or timeout from simulator-internal contacts. It is not treated as a capability available to SPCA. The SPCA loop does not query or observe ground-truth contacts directly, and it must instead operate from camera observations and the resulting scene descriptions.

Appendix B. PlannerLLM MiniGrid Prompts

This appendix documents the prompts used by PlannerLLM in the MiniGrid/BabyAI setting. While the prompts are the primary artifact, they also define the exact information given to the planner and the required structure of its outputs. We therefore summarize the prompt inputs, outputs, planning modes, and repair behavior here, and then list the exact prompts.

Appendix B.1. Prompt Inputs

Each PlannerLLM call is instantiated from the first-call user template (Prompt Appendix B.8) plus the system prompt (Prompt Appendix B.7). The following fields are inserted into the user template:
  • Environment metadata:
    env_name: the Gymnasium environment ID (e.g., “BabyAI-GoToDoor-v0”).
    level_name: human-readable level name (e.g., “Go To Door”).
    category_name: curriculum category (e.g., “static_obstacle_navigation”).
    skill: short category-level description (e.g., “Navigate around static obstacles to reach a target”).
    level_description: task description text provided by the curriculum.
  • Current state (MiniGrid):
    mission: verbatim mission string from MiniGrid (e.g., “pick up the yellow key”).
    direction: one of {East, South, West, North}, converted from MiniGrid’s numeric orientation.
    inventory: either None or a single object token such as “key blue” or “ball red”.
    visible_grid): a 7 × 7 array of strings representing the agent’s accumulated partial map.
  • Current Agent code:
    agent_code: a stripped view of the current Agent implementation (method signatures and bodies), provided so the planner can reuse existing high-level actions when they already match the needed semantics and parameter counts.

Appendix B.2. Grid Encoding

Each visible-grid cell is encoded as a space-separated string. The object vocabulary includes unseen, empty, wall, floor, door, key, ball, box, goal, lava, and agent. Optional attributes include colors (red, green, blue, purple, yellow, grey) and, for doors only, state tokens (open, closed, locked). Typical examples are “door red locked”, “key blue”, “wall”, “empty”, “goal green”, and “unseen”.
The grid is an accumulated exploration map built from MiniGrid’s egocentric observations. Observed cells overwrite previously unseen cells, while unobserved regions remain explicitly marked as “unseen”. This map reflects the agent’s current belief and may be incomplete; execution always happens in the simulator, and the map only updates as new observations arrive during action execution.

Appendix B.3. Prompt Outputs

PlannerLLM returns a single JSON object containing:
  • domain: a complete PDDL domain specification.
  • problem: a complete PDDL problem specification.
These files are parsed and solved using Unified Planning with Fast Downward. The resulting plan is converted into an ordered list of grounded high-level actions for execution. A typical plan has the form:
[ open _ red _ door ( agent 1 , door _ red ) , go _ to _ target ( agent 1 , ball , room ) ] .

Appendix B.4. Planning Modes and Contextual Headers

PlannerLLM uses the same PDDL-context template (Prompt Appendix B.9) for both reuse and replan. The difference is the value of ctx_header inserted at the top of the prompt:
  • Fresh planning: generate a new domain and problem from scratch (Prompt Appendix B.8).
  • Reuse: provide a previously successful domain/problem and set ctx_header to instruct problem regeneration while keeping the domain structure (Prompt Appendix B.9).
  • Replan: provide the previous domain/problem but set ctx_header to indicate execution-time failure and encourage larger changes (Prompt Appendix B.9 with a different ctx_header string).
A domain is treated as trusted when the previous level succeeded and the new level belongs to the same curriculum category; in that case, reuse mode is enabled to reduce repeated domain drafting.

Appendix B.5. Planner Syntax Repair Loop

If PDDL parsing, solving, or validation fails, PlannerLLM enters a bounded repair loop of up to eight attempts. In each repair attempt, the previous domain and problem together with the planner error log are provided to the model, and the model is required to resend both corrected files as a single JSON object (Prompt Appendix B.10).

Appendix B.6. Models and Structured Output

PlannerLLM uses two OpenAI chat models: a big model (openai/o3) and a small model (openai/o4-mini). The big model is used for fresh planning, replanning after execution failure, and PDDL syntax repair, while the small model is used only when reusing a trusted domain and regenerating the problem. All PlannerLLM calls enforce a fixed JSON response schema with exactly two string fields, domain and problem, using structured parsing against a Pydantic model. Generation is configured using provider defaults, with no explicit specification of temperature, maximum tokens, or sampling parameters.

Appendix B.7. System Prompt

Make 08 00022 i002

Appendix B.8. User Prompt Template (First Call)

Make 08 00022 i003

Appendix B.9. User Prompt Template (PDDL Reuse/Replan)

Make 08 00022 i004

Appendix B.10. User Prompt Template (Repair)

Make 08 00022 i005

Appendix C. CoderLLM MiniGrid Prompts

This appendix documents the prompts used by CoderLLM in the MiniGrid/BabyAI setting. The prompts define what context is provided to the code generator, what form the generated code must take, and how generation is retried when failures occur.

Appendix C.1. Prompt Inputs

Each CoderLLM call is instantiated from the user template (Prompt Appendix C.6) together with the system prompt (Prompt Appendix C.5). The following fields are inserted into the user template:
  • agent_src, which contains the full current Python source of the Agent class and serves as the base into which new code is merged.
  • agent_state, which provides contextual state used during generation, including the current mission, direction, inventory, the history of executed primitive actions (previous_primitives), and the accumulated map (full_grid).
  • schemas_text, which contains the PDDL (:action ...) schemas for all actions that must be implemented in this call.
  • plan_str, which specifies the grounded plan that must execute successfully once the missing skills are implemented.
The prompt targets only missing skills. Action names are extracted from plan_str, normalized to snake_case, and compared against the methods currently defined on the Agent class. Any action whose corresponding method is absent is treated as a missing skill and included in the prompt. A skill is a Python method implementing one PDDL action schema. The function name matches the PDDL action name, and every PDDL parameter appears as a Python argument in the same order. Parameters that are not used inside the method body are preserved and may be prefixed with _.
MiniGrid skills implement high-level behavior by emitting sequences of primitive action codes. The available primitives are turn_left (0), turn_right (1), move_forward (2), pick_up (3), drop (4), and toggle (5). Each primitive executes synchronously in a single simulator step and behaves deterministically. Success or failure is not returned directly but is reflected in subsequent observations and therefore in updates to the accumulated map.

Appendix C.2. Prompt Outputs

CoderLLM outputs raw Python source consisting only of the added or modified top-level def blocks. No markdown or class wrapper is allowed, and each function definition must begin at column zero. Each generated skill must either return a list[int] or be implemented as a generator that yields primitive codes using yield or yield from. Generators are used when the implementation needs to re-check full_grid between moves.
The generated patch is merged into the Agent class using AST-based rewriting. The patch is parsed, the Agent class is located, methods are appended or replaced as needed, any required imports are merged at the module level, and the updated module is written back.

Appendix C.3. Repair and Critics

CoderLLM is validated by two critics. The syntax and interface critic runs before execution and checks that the returned code can be parsed, merged, and reloaded successfully. Failures at this stage include invalid syntax, incorrect indentation, failure to locate the Agent class during merging, missing imports, name errors, and type errors raised during module reload.
The semantic critic runs during plan execution in the simulator after the updated module has been successfully reloaded. Semantic failures include missing methods referenced by the plan, exceptions raised during a skill call, lack of progress or oscillatory behavior classified as stuck, reward-based termination failures classified as reward_failed, completion of the plan without satisfying the mission classified as goal_not_reached, and other uncaught runtime errors.
Whenever a failure occurs, the corresponding error log is provided to CoderLLM using the repair template (Prompt Appendix C.7). The model must then produce a complete replacement for the previously returned code block. The internal syntax and interface repair loop is bounded to five attempts per call. If the code reloads successfully but execution fails, the system re-invokes CoderLLM with the execution trace and retries execution of the same plan up to a bounded semantic retry budget.

Appendix C.4. Models and Decoding

CoderLLM uses OpenAI Codex (codex-mini-latest) for all code generation and repair calls. Generation is configured with provider defaults, and no explicit temperature, maximum token limit, or sampling parameters are specified.

Appendix C.5. System Prompt

Make 08 00022 i006

Appendix C.6. User Prompt Template (First Call)

Make 08 00022 i007

Appendix C.7. User Prompt Template (Repair)

Make 08 00022 i008

Appendix D. SenseLLM ROS 2 + Gazebo Prompts

This appendix documents the prompts used by SenseLLM in the ROS 2 + Gazebo setting. SenseLLM converts a single camera image of the workspace into a short natural-language scene description that is later consumed by PlannerLLM; while only two textual prompts are defined explicitly, each SenseLLM call conceptually consists of three inputs: a system prompt, a user prompt describing the task, and the camera image itself.

Appendix D.1. Prompt Inputs

Each SenseLLM call is instantiated from the system prompt (Prompt Appendix D.4) and the user prompt template (Prompt Appendix D.5), together with a visual input provided to the model. The inputs are:
  • Task context, inserted into the user prompt:
    title: a short task title (e.g., “Touch the blue cube”).
    description: a natural-language task description (e.g., “Move the gripper to make contact with the blue cube.”).
  • Visual input, provided alongside the user prompt:
    a single top-down RGB image of the workbench captured from a fixed camera in Gazebo.
The image is acquired as a BGR array from the ROS 2 camera topic, compressed to JPEG, and encoded as a base64 ASCII string. This base64-encoded image is passed to the vision-language model as an input_image element alongside the textual user prompt, effectively acting as an implicit third prompt component. The base64 encoding serves only as a transport representation for the image and does not imply that the model processes the image as text.

Appendix D.2. Prompt Outputs

SenseLLM produces a plain-text scene description consisting of 4–8 short sentences. The output is not structured as JSON and no schema is enforced. The description is constrained by the system prompt to include only what is visible in the image and to use coarse spatial language such as left, middle, right, and near or far. Typical content includes which colored objects are present, whether objects are touching the table or each other, the gripper’s relative position, and obvious obstacles. When visual information is ambiguous, the model is instructed to state uncertainty explicitly.
The resulting text is passed verbatim to the planning stage and serves as the only perceptual input for symbolic planning in the ROS 2 + Gazebo setup.

Appendix D.3. Models and Decoding

SenseLLM uses a vision-language model (openai/gpt-4o-mini). Generation uses provider defaults, and no explicit temperature, maximum token limit, or sampling parameters are specified.

Appendix D.4. System Prompt

Make 08 00022 i009

Appendix D.5. User Prompt Template

Make 08 00022 i010

Appendix E. PlannerLLM ROS 2 + Gazebo Prompts

This appendix documents the prompts used by PlannerLLM in the ROS 2 + Gazebo setting. The planner follows the same overall design and repair strategy as the MiniGrid PlannerLLM described in Appendix B. The key difference is that planning is driven by natural-language scene descriptions produced by SenseLLM, rather than by structured grid observations. This section therefore focuses on the inputs and constraints specific to the robotic manipulation setup, while omitting details common to both planners.

Appendix E.1. Prompt Inputs

Each PlannerLLM call is instantiated from the system prompt (Prompt Appendix E.6) and the first-call user template (Prompt Appendix E.7), optionally followed by reuse or repair templates. The user prompt contains the following information:
  • Task metadata, provided as plain text:
    task_group and task_id, identifying the curriculum group and level.
    task_title and task_description, describing the manipulation objective in natural language.
  • Scene description (scene_text): a free-form natural-language description generated by SenseLLM from a single camera image. This description is the only perceptual input available to the planner and replaces the structured state used in MiniGrid.
  • Available high-level actions: a stripped outline of the current robot action library, obtained from agent_actions.py and consisting of function signatures with docstrings only. This allows the planner to reason about which actions already exist and when new ones must be introduced.
No numeric coordinates, geometric quantities, or metric information are provided to the planner. All spatial reasoning must be derived from the task text and the scene description using coarse, relational language.

Appendix E.2. Prompt Outputs

PlannerLLM returns a single JSON object containing a complete PDDL domain and problem specification. These files are solved and validated using Unified Planning, and the resulting plan is converted into an ordered list of grounded high-level actions. The plan references only actions present in the available action outline or newly introduced actions defined in the generated domain.

Appendix E.3. Planning Modes and Context Headers

As in the MiniGrid planner, PlannerLLM supports fresh planning, reuse of previously successful PDDL, replanning after execution failure, and syntax repair. Reuse and replan are implemented by providing the previous domain and problem together with a context header (ctx_header) that conditions the model on whether the prior PDDL solved a level successfully or failed during execution (Prompt Appendix E.8). Trusted domains are cached per task group and reused to reduce repeated domain drafting.

Appendix E.4. Planner Repair Loop

If parsing, solving, or validation fails, PlannerLLM enters a bounded repair loop with up to eight attempts. In each repair step, the previous domain and problem together with the planner error log are supplied to the model, which must return a complete replacement for both files as a single JSON object (Prompt Appendix E.9).

Appendix E.5. Models and Decoding

PlannerLLM uses the same model-selection strategy as in the MiniGrid setting. Fresh planning, replanning, and repair use a larger model (openai/o3), while reuse of trusted PDDL uses a smaller model (openai/o4-mini). Outputs are parsed using a structured Pydantic schema enforcing the presence of domain and problem fields. Generation uses provider defaults, and no explicit temperature, maximum token limit, or sampling parameters are specified.

Appendix E.6. System Prompt

Make 08 00022 i011

Appendix E.7. User Prompt Template (First Call)

Make 08 00022 i012

Appendix E.8. User Prompt Template (PDDL Reuse/Replan Context)

Make 08 00022 i013

Appendix E.9. User Prompt Template (Repair)

Make 08 00022 i014

Appendix F. CoderLLM ROS 2 + Gazebo Prompts

This appendix documents the prompts used by CoderLLM in the ROS 2 + Gazebo setting. The role of CoderLLM is analogous to the MiniGrid coder described in Appendix C, but code is generated for a robot action module rather than for an Agent class. The generated code implements the PDDL actions referenced by the plan as top-level Python functions that interact with the robot exclusively through a provided runtime context object (ctx).

Appendix F.1. Prompt Inputs

Each CoderLLM call is instantiated from the system prompt (Prompt Appendix F.5) and the user template (Prompt Appendix F.6), with repair calls using Prompt Appendix F.7. The user prompt contains:
  • agent_src, which is the current actions module source used as the merge base (the temporary agent_actions file that is updated across retries and levels).
  • ctx_src, which is the full definition of the runtime context class and documents the available motion, gripper, perception, and TF utilities that actions may call through ctx.
  • agent_state, which provides brief execution context and typically includes the latest scene description text.
  • schemas_text, which lists the PDDL (:action ...) blocks that must be implemented in this call.
  • plan_str, which specifies the grounded high-level plan that must execute successfully once the missing actions are implemented.
As in the MiniGrid setting, only missing skills are targeted. Action names are extracted from plan_str and compared against the function names currently defined in the temporary actions module. Any missing functions are included in schemas_text. When a semantic retry is triggered, the system re-implements the same set of actions while providing an error trace through the repair template.

Appendix F.2. Prompt Outputs

CoderLLM outputs raw Python source consisting only of added or modified top-level def blocks, optionally preceded by required import statements. No markdown, class wrapper, or additional text is permitted, and each def must begin at column zero. A skill is a top-level function that implements one PDDL action schema. The function name matches the PDDL action name and each PDDL parameter appears as a string argument after ctx in the same order.
Actions follow a callback-based execution style. Public action functions have signature def name(ctx, *string_args, done_callback=None) and do not return a value. They must terminate by calling done_callback(success=True) on success or done_callback(success=False, msg="...") on failure. Private helper functions are distinguished by a leading underscore, may return values, and do not require a callback. Plans pass only symbolic strings such as "cube_blue" or "plate_red" and action signatures are required to avoid numeric coordinates.
The returned patch is merged into the actions module by AST-based rewriting. The merge replaces or appends top-level function definitions by name and merges imports at the module level.

Appendix F.3. Repair and Critics

CoderLLM is validated by a syntax and merge critic and by execution-time feedback. Before execution, the patch and the merged module are parsed as Python AST to ensure syntactic validity and a successful merge into the module. Failures at this stage include invalid syntax, indentation errors, and merge errors caused by malformed top-level definitions.
After a successful merge, the plan is executed and failures are returned as an error trace. Typical execution-time failures include missing actions, exceptions raised inside an action or helper, and explicit done_callback(success=False, ...) reports from an action. On any failure, the corresponding error log is provided to CoderLLM using the repair template (Prompt Appendix F.7) and the model must return a complete replacement for its previously generated code block. The internal syntax repair loop is bounded to five attempts per call, and execution-time retries are bounded at the supervisor level.

Appendix F.4. Models and Decoding

CoderLLM uses OpenAI Codex (codex-mini-latest) for all code-generation and repair calls. Outputs are parsed using a structured schema with a single code field. Generation uses provider defaults, and no explicit temperature, maximum token limit, or sampling parameters are specified.

Appendix F.5. System Prompt

Make 08 00022 i015

Appendix F.6. User Prompt Template (First Call)

Make 08 00022 i016

Appendix F.7. User Prompt Template (Repair)

Make 08 00022 i017

References

  1. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  2. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  3. Huang, W.; Abbeel, P.; Pathak, D.; Mordatch, I. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 9118–9147. [Google Scholar]
  4. Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Toronto, ON, Canada, 3–10 March 2021; pp. 610–623. [Google Scholar] [CrossRef]
  5. Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.B.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
  6. Valmeekam, K.; Marquez, M.; Sreedharan, S.; Kambhampati, S. On the Planning Abilities of Large Language Models - A Critical Investigation. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023 (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 75993–76005. [Google Scholar]
  7. Kambhampati, S.; Valmeekam, K.; Guan, L.; Verma, M.; Stechly, K.; Bhambri, S.; Saldyt, L.; Murthy, A. Position: LLMs Ca not Plan, However, Can Help Planning in LLM-Modulo Frameworks. In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024; Volume 235, pp. 22895–22907. [Google Scholar]
  8. Chevalier-Boisvert, M.; Bahdanau, D.; Lahlou, S.; Willems, L.; Saharia, C.; Nguyen, T.H.; Bengio, Y. BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  9. Fox, M.; Long, D. PDDL2.1: An extension to PDDL for expressing temporal planning domains. J. Artif. Intell. Res. 2003, 20, 61–124. [Google Scholar] [CrossRef]
  10. Helmert, M. The Fast Downward planning system. J. Artif. Intell. Res. 2006, 26, 191–246. [Google Scholar] [CrossRef]
  11. Hui, D.Y.; Chevalier-Boisvert, M.; Bahdanau, D.; Bengio, Y. BabyAI 1.1. arXiv 2020, arXiv:2007.12770. [Google Scholar] [CrossRef]
  12. Cideron, G.; Seurin, M.; Strub, F.; Pietquin, O. HIGhER: Improving instruction following with Hindsight Generation for Experience Replay. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, Australia, 1–4 December 2020; pp. 225–232. [Google Scholar] [CrossRef]
  13. Agia, C.; Migimatsu, T.; Wu, J.; Bohg, J. STAP: Sequencing Task-Agnostic Policies. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 7951–7958. [Google Scholar] [CrossRef]
  14. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  15. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
  16. Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  17. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  18. Song, C.H.; Sadler, B.M.; Wu, J.; Chao, W.; Washington, C.; Su, Y. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 2986–2997. [Google Scholar] [CrossRef]
  19. Liu, B.; Jiang, Y.; Zhang, X.; Liu, Q.; Zhang, S.; Biswas, J.; Stone, P. LLM+P: Empowering large language models with optimal planning proficiency. arXiv 2023, arXiv:2304.11477. [Google Scholar] [CrossRef]
  20. Guan, L.; Valmeekam, K.; Sreedharan, S.; Kambhampati, S. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. arXiv 2023, arXiv:2305.14909. [Google Scholar]
  21. Huang, S.; Lipovetzky, N.; Cohn, T. Planning in the Dark: LLM-Symbolic Planning Pipeline Without Experts. In Proceedings of the AAAI’25/IAAI’25/EAAI’25: Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 26542–26550. [Google Scholar] [CrossRef]
  22. Chen, Y.; Arkin, J.; Dawson, C.; Zhang, Y.; Roy, N.; Fan, C. AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 6695–6702. [Google Scholar] [CrossRef]
  23. Huang, W.; Xia, F.; Xiao, T.; Chan, H.; Liang, J.; Florence, P.; Zeng, A.; Tompson, J.; Mordatch, I.; Chebotar, Y.; et al. Inner Monologue: Embodied Reasoning through Planning with Language Models. In Proceedings of the Conference on Robot Learning (CoRL), Auckland, New Zealand, 14–18 December 2022; Volume 205, pp. 1769–1782. [Google Scholar]
  24. Zawalski, M.; Chen, W.; Pertsch, K.; Mees, O.; Finn, C.; Levine, S. Robotic Control via Embodied Chain-of-Thought Reasoning. In Proceedings of the Conference on Robot Learning, Munich, Germany, 6–9 November 2024; Volume 270, pp. 3157–3181. [Google Scholar]
  25. Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Fu, C.; Gopalakrishnan, K.; Hausman, K.; et al. Do as I can, not as I say: Grounding language in robotic affordances. arXiv 2022, arXiv:2204.01691. [Google Scholar] [CrossRef]
  26. Huang, W.; Wang, C.; Zhang, R.; Li, Y.; Wu, J.; Li, F.-F. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. In Proceedings of the Conference on Robot Learning (CoRL), Atlanta, GA, USA, 6–9 November 2023; Volume 229, pp. 540–562. [Google Scholar]
  27. Zhi, P.; Zhang, Z.; Zhao, Y.; Han, M.; Zhang, Z.; Li, Z.; Jiao, Z.; Jia, B.; Huang, S. Closed-loop open-vocabulary mobile manipulation with GPT-4V. arXiv 2024, arXiv:2404.10220. [Google Scholar]
  28. Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 8469–8488. [Google Scholar]
  29. Kim, M.J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.P.; Sanketi, P.R.; Vuong, Q.; et al. OpenVLA: An Open-Source Vision-Language-Action Model. In Proceedings of the Conference on Robot Learning, Munich, Germany, 6–9 November 2024; Volume 270, pp. 2679–2713. [Google Scholar]
  30. MoveIt Task Constructor: Task-Level Motion Planning in MoveIt. Available online: https://moveit.picknik.ai/main/doc/concepts/moveit_task_constructor/moveit_task_constructor.html (accessed on 13 September 2025).
  31. Tesseract Robotics: Motion Planning and Manipulation Framework. Available online: https://github.com/tesseract-robotics/tesseract (accessed on 13 September 2025).
  32. Garrett, C.R.; Lozano-Pérez, T.; Kaelbling, L.P. PDDLStream: Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning. In Proceedings of the Thirtieth International Conference on Automated Planning and Scheduling, Nancy, France, 26–30 October 2020; pp. 440–448. [Google Scholar] [CrossRef]
  33. Liang, J.; Huang, W.; Xia, F.; Xu, P.; Hausman, K.; Ichter, B.; Florence, P.; Zeng, A. Code as Policies: Language Model Programs for Embodied Control. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 9493–9500. [Google Scholar] [CrossRef]
  34. Singh, I.; Blukis, V.; Mousavian, A.; Goyal, A.; Xu, D.; Tremblay, J.; Fox, D.; Thomason, J.; Garg, A. ProgPrompt: Generating Situated Robot Task Plans using Large Language Models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 11523–11530. [Google Scholar] [CrossRef]
  35. Huang, S.; Jiang, Z.; Dong, H.; Qiao, Y.; Gao, P.; Li, H. Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model. arXiv 2023, arXiv:2305.11176. [Google Scholar]
  36. Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An Open-Ended Embodied Agent with Large Language Models. Trans. Mach. Learn. Res. 2024, 2024. [Google Scholar]
  37. Yin, S.; Pang, X.; Ding, Y.; Chen, M.; Bi, Y.; Xiong, Y.; Huang, W.; Xiang, Z.; Shao, J.; Chen, S. SafeAgentBench: A benchmark for safe task planning of embodied LLM agents. arXiv 2024, arXiv:2412.13178. [Google Scholar]
  38. GuardrailsAI. Available online: https://www.guardrailsai.com/ (accessed on 13 September 2025).
  39. Jain, A.; Jermaine, C.; Unhelkar, V. RAG-Modulo: Solving sequential tasks using experience, critics, and language models. arXiv 2024, arXiv:2409.12294. [Google Scholar]
  40. Yang, H.; Lin, Z.; Wang, W.; Wu, H.; Li, Z.; Tang, B.; Wei, W.; Wang, J.; Tang, Z.; Song, S.; et al. Memory3: Language modeling with explicit memory. arXiv 2024, arXiv:2407.01178. [Google Scholar]
  41. Smit, A.P.; Grinsztajn, N.; Duckworth, P.; Barrett, T.D.; Pretorius, A. Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs. In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024; Volume 235, pp. 45883–45905. [Google Scholar]
  42. Valmeekam, K.; Marquez, M.; Olmo, A.; Sreedharan, S.; Kambhampati, S. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023 (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 38975–38987. [Google Scholar]
  43. Choi, J.; Yoon, Y.; Ong, H.; Kim, J.; Jang, M. LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents. In Proceedings of the The Twelfth International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
  44. Xie, J.; Zhang, K.; Chen, J.; Zhu, T.; Lou, R.; Tian, Y.; Xiao, Y.; Su, Y. TravelPlanner: A Benchmark for Real-World Planning with Language Agents. In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024; Volume 235, pp. 54590–54613. [Google Scholar]
  45. Su, Y.; Ling, Z.; Shi, H.; Jiayang, C.; Yim, Y.; Song, Y. ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 12–16 November 2024; pp. 14953–14965. [Google Scholar] [CrossRef]
  46. Zhang, M.; Fu, X.; Hao, J.; Han, P.; Zhang, H.; Shi, L.; Tang, H.; Zheng, Y. MFE-ETP: A comprehensive evaluation benchmark for multi-modal foundation models on embodied task planning. arXiv 2024, arXiv:2407.05047. [Google Scholar]
  47. Mahdavi, S.; Aoki, R.; Tang, K.; Cao, Y. Leveraging environment interaction for automated PDDL translation and planning with large language models. In Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024 (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024; Volume 38, pp. 38960–39008. [Google Scholar]
  48. Smirnov, P.; Joublin, F.; Ceravola, A.; Gienger, M. Generating consistent PDDL domains with Large Language Models. arXiv 2024, arXiv:2404.07751. [Google Scholar] [CrossRef]
  49. Zuo, M.; Velez, F.P.; Li, X.; Littman, M.; Bach, S. Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 11223–11240. [Google Scholar] [CrossRef]
  50. Carta, T.; Romac, C.; Wolf, T.; Lamprier, S.; Sigaud, O.; Oudeyer, P.Y. Grounding large language models in interactive environments with online reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
  51. Pouplin, T.; Kobalczyk, K.; Sun, H.; van der Schaar, M. The Synergy of LLMs & RL Unlocks Offline Learning of Generalizable Language-Conditioned Policies with Low-fidelity Data. In Proceedings of the Forty-Second International Conference on Machine Learning, BC, Canada, 13–19 July 2025. [Google Scholar]
  52. Saghafian, A.; Izadi, A.; Dijujin, N.H.; Baghshah, M.S. CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives. Trans. Mach. Learn. Res. 2025. Available online: https://jmlr.org/tmlr/papers/ (accessed on 13 September 2025).
  53. Yang, F.; Liu, J.; Li, K. LLM-Guided Reinforcement Learning for Interactive Environments. Mathematics 2025, 13, 1932. [Google Scholar] [CrossRef]
Figure 1. The SPCA loop. LLMs generate PDDL and code. Outputs are validated by a planner, compiler, and simulator. Failed artifacts trigger repair, while successful ones are executed and cached.
Figure 1. The SPCA loop. LLMs generate PDDL and code. Outputs are validated by a planner, compiler, and simulator. Failed artifacts trigger repair, while successful ones are executed and cached.
Make 08 00022 g001
Figure 2. SenseLLM example. The model receives an RGB frame and returns a textual scene description.
Figure 2. SenseLLM example. The model receives an RGB frame and returns a textual scene description.
Make 08 00022 g002
Figure 3. Planning workflow. PlannerLLM drafts a PDDL domain and problem. Unified Planning parses, solves and validates. On failure, a bounded repair loop requests minimal patches from the LLM.
Figure 3. Planning workflow. PlannerLLM drafts a PDDL domain and problem. Unified Planning parses, solves and validates. On failure, a bounded repair loop requests minimal patches from the LLM.
Make 08 00022 g003
Figure 4. MiniGrid example. Task: “pick up a ball and go to the red box.” The agent (red triangle) observes only a local 7 × 7 window.
Figure 4. MiniGrid example. Task: “pick up a ball and go to the red box.” The agent (red triangle) observes only a local 7 × 7 window.
Make 08 00022 g004
Figure 5. ROS 2 + Gazebo workspace. RViz visualization (left) and Gazebo simulation (right) with robotic arm, gripper, workbench, and colored objects. A static RGB-D camera observes the table from above.
Figure 5. ROS 2 + Gazebo workspace. RViz visualization (left) and Gazebo simulation (right) with robotic arm, gripper, workbench, and colored objects. A static RGB-D camera observes the table from above.
Make 08 00022 g005
Figure 6. Number of problems solved at attempt k (Solved@k) across 12 models and 11 PDDL problems.
Figure 6. Number of problems solved at attempt k (Solved@k) across 12 models and 11 PDDL problems.
Make 08 00022 g006
Figure 7. Percentage of problems each model solved within k = 10 attempts, including the mean number of attempts. Results are averaged over 5 trials per model–problem pair.
Figure 7. Percentage of problems each model solved within k = 10 attempts, including the mean number of attempts. Results are averaged over 5 trials per model–problem pair.
Make 08 00022 g007
Figure 8. Cumulative levels passed versus simulation runs (episodes).
Figure 8. Cumulative levels passed versus simulation runs (episodes).
Make 08 00022 g008
Figure 9. Per–level duration by category in MiniGrid.
Figure 9. Per–level duration by category in MiniGrid.
Make 08 00022 g009
Figure 10. Token use by category for planner and coder in MiniGrid. Mean with 95% percentile bands.
Figure 10. Token use by category for planner and coder in MiniGrid. Mean with 95% percentile bands.
Make 08 00022 g010
Figure 11. RL success rate on PickupLoc: mean with ±1 sd over 10 trials (validation batch size 512). Reproduced from BabyAI [8].
Figure 11. RL success rate on PickupLoc: mean with ±1 sd over 10 trials (validation batch size 512). Reproduced from BabyAI [8].
Make 08 00022 g011
Table 1. Runtime and dataset statistics for MiniGrid.
Table 1. Runtime and dataset statistics for MiniGrid.
QuantityValue
Curriculum totals
Total levels102
Categories9
Completion rate74.51%
Simulation runs325
Evaluated subset
Evaluated levels76
Evaluated categories8
Runtime/logs
Planner API calls142
Coder API calls225
Active span252.49 min
Time per level 3.12 ± 6.45 min
SPCA rounds per level 1.38 ± 1.00
SPCA rounds (total)112
Table 2. Planning loop statistics in MiniGrid.
Table 2. Planning loop statistics in MiniGrid.
LabelPer–Level Mean ± SDCalls (Share)
fresh 0.10 ± 0.30 8 (5.6%)
reuse 0.85 ± 0.36 69 (48.6%)
replan 0.38 ± 1.00 31 (21.8%)
syntax 0.42 ± 0.94 34 (23.9%)
Table 3. Coding loop statistics in MiniGrid.
Table 3. Coding loop statistics in MiniGrid.
LabelPer–Level Mean ± SDCalls (Share)
first 0.15 ± 0.36 12 (5.3%)
semantic 2.63 ± 6.12 213 (94.7%)
syntax 0.00 ± 0.00 0 (0.0%)
Table 4. Token usage and API cost by model (MiniGrid). Costs are computed from recorded token counts and OpenAI list rates ($/1M tokens) for input and output.
Table 4. Token usage and API cost by model (MiniGrid). Costs are computed from recorded token counts and OpenAI list rates ($/1M tokens) for input and output.
ModelInput Tok.Output Tok.Input $Output $Total $
codex-mini-latest1,834,1151,074,0502.7516.4449.195
o3-2025-04-16557,646188,4801.1151.5082.623
o4-mini-2025-04-16392,107113,7310.4310.5000.932
Total2,783,8681,376,2614.2988.45312.750
Table 5. RL success on selected BabyAI levels (validation batch of 512 episodes). Values are averaged across 10 trials. Last two columns are reported in thousands (min–max).
Table 5. RL success on selected BabyAI levels (validation batch of 512 episodes). Values are averaged across 10 trials. Last two columns are reported in thousands (min–max).
LevelSR@500 eps (%)eps@70% SReps@99% SR
GoToRedBallGrey26.00.8–1.49.8–40.0
GoToRedBall37.20.9–1.49.8–65.1
GoToLocal30.314.7–28.8135.8–353.5
PickupLoc14.852.7–163.9186.3–882.3
Table 6. Runtime and dataset statistics for ROS 2 + Gazebo.
Table 6. Runtime and dataset statistics for ROS 2 + Gazebo.
QuantityValue
Experiment totals
Total levels31
Categories4
Completion rate70.97%
Simulation runs131
Evaluated subset
Evaluated levels22
Evaluated categories3
Runtime/logs
Sense API calls114
Planner API calls81
Coder API calls108
Active span113.0 min
Time per level 5.13 ± 1.72 min
SPCA rounds per level 1.38 ± 0.18
SPCA rounds (total)30
Table 7. Planning loop statistics in ROS 2 + Gazebo.
Table 7. Planning loop statistics in ROS 2 + Gazebo.
LabelPer–Level Mean ± SDCalls (Share)
fresh 0.14 ± 0.19 3 (3.7%)
reuse 1.54 ± 0.55 34 (41.9%)
replan 0.46 ± 0.33 10 (12.3%)
syntax 1.54 ± 1.47 34 (41.9%)
Table 8. Coding loop statistics in ROS 2 + Gazebo.
Table 8. Coding loop statistics in ROS 2 + Gazebo.
LabelPer–Level Mean ± SDCalls (Share)
first 0.46 ± 0.32 10 (9.3%)
semantic 4.46 ± 1.67 98 (90.7%)
syntax 0.00 ± 0.00 0 (0.0%)
Table 9. Token usage in ROS 2 + Gazebo (mean ± sd).
Table 9. Token usage in ROS 2 + Gazebo (mean ± sd).
APITokens
SenseLLM 14 , 448 ± 14
PlannerLLM 2930 ± 998
CoderLLM 13 , 478 ± 2778
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pesjak, D.; Žabkar, J. Robot Planning via LLM Proposals and Symbolic Verification. Mach. Learn. Knowl. Extr. 2026, 8, 22. https://doi.org/10.3390/make8010022

AMA Style

Pesjak D, Žabkar J. Robot Planning via LLM Proposals and Symbolic Verification. Machine Learning and Knowledge Extraction. 2026; 8(1):22. https://doi.org/10.3390/make8010022

Chicago/Turabian Style

Pesjak, Drejc, and Jure Žabkar. 2026. "Robot Planning via LLM Proposals and Symbolic Verification" Machine Learning and Knowledge Extraction 8, no. 1: 22. https://doi.org/10.3390/make8010022

APA Style

Pesjak, D., & Žabkar, J. (2026). Robot Planning via LLM Proposals and Symbolic Verification. Machine Learning and Knowledge Extraction, 8(1), 22. https://doi.org/10.3390/make8010022

Article Metrics

Back to TopTop