1. Introduction
Behavior Trees (BTs) have become a dominant architecture for robot decision-making due to their modularity, reactivity, and interpretability [
1,
2]. Unlike finite state machines, BTs provide a hierarchical structure that decomposes complex behaviors into reusable subtrees, making them well suited for multi-task autonomous robots that must navigate, manipulate objects, manage resources, and interact with the environment [
3]. However, manual BT design for dynamic multi-competency robotic systems remains labor intensive and scales poorly with increasing task complexity.
Genetic Programming (GP) [
4] provides an automated mechanism for BT synthesis by evolving populations of candidate trees through selection, crossover, and mutation [
5,
6]. GP has demonstrated the ability to generate robot controllers that are difficult to design manually [
7]. Nevertheless, GP-based BT evolution frequently suffers from premature convergence, where populations lose behavioral diversity and collapse toward locally optimal solutions that emphasize limited task subsets [
8]. This limitation is particularly critical in multi-task robotics, where controllers must integrate diverse behavioral competencies into unified Complete Robots.
Adaptive mutation strategies and diversity preservation techniques have been widely explored to address premature convergence. Mutation rates are commonly increased when fitness stagnation is detected, and mechanisms such as behavioral elitism and behavioral category restoration are used to protect or restore specialized behavioral categories [
9,
10]. However, these approaches remain fundamentally reactive, responding to scalar stagnation signals rather than proactively guiding the evolutionary trajectory. As a result, mutation rates often remain elevated throughout training, preserving exploration but disrupting promising solutions and delaying convergence.
In parallel, Quality-Diversity (QD) approaches such as Novelty Search [
11] and MAP-Elites [
12] have been proposed to explicitly promote behavioral diversity during evolution. Novelty Search encourages exploration by rewarding behavioral uniqueness rather than objective performance, while MAP-Elites maintains a structured archive of diverse high-performing solutions across discretized behavior spaces. Although effective in mitigating premature convergence, these methods primarily rely on objective reformulation or archive-based mechanisms and typically require predefined behavior descriptors to guide diversity preservation.
Despite these advances, existing approaches share a common limitation: they rely on predefined heuristics, scalar signals, or manually specified behavior descriptors, which restrict their ability to adaptively interpret complex population dynamics during evolution. In contrast, the proposed framework introduces a supervisory mechanism that performs multi-factor population analysis, jointly considering fitness trends, behavioral distributions, and population quality metrics. This enables phase-aware mutation control and targeted behavioral synthesis, allowing the system to proactively guide the evolutionary trajectory rather than reacting to stagnation signals. As a result, the proposed method facilitates earlier emergence of integrated multi-task behaviors while maintaining diversity under lower mutation pressure.
To address these limitations, we can take advantage of recent advances in Large Language Models (LLMs) to overcome the limitations described above. LLMs can reason over multi-dimensional population states by simultaneously analyzing fitness trends, behavioral distributions, and diversity metrics [
13,
14]. Related work has explored using LLMs to generate initial BT populations from natural language task descriptions [
15]. While such approaches accelerate early-stage performance, they rely heavily on initialization quality and provide limited guidance during later evolutionary stages.
Direct LLM-based BT generation also presents scalability limitations for complex multi-objective robotic control. Single-shot generation makes it difficult to optimize trade-offs among efficiency [
16], robustness, and long-horizon coordination [
17]. Additionally, decision conditions are typically fixed at design time, limiting the evolution of fine-grained blackboard logic that is critical for context-sensitive execution [
18]. In contrast, GP provides a principled framework for progressively refining both BT structure and condition logic through performance-driven selection [
19]. Integrating LLMs as supervisory agents rather than direct policy generators preserves evolutionary optimization while enabling strategic exploration guidance.
This paper proposes a hybrid LLM-supervised GP framework for BT evolution that augments a standard GP pipeline with two mechanisms: (1) holistic mutation rate adaptation based on multi-factor population analysis and (2) targeted BT generation to address behavioral gaps. The proposed system maintains the same GP infrastructure as the baseline framework, including adaptive mutation and diversity-preserving mechanisms such as behavioral elitism and behavioral category restoration, ensuring a controlled comparison. The key difference is the replacement of reactive stagnation detection with proactive LLM-based supervision.
Experimental evaluation in a Unity-based multi-task robot simulation demonstrates three key outcomes. The proposed system achieves 28% higher behavioral diversity while reducing mutation rates over time, enables 71.7% faster emergence of Complete Robots (p
BH = 0.0358), and introduces only 13% computational overhead. A Complete Robot is defined as a controller that satisfies minimum performance thresholds across all core behavioral domains (garbage collection, humanoid interaction, delivery, and charging), while an Excellent Robot denotes a higher-performing Complete Robot that additionally meets stricter throughput and fitness criteria (see
Section 2.3.4 for detailed thresholds). These results indicate that LLM supervision provides an effective and practical enhancement to GP-based BT evolution. The contributions of this paper are summarized as follows:
A hybrid framework integrating LLM supervision into GP-based BT evolution through holistic population analysis and targeted BT generation;
Empirical validation demonstrating accelerated emergence of multi-competency robots with improved behavioral diversity;
Analysis of LLM supervisory decision patterns showing efficient evolutionary guidance with minimal emergency intervention.
The proposed framework introduces a scalable supervisory paradigm in which LLMs operate as strategic evolutionary operators rather than static solution generators. This capability supports automated synthesis of robust multi-task controllers for autonomous robots operating in dynamic and unstructured environments.
2. Methods
2.1. System Overview
Figure 1 illustrates the architecture of the proposed LLM-supervised Genetic Programming (GP) framework for Behavior Tree (BT) evolution. The system extends a conventional GP pipeline for evolving BT-based robot controllers by introducing a supervisory Large Language Model (LLM) that operates at the generation level.
During each generation, candidate BT controllers are evaluated within a Unity-based multi-task robotic simulation to obtain fitness values and behavioral metrics. These include fitness statistics, behavioral diversity measures, specialist distributions, and population health indicators, which are aggregated into a structured population state representation,
defined in
Section 2.4.1.
This population state is provided to the LLM supervisor once per generation. Based on this multi-dimensional input, the LLM performs holistic analysis to assess the current evolutionary state and outputs two forms of supervisory guidance: (1) an adaptive mutation rate and (2) a targeted BT designed to address underrepresented behavioral competencies. The recommended mutation rate is applied during subsequent GP mutation operations, directly influencing the exploration–exploitation balance. In parallel, the generated BT is validated and inserted into the diversity pool as a semantically guided individual. These injected individuals complement the stochastic variation produced by standard genetic operators and help expand behavioral coverage in the population.
The remainder of the GP pipeline, including selection, crossover, elitism, and niche protection—remains unchanged. By preserving the baseline evolutionary structure while augmenting it with LLM-based supervision, the framework enables context-aware adaptation of evolutionary dynamics without disrupting the self-organizing properties of GP. This integration allows the system to transition from reactive parameter adjustment to proactive evolutionary guidance, improving convergence efficiency and facilitating the emergence of integrated multi-behavior controllers.
2.2. Implementation and Simulation Environment
2.2.1. System Architecture
The system is implemented in Unity [
20] using C# and the NPBehave [
21] library for BT execution. The architecture consists of three principal components:
A simulation environment that manages robot agents, garbage objects, humanoids, charging stations, and patrol waypoints;
A GP engine responsible for population evolution, fitness evaluation, selection, crossover, and mutation;
An LLM supervision module that interfaces with the Claude API (Anthropic) [
22] for adaptive evolutionary control.
All components communicate through a shared GameManager that tracks environmental state, robot statistics, and fitness values. Source code will be provided upon request.
2.2.2. Simulation Environment and Humanoid
The simulation environment in
Figure 2 consists of a bounded 300 m × 300 m island terrain populated with dynamically spawned garbage, mobile humanoid agents requesting garbage clearing, fixed charging stations, disposal points, and patrol waypoints supporting exploration. Robots are initialized at the environment center.
Environmental stochasticity is introduced through randomized garbage spawning and humanoid movement. Simulations operate at a 10× time acceleration with a maximum evaluation duration of 420 s. Evaluations terminate early if fitness stagnates for 120 s (or 180 s when battery levels fall below 40%) or remains negative for 60 s. Each robot has a carrying capacity of 30 garbage items and a battery system operating within a 0–100% range, initialized at 70% to encourage charging behavior evolution. Robots are equipped with proximity sensors for detecting garbage, humanoids, charging stations, and obstacles. Operational state is maintained using a blackboard representation storing battery level, carrying capacity, lap count, detection flags, and task execution status.
In addition to environmental interaction, the task setting involves coordination with humanoid agents that accumulate and transfer collected garbage. This introduces a heterogeneous interaction scenario, where effective system performance depends not only on individual task execution but also on inter-agent coordination [
23]. Such settings are commonly associated with division of labor and role specialization in multi-agent systems, where different agents or behaviors emerge to handle distinct functional responsibilities.
From an evolutionary perspective, this creates a structured behavioral landscape in which certain policies must specialize in interacting with humanoids, while others prioritize environmental exploration or direct garbage collection. Without explicit mechanisms to preserve such specialization, evolutionary processes may converge toward dominant but incomplete strategies, neglecting critical interaction behaviors.
To address this, the proposed framework incorporates mechanisms that explicitly maintain and recover humanoid-interaction-specialized behaviors during evolution. This ensures that cooperative task components are preserved alongside individual optimization, enabling more robust and functionally diverse solutions.
2.2.3. Behavior Tree Representation
Behavior Trees are implemented using the NPBehave library, where each individual controller is encoded as a BT composed of composite nodes (Selector, Sequence, Parallel), decorator nodes (Inverter, Repeater, BlackboardCondition), and action nodes (MoveToGarbage, PickupGarbage, SeekCharger, ExecuteCharge, MoveToHumanoid, DeliverToClearingPoint, and PatrolWaypoint). Trees are constrained to a maximum depth of 8 and 50 nodes. The blackboard maintains key state variables including batteryLevel, robotCurrentExtraCapacity, lap, foundGarbage, and isHumanoidWantToClear. These structural constraints ensure executability, limit tree bloat, and bound the search space for both GP mutation and LLM-generated BTs.
On the other hand, the perception of the agents is updated through a virtual sensing module simulating camera and LiDAR capabilities with a 15 m detection range. Garbage locations, humanoid clearance requests, robot carrying capacity, and battery state are continuously recorded in the blackboard, enabling condition nodes to react to both environmental observations and internal robot status.
2.3. Genetic Programming Framework
The evolutionary framework follows a standard GP formulation [
24] for evolving BT-based robot controllers. At each generation, individuals are evaluated in simulation using a scalar fitness score, followed by tournament selection (size 3) [
25], subtree crossover (probability 0.8) [
26,
27], and adaptive mutation [
28]. The population size is fixed at 30 individuals, with top-3 elitism applied to preserve high-quality solutions.
To mitigate premature convergence and maintain behavioral diversity, the framework incorporates several diversity-preserving mechanisms commonly used in multi-behavior evolutionary robotics. These include behavioral elitism to protect high-performing specialists, behavioral resurrection to recover extinct behavioral categories, niche protection to maintain balanced behavioral group distributions, and staged epoch training to progressively emphasize different task competencies. These mechanisms are implemented identically in both the proposed system and the baseline, ensuring that performance differences arise solely from the supervisory strategy.
Population slots are distributed to balance exploitation and exploration, as summarized in
Table 1. Behavioral elites and top-ranked individuals occupy approximately 7–15% of the population to maintain selection pressure and preserve high-performing lineages [
29]. Tournament-generated offspring constitute the majority of individuals (approximately 60–83%), serving as the primary recombination mechanism. Additional slots are reserved for resurrected individuals to recover underrepresented behavioral archetypes and for diversity individuals, including at least one LLM-generated BT per generation to introduce targeted behavioral variation while maintaining convergence stability.
To ensure reproducibility, a population construction operator,
is defined following a sequential allocation process. First, high-priority individuals are selected from the previous generation, including behavioral elites, resurrected children, and top-ranked elites, with counts determined according to the ranges specified in
Table 1. Next, diversity individuals are introduced, consisting of randomly initialized Behavior Trees, with one slot reserved for the LLM-generated BT,
. Finally, the remaining population slots are filled using tournament selection and subtree crossover until the population size reaches
. This procedure ensures that the population size remains fixed while balancing exploitation and exploration across generations.
2.3.1. Staged Training Schedule
Training is conducted over eight epochs with five generations per epoch (41 total generations). Stage-dependent reward multipliers, inspired by the principles of curriculum learning [
30], are applied to progressively shift selective pressure from basic garbage pickup toward energy management, humanoid interaction, and delivery behaviors. Early epochs emphasize pickup behavior to promote productive task engagement and prevent collapse into patrol-only locomotion, while later stages increase weighting on charging and coordination behaviors to encourage multi-task integration. The reward weighting schedule used in this study is summarized in
Table 2.
This staged curriculum mitigates premature specialization by dynamically reshaping the fitness landscape across training. Early phases encourage behavioral discovery and task participation, whereas later phases reinforce coordinated task execution and efficiency. In the proposed framework, the staged schedule also complements LLM-supervised mutation control by aligning evolutionary pressure with the desired progression from specialist behaviors toward integrated multi-competency robots.
2.3.2. Fitness Function
The fitness function combines multiple reward components reflecting task performance and operational efficiency, including pickup efficiency, delivery throughput, charging behavior, humanoid interaction, patrol productivity, and survival, together with penalties for idle time, collisions, leftover garbage, and battery depletion. The overall fitness formulation is defined as:
where
are stage-dependent reward multipliers defined in
Table 2. All reward and penalty components are normalized to dimensionless quantities to ensure comparability across heterogeneous metrics. Specifically, each component
is scaled as:
where
denotes the maximum achievable or empirically observed value for that component within a simulation episode. This normalization ensures that all terms contribute proportionally to the overall fitness and prevents dominance of any single component due to scale differences.
The reward components corresponding to pickup, humanoid interaction, clearing, and charging behaviors are directly aligned with the Working, Complete, and Excellent robot classifications used by the LLM supervisor, enabling consistent evaluation and supervisory decision-making. The detailed reward structure, including base rewards and per-action contributions, is summarized in
Table 3.
2.3.3. Specialist Classification and Behavioral Diversity
To capture behavioral characteristics beyond scalar fitness, individuals are categorized using rule-based specialist classifications derived from simulation statistics. For each robot, the following behavioral metrics are recorded:
number of direct garbage pickups;
number of humanoid interactions;
number of clearing point deliveries;
number of battery charging attempts;
total distance covered during evaluation;
elapsed time since the last non-movement behavioral action.
Distance traveled is used to differentiate exploratory patrol from stationary or oscillatory movement, while inactivity duration identifies behavioral stalling even when fitness values remain stable. Non-movement actions include pickup, delivery, humanoid interaction, and charging, while locomotion is excluded from inactivity measurement.
Specialist categories are assigned using deterministic rules derived from observed behavior frequencies. Let denote the number of direct garbage pickups and denote the number of humanoid interactions. Individuals are classified as:
Pickup Specialists if ;
Humanoid Specialists if ;
Balanced Robots otherwise.
Additional binary labels are assigned independently of the primary specialist category:
Charging Robots if at least one battery charging attempt is recorded;
Clearing Robots if at least one clearing point delivery is recorded;
Patrol-Only Robots if no direct garbage pickups and no humanoid interactions are observed.
Individuals may belong to multiple categories simultaneously, enabling representation of overlapping behavioral competencies. These classifications support population monitoring and LLM supervision through behavioral diversity estimation and engagement analysis. Behavioral diversity,
is quantified using a dominance-based metric shown in Equation (3).
where
denotes total population size and
represents the number of individuals in the largest specialist category. Diversity approaches zero when the population collapses into a single behavioral type and increases as specialist distributions become balanced. Behavioral diversity is evaluated jointly with traversal distance and inactivity duration to distinguish productive exploration from non-productive wandering or behavioral deadlock, enabling early detection of premature convergence and targeted supervisory intervention.
2.3.4. Population Quality Metrics
Population quality is evaluated using the Working Robot Percentage,
which measures the proportion of robots performing productive tasks. A robot is classified as working if it performs at least one garbage pickup or humanoid interaction. Let
represent population size and
represent working robots. The working ratio is defined in Equation (4) and serves as a coarse indicator of functional competence, helping detect stagnation, unproductive exploration, and locomotion-only behaviors.
Population quality is further characterized using hierarchical multi-task competence metrics summarized in
Table 4. Complete Robots satisfy minimum performance thresholds across four core behavioral domains: garbage pickup, humanoid interaction, clearing-point delivery, and battery charging, thereby representing integrated multi-behavior capability. Excellent Robots correspond to high-performing Complete Robots that additionally meet stricter throughput and fitness thresholds. This hierarchical classification supports quantitative evaluation of convergence speed, behavioral integration, and emergence of high-quality controllers. It should be noted that no explicit “inefficient” class is defined in this taxonomy. Instead, robots that fail to meet the criteria for any specialist or composite category are implicitly considered inefficient and remain unclassified. These population quality descriptors provide high-level performance signals that complement specialist diversity metrics and support supervisory decision-making by the LLM module.
2.4. Hybrid LLM Implementation
2.4.1. LLM-Supervised Mutation Rate Adaptation
Mutation rate adaptation is supervised by an LLM that analyzes population-wide evolutionary states each generation. Unlike traditional scalar stagnation or diversity-based heuristics, mutation control is conditioned on a structured JavaScript Object Notation (JSON) [
31] population summary including fitness trends, specialist distributions, diversity metrics, working ratio, counts of Complete and Excellent Robots, locomotion coverage, and behavioral inactivity duration shown in
Figure 3.
Based on this multi-dimensional representation, the LLM recommends mutation rates that balance exploration and exploitation according to the evolutionary phase. Early training typically triggers elevated mutation rates to promote behavioral discovery, whereas later phases progressively reduce mutation to preserve emerging multi-behavior integration. The LLM also classifies evolutionary patterns (e.g., Healthy Progress, Exploration Phase, Premature Convergence, and Stagnation), enabling context-aware mutation regulation.
To ensure robustness under pathological evolutionary conditions, emergency override logic is incorporated. When behavioral diversity falls below predefined thresholds or stagnation persists beyond specified epoch limits, supervisory intervention is triggered immediately, thus bypassing normal control intervals and enforcing elevated mutation rates. This safeguard prevents latent diversity collapse and behavioral deadlock, even when scalar fitness trends appear stable. The structured population state,
provided to the LLM and the associated mutation control logic are summarized in
Table 5 and
Table 6.
2.4.2. LLM-Guided Targeted Behavior Tree Generation
In addition to mutation supervision, the LLM is employed as an explicit generator of new Behavior Trees that are injected into the evolving population. Instead of replacing GP operators or generating initial populations, the LLM injects targeted individuals based on underrepresented behavioral archetypes identified from the same structured population summary.
At each generation, the LLM receives the same structured population state used for mutation supervision, including specialist category distributions and behavioral engagement statistics. Based on this information, the LLM identifies the most underrepresented or missing behavioral archetype and generates a BT that prioritizes the corresponding capability. For example, if humanoid interaction behaviors are absent, a humanoid-focused BT is generated; if charging behavior is insufficient, a charging-oriented BT is produced; if multi-task integration is limited, the LLM generates BTs coordinating pickup, delivery, and charging behaviors.
The BT generation process follows a two-phase prompting strategy. During the exploration phase, defined by the absence of Complete and Excellent Robots, the LLM generates compact specialist trees with restricted depth and node count. These trees emphasize structural simplicity and behavioral focus, improving robustness under elevated mutation rates. Once high-quality individuals emerge, the system transitions to an exploitation phase in which the LLM generates deeper, multi-behavior BTs that encode task integration logic and long-horizon action sequencing.
All LLM-generated BTs conform to a constrained grammar that guarantees syntactic validity and compatibility with the GP execution engine. Generated trees are injected into the population and subjected to the same evaluation, selection, crossover, and mutation operators as GP-derived individuals. No elitism or survival bias is applied to LLM-generated BTs, ensuring that retention is determined solely by evolutionary performance. The targeted BT generation policy is summarized in
Table 7, while the two-phase prompting strategy and behavioral gap-driven objectives are incorporated within the LLM supervisory architecture shown in
Figure 3. This targeted injection strategy proactively maintains behavioral diversity and corrects population imbalance without disrupting promising lineages enabling earlier emergence of integrated multi-task behaviors compared to reactive diversity preservation methods.
2.4.3. Integrated Supervision Workflow and Intervention Policy
LLM supervision is performed at every generation and consists of two coupled actions: mutation rate recommendation and targeted Behavior Tree generation. After population evaluation and behavioral statistics extraction, a structured population state is passed to the LLM, which determines an appropriate mutation rate and generates a Behavior Tree addressing the most critical behavioral deficiency. The supervision workflow is formalized in Algorithm 1. The algorithm is presented using a structured pseudocode format consistent with standard Genetic Programming formulations [
32], with explicit operator definitions to improve clarity and reproducibility.
| Algorithm 1: LLM-Supervised Genetic Programming for Behavior Tree Evolution |
| | Input: Initial population , maximum generations , initial mutation rate |
| | Output: Evolved population |
| 1 | Initialize population with randomly generated Behavior Trees |
| 2 | Initialize mutation rate |
| 3 | for generation to do |
| 4 | | Evaluate all individuals in |
| 5 | | Extract population-level metrics and statistics (, , , trends, distribution) |
| 6 | | Construct structured population state representation (as defined in Section 2.4.1) |
| 7 | | ( |
| 8 | | if true then |
| 9 | | | |
| 10 | | else |
| 11 | | |
| 12 | | end if |
| 13 | | // Construct next population using the multi-source GP allocation strategy defined in Section 2.3 |
| 14 | | |
| 15 | | |
| 16 | end for |
| 17 | Return |
The structured population state
aggregates multiple metrics, including fitness
, behavioral diversity
, and working robot ratio
, together with temporal trends, specialist distribution and population health statistics. The LLM supervision process can be formally expressed as a mapping which assigns a functional value to
:
where
denotes the mutation rate proposed by the LLM,
represents the generated Behavior Tree and
denotes the LLM-based supervisory policy. The population update is then defined as:
where
represents the implementation of the GP policy defined in
Section 2.3, ensuring a fixed population size by sequentially incorporating elite, resurrected, diversity (including the LLM-generated BT), and tournament-generated individuals. Mutate
represents the standard mutation operator which applies mutation with rate
to all individuals. To ensure robustness under unfavorable evolutionary conditions, a crisis-detection mechanism is incorporated. The crisis indicator is defined as:
where
and
denote diversity and stagnation thresholds, respectively. When a crisis is detected, the mutation rate is overridden by an emergency value,
:
At each generation, the next population is constructed using a multi-source allocation strategy (see
Table 1), where individuals are drawn from multiple categories including elites, resurrected individuals, diversity candidates, and tournament-generated offspring. The diversity category consists of randomly initialized Behavior Trees, within which one individual is replaced by the LLM-generated BT
. Mutation is applied after population construction, and all individuals, including elites and the injected BT are subjected to mutation, promoting continued exploration while preserving selection bias.
The supervision policy adapts dynamically based on population quality and evolutionary patterns. When Complete or Excellent Robots are absent, supervision prioritizes behavioral discovery through elevated mutation rates and specialist BT injection. As multi-behavior competence emerges, supervision shifts toward consolidation by reducing mutation and generating integrative BTs that encode coordinated task execution. Under stagnation or diversity collapse, supervision increases mutation and biases BT injection toward rare or underrepresented behavioral archetypes, corresponding to the control logic defined in Algorithm 1.
Unlike conventional GP approaches that rely on reactive stagnation detection or fixed scheduling, the proposed framework performs continuous, proactive supervision. The LLM is queried at every generation, and targeted BTs are injected consistently as candidate individuals, enabling diversity maintenance, exploration–exploitation balancing, and behavioral gap correction as primary evolutionary control mechanisms. Importantly, all LLM interventions remain fully embedded within the GP loop: generated individuals receive no preferential selection advantage, mutation control remains fully integrated within standard evolutionary operators, and no direct fitness shaping is introduced. The LLM therefore functions strictly as a supervisory policy layer that modulates evolutionary pressure and seeds candidate structures through population-level reasoning. This formulation highlights that the proposed framework introduces supervision at the policy level rather than modifying the underlying evolutionary dynamics.
2.4.4. Design Rationale
The proposed system is guided by five core principles. First, holistic population supervision was adopted to overcome the limitations of scalar stagnation detectors, which cannot distinguish between benign convergence and pathological behavioral collapse. By conditioning supervision on multi-dimensional behavioral statistics, mutation control and BT injection can be aligned with population structure and evolutionary phase.
Second, dominance-based behavioral diversity was selected over entropy-based metrics due to its interpretability, robustness to noise, and direct sensitivity to specialist collapse. This formulation enables immediate detection of behavioral homogenization, which is the dominant failure mode in multi-task evolutionary robotics.
Third, unconditional BT injection is performed at every generation to support proactive diversity preservation. Continuous introduction of behaviorally targeted individuals mitigates gradual diversity erosion without requiring explicit stagnation or extinction triggers.
Fourth, two-phase prompting separates early behavioral discovery from later integration. Compact specialist BTs improve survival under high mutation during exploration, while deeper multi-behavior BTs support consolidation and throughput optimization during exploitation.
Fifth, constrained grammar enforcement guarantees syntactic validity and runtime safety for LLM-generated BTs. This constraint prevents execution failures while maintaining compatibility with GP operators and allowing controlled structural variation.
2.4.5. LLM Configuration
The LLM supervision module interfaces with the Anthropic API using Claude Haiku 4.5 (claude-haiku-4-5-20251001). This model was selected due to its favorable trade-off between reasoning capability and inference efficiency, which is critical given that the LLM is invoked at every generation (approximately 40 API calls per training run). The model is configured with a maximum output length of 2048 tokens and a sampling temperature of 0.7. This setting introduces controlled stochasticity in the generated responses, enabling variation in recommendations across similar population states while maintaining sufficient stability for structured JSON outputs. To ensure robustness, each request is subject to a 30-s timeout with exponential backoff retry logic (up to three attempts) to mitigate transient API failures.
At each generation, the LLM receives a structured snapshot of the population state
(defined in
Section 2.4.1). This representation includes multi-dimensional metrics covering fitness progression, behavioral diversity, specialist distribution, and overall population health, as well as the current evolutionary parameters. The input structure
is summarized as follows:
Fitness Metrics: maximum and average fitness, improvement rate, short-term trends, stagnation indicators, and recent history;
Diversity Metrics: behavioral diversity score , temporal change, and trend information;
Specialist Distribution: counts of behavior-specific specialists, balanced workers, complete robots, and excellent robots;
Population Health: working robot percentage , dominant and underrepresented behavior types, and a diversity index;
Evolutionary Parameters: mutation rate, crossover rate, and population size.
This multi-factor representation enables the LLM to perform holistic reasoning over the evolutionary state, in contrast to conventional approaches that rely on single scalar indicators such as fitness stagnation. These configuration choices ensure consistent structured outputs while maintaining sufficient variability for adaptive supervision, directly influencing the effectiveness of mutation control and targeted BT generation observed in the experimental results.
2.4.6. LLM Prompt Design
The mutation rate adaptation prompt is designed based on an observed limitation in multi-objective evolutionary settings: early-generation fitness values are often unreliable indicators of true solution quality. In particular, high fitness scores may arise from single-task specialization rather than robust multi-behavior competence, leading conventional fitness-driven adaptation strategies to prematurely reduce exploration.
To address this issue, the prompt adopts a quality-first decision framework, prioritizing the presence of complete and excellent robots over raw fitness values during early stages of evolution. The LLM is instructed to classify the population state into one of seven evolutionary phases, each associated with a recommended mutation rate range demonstrated in
Table 8.
These phase definitions serve as guiding heuristics within the prompt rather than rigid rules, allowing the LLM to adapt its recommendations based on the broader population context. A key design feature of the prompt is the explicit handling of fitness noise. When a large discrepancy exists between maximum and average fitness in the absence of quality individuals, the LLM is instructed to treat high fitness values as unreliable and maintain elevated mutation rates. This mechanism mitigates premature convergence by preventing over-exploitation of non-generalizable solutions. The LLM produces a structured JSON response containing the recommended mutation rate, a reasoning trace, a risk assessment, and an intervention flag.
On the other hand, BT generation is governed by an adaptive two-phase prompting strategy, reflecting the differing structural requirements of early-stage exploration and later-stage exploitation. In Phase 1 (Exploration), triggered when no excellent robots are present, the prompt enforces strict simplicity constraints. Generated trees are limited to a maximum depth of four and approximately 8–10 nodes, focusing on a single primary behavior with minimal fallback structure. This design is motivated by the observation that simpler trees exhibit greater robustness under tournament selection, as they contain fewer potential failure points.
In Phase 2 (Exploitation), activated once high-quality individuals emerge, these constraints are relaxed to allow deeper and more expressive trees (up to depth six and approximately 25 nodes). The prompt encourages integration of multiple behavioral competencies and prioritizes underrepresented behaviors within the population, with urgency determined dynamically (e.g., when charging and clearing behaviors are absent in the population). Both prompting strategies share a unified structured output format, consisting of:
To ensure validity, all generated trees undergo a four-stage verification process: JSON syntax validation, node type verification, action name validation, and structural constraint enforcement. Only validated trees are incorporated into the evolutionary population. Additionally, annotated examples of valid BT structures are embedded within the prompt to guide the LLM toward correct schema generation and reduce parsing errors.
To support reproducibility, full LLM prompts for mutation rate adaptation and BT generation, representative examples of LLM input and output interactions are provided in
Supplementary Materials. These examples from actual experimental logs illustrate how structured population states are translated into mutation rate recommendations and behavior tree designs under different evolutionary conditions. Example cases include:
Early Exploration: high fitness values are correctly identified as unreliable due to the absence of quality individuals, resulting in increased mutation rates;
Refinement Phase: the emergence of high-quality individuals leads to reduced mutation and increased exploitation;
Phase 1 BT Generation: demonstrating compact specialist trees optimized for robustness;
Phase 2 BT Generation: illustrating more complex multi-behavior structures enabled by a mature population.
3. Experimental Setup and Results
3.1. Experimental Setup and Statistical Analysis
A controlled comparative evaluation was conducted between the proposed LLM-supervised genetic programming system and a baseline GP system. Five independent runs were performed for each method, resulting in ten total experimental runs. Each run consisted of eight training epochs with five generations per epoch, yielding 41 generations including the initial population. The population size was fixed at 30 individuals.
All robots were evaluated in a Unity-based simulation for a maximum of 420 simulated seconds with 10× time acceleration. To reduce computational cost, individuals exhibiting persistently poor performance were terminated early based on fitness stagnation criteria. Both systems shared identical evolutionary infrastructure, including selection strategy, elitism, behavioral resurrection, diversity individual injection, and staged reward shaping. The only experimental variable was the supervision strategy: rule-based stagnation detection in the baseline system versus LLM-based holistic supervision in the proposed framework. Detailed LLM configuration and prompt design are described in
Section 2.4.5 and
Section 2.4.6, including model parameters, structured inputs, and prompting strategies.
Since each run requires approximately 10 h of wall-clock time due to real-time simulation constraints, making large-scale replication impractical within the scope of this study. Given the limited number of independent runs and the potential violation of normality assumptions, statistical analysis was conducted using the Mann–Whitney U test [
33], a non-parametric rank-based method suitable for small sample sizes. A significance level of α = 0.05 was adopted for all comparisons. For two independent samples of sizes
and
, the
statistic is defined as in Equations (9) and (10):
where
denotes the sum of ranks of observations in group 1. The smaller of
and
is used as the test statistic
.
The corresponding two-tailed p-value is obtained from the exact sampling distribution of . Given the small sample sizes (n1 = n2 = 5), exact p-values provide more reliable inference than asymptotic normal approximation. In cases involving tied observations (e.g., robot emergence timing), asymptotic p-values with tie correction were used, as exact p-values are not defined under ties. The type of p-value used is explicitly indicated in the corresponding tables and figure captions.
To account for multiple statistical comparisons,
p-values were further adjusted using the Benjamini–Hochberg (BH) procedure [
34], and adjusted values (p
BH) are reported alongside raw
p-values. For completeness and effect size estimation, the standardized z-score is computed as:
To quantify the magnitude of differences, effect sizes were reported using the rank-biserial correlation,
which is computed from the standardized statistic as:
Effect sizes were interpreted using standard thresholds (small ≈ 0.1, medium ≈ 0.3, large ≥ 0.5).
3.2. Fitness Performance and Evolutionary Dynamics
Figure 4 illustrates the generation-by-generation fitness trajectories of the proposed LLM-supervised system and the baseline GP system. The figure reports mean maximum fitness across five independent runs, with shaded regions representing ±1 standard deviation. Vertical dashed lines indicate epoch boundaries, and annotations mark the staged training phases.
As shown in
Figure 4, the proposed system exhibits faster early fitness growth, reaching a mean maximum fitness of approximately 35,000 by Generation 5, while the baseline remains near 26,000. This early separation suggests a potential advantage in the discovery of effective behavioral structures during initial exploration. However, this observation should be interpreted with caution, as the early-stage differences do not reach statistical significance in the subsequent analysis. Therefore, this result should be considered alongside the emergence analysis of Complete and Excellent Robots in
Section 3.4, which provides additional insight into the timing and quality of behavioral development.
These results are summarized in
Table 9, which reports maximum fitness, mean maximum fitness, and final average population fitness across all runs. Although no statistically significant difference is observed in absolute maximum fitness, the proposed system achieves higher mean maximum fitness; however, this difference is not statistically significant after correction (p
BH = 0.143). This observation suggests a possible tendency toward earlier and more consistent emergence of strong solutions, though further validation with larger sample sizes would be required to confirm this effect. In contrast, the baseline attains comparable peak performance only in later generations and with greater variability.
During mid training (Generations 10–25), the performance gap widens, with the proposed system maintaining fitness between approximately 38,000 and 42,000, while the baseline fluctuates between 25,000 and 35,000. The higher variance in the baseline trajectory, also reflected in
Table 9, suggests less stable adaptation to the stage-dependent driven selective pressures introduced by the training curriculum, particularly during delivery behavior introduction. The proposed system exhibits smoother phase transitions, reflecting proactive evolutionary intervention.
In late training stage, both trajectories converge, and no statistically significant difference is observed between the methods, suggesting that LLM supervision does not alter the achievable fitness ceiling. Additionally, as supported by
Figure 4 and
Table 9, the primary advantage lies in accelerating convergence and stabilizing evolutionary dynamics rather than increasing ultimate fitness, highlighting the role of LLM supervision as an efficiency-enhancing mechanism.
3.3. Convergence Speed Analysis
Figure 5 and
Table 10 present stage-wise fitness comparisons across training. The proposed system achieves higher fitness during both early (+18.4%) and mid-training (+40.2%) stages. However, after multiple comparison correction, only the mid-training stage shows a statistically significant difference (p
BH = 0.036), while the early-stage improvement does not reach statistical significance (p
BH = 0.057). The corresponding rank-biserial effect sizes (r = −0.84 for early, r = −1.00 for mid) indicate large effects in both stages, although the early-stage result should be interpreted with caution due to the limited sample size and lack of statistical significance. Overall, these results suggest that LLM supervision contributes to improved performance, particularly during the mid-training stage where statistically significant differences are observed. The baseline method, relying on random initialization and standard evolutionary operators, requires more generations to achieve comparable performance.
By the final training stage, no statistically significant difference (+0.2%) was observed between systems, confirming that the proposed method accelerates convergence without affecting final solution quality. The baseline system eventually achieved comparable performance but required substantially more generations to do so. This pattern highlights the role of LLM guidance in shortening the exploratory phase of evolution.
3.4. Complete and Excellent Robot Emergence
Figure 6 and
Table 11 report on the generation at which Complete and Excellent Robots first and averagely emerged. The proposed system produced Complete Robots significantly earlier than the baseline, corresponding to a 71.7% reduction in time-to-emergence. The rank-biserial effect size was
= 1.00, indicating a robust and consistent advantage with complete separation between groups across all experimental runs. Excellent Robots also emerged substantially earlier under LLM supervision, with a 65.2% reduction in emergence time. While the relative improvement was slightly smaller than for Complete Robots, the result suggests that additional optimization beyond basic multi-competency integration still requires evolutionary refinement. Nonetheless, the proposed system consistently reduced training time by 3.6 h (to be discussed in
Section 3.9) in wall-clock terms, yielding substantial practical benefits.
Run-level analysis in
Table 11 further demonstrates the consistency of this improvement. The earliest Complete Robot emergence in the proposed system occurred at Generation 5 (Runs 3 and 4), whereas the earliest baseline emergence occurred at Generation 17 (Run 1). Notably, even the slowest emergence in the proposed system (Generation 15, Run 5) preceded the fastest baseline emergence. This non-overlapping distribution indicates that the observed improvement reflects a systematic shift in convergence behavior rather than isolated outlier performance.
3.5. Behavioral Diversity Analysis
Behavioral diversity results are summarized in
Table 12. Across all training phases, the proposed system maintained significantly higher diversity than the baseline, with the largest improvement occurring during early training (+36.0%). This improvement is consistent with the targeted injection of LLM-generated Behavior Trees designed to address underrepresented behavioral archetypes. Notably, the proposed system achieved higher diversity despite progressively reducing mutation rates, whereas the baseline maintained consistently high mutation. This result indicates that diversity is more effectively preserved through targeted, population-aware intervention rather than indiscriminate exploration.
Higher diversity is associated with faster emergence of Complete and Excellent Robots, which require integration of multiple behavioral competencies. Preserving specialist behaviors across the population increases the likelihood that complementary behavioral components are available for recombination. The diversity advantage observed in the proposed system, particularly during early and mid-training, provides a broader pool of behavioral blocks for crossover, thereby facilitating earlier discovery of multi-competency architectures and contributing to reduced emergence time.
3.6. Mutation Rate Adaptation
Figure 7 illustrates the mutation rate trajectories of both systems. Although both systems began with similar mutation rates, their adaptation strategies diverged substantially after early training. The proposed system gradually reduced mutation rates, reflecting a transition from exploration to exploitation informed by holistic population assessment. In contrast, the baseline system increased and sustained high mutation in response to perceived stagnation.
Notably, the proposed system achieved higher diversity while using lower mutation overall. This outcome underscores the importance of timing and targeting in mutation control. Early exploration establishes a diverse set of specialists, after which reduced mutation preserves and refines promising structures. The baseline’s sustained high mutation disrupted high-quality individuals before effective integration could occur.
Figure 7 also reveals the inflection point around Generation 10, where the proposed system’s mutation rate drops below baseline. This corresponds to the emergence of quality robots (Complete/Excellent), triggering the LLM to transition from exploration to exploitation. The baseline, lacking this quality-aware trigger, continues interpreting fitness plateaus as requiring more exploration.
3.7. LLM Decision Pattern Analysis
Analysis of LLM supervision decisions shown in
Figure 8 revealed consistent and interpretable patterns. Most decisions corresponded to stable evolutionary progress, during which mutation rates were maintained or gradually reduced. Exploration-oriented decisions (26.3%) clustered around early training and curriculum transitions, while exploitation-oriented decisions (16.2%) followed major fitness breakthroughs.
Only a single diversity crisis was observed across all runs, suggesting that proactive supervision contributed to maintaining population stability. In comparison, the baseline system exhibited frequent diversity warning conditions without coordinated corrective mechanisms. These observations indicate that LLM supervision provides not only adaptive evolutionary control but also an interpretable supervisory signal aligned with population-level dynamics.
3.8. Specialist Distribution and Targeting
Table 13 characterizes specialist distributions within the proposed system, revealing evolutionary biases that guided LLM interventions. Clearing behaviors remained consistently underrepresented (mean, M = 5.6, standard deviation, SD = 2.9), indicating reduced survivability due to higher coordination complexity. The LLM mitigated this imbalance by prioritizing clearing-focused behavior tree generation, preserving competency coverage across the population.
Charging behaviors showed high variability (SD = 5.9, range 0–21), reflecting the anticipatory temporal reasoning required for effective energy management, which emerges inconsistently under conventional GP. In contrast, pickup behaviors dominated due to early curriculum emphasis and lower structural complexity, forming a strong evolutionary attractor. Consequently, LLM targeting pickup specialists occurred primarily under extinction risk conditions.
3.9. Computational Overhead
The computational overhead introduced by LLM supervision is summarized in
Table 14. Approximately 40 LLM calls were executed per run, producing an average overhead of 13.3% relative to simulation time. The reported time savings are computed from empirically measured quantities, including the average number of generations required to produce the first Complete Robot and the measured simulation time per generation (15 min). The proposed system achieves a 71.7% reduction in time to Complete Robot emergence, corresponding to an estimated 297 min (~5 h) of simulation time saved per run (19.8 generations × 15 min).
After accounting for the measured LLM overhead of approximately 80 min per run, the net time savings is estimated at 217 min (~3.6 h), representing a 36% reduction in time to first Complete Robot. Although this value is derived from aggregated experimental measurements rather than directly timed end-to-end execution, it provides a realistic estimate of effective wall-clock savings under the current synchronous implementation.
Although the current implementation performs LLM supervision synchronously by pausing evolutionary progression during LLM queries, the supervision process is inherently parallelizable with simulation. An asynchronous implementation could execute LLM analysis concurrently with robot evaluation, potentially reducing effective wall-clock overhead substantially while preserving supervisory benefits.
3.10. Structural Analysis of Behavior Tree Evolution
To further illustrate the impact of LLM-based supervision on structural evolution, representative behavior trees from different stages of training are compared in
Figure 9 and
Figure 10.
Figure 9 illustrates a representative LLM-generated Behavior Tree injected during early training (Epoch 1, Generation 2), when no Complete or Excellent Robots had yet emerged. The tree exhibits a compact and relatively shallow structure (22 nodes, depth 5), organized as a priority-ordered Selector with four semantically distinct branches: battery management, capacity clearing, garbage collection, and patrol.
Each branch is governed by a single blackboard condition followed by a fixed action sequence, reflecting the LLM’s deliberate construction of specialist behaviors targeting specific deficiencies identified in the population state. This structured and human-interpretable layout highlights the role of semantic guidance in the proposed framework. Rather than relying on random initialization, the LLM injects targeted behavioral priors that address underrepresented competencies, providing a structured scaffold for subsequent evolutionary refinement.
Figure 10 presents an Excellent Robot Behavior Tree evolved through genetic programming by Epoch 7, Generation 1, visualized at depth 5 (full structure: 70 nodes, depth 11). In contrast to the LLM-generated tree, the evolved tree exhibits substantially greater structural complexity, including nested Selectors and Sequences, redundant condition checks, and deeply entangled substructures (partially truncated for visualization clarity). The top-level branches correspond to a complete multi-task competency profile—battery management, capacity clearing, humanoid interaction, garbage collection, and fallback behaviors—yet their integration emerges organically through crossover and mutation rather than explicit design.
This structural contrast illustrates a key dynamic of the proposed framework: the LLM provides semantically meaningful specialist archetypes that seed targeted behavioral diversity, while genetic programming progressively integrates and elaborates these components into complex, high-performing multi-behavior strategies. This combination demonstrates how LLM-based supervision guides the evolutionary process at a semantic level without constraining the emergence of sophisticated solutions. Both full BT are provided in
Supplementary Materials.
4. Discussions
The experimental results demonstrate that LLM-supervised genetic programming provides clear advantages over conventional GP for multi-task Behavior Tree evolution. A key observation is the significantly faster emergence of multi-competency robots, with a 71.7% reduction in time to Complete Robot formation. This improvement can be attributed to the combination of targeted Behavior Tree injection and phase-aware mutation control, which reduces reliance on unguided recombination and enables earlier formation of structured multi-behavior policies.
Another important finding is that higher behavioral diversity can be maintained despite lower mutation rates. The proposed framework increases diversity while reducing average mutation pressure, indicating that exploration is guided more effectively rather than driven by persistent stochastic perturbation. Early-stage diversity seeding expands behavioral coverage, while later-stage mutation reduction stabilizes integrated solutions. In contrast, the baseline system exhibits oscillatory dynamics and specialist collapse under sustained mutation pressure, highlighting the limitations of purely reactive diversity mechanisms.
Importantly, the proposed method improves convergence efficiency without altering the achievable performance ceiling. Both systems converge to statistically comparable final fitness levels, suggesting that LLM supervision accelerates the search process rather than biasing the fitness landscape. This distinction reinforces the role of the LLM as a supervisory mechanism that enhances optimization dynamics while preserving the underlying evolutionary objective.
To contextualize these results, it is important to clarify the scope of experimental comparisons. While recent GP variants and Quality-Diversity approaches such as Novelty Search and MAP-Elites have demonstrated strong diversity preservation capabilities, the primary objective of this study is to isolate the effect of LLM-based supervision within a controlled evolutionary framework. Accordingly, both the proposed method and the baseline share identical GP infrastructure, including selection, crossover, mutation, elitism, and diversity preservation mechanisms. This design ensures that observed performance differences can be directly attributed to the supervision strategy rather than confounded by algorithmic variations. Incorporating additional GP variants would introduce multiple interacting factors, making it difficult to disentangle the contribution of LLM supervision. Nevertheless, extending the proposed framework to advanced GP variants and Quality-Diversity methods represents an important direction for future work. The experimental environment also integrates multiple concurrent objectives, including navigation, interaction, resource management, and exploration, providing a sufficiently complex testbed for evaluating multi-task behavioral emergence.
From a computational perspective, LLM supervision introduces approximately 13% overhead relative to simulation time, which is offset by reduced training duration. The 71.7% reduction in generations required for Complete Robot emergence corresponds to an estimated 36% reduction in wall-clock time to the first high-quality solution, providing a practical advantage under limited computational resources. Recent work in [
15] employs LLMs to generate entire initial populations through prompt-based initialization. Although effective for early performance gains, such approaches confine evolutionary guidance to a static initialization phase and introduce structural bias tied to prompt quality, limiting adaptability as population dynamics evolve.
In contrast, the proposed framework adopts a continuous supervisory role. Rather than constraining the search with fixed initial structures, the LLM dynamically modulates mutation strategies and injects behaviorally targeted individuals throughout training. This enables adaptive exploration–exploitation balancing and ongoing correction of behavioral gaps as population dynamics evolve.
Overall, the hybrid architecture preserves the self-organizing properties of genetic programming while augmenting them with context-sensitive guidance. This integration accelerates multi-task behavioral emergence, improves diversity retention, and stabilizes convergence dynamics without sacrificing evolutionary autonomy. More broadly, the results suggest that supervisory LLMs can function as adaptive meta-controllers for evolutionary search, supporting hybrid intelligence frameworks that combine symbolic reasoning with population-based optimization.
5. Conclusions
This paper presented a hybrid LLM-supervised genetic programming framework for evolving behavior trees in autonomous robots. The primary contribution is a meta-level supervision mechanism in which a large language model analyzes holistic population state, including behavioral diversity, specialist distribution, robot quality metrics, and training phase context—to guide mutation rate adaptation and targeted behavior tree injection. This approach enables context-sensitive evolutionary control while preserving core GP operators unlike conventional fitness-driven heuristics.
The proposed system was evaluated against a baseline GP framework with identical infrastructure, including behavioral elitism, resurrection mechanisms, and diversity injections. By isolating supervision strategy as the only independent variable, the experimental design enabled a controlled comparison between LLM-based holistic supervision and rule-based stagnation detection.
The results demonstrate three key advantages. First, multi-competency robots emerge significantly faster, with Complete and Excellent Robots appearing 71.7% and 65.2% earlier, respectively. Second, behavioral diversity remains higher despite lower mutation rates, indicating more effective exploration–exploitation balancing. Third, both systems converge to comparable final fitness levels, confirming that LLM supervision improves convergence efficiency without altering ultimate performance.
However, several limitations remain. Experiments were conducted in a single simulated environment with fixed task definitions, which may limit generalization. LLM supervision also relies on manually designed prompts and a specific model configuration. Additionally, although statistically significant trends were observed, the limited number of experimental runs reduces sensitivity to smaller effects. Future work will evaluate the framework across diverse robotic domains, including physical robot validation, sim-to-real transfer, automated supervision strategies, and extensions to multi-objective and safety-critical evolutionary settings.
Future work will focus on validating the proposed framework on physical robotic platforms and extending it to more diverse and dynamic task environments. Further research will explore asynchronous or lightweight supervision mechanisms to improve scalability, as well as automated prompt optimization and model distillation to reduce dependency on specific LLM configurations. Extending the framework to multi-objective and safety-critical domains also represents a promising direction.
Overall, this work demonstrates that large language models can function as effective supervisory agents for evolutionary computation, accelerating convergence while preserving behavioral diversity. Integrating population-level reasoning with genetic programming enables earlier emergence of high-quality multi-task behaviors without compromising long-term performance, highlighting the potential of combining evolutionary search with foundation-model reasoning as a hybrid paradigm for complex autonomous systems.
Supplementary Materials
The following supporting information can be downloaded at:
https://www.mdpi.com/article/10.3390/robotics15050098/s1, File S1: LLM_JSON_Examples; File S2: Prompt_MutationRate_v1.2; File S3: Prompt_BTGeneration_Simple_Phase1; File S4: Prompt_BTGeneration_Complex_Phase2; File S5: BT_Early_LLM_Clearing_Specialist; File S6: BT_Late_Evolved_Excellent_Robot.
Author Contributions
Conceptualization, C.J.T.; methodology, C.J.T.; software, C.J.T.; validation, C.J.T.; formal analysis, C.J.T.; investigation, C.J.T.; resources, C.J.T.; data curation, C.J.T.; writing—original draft preparation, C.J.T.; writing—review and editing, A.M.; visualization, C.J.T. and W.S.L.; supervision, W.S.L. and E.H.; project administration, E.H. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The data supporting the findings of this study, including simulation configurations and experimental results, are available from the corresponding author upon reasonable request. The full source code of the system is part of an ongoing research project and is therefore not publicly available at this stage.
Acknowledgments
During the preparation of this manuscript, the authors used ChatGPT (GPT-5.5) to assist with language editing and improving the clarity and flow of the text. The authors carefully reviewed and revised the generated content and took full responsibility for the final manuscript.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| BTs | Behavior Trees |
| GP | Genetic Programming |
| LLMs | Large Language Models |
| JSON | JavaScript Object Notation |
| SD | Standard Deviation |
References
- Iovino, M.; Scukins, E.; Styrud, J.; Ögren, P.; Smith, C. A survey of behavior trees in robotics and AI. Robot. Auton. Syst. 2022, 154, 104096. [Google Scholar] [CrossRef]
- Millington, I.; Funge, J. Artificial Intelligence for Games, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
- Colledanchise, M.; Almeida, D.; Ögren, P. Towards blended reactive planning and acting using behavior trees. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA); IEEE: Montreal, QC, Canada, 2019. [Google Scholar] [CrossRef]
- Koza, J.R. Genetic Programming: On the Programming of Computers by Means of Natural Selection; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
- O’Reilly, U.-M. Genetic programming II: Automatic discovery of reusable programs. Artif. Life 1994, 1, 439–441. [Google Scholar] [CrossRef]
- Angulo, K.A. Evolutionary systems. In Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies; MIT Press: Cambridge, MA, USA, 2008. [Google Scholar]
- Bongard, J.C. Evolutionary robotics. Commun. ACM 2013, 56, 74–83. [Google Scholar] [CrossRef]
- Mertan, A.; Cheney, N. Investigating premature convergence in co-optimization of morphology and control in evolved virtual soft robots. arXiv 2024, arXiv:2402.09231. [Google Scholar] [CrossRef]
- Abulail, R.N. Premature avoidance in genetic algorithm using dynamic mutation probability. J. Wirel. Mob. Netw. Ubiquitous Comput. Dependable Appl. 2025, 16, 528–542. [Google Scholar] [CrossRef]
- Tenne, Y. Accelerating the convergence of evolutionary algorithms by trajectory analysis. J. Phys. Conf. Ser. 2025, 3027, 012037. [Google Scholar] [CrossRef]
- Lehman, J.; Stanley, K.O. Abandoning objectives: Evolution through the search for novelty alone. Evol. Comput. 2011, 19, 189–223. [Google Scholar] [CrossRef] [PubMed]
- Mouret, J.-B.; Clune, J. Illuminating search spaces by mapping elites. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), Madrid, Spain, 11–15 July 2015; pp. 593–600. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the Neural Information Processing Systems Conference (NeurIPS); Curran Associates Inc.: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
- Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; et al. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv 2023, arXiv:2303.12712. [Google Scholar] [CrossRef]
- Kobilov, A.; Lan, J. Automatic robot task planning by integrating large language model with genetic programming. In Proceedings of the International Conference on Advanced Robotics and Mechatronics (ICARM); IEEE: Portsmouth, UK, 2025. [Google Scholar] [CrossRef]
- Huang, W.; Abbeel, P.; Pathak, D.; Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv 2022, arXiv:2201.07207. [Google Scholar] [CrossRef]
- Salimans, T.; Ho, J.; Chen, X.; Sidor, S.; Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv 2017, arXiv:1703.03864. [Google Scholar] [CrossRef]
- Verma, P.; Diab, M.; Rosell, J. Automatic generation of behavior trees for the execution of robotic manipulation tasks. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation (ETFA); IEEE: Västerås, Sweden, 2021. [Google Scholar] [CrossRef]
- Perez, D.; Nicolau, M.; O’Neill, M.; Brabazon, A. Evolving behaviour trees for the Mario AI competition using grammatical evolution. In Proceedings of the Applications of Evolutionary Computation Conference; Springer: Berlin, Germany, 2011. [Google Scholar] [CrossRef]
- Unity Technologies. Unity Game Engine. Available online: https://unity.com/ (accessed on 23 October 2025).
- Meniku. NPBehave—An Event-Driven Behavior Tree Library for Code-Based AIs in Unity. Available online: https://github.com/meniku/NPBehave (accessed on 23 October 2025).
- Anthropic. Claude Haiku 4.5. Available online: https://www.anthropic.com/ (accessed on 23 October 2025).
- Bejarano, G. Perception–Awareness–Decision: Socially-Aware Robot Navigation and Interaction. In HRI ‘26: 21st ACM/IEEE International Conference on Human-Robot Interaction; Association for Computing Machinery: Edinburgh, UK, 2026. [Google Scholar]
- Bernardo, R.; Sousa, J.M.C.; Botto, M.A.; Gonçalves, P.J.S. A novel control architecture based on behavior trees for an omni-directional mobile robot. Robotics 2023, 12, 170. [Google Scholar] [CrossRef]
- Hussain, A.; Riaz, S.; Amjad, M.S.; Haq, E.U. Genetic algorithm with a new round-robin based tournament selection: Statistical properties analysis. PLoS ONE 2022, 17, e0274456. [Google Scholar] [CrossRef] [PubMed]
- He, P.; Zhang, L. Rapid prototype development approach for genetic programming. J. Comput. Commun. 2024, 12, 67–79. [Google Scholar] [CrossRef]
- Papadimitriou, K.D.; Murasovs, N.; Giannaccini, M.E.; Aphale, S. Genetic algorithm-based control of a two-wheeled self-balancing robot. J. Intell. Robot. Syst. 2025, 111, 34. [Google Scholar] [CrossRef]
- Möller, F.J.D.; Bernardino, S.; Soares, S.S.R.F.; Souza, L.A.M. An adaptive mutation for Cartesian genetic programming using an ε-greedy strategy. Appl. Intell. 2023, 53, 27290–27303. [Google Scholar] [CrossRef]
- Poli, R.; McPhee, N.F.; Vanneschi, L. Elitism reduces bloat in genetic programming. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO); Association for Computing Machinery: New York, NY, USA, 2008. [Google Scholar] [CrossRef]
- Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the International Conference on Machine Learning (ICML); Association for Computing Machinery: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
- Crockford, D.; Bray, T. The JavaScript Object Notation (JSON) Data Interchange Format; Internet Engineering Task Force (IETF): Fremont, CA, USA, 2017. [Google Scholar]
- Poli, R.; Langdon, W.B.; McPhee, N.F. A Field Guide to Genetic Programming; Lulu Press: Morrisville, NC, USA, 2008. [Google Scholar]
- Mann, H.B.; Whitney, D.R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 1947, 18, 50–60. [Google Scholar] [CrossRef]
- Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 1995, 57, 289–300. [Google Scholar] [CrossRef]
Figure 1.
Overall System Illustration.
Figure 1.
Overall System Illustration.
Figure 2.
Simulation Environment.
Figure 2.
Simulation Environment.
Figure 3.
LLM-Supervised Evolutionary Control Loop.
Figure 3.
LLM-Supervised Evolutionary Control Loop.
Figure 4.
Fitness Performance Comparison.
Figure 4.
Fitness Performance Comparison.
Figure 5.
Convergence Comparison.
Figure 5.
Convergence Comparison.
Figure 6.
Robot Emergence Comparison: (a) Box plot of first emergence of complete robot and excellent robot in V2 (proposed system) and Baseline; (b) Mean emergence generation of complete robot and excellent robot in V2 (proposed system) and Baseline.
Figure 6.
Robot Emergence Comparison: (a) Box plot of first emergence of complete robot and excellent robot in V2 (proposed system) and Baseline; (b) Mean emergence generation of complete robot and excellent robot in V2 (proposed system) and Baseline.
Figure 7.
Mutation Rate Adaptation Strategy.
Figure 7.
Mutation Rate Adaptation Strategy.
Figure 8.
LLM Decision Pattern Analysis.
Figure 8.
LLM Decision Pattern Analysis.
Figure 9.
Example of an LLM-generated behavior tree during early evolution, exhibiting a compact and specialized structure focused on a single behavioral objective.
Figure 9.
Example of an LLM-generated behavior tree during early evolution, exhibiting a compact and specialized structure focused on a single behavioral objective.
Figure 10.
Example of an evolved behavior tree from later generations, demonstrating increased depth and integration of multiple behavioral strategies, including exploration, humanoid interaction, and resource management.
Figure 10.
Example of an evolved behavior tree from later generations, demonstrating increased depth and integration of multiple behavioral strategies, including exploration, humanoid interaction, and resource management.
Table 1.
Population Allocation per Generation.
Table 1.
Population Allocation per Generation.
| Individual Type | Population Slots |
|---|
| Behavioral Elites (count) | 0–5 |
| Resurrected Children (count) | 0–2 |
| Elite Individuals (count) | 0–2 |
| Tournament Children (count) | 18–25 |
| Diversity Individuals (count) | 2–4 (1 from LLM) |
Table 2.
Staged Epoch Multiplier Configuration.
Table 2.
Staged Epoch Multiplier Configuration.
| Phase | Epoch | Garbage | Battery | Humanoid | Delivery | Leftover |
|---|
| 1 | 1–2 | 2.0 | 1.0 | 1.5 | 1.0 | 0.8 |
| 2 | 3–4 | 1.8 | 2.5 | 1.5 | 1.2 | 1.0 |
| 3 | 5–6 | 1.6 | 2.0 | 1.8 | 2.5 | 1.5 |
| 4 | 7–8 | 1.5 | 1.8 | 2.0 | 2.0 | 2.0 |
Table 3.
Reward and Penalty Table.
Table 3.
Reward and Penalty Table.
| Action | Base Reward | Per Action/Item Reward/Penalty |
|---|
| Collect garbage from ground | +200 | +100 per garbage |
| Gather garbage from humanoid | +360 | +45 per garbage |
| Charging Battery | +800 if Robot Battery < 15% | −3000 if robot complete shutdown due to 0% battery |
| +400 if Robot battery < 25% |
| +200 if Robot battery < 45% |
| +20 + Robot battery if battery ≥ 45% |
| Delivery to Clearing Station | +100 | +30 per garbage, −150 if delivery with empty capacity |
| Patrolling | +13 for every 20 m with discovery | +10 per new discovery(garbage/humanoid) |
| Idle | 0 | −10 per second after 20 s idle |
| Collision | 0 | −500 per collision with humanoid |
| Leftover Garbage on the map | 0 | −20 per leftover garbage |
Table 4.
Hierarchical multi-task competence metrics for population quality evaluation. Robots that do not satisfy any defined category are implicitly considered inefficient and are not assigned to a separate class.
Table 4.
Hierarchical multi-task competence metrics for population quality evaluation. Robots that do not satisfy any defined category are implicitly considered inefficient and are not assigned to a separate class.
| Tier | Threshold Criteria |
|---|
| Excellent Robots | ≥4 pickups, ≥2 humanoid interactions, ≥2 clearing deliveries, total gathered garbage ≥6, fitness >30,000 |
| Complete Robots | ≥2 pickups, ≥1 humanoid interaction, ≥1 clearing delivery, ≥1 charging attempt |
| Full-Service Robots | ≥1 in all pickup, humanoid, clearing, and charging behaviors |
| Balanced Elite Robots | Total gathered garbage > 8, fitness > 15,000 |
Table 5.
LLM Supervision Inputs (Structured Population State, ).
Table 5.
LLM Supervision Inputs (Structured Population State, ).
| Category | Feature | Threshold Criteria |
|---|
| Fitness Dynamics | Fitness trend | Mean and best fitness change over recent generations |
| Behavioral Coverage | Specialist counts | Pickup, humanoid, balanced, patrol-only counts |
| Behavioral Diversity | Dominance diversity | Collapse-sensitive diversity metric |
| Functional Engagement | Working robot percentage | Fraction of productive individuals |
| Population Quality | Complete/Excellent counts | Multi-task competence indicators |
| Motion Activity | Distance covered | Average traversal distance |
| Behavioral Activity | Time since last non-movement action | Inactivity and stalling indicator |
| Convergence Signals | Stagnant epochs | Fitness plateau duration |
| Crisis Signals | Diversity threshold breaches | Premature convergence detection |
Table 6.
Evolutionary Pattern Classification Label and LLM Mutation Control Logic.
Table 6.
Evolutionary Pattern Classification Label and LLM Mutation Control Logic.
| Pattern Label | Mutation Policy |
|---|
| Exploration Phase | High mutation (structural discovery) |
| Healthy Progress | Moderate mutation |
| Rapid Improvement | Low mutation (stabilization) |
| Premature Convergence | High mutation (diversity recovery) |
| Stagnation | Maximum mutation + diversity injection |
Table 7.
Targeted Behavior Tree Generation Policy.
Table 7.
Targeted Behavior Tree Generation Policy.
| Population Condition | LLM BT Objective | Tree Type |
|---|
| No Complete/Excellent Robots | Discover missing behaviors | Compact specialist BT |
| Missing humanoid interactions | Introduce humanoid behavior | Humanoid-focused BT |
| Missing charging behavior | Introduce battery management | Charging-focused BT |
| Missing clearing deliveries | Introduce clearing behavior | Clearing-focused BT |
| Low diversity | Recover extinct archetypes | Rare-category BT |
| Emergent multi-task competence | Integrate behaviors | Multi-behavior BT |
| Stagnation or diversity crisis | Force exploration | Highly diverse BT |
Table 8.
LLM-Guided Mutation Adaptation Phases and Corresponding Strategies.
Table 8.
LLM-Guided Mutation Adaptation Phases and Corresponding Strategies.
| Phase | Typical Conditions | Mutation Rate |
|---|
| Early Exploration | No high-quality individuals; unreliable fitness signals | 0.40–0.50 |
| Exploitation | Strong fitness improvement with multiple high-quality individuals | 0.10–0.20 |
| Refinement | Emerging quality individuals; stable improvement | 0.15–0.25 |
| Balanced Optimization | High-quality population with stagnating fitness | 0.18–0.28 |
| Balanced Exploration | Moderate quality but declining diversity or stagnation | 0.30–0.40 |
| Aggressive Exploration | Low diversity or unstable population | 0.35–0.45 |
| Crisis Intervention | Severe collapse in diversity and solution quality | 0.45–0.50 |
Table 9.
Fitness Performance Comparison.
Table 9.
Fitness Performance Comparison.
| Metric | Mean ± Standard Deviation | U | pexact | pBH | r |
|---|
| Proposed | Baseline |
|---|
| Max Fitness Ever | 47,815 ± 3311 | 48,161 ± 6653 | 12.0 | 1.000 | 1.000 | 0.04 |
| Final Max Fitness | 39,322 ± 5108 | 42,622 ± 4859 | 8.0 | 0.421 | 0.541 | 0.36 |
| Mean Max Fitness | 36,889 ± 3369 | 31,412 ± 3250 | 21.0 | 0.095 | 0.143 | −0.68 |
| Final Average Fitness | 14,699 ± 3260 | 20,221 ± 1916 | 1.0 | 0.016 | 0.036 | 0.92 |
Table 10.
Staged-Wise Max Fitness Comparison.
Table 10.
Staged-Wise Max Fitness Comparison.
| Training Stage | Mean ± Standard Deviation | Difference | pexact | pBH | r |
|---|
| Proposed | Baseline |
|---|
| Early (Gen 1–10) | 31,723 ± 3415 | 26,784 ± 780 | +18.4% | 0.032 | 0.057 | −0.84 |
| Mid (Gen 11–25) | 38,992 ± 3238 | 27,811 ± 3902 | +40.2% | 0.008 | 0.036 | −1.00 |
| Late (Gen 26–41) | 38,565 ± 3807 | 38,502 ± 5404 | +0.2% | 0.548 | 0.616 | −0.28 |
Table 11.
Robot Emerge Timing.
Table 11.
Robot Emerge Timing.
| Run | Proposed (nth Generation) | Baseline (nth Generation) |
|---|
| Complete Robot | Excellent Robot | Complete Robot | Excellent Robot |
|---|
| 1 | 6 | 11 | 17 | 17 |
| 2 | 8 | 10 | 27 | 27 |
| 3 | 5 | 5 | 36 | 36 |
| 4 | 5 | 5 | 30 | 30 |
| 5 | 15 | 17 | 28 | 28 |
Table 12.
Behavioral Diversity Comparison by Training Phase.
Table 12.
Behavioral Diversity Comparison by Training Phase.
| Training Phase | Behavioral Diversity | Difference |
|---|
| Proposed | Baseline |
|---|
| Early (Gen 1–10) | 0.571 | 0.420 | +36.0% |
| Mid (Gen 11–25) | 0.664 | 0.507 | +31.0% |
| Late (Gen 26–41) | 0.690 | 0.591 | +16.8% |
| Overall | 0.650 | 0.508 | +28.0% |
Table 13.
Specialist Distribution in the Proposed System.
Table 13.
Specialist Distribution in the Proposed System.
| Specialist Type | Mean Count | Standard Deviation, SD | Range |
|---|
| Pickup | 12.2 | 3.5 | 3–18 |
| Humanoid | 9.6 | 3.8 | 1–18 |
| Charging | 8.7 | 5.9 | 0–21 |
| Clearing | 5.6 | 2.9 | 1–13 |
Table 14.
Computation Overhead Analysis.
Table 14.
Computation Overhead Analysis.
| Metric | Value |
|---|
| LLM API calls per run (count) | 40 |
| BTs generated per run (count) | 39.8 ± 0.4 |
| Mean API response time (s) | 120 |
| Total LLM overhead per run (min) | 80 |
| Total simulation time per run (h) | 10 |
| Overhead percentage (%) | 13.3 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |