LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization
Abstract
1. Introduction
- 1.
- We propose the LLM-TOC framework, a novel bi-level training paradigm that leverages the code generation capabilities of LLMs to construct an infinitely scalable and semantically diverse adversarial curriculum for MARL. Diverging from existing LLM-assisted MARL methods, this framework directly generates executable rule-based policies in a Turing-complete code space, effectively circumventing the computational inefficiency and instability caused by the repeated RL inner loop training in traditional reward-shaping methods.
- 2.
- We introduce the Gradient Saliency Feedback mechanism, which bridges the modality gap between numerical RL signals and LLM semantic reasoning. This mechanism transforms pixel-level value fluctuations into semantically meaningful causal cues, enabling the LLM to perform targeted semantic gradient ascent in the abstract policy space without relying on hand-crafted prompts or manual feature engineering.
- 3.
- We provide motivating theoretical analysis for the proposed framework via the PAC-Bayes framework. We formally prove the convergence rate of LLM-TOC to a robust equilibrium at , and present a qualitative analysis, showing that our method yields a tighter generalization error bound than conventional parameter-space exploration methods under reasonable preconditions, providing a theoretical rationale for the superior zero-shot robustness of our approach.
2. Related Work
2.1. Zero-Shot Coordination and Generalization in MARL
2.2. Large Language Models for Autonomous Agents
2.3. Automated Curriculum Learning and Environment Design
2.4. Stackelberg Games in MARL
3. Problem Formulation
3.1. Partially Observable Markov Games
3.2. Zero-Shot Generalization as Minimax Regret
3.3. The Challenge of Semantic Coverage
4. Methodology
4.1. The Bi-Level Optimization Framework
4.2. Gradient Saliency Feedback Mechanism
4.2.1. Identifying Critical Moments via Value Surprise
4.2.2. Visual Attribution via Jacobian Saliency
4.2.3. Semantic Mapping
4.3. Policy Generation and Curriculum Evolution
- Prompt Regeneration Rule: The structured prompt is regenerated from scratch at each outer-loop iteration k. The fixed components (role instruction and API constraints ) are reused across all iterations to ensure the consistency of generation format and objective, while the core semantic diagnosis component is completely updated based on the student agent’s latest failure modes in the current iteration, with no cross-iteration reuse of diagnosis content. This ensures that the generated adversarial policies are always targeted at the student’s current weaknesses, rather than outdated historical vulnerabilities.
- Closed-Loop Translation Pipeline: The semantic feedback derived from gradient saliency is translated into executable policies through three sequential steps: (i) the pixel-level saliency map is mapped to object-level attention scores via Equation (11); (ii) the attention scores are converted into a natural language causal diagnosis describing the student’s failure mode; (iii) the diagnosis is embedded into the prompt, and the LLM generates Python policy code that directly exploits the identified vulnerability, with no manual intervention in the entire process. It should be clarified that our framework modifies the interactive adversarial partner policies in the multi-agent system, rather than the physical layout or dynamics of the environment itself.
- Policy Validation and Filtering Mechanism: After generation, we perform a two-step validation on : first, a syntactic validation to ensure the code is compilable and executable in the Melting Pot environment; second, a functional validation to evaluate the policy’s ability to reduce the student agent’s return via 5 evaluation rollouts. Only policies that pass both validations are added to the population , to avoid invalid or irrelevant strategies polluting the curriculum.
4.4. Algorithm Summary
- 1.
- Initialization: We instantiate the student agent with random weights, and initialize the opponent population with heuristic-based policies to provide initial training signals.
- 2.
- Inner-Loop Student Optimization: For each outer-loop iteration, the student agent optimizes its policy against the current opponent population until convergence, to learn a robust best response to the existing curriculum.
- 3.
- Gradient Saliency Diagnosis: After convergence, we evaluate the student against the current population to identify the worst-case opponent that induces maximum regret. We then extract failure trajectories, compute value surprise to locate critical failure frames, and generate semantic causal descriptions of the student’s vulnerabilities via Jacobian saliency map analysis.
- 4.
- Outer-Loop Curriculum Evolution: The semantic diagnosis is embedded into a structured prompt to guide the LLM to generate a new executable adversarial policy targeting the identified vulnerability. The generated policy is validated for executability and effectiveness, then added to the opponent population to expand the training curriculum.
| Algorithm 1 LLM-TOC: LLM-driven theory-of-mind adversarial curriculum. |
Input: Max outer iterations K, Inner steps , Threshold , Learning rate ; Input: Initialize student policy with parameters ; Input: Initialize opponent population ;
|
4.5. Motivating Theoretical Analysis
4.5.1. Core Foundational Assumptions
- Practical Implications: This assumption ensures that the regret of the student agent, defined as the gap between the oracle optimal return and the actual achieved return, is always bounded. It eliminates the possibility of infinite positive/negative returns that would break the monotonic improvement property of our iterative bi-level optimization, and is a necessary precondition for the convergence of the algorithm to a stable robust equilibrium.
- Applicable Conditions in Realistic Environments: This assumption holds for almost all standard episodic MARL environments, including the Melting Pot benchmark used in our experiments as well as most realistic open-ended multi-agent interaction scenarios. Specifically, it is satisfied when three conditions are met: (1) the per-step reward of the environment is constrained to a fixed finite range; (2) the maximum length of each interaction episode is finite; and (3) the discount factor , which ensures the cumulative discounted return is always bounded regardless of the interaction horizon.
- Inherent Limitations: This assumption does not hold for non-episodic, infinite-horizon tasks with unbounded per-step rewards. For such extreme scenarios, the convergence guarantee of our framework needs to be re-derived with additional constraints on the reward growth rate, which we leave for future work.
- Practical Implications: This assumption requires that the LLM can consistently generate adversarial strategies that approximate the worst-case attack on the student agent, with a bounded approximation error. It is the core premise that ensures our outer-loop curriculum generation can effectively tighten the constraint set of the inner-loop optimization, thereby driving the algorithm to converge to a robust equilibrium rather than a narrow, brittle Nash equilibrium subset.
- Applicable Conditions in Realistic Environments: This assumption holds when the adopted LLM has sufficient capabilities in three critical dimensions: (1) strong logical reasoning ability to understand the causal diagnosis of the student agent’s failure modes derived from gradient saliency analysis; (2) reliable code generation ability to translate the adversarial strategy into syntactically valid, executable Python code that is fully compatible with the environment interface; (3) sufficient domain knowledge of multi-agent game theory to design strategically effective adversarial behaviors that can exploit the student’s vulnerabilities. Our empirical results in Section 5.4 verify that this assumption is well satisfied in our implementation, with a 92.5% generation success rate and stable policy quality across 4 independent random seeds.
- Inherent Limitations: The approximation error is directly determined by the capability of the adopted LLM. Smaller open-source LLMs with limited reasoning and coding capabilities may fail to generate valid targeted adversarial strategies, leading to a large that breaks the convergence guarantee. In addition, for extremely complex multi-agent environments with obscure failure modes that are hard to describe via semantic language, the LLM may also fail to approximate the worst-case strategy effectively. Our two-step policy validation and filtering mechanism (described in Section 4.3) can automatically exclude invalid generation cases, ensuring that the convergence of the algorithm is not affected by occasional generation failures.
4.5.2. Convergence Rate Analysis
- 1.
- Convergence Rate: LLM-TOC converges to a robust equilibrium at a rate of with respect to the number of outer-loop iterations K. This convergence rate matches the optimal rate of classic fictitious play algorithms in zero-sum games, verifying that our semantic-space curriculum evolution has the same rigorous convergence guarantee as traditional gradient-based optimization methods, despite operating in a discrete, non-differentiable code space.
- 2.
- Equilibrium Robustness: The algorithm converges to an -Nash equilibrium neighborhood, where the student agent’s worst-case regret is bounded by the approximation error of the semantic oracle. This means that the student agent cannot be further exploited by any unseen strategy in the semantic space beyond the bounded error , which directly guarantees the zero-shot generalization robustness of the learned policy.
4.5.3. Generalization Bound Analysis via PAC-Bayes Framework
- 1.
- Traditional Parameter-Space Exploration: For conventional MARL methods, the prior is an isotropic Gaussian distribution over the high-dimensional neural network weights. The set of weight configurations that exactly represent a specific semantic logical strategy is a measure-zero subset in the continuous weight space. This means that the prior assigns almost zero probability to the valid semantic strategies that constitute the real-world test distribution, resulting in an extremely large KL divergence when fitting complex social behaviors.
- 2.
- LLM-TOC Semantic-Space Exploration: In our framework, the prior is induced by the LLM and its pre-trained knowledge of code, logical reasoning, and human social behaviors. This prior assigns significantly higher probability to syntactically valid, logically consistent policy code that aligns with the distribution of real-world human-like social strategies. As a result, the KL divergence is strictly smaller than that of parameter-space exploration:
4.5.4. Computational Complexity Analysis
- 1.
- Inner-Loop MARL Training Complexity: The core of the inner loop is the PPO update for the student agent, with a time complexity of per outer-loop iteration. This complexity is identical to the standard MAPPO/IPPO baselines, as we use the same network architecture and PPO update rule for the student agent. Critically, the targeted adversarial curriculum generated by our framework significantly reduces the total number of outer-loop iterations K and inner-loop steps required to reach the target performance, resulting in a total training step reduction of over 60% compared to mainstream baselines, as verified in our experiments.
- 2.
- Gradient Saliency Calculation Complexity: The saliency map computation involves a single backpropagation of the value function per critical frame, with a complexity of per outer-loop iteration. Since (we only perform 5 evaluation rollouts per iteration) and the observation size in our implementation, this overhead is negligible compared to the inner-loop RL training cost.
- 3.
- LLM Query Complexity: We only perform one LLM query per outer-loop iteration, with a complexity of . Unlike online LLM-based methods that require LLM inference at every environment step, our framework only invokes the LLM offline during training. In our experiments, we set total outer-loop iterations, resulting in only 10 LLM queries for the entire training process. The cumulative LLM overhead is less than 0.5% of the total training computational cost, which is negligible.
5. Experiments
5.1. Experimental Setup
5.1.1. Evaluation Benchmark: Melting Pot
- 1.
- collaborative_cooking_asymmetric: A coordination task requiring role specialization and synchronized action sequences to complete recipes.
- 2.
- prisoners_dilemma_in_the_matrix_repeated: A classic social dilemma testing the agent’s ability to maintain cooperation against defection risks over repeated interactions.
- 3.
- running_with_scissors_in_the_matrix_arena: A spatially complex, cyclical resource competition game (Rock–Paper–Scissors dynamics) where agents must identify and counter opponent strategies.
- 4.
- running_with_scissors_in_the_matrix_repeated: A repeated version of the arena task, emphasizing long-term memory and reciprocity.
5.1.2. Baselines
5.1.3. Implementation Details
5.1.4. Evaluation Metrics
- 1.
- Expected Cumulative Collective Return: This is the primary metric for evaluating the zero-shot generalization performance of agents, which is the standard evaluation indicator for the Melting Pot benchmark. It is defined as the expected discounted cumulative return of the student agent in the multi-agent test scenario:where is the policy of the student agent, is the policy of unseen OOD test partners, is the discount factor, and is the environment-defined reward function of the student agent.This metric directly reflects the overall task performance of the agent in the target multi-agent scenario, and is consistent with the evaluation protocol of mainstream MARL works on the Melting Pot benchmark. For each test scenario, we compare the performance of LLM-TOC with all baselines under the exact same environment and reward settings to ensure fairness. All results are averaged over 4 random seeds, with standard deviation reported as the uncertainty interval in the learning curves.
- 2.
- Relative Performance to Oracle PPO: This metric is used to eliminate the impact of inconsistent reward scales across different Melting Pot substrates, enabling fair cross-scenario performance comparison. It is defined as the ratio of the agent’s expected cumulative collective return to the return of the Oracle PPO agent in the same test scenario:where is the expected cumulative collective return of the evaluated agent, and is the expected cumulative collective return of the Oracle PPO agent (trained directly on the test scenario, serving as the theoretical performance upper bound).The reward scales of the four selected Melting Pot substrates are significantly different, making it impossible to directly compare the raw collective return across scenarios.
- 3.
- Training Steps to Target Performance: This metric quantifies the training efficiency and computational cost of the algorithm, defined as the minimum number of environment interaction steps required for the agent to reach the predefined target relative performance threshold in the held-out test scenarios.This metric directly reflects the sample efficiency and convergence speed of the algorithm, which is the core quantitative indicator to verify the claim of “training cost reduction by more than 60%” in our work. It is also a standard efficiency evaluation metric widely used in MARL curriculum learning works.
- 4.
- Relative Performance Drop in Ablation Study: This metric evaluates the independent contribution of each core component in LLM-TOC, defined as the percentage of performance decline of the ablation variant compared with the full LLM-TOC framework, calculated aswhere is the relative performance to Oracle PPO of the full LLM-TOC framework, and is that of the corresponding ablation variant. This metric quantifies the impact of removing each core module on the final zero-shot performance.
5.2. Results and Analysis
5.3. Ablation Study
5.4. Supplementary Empirical Analysis of the Semantic Oracle Assumption
- 1.
- Compilation Failure Rate: The average syntactic compilation failure rate of the generated code is 4.2% ± 1.8% across all iterations. All syntactically invalid code is automatically filtered out by our validation mechanism, and will not be added to the training population
- 2.
- Generation Attempts per Iteration: We perform at most two generation attempts per outer-loop iteration. In 95.8% of iterations, the first generation attempt produces syntactically valid code that passes compilation; only 4.2% of iterations require a second generation attempt, with no iterations requiring more than two attempts.
- 3.
- Proportion of Non-Trivial Valid Policies: Among all syntactically valid policies, 88.3% ± 2.5% are non-trivial adversarial policies that can significantly reduce the student agent’s return (functional validation passed), and are added to the training population. The remaining 11.7% of valid but ineffective policies are filtered out by our functional validation step.
6. Conclusions, Limitations and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| AGI | Artificial General Intelligence |
| OOD | out-of-distribution |
| MAS | multi-agent systems |
| MARL | Multi-Agent Reinforcement Learning |
| SP | self-play |
| PPO | Proximal Policy Optimization |
| GAE | Generalized Advantage Estimation |
| LLM | Large Language Model |
| ToM | Theory of Mind |
| ZSC | Zero-Shot Coordination |
| IPPO | Independent Proximal Policy Optimization |
| MAPPO | Multi-Agent Proximal Policy Optimization |
| QMIX | Monotonic Value Decomposition for Multi-Agent Reinforcement Learning |
| VDN | Value Decomposition Networks |
| OP | Other-Play |
| FCP | Fictitious Co-Play |
| TrajeDi | Trajectory Diversity |
| LIPO | Latent Space Optimization |
| GITM | Ghost in the Minecraft |
| ACL | Automatic Curriculum Learning |
| UED | Unsupervised Environment Design |
| QD | Quality-Diversity |
| PLR | Prioritized Level Replay |
| OEL | Open-Ended Learning |
| POMG | Partially Observable Markov Game |
Appendix A. List of Symbols
| Symbol | Definition |
|---|---|
| Partially Observable Markov Game (POMG) | |
| The tuple defining the POMG: | |
| Global state space | |
| N | Set of agents, partitioned into student (E) and others () |
| Joint action space () | |
| Reward function for the student agent | |
| Set of partial observations for agent i | |
| Observation function | |
| Discount factor | |
| Trajectory of states and actions | |
| Policies and Strategies | |
| Policy of the student agent (Follower) | |
| Parameters of the student policy (e.g., neural network weights) | |
| Joint policy of the other agents (Teammates/Opponents) | |
| Neural network parameter space | |
| Semantic strategy space (discrete, code-based) | |
| Expected return of student against | |
| Regret of student against | |
| Oracle performance (maximal achievable return) against | |
| Relative Performance to Oracle PPO | |
| Relative Performance Drop | |
| LLM-TOC Framework | |
| Semantic Oracle (Large Language Model) | |
| Population of opponent strategies at iteration k | |
| Value surprise (absolute TD error) at timestep t | |
| Set of critical frames where | |
| Value function of the student agent | |
| Input observation tensor (pixels/features) at timestep t | |
| Gradient map of the value function w.r.t. input | |
| Aggregated 2D saliency map | |
| Binary mask for a specific object class | |
| Attention score for a specific object class | |
| Semantic diagnosis derived from gradient saliency | |
| Structured prompt input to the LLM | |
| Executable policy code generated by the LLM | |
| Module | Hyperparameter | Value |
|---|---|---|
| Observation Processing | Grid size | 11 × 11 |
| Frame stack | 4 | |
| Number of object channels | 12 | |
| Network Architecture | CNN layers | 2 (32, 64 channels) |
| CNN kernel size | 3 × 3, stride 1, padding 1 | |
| LSTM hidden dimension | 256 | |
| MLP hidden dimension | 64 | |
| Action space dimension | 8 | |
| PPO Training | Learning rate | (linear decay to ) |
| Clip parameter | 0.2 | |
| Discount factor | 0.99 | |
| GAE lambda | 0.95 | |
| Batch size | 1024 timesteps | |
| PPO epochs per batch | 4 | |
| Mini-batch size | 256 | |
| Entropy coefficient | 0.01 | |
| Gradient clipping norm | 0.5 | |
| Training Process | Number of outer-loop iterations K | 10 |
| Inner-loop steps per iteration | ||
| Total maximum training steps | ||
| Evaluation rollouts per policy | 5 | |
| Maximum episode length | 1000 timesteps | |
| Gradient saliency threshold | 90th percentile of value surprise | |
| LLM Configuration | Model version | qwen-plus-2025-12-01 |
| Sampling temperature | 0.7 | |
| top_p | 0.9 | |
| Max generation tokens | 1024 | |
| Repetition penalty | 1.05 | |
| Max generation attempts per iteration | 2 | |
| Reproducibility | Random seeds | 1, 42, 100, 2026 |
Appendix B. Environment Details and Evaluation Protocols

Appendix B.1. Collaborative Cooking: Asymmetric
- Scenario (collaborative_cooking__asymmetric_0): Bot Agents: In the test scenario _0, the focal agent is paired with a specialized partner bot that rigidly adheres to a specific sub-policy (e.g., only fetching tomatoes or only delivering).
Appendix B.2. Prisoner’s Dilemma in the Matrix: Repeated
- Scenario (prisoners_dilemma_in_the_matrix__repeated_0): Bot Agents: The _0 scenario populates the background with agents executing classic game-theoretic strategies that were likely not present in the self-play training distribution, such as Grim Trigger (cooperate until the partner defects once, then defect forever) or Tit-for-Tat.
Appendix B.3. Running with Scissors in the Matrix: Arena
- Scenario (running_with_scissors_in_the_matrix__arena_0): Bot Agents: The background population in _0 consists of bots with fixed policy biases or specific “pure strategies” (e.g., bots that exclusively collect Rock).
Appendix B.4. Running with Scissors in the Matrix: Repeated
- Scenario (running_with_scissors_in_the_matrix__repeated_0): Bot Agents: The test partner in _0 typically follows a sequence-based strategy or a sophisticated exploitation policy that changes based on the focal agent’s history.
Appendix C. Prompt Engineering for Semantic Oracle
Appendix C.1. Role Instruction ()
- Persona Definition: The LLM is conditioned to act as an “Expert Game Designer” and “Adversarial Strategist.”
- Objective Specification: The instruction explicitly states that the goal is to minimize the expected return of the focal student agent (regret maximization). It emphasizes finding “corner cases” or “blind spots” in the student’s current behavior.
- Chain-of-Thought (CoT) Trigger: We include instructions such as “Think step-by-step” to encourage the model to first analyze the semantic diagnosis and then derive the logical counter-strategy before writing code.
Appendix C.2. API Constraints and Environment Interface ()
- Code Skeleton: A predefined Python class structure (e.g., class OpponentPolicy(Policy):) is provided, requiring the LLM to implement specific methods like step(self, observation).
- Action Space Definition: A precise mapping of integer action IDs to their semantic meanings (e.g., 0: NOOP, 1: MOVE_FORWARD, 5: TURN_LEFT, 7: ZAP_BEAM) is listed to ensure valid outputs.
- Observation Space Specification: The prompt clarifies that the input observation is not raw pixels but a processed semantic feature tensor, allowing the LLM to write logic based on object presence (e.g., if ’apple’ in view).
- Library Restrictions: Explicit constraints are added to prevent the use of undefined external libraries, ensuring the code runs in the sandboxed environment.
Appendix C.3. Semantic Diagnosis via Gradient Saliency ()
- Critical Moment Description: Based on the Value Surprise , the system selects specific frames where the agent failed (e.g., “At step 450, a sudden drop in value occurred”).
- Visual Attention Summary: Using the computed saliency map , the prompt lists objects with high attention scores (what the agent focused on) and relevant objects with low attention scores (what the agent ignored).
- Causal Attribution: The numerical attention scores are converted into a natural language hypothesis. For example, if attention on “Opponent” is high but “Goal” is low, the diagnosis might state: “The agent was distracted by the opponent and neglected the objective.”
Appendix C.4. Full Prompt Template Example
| Algorithm A1 Prompt template for adversarial strategy generation. |
# --- [PART 1: ROLE INSTRUCTION] --- You are an expert Multi-Agent Game Designer. Your goal is to design a Python policy for an opponent agent that exploits the weaknesses of the current “Student Agent”. The Student Agent is playing the game “Running with Scissors”. Your strategy must be competitive and aim to minimize the Student’s score. Think step-by-step: 1. Analyze the “Diagnosis Report” to understand the Student’s vulnerability. 2. Devise a logic-based strategy to exploit this specific weakness. 3. Implement the strategy in Python. |
# --- [PART 2: API CONSTRAINTS] --- You must implement the following class structure: |
class AdversarialPolicy(object): def _ _init_ _(self): self.memory = {} # Use for state tracking |
def step(self, observation): """ Input: observation (11x11 grid of object IDs). Returns: action (int). |
Action Space: 0: NOOP, 1: FORWARD, 2: RIGHT, 3: BACKWARD, 4: LEFT, 5: TURN_L, 6: TURN_R, 7: FIRE_ZAP (Interact) |
Object IDs in observation: 1: Wall, 2: Apple (Goal), 3: Agent (Student) """ # YOUR CODE HERE return action |
Constraint: Do not import external libraries like numpy or torch. Use standard Python logic. |
# --- [PART 3: SEMANTIC DIAGNOSIS (Dynamic)] --- DIAGNOSIS REPORT (Derived from Gradient Saliency): -------------------------------------------------- |
> Context: The Student Agent failed at step T=142. > Visual Attention Analysis: - High Attention (Score 0.85): “Red Opponent” (located at relative pos [-2, 0]) - Low Attention (Score 0.10): “Green Resource” (located at relative pos [0, 3]) > Causal Inference: The agent is highly reactive to the opponent’s presence (“Distracted”) and fails to collect resources when threatened. -------------------------------------------------- INSTRUCTION: Write a policy that exploits this “Distraction” weakness. For example, create an agent that feints an attack to freeze the student, then steals the resource. |
Appendix D. Theoretical Analysis and Proofs for LLM-TOC
Appendix D.1. Problem Definition and Notation
- E: The student agent, with policy (parameterized by neural network weights ).
- : The other agents (opponents or teammates), with policy (the semantic strategy space, comprising all logically expressible behaviors, such as Turing-complete code).
- : The expected discounted return for the student agent: .
Appendix D.2. Detailed Convergence Analysis of Bi-Level Optimization
Appendix D.3. Detailed Derivation of Generalization Bounds
- Case 1: Traditional RL (). is a continuous distribution over neural weights . The probability of a neural network exactly representing a crisp logical rule, such as a specific Python if-else block, without noise is negligible:
- Case 2: LLM Generation (). is a distribution over tokens or code. It assigns high probability to valid logical structures:
Boundary Analysis of KL Divergence Terms
Appendix D.4. Efficiency Derivation for Gradient Saliency
- Likelihood : If a policy B is a valid attacker (e.g., a “Backstabber”), the probability it aligns with the prompt “Attack from behind” is high due to causal consistency:
- Evidence : The probability that any random code matches the specific prompt “Attack from behind” is very low, as most random codes exhibit unrelated behaviors:
References
- Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
- Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. The rise and potential of large language model based agents: A survey. arXiv 2025, arXiv:2309.07864. [Google Scholar]
- Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
- Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680. [Google Scholar] [CrossRef]
- Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Pérolat, J.; Silver, D.; Graepel, T. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Agapiou, J.P.; Vezhnevets, A.S.; Duéñez-Guzmán, E.A.; Matyas, J.; Mao, Y.; Sunehag, P.; Köster, R.; Madhushani, U.; Kopparapu, K.; Comanescu, R.; et al. Melting Pot 2.0. arXiv 2022, arXiv:2211.13746. [Google Scholar]
- Leibo, J.Z.; Dueñez-Guzman, E.A.; Vezhnevets, A.; Agapiou, J.P.; Sunehag, P.; Koster, R.; Matyas, J.; Beattie, C.; Mordatch, I.; Graepel, T. Scalable evaluation of multi-agent reinforcement learning with melting pot. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2021; pp. 6187–6199. [Google Scholar]
- Gorsane, R.; Mahjoub, O.; de Kock, R.J.; Dubb, R.; Singh, S.; Pretorius, A. Towards a standardised performance evaluation protocol for cooperative marl. In Advances in Neural Information Processing Systems 35; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 5510–5521. [Google Scholar]
- Hu, H.; Lerer, A.; Peysakhovich, A.; Foerster, J. “Other-play” for zero-shot coordination. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2020; pp. 4399–4410. [Google Scholar]
- Strouse, D.; McKee, K.; Botvinick, M.; Hughes, E.; Everett, R. Collaborating with humans without human data. In Advances in Neural Information Processing Systems 34; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 14502–14515. [Google Scholar]
- Jaderberg, M.; Czarnecki, W.M.; Dunning, I.; Marris, L.; Lever, G.; Castaneda, A.G.; Beattie, C.; Rabinowitz, N.C.; Morcos, A.S.; Ruderman, A.; et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 2019, 364, 859–865. [Google Scholar] [CrossRef]
- Tekin, S.F.; Ilhan, F.; Huang, T.; Hu, S.; Yahn, Z.; Liu, L. Multi-Agent Reinforcement Learning with Focal Diversity Optimization. arXiv 2025, arXiv:2502.04492. [Google Scholar] [CrossRef]
- Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. Openai o1 system card. arXiv 2024, arXiv:2412.16720. [Google Scholar] [CrossRef]
- Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
- Cross, L.; Xiang, V.; Bhatia, A.; Yamins, D.L.; Haber, N. Hypothetical minds: Scaffolding theory of mind for multi-agent tasks with large language models. arXiv 2024, arXiv:2407.07086. [Google Scholar] [CrossRef]
- Zhang, C.; Yang, K.; Hu, S.; Wang, Z.; Li, G.; Sun, Y.; Zhang, C.; Zhang, Z.; Liu, A.; Zhu, S.C.; et al. Proagent: Building proactive cooperative agents with large language models. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17591–17599. [Google Scholar] [CrossRef]
- Li, L.; Yuan, L.; Liu, P.; Jiang, T.; Yu, Y. LLM-Assisted Semantically Diverse Teammate Generation for Efficient Multi-agent Coordination. In Proceedings of the Forty-second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
- Xie, T.; Zhao, S.; Wu, C.H.; Liu, Y.; Luo, Q.; Zhong, V.; Yang, Y.; Yu, T. Text2reward: Automated dense reward function generation for reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Treutlein, J.; Dennis, M.; Oesterheld, C.; Foerster, J. A new formalism, method and open issues for zero-shot coordination. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2021; pp. 10413–10423. [Google Scholar]
- De Witt, C.S.; Gupta, T.; Makoviichuk, D.; Makoviychuk, V.; Torr, P.H.; Sun, M.; Whiteson, S. Is independent learning all you need in the starcraft multi-agent challenge? arXiv 2020, arXiv:2011.09533. [Google Scholar]
- Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. In Advances in Neural Information Processing Systems 35; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 24611–24624. [Google Scholar]
- Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
- Hu, H.; Foerster, J.N. Simplified action decoder for deep multi-agent reinforcement learning. arXiv 2019, arXiv:1912.02288. [Google Scholar]
- Lupu, A.; Cui, B.; Hu, H.; Foerster, J. Trajectory diversity for zero-shot coordination. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2021; pp. 7204–7213. [Google Scholar]
- Mahajan, A.; Samvelyan, M.; Gupta, T.; Ellis, B.; Sun, M.; Rocktäschel, T.; Whiteson, S. Generalization in cooperative multi-agent systems. arXiv 2022, arXiv:2202.00104. [Google Scholar] [CrossRef]
- Yuan, S.; Song, K.; Chen, J.; Tan, X.; Li, D.; Yang, D. Evoagent: Towards automatic multi-agent generation via evolutionary algorithms. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Association for Computational Linguistics: Albuquerque, NM, USA, 2025; pp. 6192–6217. [Google Scholar]
- Bettini, M.; Kortvelesy, R.; Prorok, A. Controlling behavioral diversity in multi-agent reinforcement learning. arXiv 2024, arXiv:2405.15054. [Google Scholar] [CrossRef]
- Carroll, M.; Shah, R.; Ho, M.K.; Griffiths, T.; Seshia, S.; Abbeel, P.; Dragan, A. On the utility of learning about humans for human-ai coordination. In Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
- Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv 2023, arXiv:2305.16291. [Google Scholar] [CrossRef]
- Zhu, X.; Chen, Y.; Tian, H.; Tao, C.; Su, W.; Yang, C.; Huang, G.; Li, B.; Lu, L.; Wang, X.; et al. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv 2023, arXiv:2305.17144. [Google Scholar]
- Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Wang, J.; Zhang, C.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Chen, W.; Su, Y.; Zuo, J.; Yang, C.; Yuan, C.; Chan, C.M.; Yu, H.; Lu, Y.; Hung, Y.H.; Qian, C.; et al. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. In Proceedings of the ICLR, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Ma, Y.J.; Liang, W.; Wang, G.; Huang, D.A.; Bastani, O.; Jayaraman, D.; Zhu, Y.; Fan, L.; Anandkumar, A. Eureka: Human-level reward design via coding large language models. arXiv 2023, arXiv:2310.12931. [Google Scholar]
- Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
- Portelas, R.; Colas, C.; Weng, L.; Hofmann, K.; Oudeyer, P.Y. Automatic curriculum learning for deep rl: A short survey. arXiv 2020, arXiv:2003.04664. [Google Scholar] [CrossRef]
- Jiang, M.; Grefenstette, E.; Rocktäschel, T. Prioritized level replay. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2021; pp. 4940–4950. [Google Scholar]
- Dennis, M.; Jaques, N.; Vinitsky, E.; Bayen, A.; Russell, S.; Critch, A.; Levine, S. Emergent complexity and zero-shot transfer via unsupervised environment design. In Advances in Neural Information Processing Systems 33; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 13049–13061. [Google Scholar]
- Parker-Holder, J.; Jiang, M.; Dennis, M.; Samvelyan, M.; Foerster, J.; Grefenstette, E.; Rocktäschel, T. Evolving curricula with regret-based environment design. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2022; pp. 17473–17498. [Google Scholar]
- Mouret, J.B.; Clune, J. Illuminating search spaces by mapping elites. arXiv 2015, arXiv:1504.04909. [Google Scholar] [CrossRef]
- Jiang, M.; Dennis, M.; Parker-Holder, J.; Foerster, J.; Grefenstette, E.; Rocktäschel, T. Replay-guided adversarial environment design. In Advances in Neural Information Processing Systems 34; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 1884–1897. [Google Scholar]
- Team, O.E.L.; Stooke, A.; Mahajan, A.; Barros, C.; Deck, C.; Bauer, J.; Sygnowski, J.; Trebacz, M.; Jaderberg, M.; Mathieu, M.; et al. Open-ended learning leads to generally capable agents. arXiv 2021, arXiv:2107.12808. [Google Scholar] [CrossRef]
- Fan, L.; Wang, G.; Jiang, Y.; Mandlekar, A.; Yang, Y.; Zhu, H.; Tang, A.; Huang, D.A.; Zhu, Y.; Anandkumar, A. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Advances in Neural Information Processing Systems 35; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 18343–18362. [Google Scholar]
- Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
- Atrey, A.; Clary, K.; Jensen, D. Exploratory not explanatory: Counterfactual analysis of saliency maps for deep reinforcement learning. arXiv 2019, arXiv:1912.05743. [Google Scholar]
- Narvekar, S.; Peng, B.; Leonetti, M.; Sinapov, J.; Taylor, M.E.; Stone, P. Curriculum learning for reinforcement learning domains: A framework and survey. J. Mach. Learn. Res. 2020, 21, 1–50. [Google Scholar]
- Li, C.; Chen, Y.H.; Zhao, H.; Sun, H. Stackelberg game theory-based optimization of high-order robust control for fuzzy dynamical systems. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 1254–1265. [Google Scholar] [CrossRef]
- Yang, C.; Wang, X.; Lu, Y.; Liu, H.; Le, Q.V.; Zhou, D.; Chen, X. Large language models as optimizers. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Sengupta, S.; Kambhampati, S. Multi-agent reinforcement learning in bayesian stackelberg markov games for adaptive moving target defense. arXiv 2020, arXiv:2007.10457. [Google Scholar] [CrossRef]
- Cheng, C.; Zhu, Z.; Xin, B.; Chen, C. A multi-agent reinforcement learning algorithm based on Stackelberg game. In Proceedings of the 2017 6th Data Driven Control and Learning Systems (DDCLS); IEEE: Princeton, NJ, USA, 2017; pp. 727–732. [Google Scholar]
- Shen, H.; Zhao, H.; Zhang, Z.; Yang, X.; Song, Y.; Liu, X. Network-wide traffic signal control based on marl with hierarchical Nash-Stackelberg game model. IEEE Access 2023, 11, 145085–145100. [Google Scholar] [CrossRef]
- Fiscko, C.; Yin, H.; Sinopoli, B. Hierarchical MARL with Stackelberg Games. In Proceedings of the 2025 American Control Conference (ACC); IEEE: Princeton, NJ, USA, 2025; pp. 4115–4122. [Google Scholar]
- Qu, H.; Li, X.; Liu, C.Z.; Zhang, J.; Zhuang, S.; Ma, M. Stackelberg Game-Theoretic Safe MARL with Bilevel Control for Autonomous Driving. IEEE Access 2026, 14, 17506–17524. [Google Scholar] [CrossRef]
- Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 157–163. [Google Scholar]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]






| Method | Training Complexity | Test-Time Inference Complexity |
|---|---|---|
| LLM-TOC (Ours) | , | |
| MAPPO/IPPO (Self-Play) | , | |
| Population-Based Training (PBT) | , | |
| Hypothetical Minds (Online LLM) | per step |
| Substrate | LLM-TOC | Hypothetical Minds | MAPPO | IPPO | Oracle PPO |
|---|---|---|---|---|---|
| Collaborative Cooking (Asymmetric) | |||||
| Prisoner’s Dilemma (Repeated) | |||||
| Running with Scissors (Arena) | |||||
| Running with Scissors (Repeated) |
| Metric | Full LLM-TOC | w/o Saliency | Improvement |
|---|---|---|---|
| Steps to reach 0.6 rel. perf. | 51.4% steps ↓ | ||
| Final rel. perf. to Oracle PPO | 75–85% | 61–70% | 17.8% avg. gain |
| Convergence speed (per 1e6 steps) | 0.21 | 0.09 | 133.3% faster |
| Metric | Mean Value | Standard Deviation |
|---|---|---|
| Generation Success Rate | 92.5% | 3.2% |
| Policy Quality Consistency (CV) | 7.8% | 2.1% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, C.; Yuan, J.; Yu, T.; Jiang, X.; Xiang, L.; Zhang, J.; He, Z. LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization. Mathematics 2026, 14, 915. https://doi.org/10.3390/math14050915
Wang C, Yuan J, Yu T, Jiang X, Xiang L, Zhang J, He Z. LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization. Mathematics. 2026; 14(5):915. https://doi.org/10.3390/math14050915
Chicago/Turabian StyleWang, Chenxu, Jiang Yuan, Tianqi Yu, Xinyue Jiang, Liuyu Xiang, Junge Zhang, and Zhaofeng He. 2026. "LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization" Mathematics 14, no. 5: 915. https://doi.org/10.3390/math14050915
APA StyleWang, C., Yuan, J., Yu, T., Jiang, X., Xiang, L., Zhang, J., & He, Z. (2026). LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization. Mathematics, 14(5), 915. https://doi.org/10.3390/math14050915

