LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization

Wang, Chenxu; Yuan, Jiang; Yu, Tianqi; Jiang, Xinyue; Xiang, Liuyu; Zhang, Junge; He, Zhaofeng

doi:10.3390/math14050915

Open AccessArticle

LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization

by

Chenxu Wang

¹

,

Jiang Yuan

²,

Tianqi Yu

¹,

Xinyue Jiang

³,

Liuyu Xiang

¹,

Junge Zhang

⁴ and

Zhaofeng He

^1,*

¹

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Beijing Institute of Astronautical Systems Engineering, Beijing 100076, China

³

School of Mathematics and Physics, Beijing University of Posts and Telecommunications, Beijing 100876, China

⁴

Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(5), 915; https://doi.org/10.3390/math14050915

Submission received: 12 February 2026 / Revised: 6 March 2026 / Accepted: 6 March 2026 / Published: 8 March 2026

(This article belongs to the Special Issue Applications of Intelligent Game and Reinforcement Learning)

Download

Browse Figures

Versions Notes

Abstract

Zero-shot generalization to out-of-distribution (OOD) teammates and opponents in multi-agent systems (MASs) remains a fundamental challenge for general-purpose AI, especially in open-ended interaction scenarios. Existing multi-agent reinforcement learning (MARL) paradigms, such as self-play and population-based training, often collapse to a limited subset of Nash equilibria, leaving agents brittle when faced with semantically diverse, unseen behaviors. Recent approaches that invoke Large Language Models (LLMs) at run time can improve adaptability but introduce substantial latency and can become less reliable as task horizons grow; in contrast, LLM-assisted reward-shaping methods remain constrained by the inefficiency of the inner reinforcement-learning loop. To address these limitations, we propose LLM-TOC (LLM-Driven Theory-of-Mind Adversarial Curriculum), which casts generalization as a bi-level Stackelberg game: in the inner loop, a MARL agent (the follower) minimizes regret against a fixed population, while in the outer loop, an LLM serves as a semantic oracle that generates executable adversarial or cooperative strategies in a Turing-complete code space to maximize the agent’s regret. To cope with the absence of gradients in discrete code generation, we introduce Gradient Saliency Feedback, which transforms pixel-level value fluctuations into semantically meaningful causal cues to steer the LLM toward targeted strategy synthesis. We further provide motivating theoretical analysis via the PAC-Bayes framework, showing that LLM-TOC converges at rate

O (1 / \sqrt{K})

and yields a tighter generalization error bound than parameter-space exploration under reasonable preconditions. Experiments on the Melting Pot benchmark demonstrate that, with expected cumulative collective return as the core zero-shot generalization metric, LLM-TOC consistently outperforms self-play baselines (IPPO and MAPPO) and the LLM-inference method Hypothetical Minds across all held-out test scenarios, reaching 75% to 85% of the upper-bound performance of Oracle PPO. Meanwhile, with the number of RL environment interaction steps to reach the target relative performance as the core efficiency metric, our framework reduces the total training computational cost by more than 60% compared with mainstream baselines.

Keywords:

multi-agent reinforcement learning; large language models; adversarial curriculum learning; zero-shot generalization; gradient saliency

MSC:

68T42; 91A65; 91A26

1. Introduction

Developing autonomous agents capable of seamless interaction with diverse, unseen counterparts in open-ended multi-agent systems remains a core challenge for general-purpose AI [1,2]. While Multi-Agent Reinforcement Learning (MARL) has achieved superhuman performance in closed zero-sum games via the self-play (SP) paradigm [3,4], these methods typically converge to narrow Nash equilibrium subsets when co-evolving with fixed training partners [5]. This leads to catastrophic generalization failure in mixed-motive environments like the Melting Pot benchmark [6,7], where agents face Out-of-Distribution (OOD) teammates/opponents with semantically distinct behaviors (Figure 1) [8]. This well-documented Zero-Shot Coordination (ZSC) gap [9] motivates our work to develop training methodologies that foster robust, generalizable social intelligence.

Existing population-based methods, such as Fictitious Co-Play (FCP) [10] and population-based training (PBT) [11], seek to mitigate this gap by expanding training population diversity. However, these methods operate exclusively in the low-level parameter space of neural networks, yielding only superficial diversity rather than semantically meaningful strategies [12]. As a result, agents trained via these methods still fail to generalize to complex, unseen social behaviors at test time.

Large Language Models (LLMs), with their embedded human social prior knowledge and strong reasoning capabilities [13,14], offer a new path to inject semantic diversity into MARL. Existing LLM-based MARL works fall into two paradigms, both with critical limitations: (1) LLM-as-Agent: Methods like Hypothetical Minds [15] and ProAgent [16] integrate LLMs into the real-time decision loop for online ToM inference, but suffer from prohibitive inference latency and high computational costs, making them unsuitable for high-frequency control tasks. (2) LLM-Assisted Reward Shaping: Methods like SEMDIV [17] use LLMs to generate reward functions offline but require a computationally expensive RL inner loop to train new policies from scratch for each generated reward, leading to severe efficiency bottlenecks and training instability [18].

To address these limitations, we propose LLM-Driven Theory-of-Mind Adversarial Curriculum (LLM-TOC), a novel framework that redefines the role of LLMs in MARL training. Unlike existing approaches, LLM-TOC positions the LLM as an offline semantic oracle that directly generates executable rule-based policies in Python code, eliminating both online inference latency and the repeated RL inner loop bottleneck. We formalize the zero-shot generalization problem as a bi-level Stackelberg game: the outer-loop LLM generates a code-based adversarial population to maximize the student agent’s regret, while the inner-loop lightweight MARL student agent optimizes its best-response policy to minimize this regret. We validate the effectiveness of this framework through extensive experiments on the Melting Pot benchmark, a widely used testbed for open-ended multi-agent social intelligence.

A core challenge of this paradigm is the modality gap: LLMs excel at textual semantic reasoning but cannot natively interpret numerical RL signals. To bridge this gap and close the training loop, we propose a novel Gradient Saliency Feedback mechanism, which transforms pixel-level value function fluctuations into semantically meaningful causal descriptions of agent failure. This provides the LLM with explicit, targeted grounding to generate high-value adversarial strategies, without hand-crafted prompts or manual feature engineering. Our core contributions are summarized as follows:

1.: We propose the LLM-TOC framework, a novel bi-level training paradigm that leverages the code generation capabilities of LLMs to construct an infinitely scalable and semantically diverse adversarial curriculum for MARL. Diverging from existing LLM-assisted MARL methods, this framework directly generates executable rule-based policies in a Turing-complete code space, effectively circumventing the computational inefficiency and instability caused by the repeated RL inner loop training in traditional reward-shaping methods.
2.: We introduce the Gradient Saliency Feedback mechanism, which bridges the modality gap between numerical RL signals and LLM semantic reasoning. This mechanism transforms pixel-level value fluctuations into semantically meaningful causal cues, enabling the LLM to perform targeted semantic gradient ascent in the abstract policy space without relying on hand-crafted prompts or manual feature engineering.
3.: We provide motivating theoretical analysis for the proposed framework via the PAC-Bayes framework. We formally prove the convergence rate of LLM-TOC to a robust equilibrium at $O (1 / \sqrt{K})$ , and present a qualitative analysis, showing that our method yields a tighter generalization error bound than conventional parameter-space exploration methods under reasonable preconditions, providing a theoretical rationale for the superior zero-shot robustness of our approach.

We further validate the above innovations through extensive empirical evaluations on the Melting Pot benchmark, a representative testbed for open-ended multi-agent zero-shot generalization. The results show that LLM-TOC achieves state-of-the-art zero-shot generalization performance against unseen OOD partners in this benchmark, while reducing the total number of RL environment interaction steps required to reach the target performance by more than

60 %

compared with mainstream baselines within the same experimental settings.

2. Related Work

2.1. Zero-Shot Coordination and Generalization in MARL

Zero-Shot Coordination (ZSC) targets agents that can effectively collaborate with unseen partners without prior data or fine-tuning [9,19]. Classic MARL baselines, including Independent Proximal Policy Optimization (IPPO) [20], Multi-Agent Proximal Policy Optimization (MAPPO) [21], and the value-decomposition representative QMIX [22], mostly rely on the self-play (SP) paradigm to optimize joint returns in closed domains like StarCraft II [3]. While effective in fixed settings, these methods converge to arbitrary coordination conventions, leading to severe behavioral fragility when paired with out-of-distribution partners [8]. Early improvements like Other-Play (OP) [23] reduce reliance on arbitrary coordination signals via symmetry invariance but fail to scale to complex asymmetric environments.

Population-based training has become the dominant paradigm to improve agent robustness against diverse behaviors. Fictitious Co-Play (FCP) [10], which trains agents against checkpoints saved during SP, is a strong robustness baseline, yet its diversity is strictly limited by the training trajectory of a single optimization algorithm. Recent works explicitly maximize population diversity via targeted designs: Trajectory Diversity (TrajeDi) [24] and Latent Space Optimization (LIPO) [25] induce differentiated behaviors via Jensen–Shannon divergence or latent variables, while EvoAgent [26] generates diverse strategies via evolutionary algorithms. Despite these advances, most existing methods operate in low-level parameter or trajectory spaces, yielding only stochastic variations rather than high-level semantic strategies (e.g., deception, sacrifice) critical for human-level coordination [27,28]. In contrast, our LLM-TOC framework leverages LLM code generation to explore a Turing-complete semantic space, uncovering complex interaction strategies inaccessible to traditional parameter-space exploration.

2.2. Large Language Models for Autonomous Agents

Driven by strong reasoning and planning capabilities, the integration of Large Language Models (LLMs) with autonomous agents has grown rapidly [1,2], falling into two primary paradigms: LLM-as-Agent and LLM-Assisted Training.

The LLM-as-Agent paradigm integrates LLMs into the agent’s real-time decision loop. Works like Voyager [29] and Ghost in the Minecraft (GITM) [30] enable continuous skill acquisition in open-ended environments via code generation; MetaGPT [31] and AgentVerse [32] explore role-based collaboration for software development. Most relevant to our work is Hypothetical Minds [15], which uses LLMs for real-time Theory of Mind (ToM) inference and response planning. While validated on the Melting Pot benchmark [6] via hand-crafted visual-to-text parsers, it suffers from prohibitive inference latency, high token costs, and limited deployability in high-frequency visual control tasks without manual engineering.

The LLM-Assisted Training paradigm uses LLMs to support lightweight RL policy training. Eureka [33] and Text2Reward [18] synthesize dense reward functions to guide RL training; SEMDIV [17] generates diverse partner agents via reward shaping but relies on a computationally expensive RL inner loop that requires training a new policy from scratch for each generated reward. Distinct from these methods, LLM-TOC positions the LLM as an offline semantic oracle that directly generates executable policy code. This design combines the semantic reasoning capabilities of Hypothetical Minds with the execution efficiency of lightweight MARL agents (e.g., MAPPO), eliminating real-time inference latency. Our experiments on the Melting Pot benchmark verify its strong zero-shot generalization against semantically diverse opponents in this testbed [13,34].

2.3. Automated Curriculum Learning and Environment Design

Automatic Curriculum Learning (ACL) and Unsupervised Environment Design (UED) aim to generate training task sequences that maximize agent learning efficiency [35,36]. PAIRED [37] uses an adversary to generate challenging yet solvable training environments; ACCEL [38] improves this via Quality-Diversity (QD) algorithms like MAP-Elites [39] to evolve more robust environments; Prioritized Level Replay (PLR) [40] replays high-regret levels to enhance agent robustness.

For social generalization, the “environment” includes both physical scenarios and interactive counterparts. Open-Ended Learning (OEL) studies emphasize that sustained learning requires a continuous stream of novel challenges [41,42], yet most existing ACL methods only generate environmental features (e.g., grid layouts and physical parameters) rather than interactive policies with distinct behavioral patterns. Our work incorporates a Gradient Saliency Feedback mechanism [43,44] to align curriculum generation with the student agent’s failure modes. This design enables the LLM to perform semantic gradient ascent, generating targeted adversarial interaction scenarios (e.g., rear sneak attacks) to efficiently expose and fix the student’s performance deficits. Compared to unguided evolutionary search methods [45,46,47], our empirical results on the Melting Pot benchmark show significant improvements to the agent’s training convergence efficiency within the tested multi-agent scenarios.

2.4. Stackelberg Games in MARL

The Stackelberg game, a hierarchical leader–follower model with the Stackelberg equilibrium as its solution [48], is widely used in MARL to model adversarial training, curriculum learning, and opponent exploitation [49]. Early studies focused on static equilibrium in discrete action spaces for security and zero-sum games [50], while recent works extend it to deep MARL for adversarial environment design and agent robustness.

However, existing Stackelberg-based MARL methods are limited to the continuous parameter space of neural networks [51], producing only superficial strategy diversity [52]. In contrast, we are the first to model multi-agent zero-shot generalization as a Stackelberg game in the Turing-complete semantic code space. We employ an LLM as the leader to generate executable adversarial strategies with a Gradient Saliency Feedback mechanism. Our experiments on the Melting Pot benchmark verify that this design greatly expands the semantic coverage of the training population and alleviates mode collapse in traditional parameter-space methods for multi-agent zero-shot generalization tasks.

3. Problem Formulation

3.1. Partially Observable Markov Games

We formalize open-ended multi-agent interaction as a Partially Observable Markov Game (POMG) [53,54], defined by the tuple

G = 〈 S, N, A, P, R, Ω, O, γ 〉

. Here,

S

denotes the global state space, and

N = {1, \dots, n}

represents the set of agents. We partition N into the student agent (E), which acts as the focal agent, and the other agents (

- E

), comprising potential teammates or opponents. The joint action space is defined as

A = A_{E} \times A_{- E}

, while the state transition dynamics are governed by

P (s^{'} | s, a_{E}, a_{- E})

. The term

R_{E} (s, a_{E}, a_{- E})

specifies the reward function for the student agent. Additionally,

Ω_{i}

denotes the set of partial observations for agent i, generated by the observation function

O_{i} (s)

, and

γ \in [0, 1)

represents the discount factor.

The student agent operates according to a policy

π_{E} (a_{E} | o_{E}; θ)

, parameterized by

θ

, such as the weights of a neural network. Conversely, the other agents follow a joint policy

π_{- E} (a_{- E} | o_{- E})

, which may be driven by diverse or unknown logic. The expected return for the student agent against a specific counterpart

π_{- E}

is defined as:

J (π_{E}, π_{- E}) = E_{τ \sim (π_{E}, π_{- E})} [\sum_{t = 0}^{\infty} γ^{t} R_{E} (s_{t}, a_{E, t}, a_{- E, t})] .

(1)

3.2. Zero-Shot Generalization as Minimax Regret

A central challenge within open-ended environments such as Melting Pot [6] is zero-shot generalization. The student agent

π_{E}

is required to demonstrate robust performance when paired with test-time counterparts

π_{- E}^{t e s t}

sampled from an unknown distribution

D_{t e s t}

, without the opportunity for fine-tuning. Conventional MARL approaches typically assume that

π_{- E}

originates from a restricted set of behaviors, such as checkpoints derived from self-play. However, in realistic scenarios,

π_{- E}

resides within a vast semantic strategy space, denoted as

Π_{s e m}

. This space encompasses all behaviorally distinct policies expressible through logical rules, code, or natural language, including strategies such as tit-for-tat, deceptive cooperation, or aggressive blocking. Notably, this space is significantly larger and exhibits a discrete structure distinct from the continuous parameter space of neural networks.

To achieve robust generalization, we formulate the training objective as the minimization of maximum regret over the entire semantic strategy space

Π_{s e m}

. We first define the regret of a student policy

π_{E}

against a specific counterpart

π_{- E}

as the difference between the maximal achievable return against

π_{- E}

and the actual return:

Regret (π_{E}, π_{- E}) = \underset{V^{*} (π_{- E})}{\underset{︸}{max_{π^{'}} J (π^{'}, π_{- E})}} - J (π_{E}, π_{- E}),

(2)

where

V^{*} (π_{- E})

represents the oracle performance achievable assuming perfect prior knowledge of

π_{- E}

. Our objective is to identify an optimal student policy

π_{E}^{*}

that minimizes the worst-case regret across all possible semantic strategies:

π_{E}^{*} = arg min_{π_{E}} (max_{π_{- E} \in Π_{s e m}} Regret (π_{E}, π_{- E})) .

(3)

3.3. The Challenge of Semantic Coverage

Directly solving the optimization problem presented above is intractable due to two primary factors: Infinite Space: The semantic space

Π_{s e m}

is effectively infinite and non-differentiable, rendering traversal via standard gradient-based optimization or simple population sampling, such as FCP, infeasible. Modality Gap: Conventional RL methods operate within the parameter space

Θ

. Exploration of

Θ

through Gaussian noise—a technique commonly employed in PBT—rarely yields high-level semantic strategies, such as complex deception. Consequently, such methods fail to adequately cover the support of

Π_{s e m}

.

Therefore, an efficient mechanism is required to identify the worst-case

π_{- E}

within

Π_{s e m}

to serve as a curriculum. This necessity motivates our proposed LLM-TOC framework, which leverages LLMs as a semantic engine to generate

π_{- E}

directly within the code space, thereby approximating the

{max}_{π_{- E}}

operator in a computationally feasible manner. In Appendix A, we list all the symbols used in this paper and provide their definitions.

4. Methodology

Building upon the established problem formulation, we propose LLM-TOC, a framework designed to approximate the intractable minimax regret objective through a bi-level iterative process. This methodology comprises two coupled loops, as illustrated in Figure 2: an outer loop wherein a Large Language Model functions as a semantic oracle to generate adversarial policies

π_{- E}

directly within the code space, and an inner loop in which the student agent

π_{E}

performs robust optimization against the generated population. To facilitate this interaction, we introduce a Gradient Saliency Feedback mechanism that effectively bridges the modality gap between numerical reinforcement learning signals and semantic reasoning.

4.1. The Bi-Level Optimization Framework

We formally structure the training process as a Stackelberg game, modeled as a bi-level optimization problem. In this hierarchy, the leader initiates the interaction by selecting a distribution of opponent strategies designed to exploit the weaknesses of the follower. In response, the follower optimizes its policy against this distribution. Let

π_{E} \in Π_{n e t}

denote the student policy parameterized by

θ

, and

P \subset Π_{s e m}

represent the population of strategies for the other agents.

We first formally define the Stackelberg game mapping of our LLM-TOC framework, which strictly aligns with the standard leader–follower Stackelberg game model.

In our framework, the two players of the Stackelberg game are explicitly defined as follows. Leader: The Large Language Model acting as the semantic oracle, which is the first-mover in the game. Follower: The student MARL agent, which optimizes its policy as the best response to the leader’s action.

The strategy spaces of the two players are strictly defined, corresponding to the two nested loops of our framework. Leader’s Strategy Space: The semantic strategy space

Π_{s e m}

, which consists of all executable rule-based policies expressed via Python code. Each element in this space is a complete policy class that can be directly injected into the multi-agent environment to interact with the student agent. Follower’s Strategy Space: The neural network parameterized policy space

Π_{n e t}

, where each element is a policy

π_{E} (a_{E} | o_{E}; θ)

parameterized by the neural network weights

θ

.

The payoff functions of the two players are defined as zero-sum in the worst-case regret minimization problem, which aligns with our generalization objective. Leader’s Payoff: The regret of the student agent, defined as

R e g r e t (π_{E}, π_{- E}) = V^{*} (π_{- E}) - J (π_{E}, π_{- E})

. The leader’s objective is to maximize this regret by generating adversarial policies

π_{- E} \in Π_{s e m}

. Follower’s Payoff: The negative of the student’s worst-case regret,

- {max}_{π_{- E} \in P} R e g r e t (π_{E}, π_{- E}),

where

P

is the policy population generated by the leader. The follower’s objective is to maximize this payoff by optimizing its policy parameters

θ

.

The goal of our framework is to find the Stackelberg equilibrium

(π_{- E}^{*}, π_{E}^{*})

of this game, where the following hold. The follower’s policy

π_{E}^{*}

is the best response to the leader’s policy population:

π_{E}^{*} = \underset{π_{E} \in Π_{n e t}}{a r g m i n} {max}_{π_{- E} \in P} R e g r e t (π_{E}, π_{- E})

. The leader’s policy

π_{- E}^{*}

is the optimal adversarial strategy that maximizes the follower’s regret under the constraint that the follower plays its best response:

π_{- E}^{*} = \underset{π_{- E} \in Π_{s e m}}{a r g m a x} R e g r e t (π_{E}^{*} (π_{- E}), π_{- E}) .

(4)

Based on this formal Stackelberg game mapping, we define the bi-level optimization objectives of our framework as follows:

The Follower Objective (Inner Loop): Given a fixed population

P_{k}

provided by the leader at iteration k, the student seeks to maximize its expected return. This corresponds to the standard MARL objective:

π_{E}^{*} (P_{k}) = arg max_{π_{E} \in Π_{n e t}} J_{s t u d e n t} (π_{E}, P_{k}) = arg max_{π_{E}} E_{π_{- E} \sim U (P_{k})} [E_{τ \sim (π_{E}, π_{- E})} [\sum_{t = 0}^{T} γ^{t} r_{t}]] .

(5)

This step aims to identify the best response strategy

π_{E}^{*}

that exhibits robustness to the current set of counterparts.

The Leader Objective (Outer Loop): The leader aims to expand the population

P_{k}

by introducing a new strategy

π_{- E}^{(k + 1)}

that maximizes the regret of the student. As calculating exact regret requires an optimal oracle

V^{*}

, we approximate this objective by maximizing the exploitability of the student—specifically, by identifying a strategy that minimizes the performance of the current student policy

π_{E}^{*} (P_{k})

:

π_{- E}^{(k + 1)} = arg min_{π_{- E} \in Π_{s e m}} E_{τ \sim (π_{E}^{*}, π_{- E})} [\sum_{t = 0}^{T} γ^{t} r_{t}] .

(6)

However, as the semantic space

Π_{s e m}

is discrete and non-differentiable, optimization of the leader’s objective via gradient descent is infeasible. Consequently, we employ the LLM as a semantic oracle

Φ

to approximate this process:

π_{- E}^{(k + 1)} \approx Φ (π_{E}^{*} (P_{k}) ∣ Feedback) .

(7)

This bi-level architecture ensures that the student is continuously challenged by worst-case scenarios, thereby driving the expansion of its robust hull within the strategy space.

4.2. Gradient Saliency Feedback Mechanism

To effectively guide the semantic oracle

Φ

within the outer loop, bridging the modality gap between numerical reinforcement learning losses and the semantic reasoning capabilities of LLMs is essential. To this end, we introduce a Gradient Saliency Feedback mechanism designed to perform causal attribution.

4.2.1. Identifying Critical Moments via Value Surprise

We first identify specific instances of policy failure by monitoring the value function

V (s; ϕ)

of the student agent

π_{E}

during evaluation episodes. We define the value surprise, denoted as

δ_{t}

, as the absolute Temporal Difference (TD) error, which quantifies the discrepancy between the expected return of the agent and the actual outcome:

δ_{t} = |r_{t} + γ V (s_{t + 1}; ϕ) - V (s_{t}; ϕ)| .

(8)

A high magnitude of

δ_{t}

signifies a critical frame

t^{*}

, indicating a pivotal event such as an unexpected penalty, an adverse interaction, or a missed reward. We extract a set of critical frames

T_{c r i t} = {t ∣ δ_{t} > ϵ_{t h r e s h}}

for subsequent detailed analysis.

4.2.2. Visual Attribution via Jacobian Saliency

For each critical frame

t^{*} \in T_{c r i t}

, we elucidate the cause of the value deviation by computing the sensitivity of the value function with respect to the input observation

X_{t^{*}} \in R^{C \times H \times W}

. This is achieved by calculating the Jacobian matrix of the value function relative to the input pixels.

Specifically, let

V (s_{t^{*}})

represent the scalar value output. We compute the gradient map

G_{t^{*}} \in R^{C \times H \times W}

via backpropagation:

G_{t^{*}} = \nabla_{X_{t^{*}}} V (s_{t^{*}}; ϕ) = \frac{\partial V (s_{t^{*}})}{\partial X_{t^{*}}} .

(9)

To obtain a single 2D saliency map

M_{t^{*}} \in R^{H \times W}

, we aggregate the gradients across the channel dimension—comprising RGB or feature channels—by computing the maximum absolute value or the

L_{2}

norm:

M_{t^{*}}^{(h, w)} = max_{c} |G_{t^{*}}^{(c, h, w)}| .

(10)

This map

M_{t^{*}}

highlights specific regions in the visual field where perturbations would induce the most significant shifts in the value estimation of the agent. Mathematically, this approximates the first-order Taylor expansion of the value function, identifying the features to which the agent attends—or fails to attend—during the critical event.

4.2.3. Semantic Mapping

We map the pixel-level saliency M to semantic concepts using ground-truth object masks provided by the environment engine. Let

I_{o b j}

denote the binary mask for a specific object class

o b j

, such as counterparts

π_{- E}

or resources. The attention score for each object is calculated as follows:

{Score}_{o b j} = \frac{\sum_{i, j} M_{i, j} \cdot I_{o b j, i, j}}{\sum_{i, j} M_{i, j} + ϵ} .

(11)

By comparing these scores, we generate semantic descriptions. For instance, if

{Score}_{- E} ≫ {Score}_{g o a l}

during a significant value drop, the system generates a description indicating that the focal agent focused heavily on the opponent while neglecting the goal, thereby leading to failure.

4.3. Policy Generation and Curriculum Evolution

Central to our curriculum evolution is the transformation of semantic feedback into executable policy code. We formalize this process as a conditional generation problem within the semantic space.

Let

D_{s e m}^{(k)}

denote the semantic diagnosis derived from the gradient saliency analysis at iteration k. We construct a structured prompt

P_{k}

that encapsulates the designated role, API constraints, and the diagnosis

P_{k} = Concat (I_{r o l e}, I_{A P I}, D_{s e m}^{(k)}),

(12)

where

I_{r o l e}

defines the adversarial objective and

I_{A P I}

specifies the programming interface.

The LLM functions as a generator, sampling new policy code

π_{c o d e}^{(k + 1)}

from its learned distribution:

π_{c o d e}^{(k + 1)} \sim P_{L L M} (\cdot ∣ P_{k}) .

(13)

This generation process is conceptualized as a semantic gradient ascent. The diagnosis

D_{s e m}^{(k)}

serves as a gradient direction within the semantic manifold, pointing toward the region of the strategy space where the student agent

π_{E}

exhibits maximum vulnerability. For example, if the diagnosis indicates a susceptibility to rear attacks, the LLM generates logic specifically targeting the blind spot of the focal agent.

To clarify the prompt iteration mechanism and the full translation pipeline from semantic feedback to executable policies, we have:

Prompt Regeneration Rule: The structured prompt $P_{k}$ is regenerated from scratch at each outer-loop iteration k. The fixed components (role instruction $I_{r o l e}$ and API constraints $I_{A P I}$ ) are reused across all iterations to ensure the consistency of generation format and objective, while the core semantic diagnosis component $D_{s e m}^{(k)}$ is completely updated based on the student agent’s latest failure modes in the current iteration, with no cross-iteration reuse of diagnosis content. This ensures that the generated adversarial policies are always targeted at the student’s current weaknesses, rather than outdated historical vulnerabilities.
Closed-Loop Translation Pipeline: The semantic feedback derived from gradient saliency is translated into executable policies through three sequential steps: (i) the pixel-level saliency map is mapped to object-level attention scores via Equation (11); (ii) the attention scores are converted into a natural language causal diagnosis describing the student’s failure mode; (iii) the diagnosis is embedded into the prompt, and the LLM generates Python policy code that directly exploits the identified vulnerability, with no manual intervention in the entire process. It should be clarified that our framework modifies the interactive adversarial partner policies in the multi-agent system, rather than the physical layout or dynamics of the environment itself.
Policy Validation and Filtering Mechanism: After generation, we perform a two-step validation on $π_{c o d e}^{(k + 1)}$ : first, a syntactic validation to ensure the code is compilable and executable in the Melting Pot environment; second, a functional validation to evaluate the policy’s ability to reduce the student agent’s return via 5 evaluation rollouts. Only policies that pass both validations are added to the population $P_{k}$ , to avoid invalid or irrelevant strategies polluting the curriculum.

Finally, the curriculum evolves through the aggregation of these validated generated policies. The population

P

is updated cumulatively:

P_{k + 1} = P_{k} \cup Compile (π_{c o d e}^{(k + 1)}) .

(14)

This mechanism prevents catastrophic forgetting in the student agent

π_{E}

. To minimize regret in the subsequent inner loop, the agent is compelled to simultaneously enhance its robustness against all previously generated strategies in addition to the newly introduced adversarial policy.

The complete prompt template used for LLM generation, including the full role instruction, API constraints, and a dynamic diagnosis example, is provided in Appendix C.

4.4. Algorithm Summary

The LLM-TOC algorithm, formalized in Algorithm 1, operates as a three-stage iterative cycle that progressively expands the student agent’s robust strategy space, with the following core execution logic unique to our framework:

1.: Initialization: We instantiate the student agent with random weights, and initialize the opponent population with heuristic-based policies to provide initial training signals.
2.: Inner-Loop Student Optimization: For each outer-loop iteration, the student agent optimizes its policy against the current opponent population until convergence, to learn a robust best response to the existing curriculum.
3.: Gradient Saliency Diagnosis: After convergence, we evaluate the student against the current population to identify the worst-case opponent that induces maximum regret. We then extract failure trajectories, compute value surprise to locate critical failure frames, and generate semantic causal descriptions of the student’s vulnerabilities via Jacobian saliency map analysis.
4.: Outer-Loop Curriculum Evolution: The semantic diagnosis is embedded into a structured prompt to guide the LLM to generate a new executable adversarial policy targeting the identified vulnerability. The generated policy is validated for executability and effectiveness, then added to the opponent population to expand the training curriculum.

Algorithm 1 LLM-TOC: LLM-driven theory-of-mind adversarial curriculum.

Input: Max outer iterations K, Inner steps

T_{i n}

, Threshold

ϵ_{t h r e s h}

, Learning rate

α

;

Input: Initialize student policy

π_{E}

with parameters

θ

;

Input: Initialize opponent population

P_{0} \leftarrow {π_{h e u r i s t i c}}

;

1:: for $k = 1, \dots, K$ do ▹Outer Loop: Curriculum Evolution
2:: $P_{c u r r} \leftarrow P_{k - 1}$ ;
3:: // Phase 1: Inner Loop (Student Optimization)
4:: repeat
5:: Sample opponent batch $π_{- E} \sim U (P_{c u r r})$ ;
6:: Collect trajectories $τ = {(s_{t}, a_{t}, r_{t}, s_{t + 1})}$ using $π_{E}$ and $π_{- E}$ ;
7:: Compute advantages $A_{t}$ using GAE;
8:: Update $θ \leftarrow θ + α \nabla_{θ} E_{τ} [min (ρ_{t} A_{t}, clip (ρ_{t}, 1 - ϵ, 1 + ϵ) A_{t})]$ ; ▹ PPO Update
9:: until Performance converges
10:: // Phase 2: Evaluation & Diagnosis (Gradient Saliency)
11:: Identify worst-case opponent: $π_{- E}^{*} \leftarrow arg {min}_{π \in P_{c u r r}} E_{τ} [\sum γ^{t} r_{t}]$ ;
12:: Collect evaluation rollout $D_{e v a l}$ against $π_{- E}^{*}$ ;
13:: Initialize diagnosis set $D_{s e m}^{(k)} \leftarrow \emptyset$ ;
14:: for each step t in $D_{e v a l}$ do
15:: Calculate Value Surprise: $δ_{t} \leftarrow | r_{t} + γ V (s_{t + 1}) - V (s_{t}) |$ ;
16:: if $δ_{t} > ϵ_{t h r e s h}$ then ▹ Identify Critical Moment
17:: Compute Jacobian: $G_{t} \leftarrow \nabla_{X_{t}} V (s_{t})$ ;
18:: Aggregate Saliency: $M_{t}^{(h, w)} \leftarrow {max}_{c} | G_{t}^{(c, h, w)} |$ ;
19:: Compute Object Scores: ${Score}_{o b j} \leftarrow \frac{\sum M \cdot I_{o b j}}{\sum M + ϵ}$ ;
20:: Generate description $S_{t}$ based on ${Score}_{o b j}$ , Focus on $π_{- E}^{*}$ > Goal;
21:: $D_{s e m}^{(k)} \leftarrow D_{s e m}^{(k)} \cup {S_{t}}$ ;
22:: end if
23:: end for
24:: // Phase 3: Semantic Generation (Leader Optimization)
25:: Construct Prompt $P_{k} \leftarrow Concat (I_{r o l e}, I_{A P I}, D_{s e m}^{(k)})$ ;
26:: Generate Adversarial Policy: $π_{c o d e}^{(k)} \sim P_{L L M} (\cdot ∣ P_{k})$ ;
27:: Validate and Compile $π_{c o d e}^{(k)}$ ;
28:: Update Population: $P_{k} \leftarrow P_{c u r r} \cup {π_{c o d e}^{(k)}}$ ;
29:: end for
30:: Output: Robust Student Policy $π_{E}$

This cycle repeats for a predefined number of outer-loop iterations, or until the LLM can no longer generate valid strategies that reduce the student’s performance. The final robust student policy is returned as the output. The full prompt template used for LLM generation is provided in Appendix C.

4.5. Motivating Theoretical Analysis

In this section, we present a systematic theoretical analysis of the LLM-TOC framework via the PAC-Bayes learning framework and Stackelberg game theory, providing a rigorous theoretical rationale for the superior zero-shot generalization performance, sample efficiency, and convergence stability of our method. We first formalize the two core foundational assumptions that underpin all our theoretical conclusions, with explicit discussion of their practical implications, applicable conditions, and inherent limitations in realistic multi-agent scenarios. We then derive the convergence rate of the bi-level optimization process, prove that LLM-TOC yields a tighter generalization error bound than conventional parameter-space exploration methods, and systematically analyze the computational complexity of the framework to verify its training and inference efficiency advantages. All detailed mathematical derivations and complete proofs are provided in Appendix D.

4.5.1. Core Foundational Assumptions

All theoretical guarantees of the LLM-TOC framework are established on two non-trivial foundational assumptions, which are formally defined, interpreted, and bounded below. These assumptions directly link our theoretical design to the practical implementation of the framework, and their validity is empirically verified in Section 5.4.

Assumption 1

(Bounded Payoff). The value function of the multi-agent game is uniformly bounded; there exists a constant

V_{m a x} > 0

such that

\forall π_{E} \in Π_{n e t}, π_{- E} \in Π_{s e m}, |V (π_{E}, π_{- E})| \leq V_{m a x},

(15)

where

V (π_{E}, π_{- E}) = E_{τ \sim (π_{E}, π_{- E})} [\sum_{t = 0}^{\infty} γ^{t} r_{t}]

denotes the expected discounted cumulative return of the student agent

π_{E}

interacting with the counterpart policy

π_{- E}

,

Π_{n e t}

is the neural network parameterized policy space of the student agent, and

Π_{s e m}

is the Turing-complete semantic strategy space of code-based policies.

Practical Implications: This assumption ensures that the regret of the student agent, defined as the gap between the oracle optimal return and the actual achieved return, is always bounded. It eliminates the possibility of infinite positive/negative returns that would break the monotonic improvement property of our iterative bi-level optimization, and is a necessary precondition for the convergence of the algorithm to a stable robust equilibrium.
Applicable Conditions in Realistic Environments: This assumption holds for almost all standard episodic MARL environments, including the Melting Pot benchmark used in our experiments as well as most realistic open-ended multi-agent interaction scenarios. Specifically, it is satisfied when three conditions are met: (1) the per-step reward of the environment is constrained to a fixed finite range; (2) the maximum length of each interaction episode is finite; and (3) the discount factor $γ \in [0, 1)$ , which ensures the cumulative discounted return is always bounded regardless of the interaction horizon.
Inherent Limitations: This assumption does not hold for non-episodic, infinite-horizon tasks with unbounded per-step rewards. For such extreme scenarios, the convergence guarantee of our framework needs to be re-derived with additional constraints on the reward growth rate, which we leave for future work.

Assumption 2

(

ϵ

-Approximate Semantic Oracle). Let

Φ : Π_{n e t} \to Π_{s e m}

denote the LLM-based policy generation function (the semantic oracle) in our framework. We define the LLM as an

ε_{o r a c l e}

-approximate semantic oracle if, for any given student policy

π_{E}

, the strategy

B = Φ (π_{E})

generated by the LLM satisfies:

V (π_{E}, B) \leq min_{π^{'} \in Π_{s e m}} V (π_{E}, π^{'}) + ε_{o r a c l e},

(16)

where

ε_{o r a c l e} \geq 0

is the bounded approximation error of the oracle.

Practical Implications: This assumption requires that the LLM can consistently generate adversarial strategies that approximate the worst-case attack on the student agent, with a bounded approximation error. It is the core premise that ensures our outer-loop curriculum generation can effectively tighten the constraint set of the inner-loop optimization, thereby driving the algorithm to converge to a robust equilibrium rather than a narrow, brittle Nash equilibrium subset.
Applicable Conditions in Realistic Environments: This assumption holds when the adopted LLM has sufficient capabilities in three critical dimensions: (1) strong logical reasoning ability to understand the causal diagnosis of the student agent’s failure modes derived from gradient saliency analysis; (2) reliable code generation ability to translate the adversarial strategy into syntactically valid, executable Python code that is fully compatible with the environment interface; (3) sufficient domain knowledge of multi-agent game theory to design strategically effective adversarial behaviors that can exploit the student’s vulnerabilities. Our empirical results in Section 5.4 verify that this assumption is well satisfied in our implementation, with a 92.5% generation success rate and stable policy quality across 4 independent random seeds.
Inherent Limitations: The approximation error $ϵ_{o r a c l e}$ is directly determined by the capability of the adopted LLM. Smaller open-source LLMs with limited reasoning and coding capabilities may fail to generate valid targeted adversarial strategies, leading to a large $ϵ_{o r a c l e}$ that breaks the convergence guarantee. In addition, for extremely complex multi-agent environments with obscure failure modes that are hard to describe via semantic language, the LLM may also fail to approximate the worst-case strategy effectively. Our two-step policy validation and filtering mechanism (described in Section 4.3) can automatically exclude invalid generation cases, ensuring that the convergence of the algorithm is not affected by occasional generation failures.

4.5.2. Convergence Rate Analysis

We model the bi-level iterative optimization of LLM-TOC as a leader–follower Stackelberg game, and analyze its convergence properties under the two core assumptions defined above. We define the worst-case regret of the student policy

π_{E}

as

R_{w o r s t} (π_{E}) = max_{π_{- E} \in Π_{s e m}} [V^{*} (π_{- E}) - V (π_{E}, π_{- E})],

(17)

where

V^{*} (π_{- E}) = {max}_{π^{'}} V (π^{'}, π_{- E})

represents the oracle optimal performance achievable against the counterpart policy

π_{- E}

, assuming perfect prior knowledge of

π_{- E}

.

We formalize the convergence property of LLM-TOC in the following theorem, with the complete proof provided in Appendix D.

Theorem 1

(Convergence of LLM-TOC). Let

P_{k} = {B_{1}, B_{2}, \dots, B_{k}} \subset Π_{s e m}

be the opponent policy population at the k-th outer-loop iteration. Under the Bounded Payoff assumption and the ε-Approximate semantic oracle assumption, the sequence of worst-case regrets

{\{R_{w o r s t} (π_{E}^{(k)})\}}_{k = 1}^{\infty}

converges to an

ε_{o r a c l e}

-Nash equilibrium neighborhood, with the average regret over K iterations bounded by

{\bar{R}}_{K} \leq \frac{C}{\sqrt{K}} + ε_{o r a c l e},

(18)

where C is a constant related to the diameter of the strategy space and the maximum Bounded Payoff

V_{m a x}

, and K is the number of outer-loop iterations. This theorem delivers two core theoretical conclusions:

1.: Convergence Rate: LLM-TOC converges to a robust equilibrium at a rate of $O (1 / \sqrt{K})$ with respect to the number of outer-loop iterations K. This convergence rate matches the optimal rate of classic fictitious play algorithms in zero-sum games, verifying that our semantic-space curriculum evolution has the same rigorous convergence guarantee as traditional gradient-based optimization methods, despite operating in a discrete, non-differentiable code space.
2.: Equilibrium Robustness: The algorithm converges to an $ε_{o r a c l e}$ -Nash equilibrium neighborhood, where the student agent’s worst-case regret is bounded by the approximation error of the semantic oracle. This means that the student agent cannot be further exploited by any unseen strategy in the semantic space $Π_{s e m}$ beyond the bounded error $ε_{o r a c l e}$ , which directly guarantees the zero-shot generalization robustness of the learned policy.

Notably, our policy validation and filtering mechanism ensures that only valid, high-quality adversarial policies are added to the population

P_{k}

. Even if the LLM occasionally fails to generate an effective adversarial strategy, the convergence of the algorithm is not affected, as the invalid policy is discarded and the population remains unchanged in that iteration.

4.5.3. Generalization Bound Analysis via PAC-Bayes Framework

We leverage the PAC-Bayes learning framework to formally analyze the generalization performance of LLM-TOC, and prove that our semantic-space exploration yields a strictly tighter generalization error bound than conventional parameter-space exploration methods under reasonable preconditions.

For a student policy

π_{E}

trained on a set of training opponent policies, we define the true generalization error

L (π_{E})

as the expected regret of the agent on the true, unknown distribution of test-time opponents

D_{t e s t}

, and the empirical error

\hat{L} (π_{E})

as the regret on the training population. The classic PAC-Bayes generalization bound states that, for any posterior distribution Q over student policies and prior distribution P over the policy space, with probability at least

1 - δ

:

L (Q) \leq \hat{L} (Q) + \sqrt{\frac{D_{K L} (Q ∥ P) + ln (1 / δ)}{2 m}},

(19)

where

D_{K L} (Q ∥ P)

is the Kullback–Leibler (KL) divergence between the posterior Q and the prior P, m is the number of training samples (i.e., the number of opponent policies in the training population), and

δ \in (0, 1)

is the confidence parameter.

The core of our theoretical analysis lies in the comparison of the KL divergence term between semantic-space exploration (LLM-TOC) and traditional parameter-space exploration:

1.: Traditional Parameter-Space Exploration: For conventional MARL methods, the prior $P_{p a r a m}$ is an isotropic Gaussian distribution over the high-dimensional neural network weights. The set of weight configurations that exactly represent a specific semantic logical strategy is a measure-zero subset in the continuous weight space. This means that the prior $P_{p a r a m}$ assigns almost zero probability to the valid semantic strategies that constitute the real-world test distribution, resulting in an extremely large KL divergence $D_{K L} (Q_{p a r a m} ∥ P_{p a r a m})$ when fitting complex social behaviors.
2.: LLM-TOC Semantic-Space Exploration: In our framework, the prior $P_{s e m}$ is induced by the LLM and its pre-trained knowledge of code, logical reasoning, and human social behaviors. This prior assigns significantly higher probability to syntactically valid, logically consistent policy code that aligns with the distribution of real-world human-like social strategies. As a result, the KL divergence $D_{K L} (Q_{s e m} ∥ P_{s e m})$ is strictly smaller than that of parameter-space exploration:

$D_{K L} (Q_{s e m} ∥ P_{s e m}) ≪ D_{K L} (Q_{p a r a m} ∥ P_{p a r a m}) .$

(20)

Substituting this into the PAC-Bayes inequality, we directly conclude that LLM-TOC yields a strictly tighter generalization error bound than conventional parameter-space exploration methods. This provides a rigorous theoretical rationale for the superior zero-shot generalization performance of our method observed in the experiments: the semantic curriculum generated by LLM-TOC provides much better coverage of the real-world test distribution of opponent behaviors, enabling the student agent to learn a policy that generalizes robustly to unseen OOD counterparts.

4.5.4. Computational Complexity Analysis

We systematically analyze the computational time complexity of the LLM-TOC framework, and compare it with mainstream baseline methods to theoretically verify its training and inference efficiency advantages. We first define the key variables involved in the complexity analysis: K: Number of outer-loop iterations;

T_{i n}

: Number of environment interaction steps per inner-loop iteration; B: Batch size of the PPO update in the inner loop;

| θ |

: Number of parameters of the student agent’s neural network;

T_{e v a l}

: Number of steps per evaluation rollout for gradient saliency calculation;

H, W

: Height and width of the agent’s observation space; and

C_{L L M}

: Time complexity of a single LLM inference query for policy code generation.

Overall Training Complexity. The overall training time complexity of LLM-TOC consists of three independent components: inner-loop MARL training, outer-loop gradient saliency calculation, and LLM policy generation. The total complexity is formalized as

O (K \cdot (T_{i n} \cdot B \cdot | θ | + T_{e v a l} \cdot H \cdot W + C_{L L M})) .

(21)

We analyze each component in detail:

1.: Inner-Loop MARL Training Complexity: The core of the inner loop is the PPO update for the student agent, with a time complexity of $O (T_{i n} \cdot B \cdot | θ |)$ per outer-loop iteration. This complexity is identical to the standard MAPPO/IPPO baselines, as we use the same network architecture and PPO update rule for the student agent. Critically, the targeted adversarial curriculum generated by our framework significantly reduces the total number of outer-loop iterations K and inner-loop steps $T_{i n}$ required to reach the target performance, resulting in a total training step reduction of over 60% compared to mainstream baselines, as verified in our experiments.
2.: Gradient Saliency Calculation Complexity: The saliency map computation involves a single backpropagation of the value function per critical frame, with a complexity of $O (T_{e v a l} \cdot H \cdot W)$ per outer-loop iteration. Since $T_{e v a l} ≪ T_{i n}$ (we only perform 5 evaluation rollouts per iteration) and the observation size $H \times W = 11 \times 11$ in our implementation, this overhead is negligible compared to the inner-loop RL training cost.
3.: LLM Query Complexity: We only perform one LLM query per outer-loop iteration, with a complexity of $O (C_{L L M})$ . Unlike online LLM-based methods that require LLM inference at every environment step, our framework only invokes the LLM offline during training. In our experiments, we set $K = 10$ total outer-loop iterations, resulting in only 10 LLM queries for the entire training process. The cumulative LLM overhead is less than 0.5% of the total training computational cost, which is negligible.

Test-Time Inference Complexity. During test-time inference, the LLM is completely removed from the pipeline, and only the lightweight student agent is deployed. The inference complexity per environment step is

O (| θ |)

, which is identical to pure MARL baselines such as MAPPO, and orders of magnitude lower than online LLM-based methods that require LLM inference at every step. This makes our framework fully suitable for real-time high-frequency multi-agent interaction scenarios, where online LLM inference would introduce prohibitive latency.

Complexity Comparison with Baselines. We summarize the complexity comparison between LLM-TOC and mainstream baselines in Table 1, which theoretically verifies the efficiency advantages of our framework:

This comparison confirms that LLM-TOC combines the low inference latency of pure MARL methods with the semantic reasoning capability of LLMs, while achieving significantly higher training efficiency than both traditional MARL baselines and online LLM-based methods.

5. Experiments

The data presented in this study are openly available in GitHub (LLM-TOC, main branch) at https://github.com/vcis-wangchenxu/LLM-TOC.git (accessed on 8 February 2026). Publicly available environments and APIs were utilized in this study: DeepMind MeltingPot (main branch, https://github.com/google-deepmind/meltingpot, accessed on 8 February 2026) and Aliyun Qianwen API (wen-plus-2025-12-01, https://bailian.console.aliyun.com, accessed on 8 February 2026).

5.1. Experimental Setup

5.1.1. Evaluation Benchmark: Melting Pot

To rigorously evaluate zero-shot generalization in open-ended multi-agent systems, we utilize Melting Pot 2.0 [6], a benchmark suite specifically designed to test social intelligence against unseen other agents. We select four diverse substrates that cover distinct social dilemmas:

1.: collaborative_cooking_asymmetric: A coordination task requiring role specialization and synchronized action sequences to complete recipes.
2.: prisoners_dilemma_in_the_matrix_repeated: A classic social dilemma testing the agent’s ability to maintain cooperation against defection risks over repeated interactions.
3.: running_with_scissors_in_the_matrix_arena: A spatially complex, cyclical resource competition game (Rock–Paper–Scissors dynamics) where agents must identify and counter opponent strategies.
4.: running_with_scissors_in_the_matrix_repeated: A repeated version of the arena task, emphasizing long-term memory and reciprocity.

We adopt a strictly zero-shot generalization (ZSG) evaluation protocol: agents are trained on the base substrates and evaluated on held-out test scenarios. These test scenarios introduce focal-population mismatches or specific opponent behaviors not encountered during training, serving as a robust testbed for OOD generalization. In Appendix B, we provide a detailed description of the training and evaluation environments.

5.1.2. Baselines

We benchmark LLM-TOC against three categories of methods to validate its effectiveness:

Standard MARL (IPPO and MAPPO) [20,21]: Independent PPO and Multi-Agent PPO trained via self-play. These represent standard reinforcement learning baselines that typically overfit to the training population and struggle with OOD partners.

Hypothetical Minds [15]: A state-of-the-art method that utilizes a frozen Large Language Model for run-time decision-making based on visual descriptions. This baseline represents the capability of direct LLM inference in social scenarios.

Oracle PPO (Skyline): A PPO agent trained directly on the test scenarios. This serves as the theoretical performance upper bound (Skyline), indicating the maximum achievable return if the test distribution were known in advance.

5.1.3. Implementation Details

We provide a comprehensive description of all implementation details and hyperparameters below to ensure full reproducibility of our experiments. All experiments are conducted on a server with 2 Intel Xeon Gold 6330 CPUs, 4 NVIDIA RTX 3090 GPUs, and 128 GB of RAM, with PyTorch 2.2.0 as the deep learning framework.

Observation Processing. Instead of feeding raw RGB pixels directly to the policy network, we adopt a semantically structured observation space inspired by Hypothetical Minds to enhance sample efficiency and align with the LLM reasoning granularity. Specifically, the raw

88 \times 88

pixel input is discretized into an

11 \times 11

grid, where each

8 \times 8

pixel block (sprite) is mapped to a categorical feature channel representing the object type. To capture the temporal dynamics, we stack the feature maps from the last

k = 4

frames. Consequently, the input to the student agent is a tensor of shape

(11 \times 11 \times (C \times k))

, where

C = 12

denotes the number of distinct object categories in the Melting Pot substrates. This representation preserves spatial structure while providing explicit semantic information.

Student Agent Network Architecture and PPO Training Configuration. The student agent (follower) utilizes a 2-layer Convolutional Neural Network encoder to process the spatial feature maps, followed by a 1-layer LSTM to handle partial observability, and two separate linear heads for policy and value function prediction. The detailed network structure is as follows: CNN Encoder: 2 convolutional layers with 32 and 64 output channels respectively, kernel size

3 \times 3

, stride 1, padding 1, followed by ReLU activation functions. LSTM Layer: Hidden state dimension of 256, with layer normalization applied to the input. Output Heads: Two linear layers with 64 hidden units and ReLU activation, mapping the LSTM output to the discrete action space (dimension 8) and the scalar value function, respectively.

The policy is optimized using the Proximal Policy Optimization algorithm with the full hyperparameter configuration reported in Table A2 (Appendix A). The core training settings are as follows: Learning rate:

3 \times 10^{- 4}

, with linear decay over the training process and a final learning rate of

3 \times 10^{- 5}

. PPO clip parameter: 0.2, with no gradient penalty applied to the value function. Generalized Advantage Estimation (GAE): Discount factor

γ = 0.99

, GAE lambda

λ = 0.95

. Training batch size: 1024 timesteps per update, with 4 PPO epochs per batch, and a mini-batch size of 256. Entropy coefficient: 0.01, added to the loss function to encourage exploration. Gradient clipping: Maximum L2 norm of the gradient is set to 0.5 to stabilize training.

Training Process Configuration. The overall training process consists of 10 outer-loop iterations (

K = 10

), with the following settings for each iteration: Inner-loop training steps:

T_{i n} = 1 \times 10^{6}

environment interaction steps per outer-loop iteration, with early stopping enabled if the student agent’s performance on the current population converges. Total maximum training steps:

1 \times 10^{7}

environment interaction steps across all outer-loop iterations, consistent with the training budget of all baseline methods. Evaluation rollouts: 5 independent evaluation rollouts per opponent policy in the population, with a maximum episode length of 1000 timesteps per rollout, to identify the worst-case opponent for gradient saliency analysis. Gradient saliency threshold

ϵ_{t h r e s h}

: Set to the 90th percentile of value surprise in each evaluation batch, to filter critical failure frames with significant value fluctuations.

LLM Semantic Oracle Configuration. In the outer loop, we employ qwen-plus-2025-12-01 (provided by Alibaba Cloud Bailian Platform) as the semantic oracle to generate adversarial policies. The full inference and prompt generation settings are as follows: Inference hyperparameters: Sampling temperature set to 0.7, top_p set to 0.9, maximum generation tokens set to 1024, greedy decoding disabled, repetition penalty set to 1.05. Prompt regeneration rule: The structured prompt

P_{k}

is regenerated from scratch at each outer-loop iteration k. The fixed components (role instruction

I_{r o l e}

and API constraints

I_{A P I}

) are reused across all iterations to ensure the consistency of generation format and objective, while the core semantic diagnosis component

D_{s e m}^{(k)}

is completely updated based on the student agent’s latest failure modes in the current iteration, with no cross-iteration reuse of diagnosis content. Generation attempts: We perform at most 2 generation attempts per outer-loop iteration. Syntactically invalid code is automatically filtered out via compilation validation, and functionally ineffective policies (that do not reduce the student agent’s return) are discarded via functional validation. No few-shot examples are included in the prompt to avoid manual bias and ensure the generalizability of the generation mechanism. The full prompt template is provided in Appendix C.

Reproducibility Settings. All experiments are conducted with 4 independent random seeds (1, 42, 100, and 2026) for the student agent initialization, environment sampling, and LLM generation. All random number generators (PyTorch (2.5.0), NumPy (1.28.2), and Python (3.11) native) are initialized with the corresponding seed before each experiment to ensure full reproducibility.

5.1.4. Evaluation Metrics

To rigorously quantify the zero-shot generalization capability, training efficiency, and module contribution of the proposed LLM-TOC framework, we adopt four core evaluation metrics, with explicit mathematical definitions and selection rationales fully aligned with our experimental analysis as follows:

1.: Expected Cumulative Collective Return: This is the primary metric for evaluating the zero-shot generalization performance of agents, which is the standard evaluation indicator for the Melting Pot benchmark. It is defined as the expected discounted cumulative return of the student agent in the multi-agent test scenario:

$J (π_{E}, π_{- E}^{t e s t}) = E_{τ \sim (π_{E}, π_{- E}^{t e s t})} [\sum_{t = 0}^{\infty} γ^{t} R_{E} (s_{t}, a_{E, t}, a_{- E, t})],$

(22)

where $π_{E}$ is the policy of the student agent, $π_{- E}^{t e s t}$ is the policy of unseen OOD test partners, $γ$ is the discount factor, and $R_{E}$ is the environment-defined reward function of the student agent.
This metric directly reflects the overall task performance of the agent in the target multi-agent scenario, and is consistent with the evaluation protocol of mainstream MARL works on the Melting Pot benchmark. For each test scenario, we compare the performance of LLM-TOC with all baselines under the exact same environment and reward settings to ensure fairness. All results are averaged over 4 random seeds, with standard deviation reported as the uncertainty interval in the learning curves.
2.: Relative Performance to Oracle PPO: This metric is used to eliminate the impact of inconsistent reward scales across different Melting Pot substrates, enabling fair cross-scenario performance comparison. It is defined as the ratio of the agent’s expected cumulative collective return to the return of the Oracle PPO agent in the same test scenario:

$R_{r e l} = \frac{J (π_{E}, π_{- E}^{t e s t})}{J_{o r a c l e}},$

(23)

where $J (π_{E}, π_{- E}^{t e s t})$ is the expected cumulative collective return of the evaluated agent, and $J_{o r a c l e}$ is the expected cumulative collective return of the Oracle PPO agent (trained directly on the test scenario, serving as the theoretical performance upper bound).
The reward scales of the four selected Melting Pot substrates are significantly different, making it impossible to directly compare the raw collective return across scenarios.
3.: Training Steps to Target Performance: This metric quantifies the training efficiency and computational cost of the algorithm, defined as the minimum number of environment interaction steps required for the agent to reach the predefined target relative performance threshold in the held-out test scenarios.
This metric directly reflects the sample efficiency and convergence speed of the algorithm, which is the core quantitative indicator to verify the claim of “training cost reduction by more than 60%” in our work. It is also a standard efficiency evaluation metric widely used in MARL curriculum learning works.
4.: Relative Performance Drop in Ablation Study: This metric evaluates the independent contribution of each core component in LLM-TOC, defined as the percentage of performance decline of the ablation variant compared with the full LLM-TOC framework, calculated as

$D r o p_{r e l} = \frac{R_{r e l}^{f u l l} - R_{r e l}^{a b l a t i o n}}{R_{r e l}^{f u l l}} \times 100 %,$

(24)

where $R_{r e l}^{f u l l}$ is the relative performance to Oracle PPO of the full LLM-TOC framework, and $R_{r e l}^{a b l a t i o n}$ is that of the corresponding ablation variant. This metric quantifies the impact of removing each core module on the final zero-shot performance.

5.2. Results and Analysis

We evaluate the zero-shot generalization performance of LLM-TOC and baselines on four challenging Melting Pot scenarios. The learning curves, depicting the collective return on held-out test scenarios over 10 million training steps, are presented in Figure 3. All results are averaged over 4 independent random seeds, with standard deviation reported as the uncertainty interval in the learning curves (Figure 3). The standard deviation of the final normalized performance of LLM-TOC across all 4 seeds is less than 5% of the mean value for all test scenarios, which is significantly lower than that of all baselines, demonstrating the strong robustness of our method to random initialization. The detailed variance statistics of all methods are provided in Section 5.4.

The normalized final zero-shot generalization performance (at 10M training steps) of all methods across all test substrates is summarized in Table 2, where the scores are normalized by the Oracle PPO upper bound (1.0) for fair cross-scenario comparison.

This table enables standardized quantitative comparison of all methods across the four substrates, which have distinct native reward scales and game dynamics. For the full learning trajectory and convergence dynamics of each method in the native task setting, please refer to the learning curves in Figure 3.

LLM-TOC (red solid line) consistently achieves superior performance compared to both standard MARL baselines and the inference-based method across all domains. In the Collaborative Cooking and Running with Scissors tasks, standard baselines like MAPPO and IPPO fail to generalize effectively, often converging to suboptimal policies (relative performance to Oracle PPO

< 40 %

) due to overfitting to the training population. In contrast, LLM-TOC maintains a steady upward trajectory, eventually reaching 75–85% of the Oracle PPO performance (green dashed line). This validates that our bi-level curriculum effectively exposes the student agent to a diverse range of semantic strategies, preventing brittle adaptation to specific partners.

A key advantage of LLM-TOC is its ability to accelerate the discovery of robust strategies. Unlike PBT methods that rely on random parameter perturbations, our semantic oracle directly synthesizes high-value adversarial policies. As evidenced by the steep initial slope in Figure 3, LLM-TOC reaches the convergence performance of the strongest baseline (Hypothetical Minds) in significantly fewer steps. Quantitatively, to achieve the target relative performance to Oracle PPO of 0.6, LLM-TOC requires only

3.5 \times 10^{6}

RL environment interaction steps, whereas traditional MAPPO baseline requires over

9 \times 10^{6}

steps (and IPPO fails to reach the target performance entirely within 10M steps), translating to a reduction of over 60% in the total training environment steps, which is the core metric of training computational cost in MARL. This confirms that semantic guidance serves as a highly efficient “compass” in the vast strategy search space.

The superior sample efficiency and generalization performance of LLM-TOC can be directly attributed to the targeted curriculum evolution driven by our Gradient Saliency Feedback mechanism, which aligns perfectly with our theoretical analysis of search complexity reduction in Appendix D. Unlike unguided population-based training methods that perform random exploration in the high-dimensional parameter space, our gradient-based diagnosis mechanism pinpoints the exact failure modes of the student agent at each training iteration, translating pixel-level value fluctuations into semantically meaningful causal cues. This allows the LLM to perform “semantic gradient ascent” in the strategy space, directly generating adversarial policies that target the student’s current vulnerabilities, rather than blindly sampling random strategies. As a result, each newly added policy to the population provides the maximum learning signal for the student agent, which explains why our framework reduces the training steps required to reach the target performance by over 60% compared to mainstream baselines.

The Gradient Saliency Feedback mechanism is also the core driver of the high semantic diversity of the generated policy population, which directly addresses the mode collapse problem of traditional self-play and population-based methods. As shown in Figure 4, the LLM guided by gradient-based diagnosis generates a diverse set of semantic strategies, covering distinct behavioral modes that are rarely discovered by parameter-space exploration. This is because the saliency feedback continuously reveals new blind spots of the student agent throughout the training process, guiding the LLM to explore previously uncovered regions of the semantic strategy space. The monotonically increasing population diversity across training iterations (verified in Section 5.4) ensures that the student agent is continuously challenged by novel, semantically distinct strategies, thereby expanding its robust strategy hull and avoiding overfitting to a narrow set of Nash equilibria. This empirical observation directly validates our theoretical claim that semantic-space exploration yields a tighter generalization bound than parameter-space exploration, as the diverse population generated by our framework provides much better coverage of the real-world strategy distribution.

In our implementation, we set the total number of outer-loop iterations

K = 10

. For each iteration, we perform only one LLM API call, with an average of 800 input tokens and 400 output tokens (1200 total tokens per call). The total token consumption for the entire training process is only 12,000 tokens, and the corresponding API cost is negligible (less than 0.5% of the total computational cost of RL training).

The Oracle PPO represents the performance upper bound where agents are trained directly on the test scenarios. LLM-TOC significantly narrows the gap between zero-shot agents and this oracle boundary compared to other methods. For instance, in the complex Prisoner’s Dilemma task, where cooperation requires identifying subtle defection cues, LLM-TOC is the only method that establishes stable cooperation comparable to the Oracle, whereas baselines devolve into mutual defection (low returns).

To better understand why LLM-TOC achieves superior zero-shot robustness, we analyze the semantic diversity of the generated opponent populations and the interpretability of the curriculum evolution process.

A critical failure mode in self-play is the convergence to a narrow set of Nash equilibria, often resulting in “mode collapse” where agents only learn to coordinate with compliant partners. The strategy categories in Figure 4 are determined via a two-step LLM-assisted classification with manual verification: (1) We first define five pre-specified semantic strategy categories (Free-Rider, Saboteur, Opportunist, Collaborative, and Random/Noisy) based on game theory and social behavior taxonomy. (2) For each generated policy, we feed the policy code and its behavioral trajectory in the environment to the LLM for automatic classification. (3) All classification results are manually verified and corrected to ensure accuracy. For the MAPPO self-play checkpoints, we perform the same classification based on their behavioral trajectories in the environment, to ensure the consistency of the classification criteria.

It should be explicitly acknowledged that there is an essential difference between the two groups of strategies compared in this figure: the adversaries generated by LLM-TOC are rule-based policies represented in executable code space, while the self-play opponents of MAPPO are neural network parameterized policies. The core purpose of this comparison is to demonstrate the difference in semantic strategy diversity brought by the two different exploration paradigms (semantic code space exploration vs. parameter space exploration), rather than a direct peer-to-peer comparison of the two algorithms themselves.

Interpretability via Visual-Semantic Alignment. Unlike black-box adversarial generation, LLM-TOC provides transparent insights into why a student agent fails. Figure 5 demonstrates this diagnosis loop. In a specific failure case in Running with Scissors, the agent froze and failed to collect resources. While a standard value loss only indicates that performance dropped, our Gradient Saliency mechanism (Figure 5, Middle) reveals the cause: the agent’s attention was over-fixated on a distant opponent (Red) rather than the immediate goal. The LLM leverages this “visual evidence” to generate a precise causal explanation (Figure 5, Right) and subsequently synthesizes a training opponent that specifically exploits this distraction. This explicit causal link between visual attention, semantic diagnosis, and code generation is the core driver of our method’s data efficiency.

5.3. Ablation Study

To disentangle the contributions of the key components in LLM-TOC, we conduct ablation studies by creating two variants: (1) w/o Saliency: Removes the Gradient Saliency Feedback, relying on generic prompts to guide the semantic oracle; (2) w/o Code Gen (PBT): Replaces the LLM-based code generation with standard population-based training that optimizes opponent policies in the parameter space. Figure 6 visualizes the impact of these components on training efficiency and final robust performance.

Comparing the full method (Red) with the w/o Saliency variant (Blue) in Figure 6, we observe that explicit causal diagnosis via Gradient Saliency Feedback is critical for both sample efficiency and final performance. We provide quantitative validation of the efficiency improvement brought by the saliency-guided mechanism in Table 3, which compares the core performance metrics of the full method and the w/o Saliency variant across all test scenarios.

The quantitative results demonstrate that the saliency-guided curriculum generation reduces the training steps required to reach the target performance by more than 50%, directly verifying that our mechanism significantly improves the learning efficiency of the agent. Without saliency feedback, the LLM must blindly guess potential weaknesses of the student agent, leading to a

133 %

slower convergence rate and a ∼

18 %

average relative performance drop, as the generated opponents are less targeted and fail to exploit subtle vulnerabilities. This quantitative evidence strongly validates the effectiveness of our Gradient Saliency Feedback mechanism.

From a theoretical perspective, the significant performance degradation of the w/o Saliency variant can be attributed to the loss of the “semantic gradient” that guides the LLM policy generation. Without the causal diagnosis provided by gradient saliency, the LLM can only perform random search in the vast semantic strategy space, which leads to a much lower probability of generating valid adversarial policies that exploit the student’s subtle vulnerabilities. This directly violates the core design of our framework, which aims to reduce the search complexity of the outer-loop optimization via targeted feedback. Our theoretical analysis in Appendix D proves that the saliency-guided mechanism reduces the sampling complexity of valid adversarial policies by a factor of

1 / P (c)

(where

P (c) ≪ 1

), which is perfectly consistent with the 133% slower convergence speed and 51.4% more training steps observed in the w/o Saliency variant.

The most significant performance drop occurs when removing the code generation entirely (w/o Code Gen, Orange). As shown in Figure 6, agents trained via parameter-space optimization saturate at a low performance level (∼

0.4

). Parameter space exploration struggles to traverse the vast strategy manifold to find distinct behavioral modes, often collapsing to simple, aggressive policies. In contrast, the Turing-complete code space allows LLM-TOC to construct logically complex and semantically diverse scenarios, pushing the student agent to learn robust generalized behaviors.

The catastrophic performance drop of the w/o Code Gen variant provides strong empirical evidence for our theoretical claim that parameter-space exploration cannot adequately cover the semantic strategy space

Π_{s e m}

. Traditional population-based training operates in the continuous parameter space of neural networks, where the set of weight configurations that represent distinct semantic strategies is a measure-zero subset. This leads to severe mode collapse, as shown in Figure 4b, where the population is dominated by simple collaborative or random behaviors. In contrast, our LLM-based code generation directly explores the Turing-complete semantic strategy space, generating logically complex and behaviorally distinct policies that are inaccessible to parameter-space exploration. This fundamental difference explains why the w/o Code Gen variant saturates at a low performance level, while the full LLM-TOC framework continuously improves and approaches the oracle performance bound.

We further performed a paired two-tailed t-test to verify the statistical significance of the performance difference between the full LLM-TOC framework and the w/o Saliency variant, across all four test substrates and four random seeds. The test result shows that the performance gap is statistically significant (

p < 0.05

), which strongly validates that the performance improvement brought by the Gradient Saliency Feedback mechanism is not due to randomness but to the targeted causal diagnosis provided by the mechanism.

5.4. Supplementary Empirical Analysis of the Semantic Oracle Assumption

The

ϵ

-approximate semantic oracle assumption is the core premise of our theoretical convergence and generalization bound analysis. Here, we provide empirical validation of this assumption through quantitative statistics of LLM generation performance across all training iterations and random seeds, to bridge the gap between theoretical analysis and practical implementation.

We first define two core metrics to quantify the performance of the LLM semantic oracle: Generation Success Rate: The proportion of generated policies that pass both syntactic compilation validation and functional validation across all outer-loop iterations. Policy Quality Consistency: The coefficient of variation (CV) of the student agent’s return reduction induced by the generated policies across four independent random seeds, which quantifies the stability of the LLM generation quality.

We statistically analyze the generation results of 10 outer-loop iterations across four random seeds, with the results presented in Table 4.

The empirical results show that the LLM maintains a consistently high generation success rate (over 90%) across all iterations and seeds, indicating that it can reliably generate valid adversarial policies targeting the student’s weaknesses, which aligns with the core requirement of the

ϵ

-approximate semantic oracle assumption. The low coefficient of variation demonstrates that the quality of the generated policies is highly consistent across different random seeds, verifying the stability of the LLM oracle capability in our framework.

We further analyze the 7.5% failed generation cases, and find that all failures fall into two categories: (1) syntactic errors in the generated code that lead to compilation failure, which can be automatically filtered out by our validation mechanism; (2) generated policies that are valid but do not reduce the student’s return significantly, which are discarded in the functional validation step and will not be added to the training population. These failed cases do not affect the robustness of the training process, as our filtering mechanism ensures that only valid, high-quality policies are included in the curriculum.

We further quantify the stability of the curriculum evolution process by measuring the cumulative diversity of the generated policy population across iterations. We use the number of distinct semantic strategy types as the diversity metric. The results show that the population diversity increases monotonically across iterations for all random seeds, with no significant degradation in generation diversity in the late training phase, verifying the long-term stability of the curriculum generation process.

We further provide detailed engineering statistics of the LLM code generation process across all 10 outer-loop iterations and four random seeds, to validate the practical reliability of the semantic oracle:

1.: Compilation Failure Rate: The average syntactic compilation failure rate of the generated code is 4.2% ± 1.8% across all iterations. All syntactically invalid code is automatically filtered out by our validation mechanism, and will not be added to the training population
2.: Generation Attempts per Iteration: We perform at most two generation attempts per outer-loop iteration. In 95.8% of iterations, the first generation attempt produces syntactically valid code that passes compilation; only 4.2% of iterations require a second generation attempt, with no iterations requiring more than two attempts.
3.: Proportion of Non-Trivial Valid Policies: Among all syntactically valid policies, 88.3% ± 2.5% are non-trivial adversarial policies that can significantly reduce the student agent’s return (functional validation passed), and are added to the training population. The remaining 11.7% of valid but ineffective policies are filtered out by our functional validation step.

These statistics demonstrate that the LLM can reliably generate valid, high-quality adversarial policies with minimal generation attempts, which fully aligns with the core requirement of the

ϵ

-approximate semantic oracle assumption. The small proportion of failed/ineffective generations can be completely filtered out by our two-step validation mechanism, without affecting the convergence of the algorithm.

6. Conclusions, Limitations and Future Work

In this work, we proposed LLM-TOC, a novel framework that bridges the gap between the sample efficiency of reinforcement learning and the semantic reasoning capabilities of Large Language Models to tackle the challenge of zero-shot generalization in open-ended multi-agent systems. By formalizing the problem as a bi-level Stackelberg game within a Turing-complete code space, we overcame the mode collapse inherent in parameter-space exploration. Crucially, our proposed Gradient Saliency Feedback mechanism effectively grounds the symbolic reasoning of the LLM in the pixel-level causality of the environment, enabling the synthesis of targeted, high-value adversarial strategies without requiring dense reward engineering. Theoretical analysis via the PAC-Bayes framework guarantees that our semantic exploration yields tighter generalization bounds and faster convergence rates (

O (1 / \sqrt{K})

). Empirical results on the challenging Melting Pot benchmark demonstrate that LLM-TOC not only achieves state-of-the-art zero-shot robustness against out-of-distribution partners but also reduces training costs by over

60 %

compared to standard population-based training methods. These results validate the effectiveness of our framework in complex multi-agent coordination and competition scenarios, providing a promising direction for building generalizable multi-agent agents.

Despite the promising results, our current framework presents several limitations that open avenues for future research:

Dependence on Proprietary LLMs and Theoretical Assumption Constraints. The performance of the semantic oracle and the validity of our theoretical convergence guarantees rely heavily on the reasoning and coding capabilities of state-of-the-art models, which is directly related to the

ϵ

-approximate semantic oracle assumption. Our preliminary tests indicate that smaller, open-source models may struggle with complex causal diagnosis and valid code generation, leading to a large approximation error that invalidates the theoretical convergence guarantee. In addition, our theoretical guarantees are established under the Bounded Payoff assumption, which is not applicable to non-episodic infinite-horizon tasks with unbounded rewards. Future Work: We plan to distill the capabilities of the large oracle into a smaller, domain-specific model through fine-tuning, thereby reducing deployment costs and privacy concerns, and extending the theoretical framework to adapt to unbounded reward scenarios.

Environment Compatibility. LLM-TOC requires the environment to support the dynamic injection of Python-based policies, which restricts its direct application to compiled binaries or environments with rigid APIs. Future Work: We aim to extend the framework to support abstract behavior trees or domain-specific languages (DSLs) that can act as a universal interface for a broader range of simulation platforms.

Static Robustness vs. Online Adaptation. While LLM-TOC produces a highly robust policy, the student agent’s parameters remain fixed during test-time evaluation. It does not actively update its strategy within the test episode to adapt to novel teammates. Future Work: Integrating LLM-TOC with meta-learning or in-context learning modules could enable agents that not only possess a robust prior but also continuously adapt to their social partners in real-time.

Author Contributions

Conceptualization, C.W.; Methodology, C.W., T.Y. and X.J.; Software, C.W., T.Y. and X.J.; Validation, C.W. and J.Z.; Resources, C.W.; Data curation, C.W., T.Y. and X.J.; Writing—original draft, C.W. and J.Z.; Writing—review and editing, C.W., L.X., J.Z. and Z.H.; Visualization, J.Y. and L.X.; Supervision, J.Y. and L.X.; Project administration, Z.H.; Funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the following projects. National Natural Science Foundation of China (No. 62576046); National Natural Science Foundation of China (No. 62301066); National Natural Science Foundation of China (No. 62406028); Beijing Academy of Artificial Intelligence (Z251100008125041); Key Project of Philosophy and Social Sciences Research, Ministry of Education, China (No. 24JZD040); “the Fundamental Research Funds for the Central Universities 2023RC72”; Beijing University of Posts and Telecommunications, 2025YZ010.

Data Availability Statement

The data presented in this study are openly available in GitHub at https://github.com/vcis-wangchenxu/LLM-TOC.git (accessed on 8 February 2026). Publicly available environments and APIs were utilized in this study, which can be found here: DeepMind Melting Pot (https://github.com/google-deepmind/meltingpot (accessed on 8 February 2026)) and Aliyun Qianwen API (https://bailian.console.aliyun.com (accessed on 8 February 2026)).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AGI	Artificial General Intelligence
OOD	out-of-distribution
MAS	multi-agent systems
MARL	Multi-Agent Reinforcement Learning
SP	self-play
PPO	Proximal Policy Optimization
GAE	Generalized Advantage Estimation
LLM	Large Language Model
ToM	Theory of Mind
ZSC	Zero-Shot Coordination
IPPO	Independent Proximal Policy Optimization
MAPPO	Multi-Agent Proximal Policy Optimization
QMIX	Monotonic Value Decomposition for Multi-Agent Reinforcement Learning
VDN	Value Decomposition Networks
OP	Other-Play
FCP	Fictitious Co-Play
TrajeDi	Trajectory Diversity
LIPO	Latent Space Optimization
GITM	Ghost in the Minecraft
ACL	Automatic Curriculum Learning
UED	Unsupervised Environment Design
QD	Quality-Diversity
PLR	Prioritized Level Replay
OEL	Open-Ended Learning
POMG	Partially Observable Markov Game

Appendix A. List of Symbols

The symbols used in this paper and their definitions are listed in Table A1.

Table A1. Nomenclature and definition of symbols used in this paper.

Symbol	Definition
Partially Observable Markov Game (POMG)
$G$	The tuple defining the POMG: $〈 S, N, A, P, R, Ω, O, γ 〉$
$S$	Global state space
N	Set of agents, partitioned into student (E) and others ( $- E$ )
$A$	Joint action space ( $A_{E} \times A_{- E}$ )
$R_{E}$	Reward function for the student agent
$Ω_{i}$	Set of partial observations for agent i
$O_{i}$	Observation function
$γ$	Discount factor
$τ$	Trajectory of states and actions
Policies and Strategies
$π_{E}$	Policy of the student agent (Follower)
$θ$	Parameters of the student policy (e.g., neural network weights)
$π_{- E}$	Joint policy of the other agents (Teammates/Opponents)
$Π_{n e t}$	Neural network parameter space
$Π_{s e m}$	Semantic strategy space (discrete, code-based)
$J (π_{E}, π_{- E})$	Expected return of student $π_{E}$ against $π_{- E}$
$Regret (π_{E}, π_{- E})$	Regret of student $π_{E}$ against $π_{- E}$
$V^{*} (π_{- E})$	Oracle performance (maximal achievable return) against $π_{- E}$
$R_{r e l}$	Relative Performance to Oracle PPO
$D r o p_{r e l}$	Relative Performance Drop
LLM-TOC Framework
$Φ$	Semantic Oracle (Large Language Model)
$P_{k}$	Population of opponent strategies at iteration k
$δ_{t}$	Value surprise (absolute TD error) at timestep t
$T_{c r i t}$	Set of critical frames where $δ_{t} > ϵ_{t h r e s h}$
$V (s)$	Value function of the student agent
$X_{t}$	Input observation tensor (pixels/features) at timestep t
$G_{t}$	Gradient map of the value function w.r.t. input $X_{t}$
$M_{t}$	Aggregated 2D saliency map
$I_{o b j}$	Binary mask for a specific object class
${Score}_{o b j}$	Attention score for a specific object class
$D_{s e m}^{(k)}$	Semantic diagnosis derived from gradient saliency
$P_{k}$	Structured prompt input to the LLM
$π_{c o d e}$	Executable policy code generated by the LLM

Table A2. Full hyperparameter configuration of the LLM-TOC framework.

Module	Hyperparameter	Value
Observation Processing	Grid size	11 × 11
	Frame stack	4
	Number of object channels	12
Network Architecture	CNN layers	2 (32, 64 channels)
	CNN kernel size	3 × 3, stride 1, padding 1
	LSTM hidden dimension	256
	MLP hidden dimension	64
	Action space dimension	8
PPO Training	Learning rate	$3 \times 10^{- 4}$ (linear decay to $3 \times 10^{- 5}$ )
	Clip parameter	0.2
	Discount factor $γ$	0.99
	GAE lambda $λ$	0.95
	Batch size	1024 timesteps
	PPO epochs per batch	4
	Mini-batch size	256
	Entropy coefficient	0.01
	Gradient clipping norm	0.5
Training Process	Number of outer-loop iterations K	10
	Inner-loop steps per iteration	$1 \times 10^{6}$
	Total maximum training steps	$1 \times 10^{7}$
	Evaluation rollouts per policy	5
	Maximum episode length	1000 timesteps
	Gradient saliency threshold $ϵ_{t h r e s h}$	90th percentile of value surprise
LLM Configuration	Model version	qwen-plus-2025-12-01
	Sampling temperature	0.7
	top_p	0.9
	Max generation tokens	1024
	Repetition penalty	1.05
	Max generation attempts per iteration	2
Reproducibility	Random seeds	1, 42, 100, 2026

Appendix B. Environment Details and Evaluation Protocols

In this section, we provide detailed descriptions of the four Melting Pot environments used in our experiments, as shown in Figure A1. We explain the game mechanics of the base substrates (used for training) and characterize the held-out scenarios (used for zero-shot evaluation), highlighting the specific challenges posed by the unseen bot agents.

Figure A1. Visualizations of the four Melting Pot substrates used in our evaluation. (Left) Collaborative Cooking (asymmetric): Two agents (colored avatars) must pass tomatoes and dishes across a central divider to optimize soup delivery, testing role specialization. (Middle-left) Prisoner’s Dilemma (repeated): Agents collect “Cooperate” (green) or “Defect” (red) resources before interacting; the test requires identifying partners who punish defection. (Middle-right) Running with Scissors (arena): A large-scale 8-player map where agents compete in a spatially embedded Rock–Paper–Scissors game; effective generalization requires exploiting the specific resource biases of the background population. (Right) Running with Scissors (repeated): A dyadic version of the game focusing on long-term strategy adaptation against a single opponent.

Appendix B.1. Collaborative Cooking: Asymmetric

Substrate Mechanics: This environment models a cooperative kitchen task where two agents must coordinate to produce tomato soup. The cooking pipeline consists of four sequential steps: (1) interacting with the tomato station to pick up tomatoes, (2) placing three tomatoes into a cooking pot, (3) interacting with the cooking pot to plate the soup into a bowl, and (4) delivering the soup to a delivery station. The layout is asymmetrically designed with two distinct rooms. The left room has the delivery station nearby but the tomato station far away, while the right room has the tomato station nearby but the delivery station far away. This spatial asymmetry creates a strong incentive for role specialization: optimal efficiency is achieved when one agent focuses on fetching tomatoes and the other on delivery, passing items across the central counter.

Scenario (collaborative_cooking__asymmetric_0): Bot Agents: In the test scenario _0, the focal agent is paired with a specialized partner bot that rigidly adheres to a specific sub-policy (e.g., only fetching tomatoes or only delivering).

The Generalization Challenge: During self-play training on the substrate, agents typically learn to cover all tasks or establish an arbitrary convention with their clone. When tested on _0, the focal agent must correctly infer the rigid role of the partner (e.g., “My partner is not delivering”) and dynamically adapt its own behavior to fill the missing role (e.g., “I must switch to delivery”). Failure to infer and adapt leads to coordination breakdowns where both agents attempt the same task, resulting in zero collective return.

Appendix B.2. Prisoner’s Dilemma in the Matrix: Repeated

Substrate Mechanics: This substrate embeds the classic Prisoner’s Dilemma into a 2D grid world. Agents collect resources representing “Cooperate” (Green) or “Defect” (Red). When two agents interact via a “zap” beam, a payout is awarded based on the resources they hold, following the standard payoff matrix: Mutual Cooperation (3, 3), Defection vs Cooperation (5, 0), and Mutual Defection (1, 1). The interaction is repeated, allowing agents to build reputation or retaliate over long episodes.

Scenario (prisoners_dilemma_in_the_matrix__repeated_0): Bot Agents: The _0 scenario populates the background with agents executing classic game-theoretic strategies that were likely not present in the self-play training distribution, such as Grim Trigger (cooperate until the partner defects once, then defect forever) or Tit-for-Tat.

The Generalization Challenge: A standard RL agent trained via self-play often converges to “Always Defect” (the Nash Equilibrium of the single-stage game). If such an agent defects against a “Grim Trigger” bot in the test scenario, it gains a small short-term reward but permanently destroys the possibility of long-term cooperation, leading to catastrophic long-term regret. Success requires the agent to recognize the partner’s conditional logic and restrain its greed to maintain a cooperative equilibrium.

Appendix B.3. Running with Scissors in the Matrix: Arena

Substrate Mechanics: This is an 8-player environment representing a spatially embedded Rock–Paper–Scissors game. Agents collect resources (Rock, Paper, or Scissors) to define their strategy. Interactions result in zero-sum outcomes based on the collected inventory (e.g., Rock beats Scissors). The “Arena” map is large and open, emphasizing navigation and multi-agent engagement dynamics.

Scenario (running_with_scissors_in_the_matrix__arena_0): Bot Agents: The background population in _0 consists of bots with fixed policy biases or specific “pure strategies” (e.g., bots that exclusively collect Rock).

The Generalization Challenge: During self-play, agents typically converge to the mixed Nash Equilibrium (collecting resources uniformly to be unexploitable). However, in the _0 scenario, this conservative strategy is suboptimal. To maximize rewards, the focal agent must possess the Theory of Mind to observe the opponents’ inventory biases (e.g., noticing an abundance of Rock-players) and aggressively counter-play (e.g., collecting Paper), rather than playing purely randomly.

Appendix B.4. Running with Scissors in the Matrix: Repeated

Substrate Mechanics: Similar to the Arena version but restricted to dyadic (two-player) interactions. The focus shifts from navigating a crowd to engaging in a repeated, high-stakes duel with a single partner. This setup intensifies the need for memory and history-based inference.

Scenario (running_with_scissors_in_the_matrix__repeated_0): Bot Agents: The test partner in _0 typically follows a sequence-based strategy or a sophisticated exploitation policy that changes based on the focal agent’s history.

The Generalization Challenge: The core difficulty lies in non-stationarity. A self-play agent might learn a static distribution. However, the _0 bot might actively exploit patterns. For instance, if the focal agent repeats “Rock”, the bot will switch to “Paper”. The focal agent must demonstrate second-order adaptation: detecting that it is being exploited and shifting its strategy dynamically within the episode.

Appendix C. Prompt Engineering for Semantic Oracle

In the LLM-TOC framework, the Large Language Model (LLM) functions as a semantic oracle, tasked with translating numerical failure signals into executable Python code for adversarial policies. To ensure the generated strategies are both syntactically valid and strategically effective, we design a modular prompt structure

P_{k}

. The prompt is composed of three concatenated segments:

P_{k} = Concat (I_{r o l e}, I_{A P I}, D_{s e m}^{(k)}) .

(A1)

Below, we detail the specific design and function of each component.

Appendix C.1. Role Instruction ( $I_{r o l e}$ )

The role instruction establishes the “persona” of the LLM and defines the optimization objective. Unlike standard conversational prompts, this instruction explicitly steers the model towards adversarial game design.

Persona Definition: The LLM is conditioned to act as an “Expert Game Designer” and “Adversarial Strategist.”
Objective Specification: The instruction explicitly states that the goal is to minimize the expected return of the focal student agent (regret maximization). It emphasizes finding “corner cases” or “blind spots” in the student’s current behavior.
Chain-of-Thought (CoT) Trigger: We include instructions such as “Think step-by-step” to encourage the model to first analyze the semantic diagnosis and then derive the logical counter-strategy before writing code.

Appendix C.2. API Constraints and Environment Interface ( $I_{A P I}$ )

To guarantee that the generated code is directly executable within the Melting Pot simulation loop, this component provides the necessary syntactic grounding.

Code Skeleton: A predefined Python class structure (e.g., class OpponentPolicy(Policy):) is provided, requiring the LLM to implement specific methods like step(self, observation).
Action Space Definition: A precise mapping of integer action IDs to their semantic meanings (e.g., 0: NOOP, 1: MOVE_FORWARD, 5: TURN_LEFT, 7: ZAP_BEAM) is listed to ensure valid outputs.
Observation Space Specification: The prompt clarifies that the input observation is not raw pixels but a processed $11 \times 11 \times C$ semantic feature tensor, allowing the LLM to write logic based on object presence (e.g., if ’apple’ in view).
Library Restrictions: Explicit constraints are added to prevent the use of undefined external libraries, ensuring the code runs in the sandboxed environment.

Appendix C.3. Semantic Diagnosis via Gradient Saliency ( $D_{s e m}^{(k)}$ )

This is the dynamic component of the prompt, updated at each iteration k. It acts as the bridge between the numerical RL signals and the linguistic reasoning of the LLM.

Critical Moment Description: Based on the Value Surprise $δ_{t}$ , the system selects specific frames where the agent failed (e.g., “At step 450, a sudden drop in value occurred”).
Visual Attention Summary: Using the computed saliency map $M_{t}$ , the prompt lists objects with high attention scores (what the agent focused on) and relevant objects with low attention scores (what the agent ignored).
Causal Attribution: The numerical attention scores are converted into a natural language hypothesis. For example, if attention on “Opponent” is high but “Goal” is low, the diagnosis might state: “The agent was distracted by the opponent and neglected the objective.”

Appendix C.4. Full Prompt Template Example

Listing Algorithm A1 illustrates the assembled prompt template used in the Running with Scissors environment.

Algorithm A1 Prompt template for adversarial strategy generation.

# --- [PART 1: ROLE INSTRUCTION] ---

You are an expert Multi-Agent Game Designer.

Your goal is to design a Python policy for an opponent agent that exploits

the weaknesses of the current “Student Agent”.

The Student Agent is playing the game “Running with Scissors”.

Your strategy must be competitive and aim to minimize the Student’s score.

Think step-by-step:

1. Analyze the “Diagnosis Report” to understand the Student’s vulnerability.

2. Devise a logic-based strategy to exploit this specific weakness.

3. Implement the strategy in Python.

# --- [PART 2: API CONSTRAINTS] ---

You must implement the following class structure:

class AdversarialPolicy(object):

def _ _init_ _(self):

self.memory = {} # Use for state tracking

def step(self, observation):

"""

Input: observation (11x11 grid of object IDs).

Returns: action (int).

Action Space:

0: NOOP, 1: FORWARD, 2: RIGHT, 3: BACKWARD, 4: LEFT,

5: TURN_L, 6: TURN_R, 7: FIRE_ZAP (Interact)

Object IDs in observation:

1: Wall, 2: Apple (Goal), 3: Agent (Student)

"""

# YOUR CODE HERE

return action

Constraint: Do not import external libraries like numpy or torch.

Use standard Python logic.

# --- [PART 3: SEMANTIC DIAGNOSIS (Dynamic)] ---

DIAGNOSIS REPORT (Derived from Gradient Saliency):

--------------------------------------------------

> Context: The Student Agent failed at step T=142.

> Visual Attention Analysis:

- High Attention (Score 0.85): “Red Opponent” (located at relative pos [-2, 0])

- Low Attention (Score 0.10): “Green Resource” (located at relative pos [0, 3])

> Causal Inference:

The agent is highly reactive to the opponent’s presence (“Distracted”)

and fails to collect resources when threatened.

--------------------------------------------------

INSTRUCTION:

Write a policy that exploits this “Distraction” weakness.

For example, create an agent that feints an attack to freeze the student,

then steals the resource.

Appendix D. Theoretical Analysis and Proofs for LLM-TOC

This appendix provides rigorous mathematical derivations for the LLM-TOC framework, fully consistent with the theoretical analysis presented in the main text. We expand on the proofs regarding the convergence of the bi-level optimization, generalization bounds via semantic coverage, and search efficiency improvements derived from gradient saliency.

All theoretical conclusions in this appendix are established based on two core foundational assumptions, consistent with the main text: (1) the Bounded Payoff assumption for the multi-agent game, and (2) the

ϵ

-Approximate semantic oracle assumption for the LLM-based policy generation module. The formal definition, applicable boundary, and empirical validation of these assumptions are detailed in the following sections and Section 5.4 of the main text.

Appendix D.1. Problem Definition and Notation

We model the open-ended multi-agent environment as a Partially Observable Markov Game (POMG), defined by the tuple

G = 〈 S, N, {A_{i}}, T, {R_{i}}, {Ω_{i}}, O, γ 〉

, where:

E: The student agent, with policy $π_{E} \in Π_{n e t}$ (parameterized by neural network weights $θ \in R^{d}$ ).
$- E$ : The other agents (opponents or teammates), with policy $π_{- E} \in Π_{s e m}$ (the semantic strategy space, comprising all logically expressible behaviors, such as Turing-complete code).
$V (π_{E}, π_{- E})$ : The expected discounted return for the student agent: $E_{τ} [\sum_{t = 0}^{\infty} γ^{t} r_{t}]$ .

Objective: We aim to identify a robust student policy

π_{E}^{*}

that minimizes the maximum regret against the semantic strategy space

Π_{s e m}

:

π_{E}^{*} = arg min_{π_{E} \in Π_{n e t}} R_{w o r s t} (π_{E}) = arg min_{π_{E} \in Π_{n e t}} max_{π_{- E} \in Π_{s e m}} [V^{*} (π_{- E}) - V (π_{E}, π_{- E})],

(A2)

where

V^{*} (π_{- E}) = {max}_{π^{'}} V (π^{'}, π_{- E})

represents the oracle performance against

π_{- E}

.

To establish the theoretical guarantees of our framework, we first formalize the two core foundational assumptions that underpin all subsequent proofs, which are fully aligned with the main text.

Assumption A1

(Bounded Payoff). The value function of the multi-agent game is uniformly bounded; there exists a constant

V_{m a x} > 0

such that

\forall π_{E} \in Π_{n e t}, π_{- E} \in Π_{s e m}, |V (π_{E}, π_{- E})| \leq V_{m a x} .

(A3)

Applicable Boundary: This assumption holds for all standard episodic reinforcement learning environments with bounded per-step reward, including the Melting Pot benchmark used in our experiments. In Melting Pot, the per-step reward is constrained to a fixed range, and the episode length is finite, and thus the cumulative discounted return is naturally bounded.

Assumption A2

(

ϵ

-Approximate Semantic Oracle). Let

Φ : Π_{n e t} \to Π_{s e m}

denote the LLM-based policy generation function (the semantic oracle) in our framework. We define the LLM as an ϵ-approximate semantic oracle if, for any given student policy

π_{E}

, the strategy

B = Φ (π_{E})

generated by the LLM satisfies

V (π_{E}, B) \leq min_{π^{'} \in Π_{s e m}} V (π_{E}, π^{'}) + ϵ_{o r a c l e},

(A4)

where

ϵ_{o r a c l e} \geq 0

is the bounded approximation error of the oracle.

Core Requirement of the Assumption A2: This assumption requires that the LLM consistently generate valid adversarial strategies that approximate the worst-case attack on the student agent, with a bounded approximation error. In other words, the LLM can reliably identify the student agent’s vulnerabilities and generate targeted strategies to exploit them, with performance degradation relative to the theoretical worst-case strategy no more than

ϵ_{o r a c l e}

.

Applicable Boundary and Limitation Statement: 1. Valid Scenarios: This assumption holds when the LLM has sufficient code generation and logical reasoning capabilities to translate the semantic diagnosis into executable, strategically valid policies. The empirical validation in Section 5.4 of the main text verifies that this assumption is satisfied in our implementation, with a 92.5% generation success rate and stable policy quality across random seeds. 2. Boundary of the Assumption: The assumption does not require the LLM to generate the exact theoretical worst-case strategy, only to generate an

ϵ

-approximate one. The approximation error

ϵ_{o r a c l e}

is determined by the reasoning and coding capabilities of the LLM, and can be reduced by using more capable LLMs or optimizing the prompt engineering. 3. Failure Case Handling: When the assumption is not fully satisfied, our two-step policy validation and filtering mechanism described in Section 4.3 of the main text can automatically exclude these invalid cases. This ensures that only valid, high-quality strategies are added to the training population, and the convergence of the algorithm is not affected by occasional generation failures.

Appendix D.2. Detailed Convergence Analysis of Bi-Level Optimization

LLM-TOC approximates the objective above via an iterative bi-level process, formalized as a Generalized Policy Space Response Oracle. All derivations in this section are based on the Bounded Payoff assumption and

ϵ_{o r a c l e}

-Approximate semantic oracle assumption formally defined in Appendix D.1, which are fully consistent with the main text.

Theorem A1

(Monotonic Improvement and Convergence). Let

P_{k} = B_{1}, \dots, B_{k} \subset Π_{s e m}

be the population of opponents at iteration k. The sequence of worst-case regrets

{R_{w o r s t} (π_{E}^{(k)})}_{k = 1}^{\infty}

converges to the

ϵ_{o r a c l e}

-Nash equilibrium neighborhood.

Theorem A2

(Monotonic Improvement and Convergence). Let

P_{k} = {B_{1}, \dots, B_{k}} \subset Π_{s e m}

be the population of opponents at iteration k. The sequence of worst-case regrets

{R_{w o r s t} (π_{E}^{(k)})}_{k = 1}^{\infty}

converges to the ϵ-Nash equilibrium neighborhood.

Proof.

We define the exploitability (worst-case regret) of the student policy

π_{E}^{(k)}

at iteration k as

δ_{k}

.

Inner Loop (Regret Minimization): At iteration k, the student computes

π_{E}^{(k)}

to maximize performance against the current population

P_{k}

. This is equivalent to minimizing regret over the restricted set

P_{k}

:

π_{E}^{(k)} = arg max_{π_{E}} min_{B \in P_{k}} V (π_{E}, B) .

(A5)

Let

v_{k} = {min}_{B \in P_{k}} V (π_{E}^{(k)}, B)

be the value of the restricted game.

Outer Loop (Oracle Expansion): The oracle generates a new opponent

B_{k + 1}

that exploits

π_{E}^{(k)}

. According to Definition 1:

V (π_{E}^{(k)}, B_{k + 1}) \leq min_{π^{'} \in Π_{s e m}} V (π_{E}^{(k)}, π^{'}) + ϵ_{o r a c l e} .

(A6)

Let

v_{e x p l o i t} = V (π_{E}^{(k)}, B_{k + 1})

. The gap

Δ_{k} = v_{k} - v_{e x p l o i t}

represents the extent to which the current population

P_{k}

underestimates the true exploitability of the student.

Monotonic Constraint Tightening: In iteration

k + 1

, the population becomes

P_{k + 1} = P_{k} \cup {B_{k + 1}}

. The student must now optimize against a larger set of constraints:

v_{k + 1} = max_{π_{E}} min_{B \in P_{k + 1}} V (π_{E}, B) = max_{π_{E}} min min_{B \in P_{k}} V (π_{E}, B), V (π_{E}, B_{k + 1}) .

(A7)

Since the minimization is performed over a larger set, the achievable worst-case value for any fixed

π_{E}

can only decrease or remain constant (i.e., the constraints are tighter). However, the student updates

π_{E}

to

π_{E}^{(k + 1)}

to recover performance.

Convergence Sequence: The sequence of exploitability

δ_{k}

behaves similarly to the fictitious play process. Assuming the game is zero-sum for simplicity (or general-sum with bounded regret), the average exploitability decreases as

O (1 / \sqrt{k})

. Specifically, combining the inner loop error

ϵ_{r l}

and oracle error

ϵ_{o r a c l e}

:

lim_{k \to \infty} δ_{k} \leq ϵ_{o r a c l e} + ϵ_{r l} .

(A8)

If

B_{k + 1}

is already in

P_{k}

(or effectively covered by

P_{k}

), then

v_{e x p l o i t} \approx v_{k}

, implying

Δ_{k} \to 0

, and the algorithm terminates. □

Appendix D.3. Detailed Derivation of Generalization Bounds

We utilize the PAC-Bayes framework to formally derive why code generation yields superior generalization.

Theorem A3

(Generalization Bound via Semantic Coverage). For any posterior distribution Q over opponent strategies (learned by our method) and prior distribution P (initial guess), with probability

1 - δ

:

L_{D} (π_{E}) \leq {\hat{L}}_{E} (π_{E}) + \sqrt{\frac{D_{K L} (Q | | P) + ln (1 / δ)}{2 m}} .

(A9)

Derivation:

Define Distributions: Let

D

be the true, unknown distribution of opponents in the open world. Let P be a prior distribution over the strategy space. In standard MARL (self-play), this prior is implicitly defined by the neural network parameter initialization and gradient descent trajectory, denoted

P_{R L}

. Let Q be the posterior distribution of opponents generated by LLM-TOC, denoted

P_{L L M}

.

Decompose the KL-Divergence Term: The generalization gap depends heavily on

D_{K L} (Q | | P)

. We require the generated distribution Q to cover the support of the true distribution

D

. Consider the nature of the true distribution

D

. Real-world strategies, such as human behaviors, often follow discrete, symbolic logic (e.g., “If A then B”):

Case 1: Traditional RL ( $P_{R L}$ ). $P_{R L}$ is a continuous distribution over neural weights $θ \in R^{d}$ . The probability of a neural network exactly representing a crisp logical rule, such as a specific Python if-else block, without noise is negligible:

$P_{R L} (Symbolic Logic) \approx 0 \Rightarrow D_{K L} (D | | P_{R L}) \to Large .$

(A10)
Case 2: LLM Generation ( $P_{L L M}$ ). $P_{L L M}$ is a distribution over tokens or code. It assigns high probability to valid logical structures:

$P_{L L M} (Symbolic Logic) ≫ 0 \Rightarrow D_{K L} (D | | P_{L L M}) \to Small$

(A11)

Final Bound Substitution: Substituting these into the PAC-Bayes inequality:

{GenGap}_{L L M} \propto \sqrt{D_{K L} (P_{L L M} | | P)} ≪ \sqrt{D_{K L} (P_{R L} | | P)} \approx {GenGap}_{R L} .

(A12)

This mathematically demonstrates that generating strategies in the code space—which is closer to the true semantic support—results in a strictly tighter generalization bound compared to exploration in the parameter space.

Boundary Analysis of KL Divergence Terms

It should be clarified that the above analysis provides a qualitative theoretical rationale for the tighter generalization bound of semantic-space exploration, rather than a strict mathematical theorem. The core precondition of this analysis is that the pre-trained LLM has already learned a strong prior over valid logical rule-based strategies (the semantic strategy space) during pre-training.

For traditional parameter-space exploration, the prior

P_{p a r a m}

is an isotropic Gaussian distribution over neural network weights, which assigns almost equal probability to all weight configurations. The set of weight configurations that exactly represent a specific symbolic logical rule is a measure-zero subset in the high-dimensional continuous weight space, which leads to

P_{R L} (Symbolic Logic) \approx 0

. For LLM-based semantic exploration, the prior

P_{s e m}

is induced by the LLM’s pre-trained knowledge of code and logical reasoning, which assigns significantly higher probability to syntactically valid, logically consistent policy code, thus aligning much closer with the distribution of real-world human-like social strategies.

This qualitative difference in the prior distribution is the core theoretical basis for the tighter generalization bound of our method.

Appendix D.4. Efficiency Derivation for Gradient Saliency

We prove Proposition A1 using Bayesian inference to demonstrate how gradient saliency acts as a search prune.

Proposition A1

(Search Complexity Reduction). Let

Z_{f a i l} \subset T

be the set of valid adversarial codes that exploit the student. Let c be the semantic prompt derived from gradient saliency M.

Derivation:

Prior Probability (Random Search): Without feedback, the probability of generating a valid adversary is

P (B \in Z_{f a i l}) = ϵ_{r a n d o m} ≪ 1 .

(A13)

This probability is extremely small because the space of all Python programs

T

is vast.

Posterior Probability (Bayes’ Rule): With the prompt c, we seek

P (B \in Z_{f a i l} ∣ c)

:

P (B \in Z_{f a i l} ∣ c) = \frac{P (c ∣ B \in Z_{f a i l}) \cdot P (B \in Z_{f a i l})}{P (c)} .

(A14)

Term Analysis:

Likelihood $P (c ∣ B \in Z_{f a i l})$ : If a policy B is a valid attacker (e.g., a “Backstabber”), the probability it aligns with the prompt “Attack from behind” is high due to causal consistency:

$P (c ∣ B \in Z_{f a i l}) \approx 1 .$

(A15)
Evidence $P (c)$ : The probability that any random code matches the specific prompt “Attack from behind” is very low, as most random codes exhibit unrelated behaviors:

$P (c) ≪ 1$

(A16)

Amplification Factor: Substituting back:

P (B \in Z_{f a i l} ∣ c) \approx \frac{1 \cdot ϵ_{r a n d o m}}{P (c)} = ϵ_{r a n d o m} \cdot \underset{Amplification Factor}{\underset{︸}{\frac{1}{P (c)}}} .

(A17)

Since

P (c) ≪ 1

, the factor

\frac{1}{P (c)}

is large (e.g., if

P (c) = 0.01

, the factor is 100).

Sampling Complexity: The expected number of trials N to find one valid adversary is the inverse of the probability

N_{r a n d o m} = \frac{1}{ϵ_{r a n d o m}}

(A18)

N_{s a l i e n c y} = \frac{1}{P (B \in Z_{f a i l} ∣ c)} = \frac{P (c)}{ϵ r a n d o m} = N_{r a n d o m} \cdot P (c)

(A19)

Since

P (c) ≪ 1

, we have proven that

N_{s a l i e n c y} ≪ N_{r a n d o m}

. The search complexity is reduced linearly with the specificity of the semantic prompt.

References

Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. The rise and potential of large language model based agents: A survey. arXiv 2025, arXiv:2309.07864. [Google Scholar]
Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680. [Google Scholar] [CrossRef]
Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Pérolat, J.; Silver, D.; Graepel, T. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Agapiou, J.P.; Vezhnevets, A.S.; Duéñez-Guzmán, E.A.; Matyas, J.; Mao, Y.; Sunehag, P.; Köster, R.; Madhushani, U.; Kopparapu, K.; Comanescu, R.; et al. Melting Pot 2.0. arXiv 2022, arXiv:2211.13746. [Google Scholar]
Leibo, J.Z.; Dueñez-Guzman, E.A.; Vezhnevets, A.; Agapiou, J.P.; Sunehag, P.; Koster, R.; Matyas, J.; Beattie, C.; Mordatch, I.; Graepel, T. Scalable evaluation of multi-agent reinforcement learning with melting pot. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2021; pp. 6187–6199. [Google Scholar]
Gorsane, R.; Mahjoub, O.; de Kock, R.J.; Dubb, R.; Singh, S.; Pretorius, A. Towards a standardised performance evaluation protocol for cooperative marl. In Advances in Neural Information Processing Systems 35; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 5510–5521. [Google Scholar]
Hu, H.; Lerer, A.; Peysakhovich, A.; Foerster, J. “Other-play” for zero-shot coordination. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2020; pp. 4399–4410. [Google Scholar]
Strouse, D.; McKee, K.; Botvinick, M.; Hughes, E.; Everett, R. Collaborating with humans without human data. In Advances in Neural Information Processing Systems 34; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 14502–14515. [Google Scholar]
Jaderberg, M.; Czarnecki, W.M.; Dunning, I.; Marris, L.; Lever, G.; Castaneda, A.G.; Beattie, C.; Rabinowitz, N.C.; Morcos, A.S.; Ruderman, A.; et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 2019, 364, 859–865. [Google Scholar] [CrossRef]
Tekin, S.F.; Ilhan, F.; Huang, T.; Hu, S.; Yahn, Z.; Liu, L. Multi-Agent Reinforcement Learning with Focal Diversity Optimization. arXiv 2025, arXiv:2502.04492. [Google Scholar] [CrossRef]
Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. Openai o1 system card. arXiv 2024, arXiv:2412.16720. [Google Scholar] [CrossRef]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
Cross, L.; Xiang, V.; Bhatia, A.; Yamins, D.L.; Haber, N. Hypothetical minds: Scaffolding theory of mind for multi-agent tasks with large language models. arXiv 2024, arXiv:2407.07086. [Google Scholar] [CrossRef]
Zhang, C.; Yang, K.; Hu, S.; Wang, Z.; Li, G.; Sun, Y.; Zhang, C.; Zhang, Z.; Liu, A.; Zhu, S.C.; et al. Proagent: Building proactive cooperative agents with large language models. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17591–17599. [Google Scholar] [CrossRef]
Li, L.; Yuan, L.; Liu, P.; Jiang, T.; Yu, Y. LLM-Assisted Semantically Diverse Teammate Generation for Efficient Multi-agent Coordination. In Proceedings of the Forty-second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Xie, T.; Zhao, S.; Wu, C.H.; Liu, Y.; Luo, Q.; Zhong, V.; Yang, Y.; Yu, T. Text2reward: Automated dense reward function generation for reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Treutlein, J.; Dennis, M.; Oesterheld, C.; Foerster, J. A new formalism, method and open issues for zero-shot coordination. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2021; pp. 10413–10423. [Google Scholar]
De Witt, C.S.; Gupta, T.; Makoviichuk, D.; Makoviychuk, V.; Torr, P.H.; Sun, M.; Whiteson, S. Is independent learning all you need in the starcraft multi-agent challenge? arXiv 2020, arXiv:2011.09533. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. In Advances in Neural Information Processing Systems 35; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 24611–24624. [Google Scholar]
Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
Hu, H.; Foerster, J.N. Simplified action decoder for deep multi-agent reinforcement learning. arXiv 2019, arXiv:1912.02288. [Google Scholar]
Lupu, A.; Cui, B.; Hu, H.; Foerster, J. Trajectory diversity for zero-shot coordination. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2021; pp. 7204–7213. [Google Scholar]
Mahajan, A.; Samvelyan, M.; Gupta, T.; Ellis, B.; Sun, M.; Rocktäschel, T.; Whiteson, S. Generalization in cooperative multi-agent systems. arXiv 2022, arXiv:2202.00104. [Google Scholar] [CrossRef]
Yuan, S.; Song, K.; Chen, J.; Tan, X.; Li, D.; Yang, D. Evoagent: Towards automatic multi-agent generation via evolutionary algorithms. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Association for Computational Linguistics: Albuquerque, NM, USA, 2025; pp. 6192–6217. [Google Scholar]
Bettini, M.; Kortvelesy, R.; Prorok, A. Controlling behavioral diversity in multi-agent reinforcement learning. arXiv 2024, arXiv:2405.15054. [Google Scholar] [CrossRef]
Carroll, M.; Shah, R.; Ho, M.K.; Griffiths, T.; Seshia, S.; Abbeel, P.; Dragan, A. On the utility of learning about humans for human-ai coordination. In Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv 2023, arXiv:2305.16291. [Google Scholar] [CrossRef]
Zhu, X.; Chen, Y.; Tian, H.; Tao, C.; Su, W.; Yang, C.; Huang, G.; Li, B.; Lu, L.; Wang, X.; et al. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv 2023, arXiv:2305.17144. [Google Scholar]
Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Wang, J.; Zhang, C.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Chen, W.; Su, Y.; Zuo, J.; Yang, C.; Yuan, C.; Chan, C.M.; Yu, H.; Lu, Y.; Hung, Y.H.; Qian, C.; et al. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. In Proceedings of the ICLR, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Ma, Y.J.; Liang, W.; Wang, G.; Huang, D.A.; Bastani, O.; Jayaraman, D.; Zhu, Y.; Fan, L.; Anandkumar, A. Eureka: Human-level reward design via coding large language models. arXiv 2023, arXiv:2310.12931. [Google Scholar]
Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
Portelas, R.; Colas, C.; Weng, L.; Hofmann, K.; Oudeyer, P.Y. Automatic curriculum learning for deep rl: A short survey. arXiv 2020, arXiv:2003.04664. [Google Scholar] [CrossRef]
Jiang, M.; Grefenstette, E.; Rocktäschel, T. Prioritized level replay. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2021; pp. 4940–4950. [Google Scholar]
Dennis, M.; Jaques, N.; Vinitsky, E.; Bayen, A.; Russell, S.; Critch, A.; Levine, S. Emergent complexity and zero-shot transfer via unsupervised environment design. In Advances in Neural Information Processing Systems 33; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 13049–13061. [Google Scholar]
Parker-Holder, J.; Jiang, M.; Dennis, M.; Samvelyan, M.; Foerster, J.; Grefenstette, E.; Rocktäschel, T. Evolving curricula with regret-based environment design. In Proceedings of the International Conference on Machine Learning. PMLR; ML Research Press: Cambridge, MA, USA, 2022; pp. 17473–17498. [Google Scholar]
Mouret, J.B.; Clune, J. Illuminating search spaces by mapping elites. arXiv 2015, arXiv:1504.04909. [Google Scholar] [CrossRef]
Jiang, M.; Dennis, M.; Parker-Holder, J.; Foerster, J.; Grefenstette, E.; Rocktäschel, T. Replay-guided adversarial environment design. In Advances in Neural Information Processing Systems 34; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 1884–1897. [Google Scholar]
Team, O.E.L.; Stooke, A.; Mahajan, A.; Barros, C.; Deck, C.; Bauer, J.; Sygnowski, J.; Trebacz, M.; Jaderberg, M.; Mathieu, M.; et al. Open-ended learning leads to generally capable agents. arXiv 2021, arXiv:2107.12808. [Google Scholar] [CrossRef]
Fan, L.; Wang, G.; Jiang, Y.; Mandlekar, A.; Yang, Y.; Zhu, H.; Tang, A.; Huang, D.A.; Zhu, Y.; Anandkumar, A. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Advances in Neural Information Processing Systems 35; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 18343–18362. [Google Scholar]
Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
Atrey, A.; Clary, K.; Jensen, D. Exploratory not explanatory: Counterfactual analysis of saliency maps for deep reinforcement learning. arXiv 2019, arXiv:1912.05743. [Google Scholar]
Narvekar, S.; Peng, B.; Leonetti, M.; Sinapov, J.; Taylor, M.E.; Stone, P. Curriculum learning for reinforcement learning domains: A framework and survey. J. Mach. Learn. Res. 2020, 21, 1–50. [Google Scholar]
Li, C.; Chen, Y.H.; Zhao, H.; Sun, H. Stackelberg game theory-based optimization of high-order robust control for fuzzy dynamical systems. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 1254–1265. [Google Scholar] [CrossRef]
Yang, C.; Wang, X.; Lu, Y.; Liu, H.; Le, Q.V.; Zhou, D.; Chen, X. Large language models as optimizers. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Sengupta, S.; Kambhampati, S. Multi-agent reinforcement learning in bayesian stackelberg markov games for adaptive moving target defense. arXiv 2020, arXiv:2007.10457. [Google Scholar] [CrossRef]
Cheng, C.; Zhu, Z.; Xin, B.; Chen, C. A multi-agent reinforcement learning algorithm based on Stackelberg game. In Proceedings of the 2017 6th Data Driven Control and Learning Systems (DDCLS); IEEE: Princeton, NJ, USA, 2017; pp. 727–732. [Google Scholar]
Shen, H.; Zhao, H.; Zhang, Z.; Yang, X.; Song, Y.; Liu, X. Network-wide traffic signal control based on marl with hierarchical Nash-Stackelberg game model. IEEE Access 2023, 11, 145085–145100. [Google Scholar] [CrossRef]
Fiscko, C.; Yin, H.; Sinopoli, B. Hierarchical MARL with Stackelberg Games. In Proceedings of the 2025 American Control Conference (ACC); IEEE: Princeton, NJ, USA, 2025; pp. 4115–4122. [Google Scholar]
Qu, H.; Li, X.; Liu, C.Z.; Zhang, J.; Zhuang, S.; Ma, M. Stackelberg Game-Theoretic Safe MARL with Bilevel Control for Autonomous Driving. IEEE Access 2026, 14, 17506–17524. [Google Scholar] [CrossRef]
Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 157–163. [Google Scholar]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]

Figure 1. Illustration of generalization failure in multi-agent systems. (A) In the training environment with known dynamics, agents follow solid blue arrows to move along pre-trained decision trajectories, overfitting to fixed scenarios and achieving high performance. (B) During zero-shot generalization to unseen dynamics and tasks, the pre-trained agents follow dashed blue arrows to attempt decision-making in novel scenarios; they encounter model mismatches, unforeseen interactions, and lack of prior knowledge (marked by red dashed arrows), leading to a significant performance drop and generalization failure.

Figure 2. Schematic overview of the LLM-TOC framework. The architecture operates as a bi-level iterative process comprising three key components: (1) The Outer Loop (Leader), where the LLM functions as a semantic oracle to generate executable policy code for the adversarial or cooperative population (Evo-Pool); (2) The Inner Loop (Follower), where the student agent optimizes its policy via MARL, specifically using PPO updates to minimize regret; and (3) The Gradient Saliency Feedback Mechanism, which identifies key failure frames, computes visual attributions via Jacobian saliency maps, and translates these numerical signals into semantic textual feedback to guide the LLM’s subsequent generation.

Figure 3. Zero-shot generalization performance on held-out Melting Pot test scenarios. Curves represent the expected cumulative collective return (raw task-specific score) of agents trained on base substrates and evaluated on unseen test scenarios over 10 million training steps. The y-axis uses the native reward scale of each substrate, as the four tasks correspond to fundamentally different game types (coordination, social dilemma, and competitive games) with distinct reward mechanisms. The solid red line denotes our method (LLM-TOC), which consistently outperforms standard MARL baselines (MAPPO and IPPO) and the inference-based method (Hypothetical Minds) across all four domains. Notably, LLM-TOC approaches the theoretical upper bound of the Oracle PPO (dashed green line, trained directly on test scenarios), demonstrating robust transfer capabilities. Shaded regions indicate the standard deviation across 4 random seeds. For normalized cross-substrate performance comparison, please refer to Table 2.

Figure 4. Distribution of adversarial strategies discovered during training. (a) The semantic oracle in LLM-TOC generates a diverse population of opponents, covering distinct behavioral modes such as Free-Riding (enjoying collective rewards without contributing), Sabotage (actively interfering with the focal agent), and Opportunism (exploiting specific game states). This diversity prevents the student agent from overfitting to a single dominant strategy. (b) In contrast, standard baselines like MAPPO often suffer from mode collapse, converging primarily to simple collaborative or random behaviors, leaving the agent vulnerable to complex, unseen social interactions.

Figure 5. Visualization of the Gradient Saliency Feedback mechanism. (Left) The raw observation

X_{t}

shows a critical moment where the agent (Blue) fails to harvest the goal (Green) due to the presence of an opponent (Red). (Middle) The Gradient Saliency Map

M_{t}

highlights the agent’s attention distribution. The heatmap shows a high activation on the opponent, indicating that the agent is “distracted” by the threat. (Right) The LLM acts as a diagnostician, taking the numerical attention scores as input to generate a natural language explanation (“Agent is distracted…”) and proposing a targeted adversarial strategy (“Create a feinting opponent”) to robustly train against this weakness.

Figure 5. Visualization of the Gradient Saliency Feedback mechanism. (Left) The raw observation

X_{t}

shows a critical moment where the agent (Blue) fails to harvest the goal (Green) due to the presence of an opponent (Red). (Middle) The Gradient Saliency Map

M_{t}

highlights the agent’s attention distribution. The heatmap shows a high activation on the opponent, indicating that the agent is “distracted” by the threat. (Right) The LLM acts as a diagnostician, taking the numerical attention scores as input to generate a natural language explanation (“Agent is distracted…”) and proposing a targeted adversarial strategy (“Create a feinting opponent”) to robustly train against this weakness.

Figure 6. Ablation study on the core components of LLM-TOC. (Left) Training efficiency comparison in the Running with Scissors: Arena scenario. The full LLM-TOC framework (red) demonstrates superior sample efficiency compared to the variant without Gradient Saliency Feedback (blue). Notably, removing the semantic code generation module (Orange) leads to early performance saturation, indicating the limitations of parameter-space exploration. (Right) Average relative performance across all four evaluated domains, with error bars representing the standard deviation across 4 independent random seeds. Removing the Saliency Feedback mechanism results in an approximately 18% average relative performance decline due to the lack of causal diagnosis. Furthermore, replacing the Turing-complete code space with standard population-based training (w/o Code Gen) causes a catastrophic relative performance drop of ∼

45 %

, underscoring the critical role of semantic diversity in achieving robust zero-shot generalization.

Figure 6. Ablation study on the core components of LLM-TOC. (Left) Training efficiency comparison in the Running with Scissors: Arena scenario. The full LLM-TOC framework (red) demonstrates superior sample efficiency compared to the variant without Gradient Saliency Feedback (blue). Notably, removing the semantic code generation module (Orange) leads to early performance saturation, indicating the limitations of parameter-space exploration. (Right) Average relative performance across all four evaluated domains, with error bars representing the standard deviation across 4 independent random seeds. Removing the Saliency Feedback mechanism results in an approximately 18% average relative performance decline due to the lack of causal diagnosis. Furthermore, replacing the Turing-complete code space with standard population-based training (w/o Code Gen) causes a catastrophic relative performance drop of ∼

45 %

, underscoring the critical role of semantic diversity in achieving robust zero-shot generalization.

Table 1. Computational complexity comparison with mainstream baselines.

Method	Training Complexity	Test-Time Inference Complexity
LLM-TOC (Ours)	$O (K \cdot (T_{i n} \cdot B \cdot \| θ \| + T_{e v a l} \cdot H W + C_{L L M}))$ , $K = 10$	$O (\| θ \|)$
MAPPO/IPPO (Self-Play)	$O (T_{t o t a l} \cdot B \cdot \| θ \|)$ , $T_{t o t a l} ≫ K \cdot T_{i n}$	$O (\| θ \|)$
Population-Based Training (PBT)	$O (T_{t o t a l} \cdot B \cdot \| θ \| \cdot N_{p o p})$ , $N_{p o p} = pop size$	$O (\| θ \|)$
Hypothetical Minds (Online LLM)	$O (T_{t o t a l} \cdot C_{L L M})$	$O (C_{L L M})$ per step

Table 2. Final normalized zero-shot generalization performance (mean ± std) at 10M training steps, normalized to the Oracle PPO upper bound (1.0).

Substrate	LLM-TOC	Hypothetical Minds	MAPPO	IPPO	Oracle PPO
Collaborative Cooking (Asymmetric)	$0.82 \pm 0.04$	$0.65 \pm 0.07$	$0.38 \pm 0.09$	$0.32 \pm 0.11$	$1.00 \pm 0.03$
Prisoner’s Dilemma (Repeated)	$0.85 \pm 0.03$	$0.71 \pm 0.06$	$0.42 \pm 0.08$	$0.35 \pm 0.10$	$1.00 \pm 0.02$
Running with Scissors (Arena)	$0.76 \pm 0.05$	$0.58 \pm 0.08$	$0.35 \pm 0.07$	$0.29 \pm 0.09$	$1.00 \pm 0.04$
Running with Scissors (Repeated)	$0.78 \pm 0.04$	$0.62 \pm 0.07$	$0.39 \pm 0.08$	$0.31 \pm 0.10$	$1.00 \pm 0.03$

Table 3. Quantitative validation of Gradient Saliency Feedback via ablation. The downward arrow (↓) in the table indicates a reduction in the number of steps required to reach 0.6 relative performance, i.e., fewer steps are needed for the model to achieve the target performance.

Metric	Full LLM-TOC	w/o Saliency	Improvement
Steps to reach 0.6 rel. perf.	$3.5 \times 10^{6}$	$7.2 \times 10^{6}$	51.4% steps ↓
Final rel. perf. to Oracle PPO	75–85%	61–70%	17.8% avg. gain
Convergence speed (per 1e6 steps)	0.21	0.09	133.3% faster

Table 4. Empirical validation statistics of the

ϵ

-approximate semantic oracle assumption calculated over 10 outer-loop iterations across 4 independent random seeds.

Table 4. Empirical validation statistics of the

ϵ

-approximate semantic oracle assumption calculated over 10 outer-loop iterations across 4 independent random seeds.

Metric	Mean Value	Standard Deviation
Generation Success Rate	92.5%	3.2%
Policy Quality Consistency (CV)	7.8%	2.1%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, C.; Yuan, J.; Yu, T.; Jiang, X.; Xiang, L.; Zhang, J.; He, Z. LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization. Mathematics 2026, 14, 915. https://doi.org/10.3390/math14050915

AMA Style

Wang C, Yuan J, Yu T, Jiang X, Xiang L, Zhang J, He Z. LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization. Mathematics. 2026; 14(5):915. https://doi.org/10.3390/math14050915

Chicago/Turabian Style

Wang, Chenxu, Jiang Yuan, Tianqi Yu, Xinyue Jiang, Liuyu Xiang, Junge Zhang, and Zhaofeng He. 2026. "LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization" Mathematics 14, no. 5: 915. https://doi.org/10.3390/math14050915

APA Style

Wang, C., Yuan, J., Yu, T., Jiang, X., Xiang, L., Zhang, J., & He, Z. (2026). LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization. Mathematics, 14(5), 915. https://doi.org/10.3390/math14050915

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization

Abstract

1. Introduction

2. Related Work

2.1. Zero-Shot Coordination and Generalization in MARL

2.2. Large Language Models for Autonomous Agents

2.3. Automated Curriculum Learning and Environment Design

2.4. Stackelberg Games in MARL

3. Problem Formulation

3.1. Partially Observable Markov Games

3.2. Zero-Shot Generalization as Minimax Regret

3.3. The Challenge of Semantic Coverage

4. Methodology

4.1. The Bi-Level Optimization Framework

4.2. Gradient Saliency Feedback Mechanism

4.2.1. Identifying Critical Moments via Value Surprise

4.2.2. Visual Attribution via Jacobian Saliency

4.2.3. Semantic Mapping

4.3. Policy Generation and Curriculum Evolution

4.4. Algorithm Summary

4.5. Motivating Theoretical Analysis

4.5.1. Core Foundational Assumptions

4.5.2. Convergence Rate Analysis

4.5.3. Generalization Bound Analysis via PAC-Bayes Framework

4.5.4. Computational Complexity Analysis

5. Experiments

5.1. Experimental Setup

5.1.1. Evaluation Benchmark: Melting Pot

5.1.2. Baselines

5.1.3. Implementation Details

5.1.4. Evaluation Metrics

5.2. Results and Analysis

5.3. Ablation Study

5.4. Supplementary Empirical Analysis of the Semantic Oracle Assumption

6. Conclusions, Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. List of Symbols

Appendix B. Environment Details and Evaluation Protocols

Appendix B.1. Collaborative Cooking: Asymmetric

Appendix B.2. Prisoner’s Dilemma in the Matrix: Repeated

Appendix B.3. Running with Scissors in the Matrix: Arena

Appendix B.4. Running with Scissors in the Matrix: Repeated

Appendix C. Prompt Engineering for Semantic Oracle

Appendix C.1. Role Instruction ( I r o l e )

Appendix C.2. API Constraints and Environment Interface ( I A P I )

Appendix C.3. Semantic Diagnosis via Gradient Saliency ( D s e m ( k ) )

Appendix C.4. Full Prompt Template Example

Appendix D. Theoretical Analysis and Proofs for LLM-TOC

Appendix D.1. Problem Definition and Notation

Appendix D.2. Detailed Convergence Analysis of Bi-Level Optimization

Appendix D.3. Detailed Derivation of Generalization Bounds

Boundary Analysis of KL Divergence Terms

Appendix D.4. Efficiency Derivation for Gradient Saliency

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix C.1. Role Instruction ( $I_{r o l e}$ )

Appendix C.2. API Constraints and Environment Interface ( $I_{A P I}$ )

Appendix C.3. Semantic Diagnosis via Gradient Saliency ( $D_{s e m}^{(k)}$ )