LLM-Augmented Multi-Agent Reinforcement Learning for Cross-Scenario Knowledge Transfer

Li, Chao; Liu, Yanfei; Wang, Jieling; Wang, Zhong; Lu, Kewei; Wang, Chengjin

doi:10.3390/e28050525

Open AccessArticle

LLM-Augmented Multi-Agent Reinforcement Learning for Cross-Scenario Knowledge Transfer

by

Chao Li

¹

,

Yanfei Liu

^1,*,

Jieling Wang

¹,

Zhong Wang

¹,

Kewei Lu

¹ and

Chengjin Wang

^1,2

¹

Department of Basic Courses, Rocket Force University of Engineering, Xi’an 710025, China

²

Department of Optical Communication Networks, Information Support Force Engineering University, Wuhan 430035, China

^*

Author to whom correspondence should be addressed.

Entropy 2026, 28(5), 525; https://doi.org/10.3390/e28050525

Submission received: 11 February 2026 / Revised: 28 April 2026 / Accepted: 2 May 2026 / Published: 6 May 2026

Download

Browse Figures

Versions Notes

Abstract

Multi-agent reinforcement learning (MARL) relies on trial-and-error interactions to update policies. However, trial-and-error learning typically requires extensive interactions to achieve satisfactory performance, resulting in low sample efficiency, which limits its application in the real world. To reduce the trial-and-error costs of MARL and accelerate the convergence of multi-agent collaborative policies, we propose a MARL policy transfer method named LoLM-MARL, based on fine-tuning large language models (LLMs). First, leveraging the general world knowledge and reasoning capabilities of LLMs, low-rank adaptation (LoRA) is employed to fine-tune the pre-trained model on source tasks, thereby providing general decision-making knowledge for cross-scenario policy transfer. Second, a dynamic prompt construction method for LLMs is designed. By dynamically eliminating the state information of ineffective agents from the prompts, the method provides denser observation data for the large language model, thereby enhancing its policy representation capability in specific complex collaborative scenarios. Meanwhile, the dynamic prompt design concept enriches the training sub-scenarios for the algorithm, thereby laying the foundation for the model to learn more general decision-making knowledge. Finally, a Kullback–Leibler (KL) divergence regularization method based on an annealing strategy is constructed to ensure consistency between the policy distributions of the fine-tuned model and the pre-trained model, effectively mitigating the catastrophic forgetting problem during the fine-tuning process of the pre-trained model. Experimental results show that in zero-shot transfer tasks, LoLM-MARL achieves a maximum improvement of 101.4% in average win rate compared to existing state-of-the-art (SOTA) methods. In six few-shot transfer tasks, our method consistently achieves better generalization performance than traditional SOTA methods, and improves the convergence speed by 4 to 30 times compared to the training-from-scratch approach, providing a new solution paradigm for efficient policy transfer in complex dynamic environments.

Keywords:

multi-agent reinforcement learning; large language models; low-rank adaptation; knowledge transfer; dynamic prompt; annealing Kullback–Leibler divergence

1. Introduction

Multi-agent reinforcement learning (MARL), drawing inspiration from human trial-and-error exploration mechanisms, has demonstrated significant potential in addressing complex multi-agent decision-making problems [1]. In recent years, it has been widely applied in fields such as embodied intelligence [2], autonomous driving [3], and distributed systems [4]. Despite important progress in MARL, it still faces numerous challenges. Among these, the low sample utilization efficiency of algorithms is a core challenge [5], primarily reflected in the following: reinforcement learning relies on continuous trial-and-error to obtain environmental feedback signals, which are then used to guide policy updates. However, trial-and-error learning, without the guidance of prior knowledge, requires vast amounts of interactions to learn a satisfactory optimal policy, resulting in generally low exploration efficiency and sample utilization. This phenomenon becomes more pronounced in environments with sparse rewards [6], high-dimensional state spaces [7], and non-stationary environments [8], severely constraining the training quality and convergence efficiency of algorithms while increasing deployment costs and difficulties in practical applications.

To improve the sample efficiency of algorithms, that is, to reduce the amount of environment interaction data required for the algorithm to achieve threshold performance, researchers have proposed a series of targeted measures in recent years, among which knowledge reuse methods have received widespread attention [9,10]. Based on the concept of transfer learning, this approach applies the optimal collaborative policies learned by the algorithm in source tasks to unseen target tasks, enabling the algorithm to fully leverage previously acquired experience and policies for rapid initialization in new tasks, thereby reducing training costs. A core challenge in studying cross-scenario transfer for MARL lies in ensuring the effectiveness of knowledge transfer from source to target tasks. Currently, existing methods addressing the issue of effective knowledge transfer primarily focus on three aspects: online multi-task representation learning [11,12,13,14], offline multi-task universal skill learning [15,16,17], and universal subtask decomposition [18]. Online multi-task representation learning trains multiple source tasks in parallel to learn cross-task general knowledge, thereby improving the algorithm’s performance on a single unseen task. While this method helps enhance sample utilization efficiency and transfer performance, it also suffers from issues such as gradient conflicts [14] and training imbalance [11,14] caused by differences between tasks. Furthermore, online multi-task representation learning requires frequent interactions with the environment to simultaneously learn or fine-tune policies for different tasks, a process that is typically costly. In contrast, offline multi-task universal skill learning can learn more general collaborative skills only relying on static multi-task offline data, effectively reducing the computational and time overhead of online representation learning. Nevertheless, its performance highly depends on the quality of the training dataset. When the dataset lacks sufficient optimal trajectories or diversity, agents struggle to learn general skills and optimal policies from source tasks, which further limits their adaptability in new tasks. Different from multi-task MARL, the universal subtask decomposition method decomposes source tasks into several task-independent subtasks and endows them with cross-task general semantics to achieve effective knowledge transfer in target tasks. However, when facing complex situations such as significant task differences or high task diversity, the extracted “task-independent” subtasks often fail to meet the policy requirements of target tasks, thereby impairing the effectiveness of cross-task knowledge transfer.

In recent years, large language models (LLMs) have played a crucial role in various fields, including energy [19], medical [20], chemistry [21], and robot control [22], owing to their extensive world knowledge and powerful reasoning capabilities. Currently, several studies have combined LLMs with reinforcement learning (RL) and achieved preliminary results [23,24,25], yet the cross-disciplinary exploration of LLMs and MARL remains relatively limited. Inspired by this, this paper proposes a cross-scenario transfer method for MARL based on LLMs, utilizing the robust semantic understanding and task-reasoning capabilities of LLMs along with the low-rank adaptation (LoRA) [26] fine-tuning technique. This method employs the lightweight LoRA fine-tuning technique to preserve the general knowledge of the pre-trained model while enabling training on specific decision-making tasks and efficient cross-scenario transfer through fine-tuning low-rank parameter matrices on a small scale. It aims to address the performance bottlenecks of traditional MARL in policy transfer across complex scenarios, with the core objective of enhancing the cross-scenario generalization ability of the algorithm.

However, adapting LLMs to MARL primarily faces the following two challenges: (1) Although LLMs possess extensive prior knowledge, the type of knowledge may not be well-adapted to specific cross-scenario transfer decision-making tasks, thereby limiting their generalization ability across scenarios; and (2) during the fine-tuning process for complex decision-making tasks, LLMs often suffer from catastrophic forgetting. Even when only small-scale updates are made to the low-rank adaptation matrix parameters via LoRA, the model still cannot fundamentally avoid forgetting the existing knowledge of the frozen pre-trained model, thereby affecting the algorithm’s stable convergence in the source task. To address these challenges, the main contributions and solutions of this paper are as follows:

(1) To address the insufficient generalization of traditional MARL in cross-scenario transfer, we propose LoLM-MARL, a cross-scenario transfer method for MARL based on lightweight LoRA fine-tuning of large language models. It establishes a semantics-driven multi-agent collaborative decision-making mechanism in complex dynamic environments, representing an exploratory effort to enhance cross-scenario generalization in MARL with the assistance of LLMs.

(2) A dynamic prompt design scheme for LLMs is proposed to enhance LLMs’ feature extraction capability and generalization. This scheme dynamically eliminates redundant information in prompts, providing an LLM with state information of higher density to improve its semantic understanding and feature extraction effectiveness. Furthermore, this design naturally exposes the LLM to more diverse training scenarios during fine-tuning, facilitating the adaptation of general world knowledge to specific decision-making tasks and laying the foundation for zero-shot or few-shot policy transfer.

(3) An annealing-strategy-based KL divergence regularization method is proposed to address the catastrophic forgetting problem during LLM fine-tuning. This method constrains the action probability distributions between the pre-trained LLM and the fine-tuned LLM via KL divergence, ensuring that the fine-tuned model maintains a consistent action distribution with the pre-trained model in the early stage of policy exploration, thereby promoting stable convergence of LoLM-MARL. Meanwhile, the annealing strategy dynamically adjusts the constraint strength to flexibly balance policy exploration and exploitation.

2. Related Work

2.1. Generalization Challenges in MARL

When tackling complex multi-agent sequential decision-making problems, cross-scenario generalization ability is crucial [27]. This ability not only serves as a prerequisite for efficient exploration and high sample efficiency but also significantly reduces the cost of policy transfer, providing strong support for the practical deployment and engineering applications of MARL. For this reason, determining how to effectively enhance the generalization performance of MARL has gradually attracted increasing attention from scholars in recent years.

Currently, existing works mainly focus on two aspects: multi-task learning and subtask decomposition. Multi-task multi-agent reinforcement learning (MT-MARL) improves sample efficiency and transferability by training agents to simultaneously learn multiple related tasks. Depending on the mode of policy learning, MT-MARL can be categorized into online learning and offline training. Wang et al. [11] employed online multi-task pre-training to learn general cooperative policies. The core idea involves decoupling the perception and decision-making modules and extracting shared decision logic from similar tasks to enable fast adaptation to new tasks. Li et al. [12] explicitly modeled inter-agent dependencies via interactive value decomposition and utilized local trajectories of each agent to learn task representations for effectively evaluating task similarity, thereby achieving targeted cross-task knowledge transfer. Liang et al. [13] proposed a two-stage curriculum learning framework that refines policies learned in previous curricula into general policies through a gradient-based adaptive meta-learning algorithm, thus effectively alleviating the persistent adaptation problem in MARL multi-task training. Zhu et al. [14] tackled task difficulty disparity and imbalanced training in online multi-task learning by using pre-trained language models to encode task semantics. They employed Transformers to dynamically integrate observations with task representations and introduced a task-regret-based adaptive weighting mechanism, enabling efficient parallel training and cross-scenario generalization. Furthermore, Zhu et al. [28] addressed policy transfer across unrelated tasks with a hierarchical multi-task learning framework based on a skill graph. Their method comprises a high-level module for skill graph construction, selection, and combination via a scoring mechanism, and a low-level module for policy execution. This approach effectively enhances knowledge generalization across unrelated tasks. Although online multi-task learning plays an important role in enhancing the generalization of MARL, such methods inevitably incur high training costs and tend to suffer from gradient conflicts and training imbalance when task differences are substantial.

To overcome the inherent limitations of online MT-MARL, Zhang et al. [15] proposed an offline MT-MARL method, which learns task-agnostic general skills solely from offline datasets of multiple tasks and selects optimal skills under the centralized training with decentralized execution paradigm, finally achieving promising zero-shot generalization on unseen target tasks. To further improve generalization on unseen tasks, Liu et al. [16] proposed a hierarchical decoupled skill discovery framework. This framework learns effective skill representations from limited multi-task offline data by jointly learning general task skills and task-specific skills in parallel. Although offline MT-MARL substantially reduces training costs, its asymptotic performance in source tasks heavily depends on the quality of the offline data [29]. If the coverage of multi-agent trajectories is insufficient or lacks adequate optimal trajectories, agents will struggle to learn general decision-making skills, thereby affecting cross-scenario generalization ability.

Different from the above two approaches, Tian et al. [18] explored an alternative approach by decomposing source tasks into a series of task-agnostic general subtasks. Through a scalable subtask encoder and an adaptive semantic module, they effectively ensured the cross-scenario semantic consistency of subtasks, thereby achieving effective cross-scenario transfer. However, when facing complex tasks or tasks with significant differences, the adaptability of these subtasks is limited, consequently reducing cross-scenario transfer performance.

2.2. LLMs for Decision-Making

In recent years, with the rapid development of LLMs, they have shown great potential in decision-making tasks that require complex scenario understanding and reasoning capabilities [30,31]. Their powerful semantic understanding ability can more accurately understand the dependencies between agents and dynamic decision-making scenarios than traditional models. However, most existing studies focus on LLM-assisted single-agent reinforcement learning, and there is still a lack of systematic study on how to effectively leverage the advanced semantic understanding and reasoning abilities of LLMs in MARL. Li et al. [32] investigated sample efficiency and generalization in autonomous driving tasks, proposing an LLM-based hierarchical reinforcement learning framework. In this framework, the LLM serves as a high-level planner, providing semantic understanding and long-term goals. The low level promotes efficient policy learning through goal-conditioned skill transfer, and its generalization capability in unseen scenarios was validated within an autonomous driving simulation system. Similarly, reference [33] leveraged LLMs to enhance the performance of hierarchical reinforcement learning in complex long-horizon sequential decision-making tasks. This approach employs the LLM as a teacher agent to guide exploration in high-level policy, while the low-level adopts a policy library-based selective execution mechanism. The synergistic collaboration between the two effectively addresses the exploration challenge in long-horizon decision-making under sparse rewards. Zhou et al. [34] addressed path planning for heterogeneous agents by integrating the local exploration advantages of Q-learning with the global understanding capability of LLMs for unknown environments. By designing a problem-modeling-oriented prompting strategy, they transformed complex scheduling tasks into semantic descriptions understandable by LLMs, significantly improving path planning efficiency and algorithm convergence. Zhang et al. [35] utilized the reasoning and analysis capabilities of LLMs to optimize the iterative process of RL, effectively solving the joint optimization problem of energy cost control and carbon emission reduction in energy systems. Wang et al. [36] introduced LLMs to optimize the training process of RL and neural networks, achieving high-precision target tracking control of a bionic robotic arm. Dalal et al. [37] used LLMs to generate high-level expert motion plans, decompose complex tasks into phased subgoals, and guide a single reinforcement learning policy to execute step-by-step. This method significantly improves the ability of robots to complete long-horizon tasks and achieves a success rate of over 85% across benchmark tests.

In summary, existing research demonstrates that integrating LLMs with RL, particularly within hierarchical reinforcement learning architectures, has become an effective paradigm for addressing key challenges such as long-horizon decision-making, sparse rewards, and cross-scenario generalization in single-agent settings. However, current successful applications remain largely confined to single-agent scenarios. The rich world knowledge and powerful reasoning capabilities inherent in LLMs also hold significant potential to enable efficient learning and generalization for MARL in complex decision-making environments. Inspired by this potential, this paper introduces LLMs into the domain of MARL, focusing specifically on addressing the challenge of limited cross-scenario generalization in traditional MARL. This work represents an exploratory effort at the intersection of LLMs and MARL.

3. Preliminaries

3.1. Dec-POMDPs

Multi-agent reinforcement learning can be viewed as a decentralized partially observable Markov decision process (Dec-POMDPs) [38], typically defined by

〈S, A, T, R, O, n, γ〉

.

S

is the state space,

A

is the joint action space,

T : S \times A \times S^{'} \to [0, 1]

is the state transition probability, representing the probability of transitioning from state

S

to the next state

S^{'}

given the joint action

A = (a_{1}, \dots, a_{n})

.

R : S \times A \times S^{'} \to R

denotes the instant reward after transitioning from state

S

to state

S^{'}

following the joint action.

O

is the set of local observations for all agents in state

S

,

n

is the number of agents, and

γ \in [0, 1]

is the discount factor used to balance instant rewards and future returns. At each decision time step, each agent

i

selects an action

a_{i}

based on its local observation

o_{i}

and its policy

π_{θ} (a_{i} | o_{i})

, forming a joint action

a \in A_{n}

. The joint action transitions the multi-agent system to the next state

S^{'}

according to the state transition function

T

. The environment then provides an immediate reward

r \in R (s, a)

to the agents based on the joint action. Each agent learns a policy to maximize the future discounted cumulative reward

J (θ) = E [\sum_{t = 0}^{\infty} γ^{t} R (s^{t}, a^{t})]

.

3.2. Multi-Agent Proximal Policy Optimization (MAPPO)

Multi-Agent Proximal Policy Optimization (MAPPO) [39] is the multi-agent extension of the PPO. Its strong performance, simplicity, and excellent compatibility with the centralized training with decentralized execution (CTDE) framework have made it widely adopted in MARL research. Based on the Actor–Critic framework, MAPPO trains individual policy networks for each agent while sharing a common value network across all agents. During policy learning, the objective is to maximize the following function:

L (θ) = [\frac{1}{B n} \sum_{i = 1}^{B} \sum_{k = 1}^{n} \min (r_{θ, i}^{(k)} A_{i}^{(k)}, c l i p (r_{θ, i}^{(k)}, 1 - ϵ, 1 + ϵ) A_{i}^{(k)})] + σ \frac{1}{B n} \sum_{i = 1}^{B} \sum_{k = 1}^{n} S [π_{θ} (o_{i}^{(k)})]

(1)

where

r_{θ, i}^{(k)} = π_{θ} (a_{i}^{(k)} | o_{i}^{(k)}) / π_{θ_{o l d}} (a_{i}^{(k)} | o_{i}^{(k)})

,

A_{i}^{(k)}

denotes the advantage function,

S

represents the policy entropy,

σ

is the entropy coefficient hyperparameter, and

B

denotes the batch size.

The value network is updated by minimizing the following function:

L (ϕ) = \frac{1}{B n} \sum_{i = 1}^{B} \sum_{k = 1}^{n} m a x [C, D]

(2)

where

C = V_{ϕ} {(s_{i}^{(k)} - {\hat{R}}_{i})}^{2}

,

D = {(c l i p (V_{ϕ} (s_{i}^{(k)}), V_{ϕ_{o l d}} (s_{i}^{(k)}) - ε, V_{ϕ_{o l d}} (s_{i}^{(k)}) + ε) - {\hat{R}}_{i})}^{2}

,

{\hat{R}}_{i}

denotes the cumulative discounted reward.

3.3. Low-Rank Adaptation (LoRA)

Low-rank adaptation (LoRA) [26], as a mainstream parameter-efficient fine-tuning method, is inspired by the findings of Aghajanyan et al. [40]: while the weight matrices of pre-trained models are full-rank, weight updates during task-specific adaptation often exhibit low-rank characteristics. Building on this insight, LoRA freezes the original pre-trained weights and decomposes the weight update matrix into two low-rank parameter matrices. By updating only these two low-dimensional matrices, LoRA enables efficient fine-tuning in target domains. Its mathematical principle is as follows:

For a pre-trained weight matrix

W_{0} \in R^{d \times k}

, it is kept frozen during fine-tuning on downstream tasks, while the weight update matrix

Δ W

is decomposed into the product of two low-rank matrices

A \in R^{r \times k}

and

B \in R^{d \times r}

:

Δ W = B A

(3)

And then, the updated model weight becomes:

W = W_{0} + Δ W = W_{0} + B A

(4)

During forward propagation, assuming the input is

x

, the model output after incorporating the low-rank parameter matrices can be expressed as:

h = W_{0} x + Δ W x = W_{0} x + B A x

(5)

Figure 1 shows the framework of LoRA. Here, matrix

A

is initialized with a random gaussian distribution, and matrix

B

is initialized as a zero matrix to ensure consistency between the model behavior and the initial performance of the pre-trained model. During forward propagation, the input

x

is fed in parallel into the frozen original weight branch and the trainable low-rank weight branch. The final output

h = W_{0} x + B A x

is the sum of the original output and the low-rank updated output.

4. Methodology

In this section, we will provide a detailed introduction to the LoLM-MARL, which primarily comprises the following three components: (1) dynamic prompt construction for LLM; (2) mitigation of catastrophic forgetting in the LLM during fine-tuning; and (3) network architecture design for adaptive agent numbers in the transfer process. Accordingly, the section begins with an overview of the overall algorithm framework, followed by a detailed exposition of the specific implementation details for each component.

4.1. Overview of LoLM-MARL

The overall framework of LoLM-MARL is shown in Figure 2. The algorithm mainly consists of two components: the LLM policy network fine-tuned via LoRA, and the value network based on the Transformer. For the policy network, we first perform semantic mapping on the raw numerical observation matrix and construct dynamic LLM input according to the prompt template. To ensure training stability, the hidden state features of both the frozen pre-trained LLM and the fine-tuned LLM are fed into the shared-attention action head of the fine-tuned LLM. The action probability distributions of the two are constrained by KL divergence with an annealing strategy to alleviate catastrophic forgetting during fine-tuning. The final action output of the policy network is generated by the LoRA-fine-tuned LLM, while the frozen pre-trained LLM only contributes to the KL loss calculation and does not directly participate in action generation. Furthermore, to enable effective adaptation to dynamic changes in the number of agents during cross-scenario transfer, we have designed specialized structures for both the policy network and the value network. Detailed implementation specifics are provided in Section 4.4.

4.2. Dynamic Prompt Design for LLM

In reinforcement learning, the data is derived from the agent’s observations of the interactive environment, and such observations are typically represented as structured numerical tensors, whereas LLM-based reinforcement learning requires semantically rich natural language sequences as input. To bridge this gap, this section introduces a customized dynamic prompt framework for the StarCraft Multi-Agent Challenge (SMAC) environment. To enhance the cross-scenario transferability of the fine-tuned model, we dynamically filter out information of deceased agents from the prompts, thereby reducing irrelevant state information that could interfere with LLMs’ decision-making. This dynamic processing naturally covers various types of multi-agent adversarial scenarios, ensuring the algorithm’s transfer capability across different environments. Furthermore, the adaptive adjustment of prompt sequence length reduces the time required for tokenization and forward inference in the LLM, thereby improving the overall efficiency of model fine-tuning.

Before constructing the LLM prompts, the raw numerical observation matrix must first undergo semantic mapping. To illustrate the transformation more clearly, Table 1 and Table 2 detail the composition of agent observations and actions in SMAC, respectively.

Taking the 3s5z_vs_3s6z scenario as an example, Figure 3 illustrates the format transformation of the observation data for agent 0 before and after mapping. Figure 3a shows the observation matrix of 8 agents, where each row vector integrates observation information, including allies, enemies, and the agent itself. Each position in the observation vector is composed of elements listed in Table 1 and Table 2. Among them, numerical values are uniformly rounded to two decimal places, and the entity data mapping follows the order in the original observation vector (allies → enemies → self). By mapping the raw observation information of all agents into the natural language descriptions shown in Figure 3b, the observation data becomes more semantically expressive, laying the foundation for the subsequent construction of dynamic prompts.

Based on the above semantic observation data, we proceed to construct the LLM prompt. To provide the LLM with denser prompt information, we adopt a dynamic prompt construction method to timely remove the observation information of dead agents, thereby reducing interference with its decision-making. Moreover, the dynamic prompt design naturally covers more adversarial training scenarios, thus ensuring the transfer performance of the fine-tuned model and promoting the generalization of the algorithm across different scenarios. Figure 4 illustrates the design idea of dynamic prompt. Figure 4a shows the prompt for agent 0 in the 3s5z_vs_3s6z scenario, which consists of four components: LLM instruction, goal, current game state, and available actions. The LLM instruction specifies the functional requirements and constraints for the model; the goal defines the long-term and short-term tasks of the agent; the current game state includes observations of the agent itself, allied units, and enemy units. As shown in Figure 4b, the enemy formation consists of 6 combat units (the green circle), corresponding to only 6 semantic observation entries for enemy units in the left figure. Similarly, it can be seen from the available actions that there are 6 attackable agents at present. The available actions cover all valid actions available to agent 0 at the current timestep. Figure 4 only shows part of the prompt information for agent 0. For detailed information, see Appendix A.

4.3. Policy Annealing Alignment for Pre-Trained and Fine-Tuned LLMs

During the fine-tuning of LLMs, catastrophic forgetting is commonly observed. Although LoRA achieves efficient task-specific adaptation by freezing the pre-trained parameters of LLMs and only training low-rank adaptation matrices, the model still cannot fundamentally avoid catastrophic forgetting when handling complex decision-making tasks, which adversely affects the stable convergence of the algorithm. To address this issue, this section adopts the KL divergence regularization method to constrain the action probability distributions of the frozen pre-trained model and the fine-tuned model. Additionally, an annealing strategy is introduced to dynamically adjust the regularization strength, effectively balancing exploration and exploitation of the policy. The specific implementation details are as follows.

The loss function of the LoLM-MARL policy network is shown in Equation (6), which consists of two components: the policy loss of MAPPO and the KL divergence loss:

L_{t o t a l} (θ) = L_{M A P P O} (θ) + λ \cdot L_{K L} (θ)

(6)

Here,

L_{M A P P O} (θ)

consists of two terms: the clipped policy loss and the entropy regularization loss, as shown in Equation (7), and

L_{K L} (θ)

is given by Equation (8):

L_{M A P P O} (θ) = [- \frac{1}{B n} \sum_{i = 1}^{B} \sum_{k = 1}^{n} \min (r_{θ, i}^{(k)} A_{i}^{(k)}, c l i p (r_{θ, i}^{(k)}, 1 - ϵ, 1 + ϵ) A_{i}^{(k)})] - σ \frac{1}{B n} \sum_{i = 1}^{B} \sum_{k = 1}^{n} S [π_{θ} (o_{i}^{(k)})]

(7)

\begin{array}{l} L_{K L} (θ) = \frac{1}{B n} \sum_{i = 1}^{B} \sum_{k = 1}^{n} D_{K L} (\underset{F r o z e n L L M}{\underset{⏟}{π_{θ_{0}} (o_{i}^{(k)})}} ∥ \underset{L o R A L L M}{\underset{⏟}{π_{θ} (o_{i}^{(k)})}}) \\ = \frac{1}{B n} \sum_{i = 1}^{B} \sum_{k = 1}^{n} π_{θ_{0}} (o_{i}^{(k)}) l o g \frac{π_{θ_{0}} (o_{i}^{(k)})}{π_{θ} (o_{i}^{(k)})} \end{array}

(8)

where

B

is the batch size,

n

is the number of agents, and

σ

is the entropy coefficient hyperparameter.

Unlike typical KL divergence designs, to ensure the diversity of policy learning in the fine-tuned model, this paper adopts an annealing design for the regularization parameter

λ

of KL divergence. Its core idea is to dynamically adjust the constraint strength imposed by the frozen LLM on the fine-tuned LLM at different stages of policy learning, thereby balancing alignment between the two policies and the diversity of exploration. In the initial learning stage, due to the high randomness of the model’s exploration mechanism, a stronger constraint should be applied. As the network is continuously updated and the model gradually learns the optimal policy, the constraint strength should decay gradually with the training timesteps. The detailed parameter design of

λ

is given as follows:

λ_{t} = \{\begin{cases} λ_{0} - \frac{λ_{0}}{t_{1}} * t i f t < t_{1} \\ 0 e l s e \end{cases}

(9)

λ_{0} = \frac{L_{M A P P O} (θ)}{L_{K L} (θ)}

(10)

At the beginning of training, to balance the MAPPO loss and the KL divergence loss and avoid a significant gap between them,

λ_{0}

is set according to Equation (10). As the timestep

t

increases,

λ

decreases linearly at a decay rate

λ_{0} / t_{1}

until it reaches zero at training step

t_{1}

. Here,

t_{1}

is a manually tuned hyperparameter. It should be noted that the KL divergence loss is only applied during the LoRA fine-tuning and is not involved in either few-shot transfer or zero-shot transfer.

Since LoRA is initialized to zero, the resulting issue is that the KL divergence loss may be close to zero at the beginning of training, which makes Equation (10) numerically unstable. To ensure the stability of algorithm training during the early stage of policy exploration, we adopt the following safeguard mechanism to avoid the aforementioned problem. First, we apply smoothing to the denominator of Equation (10) by using

\max (KL, 1 \times 10^{- 4})

as the denominator to avoid numerical instability caused by division by zero. Second, based on the loss values obtained from multiple cross-validations of

L_{K L} (θ)

and

L_{M A P P O} (θ)

, we reasonably set the hyperparameter

λ_{0}

to avoid gradient explosion caused by an excessively large

λ_{0}

resulting from over-pursuing the balance between

L_{K L} (θ)

and

L_{M A P P O} (θ)

.

4.4. Adaptive-Scale Transfer Network Architecture

Cross-scenario transfer in MARL inevitably faces challenges caused by dynamic changes in the number of agents. Specifically, deep reinforcement learning approximates the optimal policy via neural networks, whose weight dimensions are usually fixed, resulting in mismatches between input–output dimensions and network weights during transfer. In response to these challenges, this section will focus on the targeted design of policy network and value network to achieve flexible cross-scenario transfer.

For the policy network, state feature extraction relies on the LLM. The Transformer architecture [41] of the LLM inherently supports the processing of variable-length sequence inputs. Its core parameter matrix dimensions are only related to the hidden feature dimension and are completely decoupled from sequence length. Changes in the number of agents only affect the size of intermediate computation matrices rather than the network structure itself. However, for the action head of the policy network, the action output dimension of the traditional linear layer is related to the number of agents and remains fixed. To enable the feature dimension of the action head to dynamically adapt to variations in the number of agents, this section proposes an attention action head by considering the characteristics of the interactive environment. Specifically, we decouple the action head into two components: a base action head and an attack action head (as shown in Figure 2). The action head takes the hidden state

h \in R^{d}

output by the LLM as input, where

d

is the hidden dimension of the LLM, and uses this hidden state as the state feature query for the action head. The action head maintains two learnable parameter matrices: a base action embedding

W_{base} \in R^{6 \times d}

, corresponding to six fixed actions (no operation, stop, move north, move south, move west, move east), and an attack action embedding

W_{attack} \in R^{d}

, which is dynamically replicated

N_{enemy}

times according to the current number of enemy units in the environment to form the attack action matrix. The base action embedding and the attack action embedding are concatenated to obtain a complete and adaptively adjustable action matrix

Q \in R^{(6 + N_{enemy}) \times d}

. Subsequently, the action logits are obtained through the dot product operation

logits = h \cdot Q^{T}

, where the output dimension is

R^{(6 + N_{enemy}) \times B}

, with

B

denoting the batch size. Finally, unavailable actions are masked out, and a softmax function is applied to obtain the final action probability distribution.

The architecture of the Transformer-based value network is shown in the lower right part of Figure 2. The network takes the global observation as input and leverages the Transformer architecture to model variable-length observation sequences. The specific implementation is as follows: First, the global observation data is expanded into a third-order tensor and linearly mapped to a high-dimensional state space, which is then fed into the Transformer encoder layers. Subsequently, the high-dimensional state features extracted by the Transformer encoder are mean-pooled to achieve an effective representation of variable-length observations. Finally, the state value is output through a fully connected layer. This state value, together with the immediate reward from the environment, is used to compute the advantage function, which in turn guides the parameter update of the policy network.

5. Experiments

In this section, we will validate the cross-scenario generalization performance of LoLM-MARL in the StarCraft Multi-Agent Challenge (SMAC) environment [42]. First, we provide a brief introduction to each scenario used in the experiments. Next, we detail the baselines, experimental parameter configurations, and performance evaluation metrics. Subsequently, we evaluate the algorithm’s performance in single-task, zero-shot transfer, and few-shot transfer settings, and conduct an in-depth analysis of its cross-scenario generalization capabilities using commonly adopted core evaluation metrics for transfer learning [43]. Finally, we perform ablation studies to assess the contribution of each module to the algorithm’s overall performance.

5.1. Environments

The SMAC is a multi-agent reinforcement learning algorithm testing platform developed based on the real-time strategy game StarCraft II. It is one of the most widely used standard testing environments in the field of MARL. This testing environment consists of a series of collaborative multi-agent adversarial tasks, where the algorithm is required to control friendly agent combat units to engage with enemy units controlled by the built-in game AI, with the goal of achieving victory by eliminating all enemy units. Table 3 provides detailed information on the allocation of enemy and friendly combat unit numbers, as well as scenario characteristics, for both single-task and cross-scenario transfer used in the experiments. Among these, symmetric scenarios refer to scenarios where both sides have equal combat strength, while in asymmetric scenarios, there is a numerical disparity in the forces between the two sides.

5.2. Methods and Metric

5.2.1. Baselines

To validate the advantages of the LoLM-MARL, we selected three classic cross-scenario transfer methods for comparative analysis on three types of tasks: single-task learning, zero-shot transfer, and few-shot transfer. The core concepts of each algorithm are as follows:

DT2GS [18]: The core concept of this method lies in decomposing complex tasks into general subtasks that are independent of specific tasks. By utilizing scalable subtask encoders and adaptive subtask semantic modules, it aims to reduce the risk of overfitting to the source task and enhance the generalizability of the subtasks, thereby achieving effective cross-scenario transfer.
ASN-Attention [44]: This method constructs a general action semantic network and decomposes the decision-making network into multiple sub-modules, each responsible for processing observation information corresponding to a specific action type, thereby eliminating interference from irrelevant information. This concept provides important insights for the development of general multi-agent decision models. For a fair comparison, we incorporate an attention mechanism into this method to enable cross-task transfer capability.
UPDeT [45]: UPDeT decouples the observations and actions of agents, constructs specific entity-action mappings, and decomposes the policy learning task into a series of entity-centric, independently computable sub-decision modules. These modules are then flexibly coordinated through an attention mechanism to enhance adaptability to diverse tasks.

5.2.2. Hyperparameter Settings

The hyperparameters used in the experiment consist of two parts: the baseline MAPPO parameter settings for LoLM-MARL and the LoRA-related parameter settings. All fixed hyperparameter configurations for the task scenarios are shown in Table 4. The pre-trained LLM employed in the experiment is Qwen3-0.6b. The number of training episodes and the annealing steps for the KL divergence are flexibly adjusted according to the difficulty level of each map. Both the representation dimensions for basic actions and attack actions are set to 1024. The number of basic actions is fixed at six, while the number of attack actions varies depending on the quantity of enemy units in the task.

All experiments were conducted on a high-performance computing server equipped with a 104-core Intel Xeon Platinum 8470Q CPU and an NVIDIA RTX PRO 6000 GPU with 96 GB memory. The software environment included PyTorch 2.7.0 and Python 3.10.18.

5.2.3. Performance Metrics

To comprehensively evaluate the performance of LoLM-MARL, we adopt two types of evaluation metrics according to different evaluation objectives. Specifically, for single-task, zero-shot, and few-shot transfer experiments, we focus on the cooperative capability of the algorithm in specific tasks. For generalization performance analysis, we focus on the quality and efficiency of cross-scenario transfer. The specific metrics are as follows:

(1) Single-task, zero-shot, and few-shot transfer experiments: Following the common practice in SMAC [42], we adopt the average test win rate across multiple random seeds as the core evaluation metric to measure the cooperative performance of the algorithm in multi-agent combat tasks.

(2) Generalization performance analysis: Following the evaluation framework for transfer learning performance in reference [43], we assess the cross-scenario generalization capability of the algorithm from the following three dimensions:

Jumpstart: the win rate at the first training timestep after transferring to the target task, reflecting the effectiveness of knowledge transfer from the source task.
Asymptotic performance: the final win rate of the algorithm upon convergence, reflecting the upper bound of its capability.
The number of training steps required to reach asymptotic performance: the number of interaction steps required to achieve asymptotic performance reflects the learning speed of the algorithm on the target task.

5.3. Performance on Single-Task

This section will provide a detailed comparative validation of LoLM-MARL’s performance on single-task learning. The experiment selects eight different scenarios covering both symmetric and asymmetric tasks, tested across three levels of difficulty: easy tasks (3s_vs_4z), hard tasks (3s5z, 8m_vs_9m, 5m_vs_6m, 10m_vs_11m, 3s_vs_5z), and super-hard tasks (3s5z_vs_3s6z, MMM2).

Figure 5 presents the comparative experimental results of LoLM-MARL on single-task learning. The horizontal axis represents the number of training steps, while the vertical axis records the algorithm’s test win rate in each scenario. The red curve corresponds to the proposed method in this paper, where the solid line indicates the average of three runs with different random seeds, and the shaded area represents the standard deviation across these three runs. As shown in the figure, for the easy task 3s_vs_4z, the performance of all algorithms except ASN-Attention is relatively similar. Notably, LoLM-MARL converges around 1 × 10⁶ training steps, demonstrating significantly faster convergence compared to DT2GS and UPDeT. In hard tasks, LoLM-MARL shows absolute advantages in both asymptotic performance and convergence efficiency on 5m_vs_6m and 3s_vs_5z. For super-hard tasks, UPDeT slightly outperforms the proposed method, yet the LoLM-MARL achieves the fastest convergence speed. This indicates that the integration of LLM effectively enhances decision-making in complex MARL tasks.

5.4. Zero-Shot Generalization Across Scenarios

In the last section, a detailed comparative experiment and analysis were conducted on the algorithm’s collaborative performance in single tasks. This section will perform zero-shot cross-scenario transfer experiments on the policies learned by LoLM-MARL in single-task scenarios. Specifically, we deploy the collaborative policy learned by the algorithm in one scenario directly into other unseen scenarios without additional training, to verify the algorithm’s transfer reasoning capability across different scenarios. To this end, this section designs six cross-scenario zero-shot transfer experiments, namely

2 s 3 z \to 1 c 3 s 5 z

,

3 s_v s_4 z \to 3 s_v s_5 z

,

8 m_v s_9 m \to 5 m_v s_6 m

,

3 s 5 z \to 3 s 5 z_v s_3 s 6 z

,

8 m \to 10 m_v s_11 m

and

5 m_v s_6 m \to 10 m_v s_11 m

. These six transfer experiments cover three distinct levels of cross-scenario transfer forms. All experiments follow the design principle of transferring from a simple source task to a complex target task, with the increased complexity mainly reflected in the following three aspects: transfer from symmetric to asymmetric scenarios, an increase in the disparity between enemy and friendly forces in asymmetric scenarios, and transfer between entirely different scenarios.

Table 5 presents the comparative experimental results of zero-shot transfer. The results indicate that the proposed method possesses strong cross-scenario transfer capability, particularly evident in transfers between entirely different types of collaborative scenarios (

2 s 3 z \to 1 c 3 s 5 z

) and transfers between asymmetric challenging collaborative scenarios (

8 m_v s_9 m \to 5 m_v s_6 m

,

5 m_v s_6 m \to 10 m_v s_11 m

).

In contrast, in transfer task

2 s 3 z \to 1 c 3 s 5 z

, UPDeT shows weaker adaptability to completely unfamiliar scenarios, while LoLM-MARL achieves a transfer win rate on the target task that is twice that of the suboptimal algorithm DT2GS. This indicates that the proposed method can learn more general and transferable policies from the source task, exhibiting a lower rate of knowledge forgetting compared to other algorithms. For transfers between asymmetric challenging collaborative scenarios, Figure 6 clearly illustrates the significant advantages of LoLM-MARL in transfer tasks

8 m_v s_9 m \to 5 m_v s_6 m

and

5 m_v s_6 m \to 10 m_v s_11 m

. In this figure, dots represent the average win rate, thick vertical bars indicate the standard deviation of the data, and thin vertical bars show the extreme values. It can also be observed from the figure that the ASN-Attention generally exhibits weak cross-scenario transfer capability. Combining with the performance of each algorithm in single-task scenarios shown in Figure 5, a conclusion can be drawn: strong collaborative capability in a single task is a crucial foundation for cross-scenario transfer. If an algorithm fails to learn robust and effective cooperative policies in a single task, it is also unlikely to achieve stable transfer performance during cross-scenario adaptation.

5.5. Few-Shot Generalization Across Scenarios

In Section 5.4, extensive experiments were conducted to validate the zero-shot cross-scenario transfer performance of LoLM-MARL. To further verify the cross-scenario transfer capability of the proposed method under few-shot conditions, this section similarly follows the experimental design of the previous section to carry out comparative experiments across six cross-scenario tasks. Different from the experimental setup in the previous section, few-shot cross-scenario transfer refers to fine-tuning the parameters of the algorithm trained on the source task using a small number of samples in the target task, thereby enabling the algorithm to quickly adapt to the new scenario. Due to the varying difficulty levels of the tasks, the number of training timesteps for each transfer task is not exactly the same. Based on the convergence observed in multiple cross-validation experiments of the algorithm, this section sets the interaction timesteps for the four tasks

2 s 3 z \to 1 c 3 s 5 z

,

8 m_v s_9 m \to 5 m_v s_6 m

,

3 s 5 z \to 3 s 5 z_v s_3 s 6 z

and

8 m \to 10 m_v s_11 m

to 3 × 10⁵, for task

3 s_v s_4 z \to 3 s_v s_5 z

to 1 × 10⁶, and for task

5 m_v s_6 m \to 10 m_v s_11 m

to 6 × 10⁵.

Figure 7 presents the transfer experimental results of the fine-tuned algorithms and their corresponding original algorithms under few-shot data. Overall, LoLM-MARL demonstrates clear advantages in each of the few-shot transfer tasks. Specifically, it achieves win-rate advantages of approximately 20% and 10% over the suboptimal methods in transfer tasks

8 m_v s_9 m \to 5 m_v s_6 m

and

5 m_v s_6 m \to 10 m_v s_11 m

. Meanwhile, LoLM-MARL attains nearly 100% win rates in the other transfer tasks, effectively illustrating the generalization capability of LLM-based multi-agent reinforcement learning in cross-scenario transfer. In contrast, ASN-Attention struggles to achieve effective transfer to unseen scenarios. We analyze that a possible reason is the distributional shift in task semantics during scenario switching. Relying solely on predefined static primitive action semantics fails to adequately match the action requirements of new tasks, thereby limiting the algorithm’s transfer capability.

The learning speed of an algorithm on the target task largely reflects its transfer capability. How to effectively and rapidly apply knowledge learned from the source task to the target task is one of the core metrics for evaluating algorithm transfer performance. Based on the experimental results shown in Figure 7, it can be observed that LoLM-MARL_finetune achieves a high win rate within only 6400 training steps, particularly in transfer tasks

3 s 5 z \to 3 s 5 z_v s_3 s 6 z

and

8 m \to 10 m_v s_11 m

, where it nearly reaches the asymptotic performance of algorithmic convergence. This clearly demonstrates that the fine-tuning approach can fully leverage prior knowledge acquired from the source task to enable rapid learning in the target task.

To demonstrate the necessity and effectiveness of cross-scenario transfer research, we also compared the performance of the original algorithm and the fine-tuned algorithm on the target task. As shown in Figure 7, in most test scenarios, only the original LoLM-MARL algorithm achieved a certain win rate within limited training steps, but its stability (shaded area in the figure) was significantly weaker than that of the fine-tuning-based method (LoLM-MARL_finetune). This indicates that without guidance from prior knowledge of the source task, the model struggles to learn effective and stable cooperative policies in a short time. In contrast, the fine-tuning method introduces collaborative experience learned from the source task, providing the model with reasonable initialization directions. This approach significantly reduces the policy exploration space and accelerates the convergence speed of the algorithm on the target task.

5.6. Generalization Analysis

Evaluating the cross-scenario generalization performance of MARL primarily involves two aspects: first, whether the algorithm can leverage policies learned from the source task to guide its learning in similar but distinct target tasks; and second, how well the algorithm utilizes existing knowledge. This section will address these two questions and systematically conduct an in-depth analysis of the generalization capability of LoLM-MARL by integrating performance metrics [42] from transfer reinforcement learning. The analysis will proceed in two steps: first, by comparing the performance of the fine-tuned algorithm LoLM-MARL_finetune and the original algorithm LoLM-MARL on the target task to verify whether the algorithm possesses cross-scenario generalization ability; and second, by comparing the LoLM-MARL_finetune algorithm with other transfer algorithms to demonstrate the generalization capability of the proposed method.

Regarding the first question, we compare and analyze the performance of the fine-tuning method and the original learning-from-scratch method on the target task. Based on transfer learning performance metrics, this section focuses on three commonly used indicators: jumpstart, asymptotic performance, and the number of training steps required to reach asymptotic performance. Table 6 details the experimental results across these metrics. It can be seen that the learning-from-scratch method LoLM-MARL exhibits almost no learning capability in the early training stage. In contrast, the fine-tuned algorithm LoLM-MARL_finetune demonstrates higher initial learning performance, which helps the algorithm explore a more optimal policy space and provides important support for rapid convergence on unseen tasks. In terms of asymptotic performance, LoLM-MARL_finetune achieves slight improvements across all six transfer tasks. We analyze the possible reasons as follows: the integration of LLM and MARL significantly enhances LoLM-MARL’s performance in single-task scenarios, nearly achieving the optimal test win rate. Therefore, in terms of algorithm limitations, the single-task performance itself already approaches the theoretical upper limit of the environment, leaving relatively limited room for potential improvement through transfer learning. Regarding asymptotic convergence speed, Table 6 shows that LoLM-MARL_finetune achieves a 4–30× speedup, primarily attributed to the general decision-making knowledge learned by the LLM from the source task. This provides an effective initial exploration policy for the algorithm, significantly reducing exploration costs in the target task and minimizing the probability of random and ineffective exploration during the early learning stage.

Next, we will explore the cross-scenario generalization capability of the proposed method by comparing it with other transfer methods. Figure 8 illustrates the transfer performance of four methods on the target task, where the red line represents jumpstart on the target task, and the blue line represents the asymptotic performance. As can be clearly seen from the figure, except for a slightly lower jumpstart than the UPDeT_finetune algorithm in the transfer task

3 s_v s_4 z \to 3 s_v s_5 z

, LoLM-MARL_finetune demonstrates significant advantages in both initial and asymptotic performance across all other transfer tasks. It is worth noting that although UPDeT_finetune outperforms LoLM-MARL_finetune in initial performance in transfer task

3 s_v s_4 z \to 3 s_v s_5 z

, its asymptotic performance falls short of the proposed method. This indirectly reflects the learning potential of our method in cross-scenario transfer tasks. In terms of convergence speed, based on the few-shot transfer experimental results in Figure 7, the proposed method exhibits faster convergence, with the advantages being more pronounced in challenging tasks. Furthermore, combining the visual results from Figure 8 reveals that jumpstart performance serves as the foundation for asymptotic performance, with a clear positive correlation between the two. Strong initial learning performance helps the algorithm explore more promising solution spaces early in training, facilitating subsequent cumulative reward improvement and final policy convergence.

5.7. Ablation Studies

To thoroughly validate the effectiveness of each module in LoLM-MARL, this section conducts ablation analyses on the LLM dynamic prompt module and the KL divergence regularization module based on the annealing strategy. For the dynamic prompt module, we primarily analyze its impact on transfer performance; for the KL divergence regularization module, the focus is on its influence on stability during source task training.

Taking transfer task

8 m_v s_9 m \to 5 m_v s_6 m

as an example, Figure 9a presents a comparison of transfer performance on the target task between LoLM-MARL and LoLM-MARL w/o dynamic prompt, which employs a fixed prompt scheme. The figure clearly shows that the dynamic prompt design significantly improves both the initial performance and the asymptotic performance of the algorithm on the target task, while also conferring higher transfer stability. This indicates that dynamic prompts enable the algorithm to cover a wider variety of training scenarios during source task learning, thereby facilitating the acquisition of more general collaborative policies and ultimately enhancing its cross-scenario generalization capability.

The ablation experimental results for KL divergence are shown in Figure 9b. It can be observed that LoLM-MARL w/o KL Divergence exhibits a performance decline after approximately 2.35 × 10⁶ training steps. This is due to the lack of KL divergence constraints during the early stages of LoRA fine-tuning, which causes the fine-tuned model to deviate too rapidly from the prior distribution of the pre-trained LLM. This increases the randomness of policy exploration, leads the algorithm to converge prematurely to suboptimal solutions, and gradually manifests catastrophic forgetting as training progresses. In contrast, the LoLM-MARL leverages KL divergence to regularize the fine-tuned LLM, ensuring that the policy distribution remains aligned with that of the pre-trained LLM during the initial stage of fine-tuning. This guides the algorithm toward exploring a more optimal policy space.

To further verify that the improvements of LoLM-MARL in single-task performance and cross-scenario generalization capability are mainly attributed to the pre-trained prior knowledge and reasoning abilities of the LLM, rather than the gain from parameter scaling, we designed an ablation experiment: replacing the pre-trained LLM in the policy network with a randomly initialized Transformer model (0.6B) of the same parameter scale. The experimental results are shown in Figure 10.

Figure 10a presents the comparative transfer experiment results between the randomly initialized Transformer model (0.6B) and LoLM-MARL. It can be observed that LoLM-MARL outperforms Transformer (0.6B) in both initial performance and convergence performance, fully demonstrating that the cross-scenario generalization capability of LoLM-MARL is not merely a result of increasing model parameters. Figure 10b also shows the performance comparison results on the single-task setting. It can be observed that the algorithm using Transformer (0.6B) as the baseline performs relatively poorly. Around 1.75 × 10⁶ training steps, it exhibits significant fluctuations, and its final convergence performance is substantially lower than that of LoLM-MARL. These results further demonstrate that the performance advantage of the proposed method on single tasks mainly stems from the pre-trained prior knowledge and reasoning capabilities of the LLM.

To more thoroughly examine the influence of the LLM on the exploration mechanism in MARL, we compared the policy entropy of the vanilla MAPPO algorithm and the LLM-assisted LoLM-MARL algorithm during training on task

8 m_v s_9 m

, thereby providing deeper insight into how the LLM affects collaborative agent behavior.

As shown in Figure 11, in the early stage of policy exploration, the policy entropy of LoLM-MARL is significantly higher than that of MAPPO, indicating that the prior knowledge of the LLM does not force the agent to prematurely converge to a deterministic policy. Instead, it encourages more diverse initial exploration, which helps the agent discover better policy spaces in the early stage. As training progresses, the policy entropy of LoLM-MARL rapidly decreases and stabilizes at a low level, while the entropy of MAPPO decreases more slowly and remains at a relatively high level. This demonstrates that LoLM-MARL can converge more quickly to deterministic policies after sufficient exploration, reflecting the positive role of the LLM in balancing exploration and exploitation. Consequently, this promotes rapid convergence of the algorithm and enhances its ability to explore optimal policies.

6. Conclusions

In this work, we propose a MARL policy transfer method, LoLM-MARL, based on large language model fine-tuning to solve the problem that the traditional MARL algorithm is difficult to generalize when facing complex decision tasks. This method leverages lightweight and efficient parameter fine-tuning through LoRA, combined with the rich prior knowledge and strong reasoning capabilities of LLMs, to realize a semantics-driven multi-agent collaborative decision-making scheme. Moreover, to further enhance the adaptability of LLMs in specific collaborative decision-making tasks, we skillfully designed a dynamic prompt construction method that provides LLMs with broader training scenarios and denser state information, effectively ensuring few-shot and zero-shot transfer capabilities across scenarios. Additionally, to mitigate the potential catastrophic forgetting problem during the early stages of fine-tuning, this paper introduces a KL divergence regularization method based on an annealing strategy, which dynamically constrains the action probability distributions of both the pre-trained model and the fine-tuned model. This work represents an exploratory attempt at the intersection of LLMs and MARL. Extensive experimental results demonstrate that, across similar but not identical scenarios, the proposed method achieves significantly superior generalization performance in both zero-shot and few-shot transfer tasks compared to traditional SOTA methods, thereby providing a new technical pathway for reducing the deployment cost of multi-agent systems in real-world settings and accelerating the rapid adaptation of agents to unseen environments.

Limitations and Future Work

Although LoLM-MARL outperforms traditional methods in terms of generalization performance, it still has the following limitations. First, due to the large number of parameters of the LLM, even with LoRA fine-tuning, its training and inference overhead remain significantly higher than those of traditional MARL methods, making it difficult to directly apply to scenarios with high real-time requirements. This, to some extent, limits its applicability in resource-constrained environments. Second, LoLM-MARL requires manually designing semantic mapping rules from numerical observations to natural language. When switching to a completely different task for retraining (e.g., from StarCraft to MPE), the prompt template needs to be redesigned according to the new task and environment characteristics, thereby increasing the complexity of algorithm design.

To address the above limitations, future work will focus on two main aspects. First, we will explore lightweight LLM backbones and inference acceleration strategies, such as model compression or sparse attention mechanisms, to reduce computational overhead and meet the requirements of real-time applications. Second, we will investigate automated generation methods for prompt templates to reduce the manual design cost caused by task switching, thereby improving the generalization and deployment efficiency of the algorithm across different environments.

Author Contributions

Conceptualization, C.L. and Y.L.; methodology, C.L.; software, C.L. and J.W.; validation, J.W. and Z.W.; formal analysis, K.L.; investigation, C.W.; resources, K.L.; data curation, Z.W.; writing—original draft preparation, C.L.; writing—review and editing, C.L.; visualization, J.W.; supervision, Y.L.; project administration, Z.W.; funding acquisition, J.W. and Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China under Grant No. U23B2064 and 62176263, and in part by the Natural Science Basic Research Program of Shaanxi Province, under Grant No. 2022KJXX-99.

Data Availability Statement

The data presented in this work are available on request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The complete semantic information of the dynamic prompt in Figure 4 is shown in Table A1 and Table A2. From the information in the Tables combined with Figure 4, it can be seen that the information of dead agents is promptly filtered out. This design philosophy of prompt construction effectively reduces the interference of redundant information on LLM decision-making and improves its decision-making capability in specific scenarios.

Table A1. Semantic information of agent 0′s observations of allies.

Entity	Visibility	Type	Distance	Relative Position	Health	Shield	Last Action
Ally 1	is visible	stalker	0.39	(0.31, −0.23)	1.00	1.00	attack enemy 3
Ally 2	is visible	stalker	0.85	(0.67, −0.53)	1.00	1.00	move south
Ally 3	is visible	zealot	0.96	(0.81, 0.51)	0.80	0.00	attack enemy 1
Ally 4	is visible	zealot	1.24	(1.16, −0.43)	0.60	0.00	move north
Ally 5	is visible	zealot	1.25	(1.18, 0.40)	1.00	0.60	move west
Ally 6	is visible	zealot	1.04	(0.95, 0.42)	0.80	0.00	attack enemy 1
Ally 7	is visible	zealot	1.17	(1.13, 0.32)	1.00	0.20	move east

Table A2. Semantic information of agent 0′s observations of enemies.

Entity	Attackable	Type	Distance	Relative Position	Health	Shield
Enemy 0	is available attack	stalker	1.99	(1.79, 0.87)	1.00	1.00
Enemy 1	is available attack	stalker	1.15	(0.92, 0.69)	1.00	0.29
Enemy 2	is available attack	stalker	1.48	(1.40, 0.48)	1.00	1.00
Enemy 3	is available attack	zealot	1.37	(1.25, 0.57)	1.00	0.80
Enemy 4	is available attack	zealot	1.35	(1.01, 0.90)	1.00	1.00
Enemy 6	is available attack	zealot	1.69	(1.51, 0.76)	1.00	1.00

References

Mishra, M.; Poddar, P.; Agrawal, R.; Chen, J.; Tokekar, P.; Sujit, P.B. Multi-Agent Deep Reinforcement Learning for Persistent Monitoring With Sensing, Communication, and Localization Constraints. IEEE Trans. Autom. Sci. Eng. 2025, 22, 2831–2843. [Google Scholar] [CrossRef]
Kang, L.; Song, X.; Zhou, H.; Qin, Y.; Yang, J.; Liu, X.; Torr, P.; Bai, L.; Yin, Z. VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning. arXiv 2026, arXiv:2506.09049. [Google Scholar]
Chen, L.; Deng, P.; Li, L.; Hu, X. Mixed Motivation Driven Social Multi-Agent Reinforcement Learning for Autonomous Driving. IEEE-CAA J. Autom. Sin. 2025, 12, 1272–1282. [Google Scholar] [CrossRef]
Jayanetti, A.; Halgamuge, S.; Buyya, R. Multi-Agent Deep Reinforcement Learning Framework for Renewable Energy-Aware Workflow Scheduling on Distributed Cloud Data Centers. IEEE Trans. Parallel Distrib. Syst. 2024, 35, 604–615. [Google Scholar] [CrossRef]
Zhang, Z.; Cheng, B.; He, B. EALLMs: Environment-Aligned LLMs for Enhanced Exploration and Communication in Multi-Agent Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2026, 23, 1009–1020. [Google Scholar] [CrossRef]
Liu, B.; Pu, Z.; Pan, Y.; Yi, J.; Liang, Y.; Zhang, D. Lazy Agents: A New Perspective on Solving Sparse Reward Problem in Multi-agent Reinforcement Learning. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 21937–21950. [Google Scholar]
Nie, L.; Qi, D.; Liu, B.; Li, P.; Bao, H.; He, H. CMRM: Collaborative Multi-Agent Reinforcement Learning for Multi-Objective Traffic Signal Control. IEEE Trans. Consum. Electron. 2025, 71, 2793–2805. [Google Scholar] [CrossRef]
Selamet, E.T.; Tümer, B. Context Detection and Identification in Multi-Agent Reinforcement Learning with Non-Stationary Environment. In Proceedings of the 30th Signal Processing and Communications Applications Conference (SIU), Safranbolu, Turkey, 15–18 May 2022; pp. 1–4. [Google Scholar]
Shi, D.; Tong, J.; Liu, Y.; Fan, W. Knowledge Reuse of Multi-Agent Reinforcement Learning in Cooperative Tasks. Entropy 2022, 24, 470. [Google Scholar] [CrossRef]
Da Silva, F.L.; Costa, A.H.R. A survey on transfer learning for multiagent reinforcement learning systems. J. Artif. Int. Res. 2019, 64, 645–703. [Google Scholar] [CrossRef]
Wang, J.; Xu, L.; Sun, C. Learning general multi-agent decision model through multi-task pre-training. Neurocomputing 2025, 627, 129524. [Google Scholar] [CrossRef]
Li, C.; Dong, S.; Yang, S.; Hu, Y.; Ding, T.; Li, W.; Gao, Y. Multi-Task Multi-Agent Reinforcement Learning With Interaction and Task Representations. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 13431–13445. [Google Scholar] [CrossRef] [PubMed]
Liang, W.; Wang, J.; Bao, W.; Zhu, X.; Wang, Q.; Han, B. Continuous self-adaptive optimization to learn multi-task multi-agent. Complex Intell. Syst. 2022, 8, 1355–1367. [Google Scholar] [CrossRef]
Zhu, Y.; Huang, S.; Zuo, B.; Zhao, D.; Sun, C. Multi-Task Multi-Agent Reinforcement Learning With Task-Entity Transformers and Value Decomposition Training. IEEE Trans. Autom. Sci. Eng. 2025, 22, 9164–9177. [Google Scholar] [CrossRef]
Zhang, F.; Jia, C.; Li, Y.C.; Yuan, L.; Yu, Y.; Zhang, Z. Discovering Generalizable Multi-agent Coordination Skills from Multi-task Offline Data. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; pp. 1–24. [Google Scholar]
Liu, S.; Shu, Y.; Guo, C.; Yang, B. Learning Generalizable Skills from Offline Multi-Task Data for Multi-Agent Cooperation. arXiv 2025, arXiv:2503.21200. [Google Scholar]
Meng, L.; Wen, M.; Le, C.; Li, X.; Xing, D.; Zhang, W.; Wen, Y.; Zhang, H.; Wang, J.; Yang, Y.; et al. Offline Pre-trained Multi-agent Decision Transformer. Mach. Intell. Res. 2023, 20, 233–248. [Google Scholar] [CrossRef]
Tian, Z.; Chen, R.; Hu, X.; Li, L.; Zhang, R.; Wu, F.; Peng, S.; Guo, J.; Du, Z.; Guo, Q.; et al. Decompose a task into generalizable subtasks in multi-agent reinforcement learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 1–19. [Google Scholar]
Tan, H.; Guo, Z.; Lin, Z.; Chen, Y.; Huang, D.; Yuan, W.; Zhang, H.; Yan, J. General generative AI-based image augmentation method for robust rooftop PV segmentation. Appl. Energy 2024, 368, 123554. [Google Scholar] [CrossRef]
Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef] [PubMed]
Boiko, D.A.; MacKnight, R.; Kline, B.; Gomes, G. Autonomous chemical research with large language models. Nature 2023, 624, 570–578. [Google Scholar] [CrossRef]
Liang, J.; Huang, W.; Xia, F.; Xu, P.; Hausman, K.; Ichter, B. Code as Policies: Language Model Programs for Embodied Control. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 9493–9500. [Google Scholar] [CrossRef]
Li, S.; Puig, X.; Paxton, C.; Du, Y.; Wang, C.; Fan, L.; Chen, T.; Huang, D.A.; Akyürek, E.; Anandkumar, A.; et al. Pre-trained language models for interactive decision-making. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 31199–31212. [Google Scholar]
Zitkovich, B.; Yu, T.; Xu, S.; Xu, P.; Xiao, T.; Xia, F. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Proceedings of the 7th Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023; pp. 2165–2183. [Google Scholar]
Shi, R.; Liu, Y.; Ze, Y.; Du, S.S.; Xu, H. Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning. arXiv 2024, arXiv:2310.20587. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations, Virtual, 25–29 April 2022; pp. 1–17. [Google Scholar]
Chai, J.; Zhao, Z.; Zhu, Y.; Zhao, D. A Survey of Cooperative Multi-Agent Reinforcement Learning for Multi-Task Scenarios. Artif. Intell. Sci. Eng. 2025, 1, 98–121. [Google Scholar] [CrossRef]
Zhu, G.; Zhou, R.; Ji, W.; Zhang, H.; Wang, D.; Zhao, S. Multi-Task Multi-Agent Reinforcement Learning via Skill Graphs. IEEE Robot. Autom. Lett. 2025, 10, 8650–8657. [Google Scholar] [CrossRef]
Zhang, M.; Su, S.; He, C.; Sartoretti, G. Hybrid Training for Enhanced Multi-task Generalization in Multi-agent Reinforcement Learning. arXiv 2025, arXiv:2408.13567. [Google Scholar]
Wu, X. Enhancing Q-Learning with Large Language Model Heuristics. arXiv 2024, arXiv:2405.03341. [Google Scholar] [CrossRef]
Chen, L.; Sinavski, O.; Hünermann, J.; Karnsund, A.; Willmott, A.J.; Birch, D. Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: New York, NY, USA, 2024; pp. 14093–14100. [Google Scholar] [CrossRef]
Li, L.; Tan, R.; Fang, J.; Xue, J.; Lv, C. LLM-augmented hierarchical reinforcement learning for human-like decision-making of autonomous driving. Expert Syst. Appl. 2025, 294, 128736. [Google Scholar] [CrossRef]
Li, Q.; Pang, B.; Song, Y.; Fu, H.; Xu, Q.; Yuan, X.; Xu, X.; Zhang, C. Large language model assisted hierarchical reinforcement learning training. Inf. Sci. 2026, 723, 122688. [Google Scholar] [CrossRef]
Zhou, Q.; Wu, J.; Zhu, M.; Zhou, Y.; Xiao, F.; Zhang, Y. LLM-QL: A LLM-Enhanced Q-Learning Approach for Scheduling Multiple Parallel Drones. IEEE Trans. Knowl. Data Eng. 2025, 37, 5393–5406. [Google Scholar] [CrossRef]
Zhang, L.; Yue, D.; Hancke, G.P.; Dou, C.; Yu, L.; Chen, Z. Optimization of Energy and Carbon Emissions in Integrated Energy System Based on Deep Reinforcement Learning Assisted by Large Language Model. IEEE Trans. Ind. Inform. 2025, 21, 8186–8197. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Q.; Fan, J. Target Tracking for Muscle-Skeleton Robotic Arm Using Large Language Model-Enhanced Reinforcement Learning. IEEE Trans. Instrum. Meas. 2025, 74, 2526511. [Google Scholar] [CrossRef]
Dalal, M.; Chiruvolu, T.; Chaplot, D.; Salakhutdinov, R. Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks. arXiv 2024, arXiv:2405.01534. [Google Scholar]
Bernstein, D.S.; Givan, R.; Immerman, N.; Zilberstein, S. The Complexity of Decentralized Control of Markov Decision Processes. Math. Oper. Res. 2002, 27, 819–840. [Google Scholar] [CrossRef]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 24611–24624. [Google Scholar]
Aghajanyan, A.; Gupta, S.; Zettlemoyer, L. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP 2021), Online, 1–6 August 2021; pp. 7319–7328. [Google Scholar] [CrossRef]
Qin, C.; Ran, X. Efficient unsupervised image stitching using attention mechanism with deep homography estimation. Comput. Mater. Contin. 2024, 79, 1319–1334. [Google Scholar] [CrossRef]
Samvelyan, M.; Rashid, T.; de Witt, C.S.; Farquhar, G.; Nardelli, N.; Rudner, T.G.J.; Hung, C.M.; Torr, P.H.S.; Foerster, J.; Whiteson, S. The StarCraft Multi-Agent Challenge. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS ‘19), Montreal, QC, Canada, 13–17 May 2019; pp. 2186–2188. [Google Scholar]
Taylor, M.E.; Stone, P. Transfer Learning for Reinforcement Learning Domains: A Survey. J. Mach. Learn. Res. 2009, 10, 1633–1685. [Google Scholar]
Wang, W.; Yang, T.; Liu, Y.; Hao, J.; Hao, X.; Hu, Y.; Chen, Y.; Fan, C.; Gao, Y. Action semantics network: Considering the effects of actions in multiagent systems. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020; pp. 1–18. [Google Scholar]
Hu, S.; Zhu, F.; Chang, X.; Liang, X. UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers. arXiv 2021, arXiv:2101.08001. [Google Scholar]

Figure 1. LoRA technical framework.

Figure 2. Overall framework of LoLM-MARL.

Figure 3. Form changes in agent 0′s observation data before and after mapping. (a) original data format; (b) mapped data format.

Figure 4. Design of dynamic prompts for agent 0 in 3s5z_vs_3s6z map. (a) the composition of agent 0’s prompt; (b) current battlefield situation at this timestep.

Figure 5. Comparative experimental results of LoLM-MARL on single-task learning.

Figure 6. Visualized comparison results of zero-shot transfer for each algorithm. Dots represent the average win rate, thick vertical bars indicate the standard deviation of the data, and thin vertical bars show the extreme values.

Figure 7. Comparative experimental results of few-shot transfer for each algorithm.

Figure 8. Comparison results of jumpstart and asymptotic performance of the fine-tuning method across six types of transfer tasks.

Figure 9. Ablation studies results of each module. (a) Impact of dynamic prompts on transfer performance; (b) impact of KL divergence on source task training stability.

Figure 10. Ablation study results of the policy network baseline. (a) Comparison of transfer performance between Transformer (0.6B) and LoLM-MARL; (b) comparison of experimental results between Transformer (0.6B) and LoLM-MARL on the source task.

Figure 11. Ablation study results of policy entropy.

Table 1. Composition of raw observation space for SMAC.

Observed Entities	Observation Space
Ally	Visible/Not visible
	Distance
	Relative position along the X-axis
	Relative position along the Y-axis
	Health
	Shield
	Unit type
	Last action
Enemy	Attack availability
	Distance
	Relative position along the X-axis
	Relative position along the Y-axis
	Health
	Shield
	Unit type
Self	Health
	Shield
	Unit type
	Last action
	Agent ID

Table 2. Composition of primitive action space for SMAC.

Action	Description
No Operation	Do nothing (only applicable to dead agents)
Stop	Stop all actions: the agent remains in stationary state
Move Direction	Move a fixed distance in four directions (North, South, East, and West)
Attack ID	Attack the enemy with the specified ID

Table 3. Configuration information for each scenario in SMAC.

Scenario	Ally	Enemy	Scenario Characteristics
3s5z	3 Stalkers & 5 Zealots	3 Stalkers & 5 Zealots	Symmetric
8m_vs_9m	8 Marines	9 Marines	Asymmetric
5m_vs_6m	5 Marines	6 Marines	Asymmetric
10m_vs_11m	10 Marines	11 Marines	Asymmetric
3s_vs_4z	3 Stalkers	4 Zealots	Asymmetric
3s_vs_5z	3 Stalkers	5 Zealots	Asymmetric
3s5z_vs_3s6z	3 Stalkers & 5 Zealots	3 Stalkers & 6 Zealots	Asymmetric
MMM2	1 Medivac, 2 Marauders & 7 Marines	1 Medivac, 3 Marauders & 8 Marines	Asymmetric
2s3z	2 Stalkers & 3 Zealots	2 Stalkers & 3 Zealots	Symmetric
1c3s5z	1 Colossus, 3 Stalkers & 5 Zealots	1 Colossus, 3 Stalkers & 5 Zealots	Symmetric
8m	8 Marines	8 Marines	Symmetric

Table 4. Experimental hyperparameter settings.

Hyper-Parameters	Value
Discount factor	0.99
Prompt batch size	64
LoRA learning rate	5 × 10⁻⁶
Action head learning rate	5 × 10⁻⁵
Value network learning rate	5 × 10⁻⁵
Maximum episode length	200
LoRA rank	8
LoRA scaling coefficient	16
LoRA dropout	0.1
LoRA target modules	q, k, v, o attention projection layer
Number of value network Transformer encoder layers	1
Number of self-attention heads in Transformer encoder	4
$λ_{0}$	20

Table 5. Zero-shot transfer win rate comparison results.

Transfer Task	DT2GS	UPDeT	ASN-Attention	LoLM-MARL (Ours)
$2 s 3 z \to 1 c 3 s 5 z$	11.25 ± 4.36	3.15 ± 2.37	8.63 ± 1.69	27.17 ± 3.35
$3 s_v s_4 z \to 3 s_v s_5 z$	32.46 ± 5.36	49.86 ± 6.64	0.00 ± 0.00	43.89 ± 4.79
$8 m_v s_9 m \to 5 m_v s_6 m$	25.89 ± 7.28	18.62 ± 3.26	7.93 ± 2.21	47.54 ± 3.98
$3 s 5 z \to 3 s 5 z_v s_3 s 6 z$	91.59 ± 2.06	89.64 ± 4.36	35.26 ± 9.38	94.38 ± 3.25
$8 m \to 10 m_v s_11 m$	61.25 ± 3.42	77.52 ± 4.26	33.26 ± 11.78	83.84 ± 5.27
$5 m_v s_6 m \to 10 m_v s_11 m$	15.48 ± 3.06	53.67 ± 2.02	10.56 ± 2.35	66.87 ± 4.23

Table 6. Performance comparison results between the fine-tuning method LoLM-MARL_finetune and the original method LoLM-MARL on the target task.

Transfer Task	Jumpstart (%)		Asymptotic Performance (%)		The Training Step Required for Asymptotic Performance
Transfer Task	LoLM-MARL	LoLM-MARL Finetune	LoLM-MARL	LoLM-MARL Finetune	LoLM-MARL	LoLM-MARL Finetune
$2 s 3 z \to 1 c 3 s 5 z$	0.00 ± 0.00	22.62 ± 2.55	97.64 ± 1.50	97.92 ± 1.47	4 × 10⁶	3.26 × 10⁵
$3 s_v s_4 z \to 3 s_v s_5 z$	0.00 ± 0.00	42.04 ± 2.86	1.00 ± 0.00	1.00 ± 0.00	4 × 10⁶	1 × 10⁶
$8 m_v s_9 m \to 5 m_v s_6 m$	0.00 ± 0.00	44.75 ± 3.95	79.54 ± 5.48	81.96 ± 2.04	1 × 10⁷	3.26 × 10⁵
$3 s 5 z \to 3 s 5 z_v s_3 s 6 z$	0.00 ± 0.00	93.54 ± 2.30	97.92 ± 1.47	98.96 ± 1.47	8 × 10⁶	3.26 × 10⁵
$8 m \to 10 m_v s_11 m$	0.00 ± 0.00	84.17 ± 3.08	92.71 ± 3.90	1.00 ± 0.00	8 × 10⁶	3.26 × 10⁵
$5 m_v s_6 m \to 10 m_v s_11 m$	0.00 ± 0.00	63.54 ± 5.91	92.71 ± 3.90	93.29 ± 3.03	8 × 10⁶	6.46 × 10⁵

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, C.; Liu, Y.; Wang, J.; Wang, Z.; Lu, K.; Wang, C. LLM-Augmented Multi-Agent Reinforcement Learning for Cross-Scenario Knowledge Transfer. Entropy 2026, 28, 525. https://doi.org/10.3390/e28050525

AMA Style

Li C, Liu Y, Wang J, Wang Z, Lu K, Wang C. LLM-Augmented Multi-Agent Reinforcement Learning for Cross-Scenario Knowledge Transfer. Entropy. 2026; 28(5):525. https://doi.org/10.3390/e28050525

Chicago/Turabian Style

Li, Chao, Yanfei Liu, Jieling Wang, Zhong Wang, Kewei Lu, and Chengjin Wang. 2026. "LLM-Augmented Multi-Agent Reinforcement Learning for Cross-Scenario Knowledge Transfer" Entropy 28, no. 5: 525. https://doi.org/10.3390/e28050525

APA Style

Li, C., Liu, Y., Wang, J., Wang, Z., Lu, K., & Wang, C. (2026). LLM-Augmented Multi-Agent Reinforcement Learning for Cross-Scenario Knowledge Transfer. Entropy, 28(5), 525. https://doi.org/10.3390/e28050525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLM-Augmented Multi-Agent Reinforcement Learning for Cross-Scenario Knowledge Transfer

Abstract

1. Introduction

2. Related Work

2.1. Generalization Challenges in MARL

2.2. LLMs for Decision-Making

3. Preliminaries

3.1. Dec-POMDPs

3.2. Multi-Agent Proximal Policy Optimization (MAPPO)

3.3. Low-Rank Adaptation (LoRA)

4. Methodology

4.1. Overview of LoLM-MARL

4.2. Dynamic Prompt Design for LLM

4.3. Policy Annealing Alignment for Pre-Trained and Fine-Tuned LLMs

4.4. Adaptive-Scale Transfer Network Architecture

5. Experiments

5.1. Environments

5.2. Methods and Metric

5.2.1. Baselines

5.2.2. Hyperparameter Settings

5.2.3. Performance Metrics

5.3. Performance on Single-Task

5.4. Zero-Shot Generalization Across Scenarios

5.5. Few-Shot Generalization Across Scenarios

5.6. Generalization Analysis

5.7. Ablation Studies

6. Conclusions

Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI