Multi-Agent Transfer Learning Based on Contrastive Role Relationship Representation

Wu, Zixuan; Wu, Jintao; Zhang, Jiajia

doi:10.3390/ai7010013

Open AccessArticle

Multi-Agent Transfer Learning Based on Contrastive Role Relationship Representation

by

Zixuan Wu

,

Jintao Wu

and

Jiajia Zhang

^*

Computer Application Research Center, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

AI 2026, 7(1), 13; https://doi.org/10.3390/ai7010013

Submission received: 5 November 2025 / Revised: 31 December 2025 / Accepted: 1 January 2026 / Published: 6 January 2026

(This article belongs to the Section AI in Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

This paper presents the Multi-agent Transfer Learning Based on Contrastive Role Relationship Representation (MCRR), focusing on the unique function of role mechanisms in cross-task knowledge transfer. The framework employs contrastive learning-driven role representation modeling to capture the differences and commonalities of agent behavior patterns among multiple tasks. We generate generalizable role representations and embed them into transfer policy networks, enabling agents to efficiently share role assignment knowledge during source task training and achieve policy transfer through precise role adaptation in unseen tasks. Unlike traditional methods relying on the generalization ability of neural networks, MCRR breaks through the coordination bottleneck in multi-agent systems for dynamic team collaboration by explicitly modeling role dynamics among tasks and constructing a cross-task role contrast model. In the SMAC benchmark task series, including mixed formations and quantity variations, MCRR significantly improves win rates in both source and unseen tasks. By outperforming mainstream baselines like MATTAR and UPDeT, MCRR validates the effectiveness of roles as a bridge for knowledge transfer.

Keywords:

machine learning; multi-agent transfer learning; role mechanism; contrastive learning

1. Introduction

In recent years, artificial intelligence has made important progress in machine gaming. Standard test environments, such as Atari video game [1] and MuJoCo (a physics simulation platform) [2], provide common validation tools for single-agent reinforcement learning algorithms due to their fixed action spaces and predictable state transitions. However, real-world decision-making problems often require dynamic teamwork and competition between multiple agents, creating major challenges for traditional single-agent learning methods. Multi-Agent Reinforcement Learning (MARL) [3] shows special strengths in some areas such as game AI [4], robot control [5], smart military systems [6] and mobility [7]. The fundamental concept involves emulating the operational mechanisms of real-world collective intelligence to optimize multi-agent cooperative strategies within complex and evolving environments. In this context, a “task” is defined as a sequence of practical, real-world jobs that require algorithmic execution and completion.

Most multi-agent reinforcement learning methods adopt the centralized training with decentralized execution (CTDE) [8] framework. This framework utilizes global states during training to optimize collaborative policies while enabling agents to make autonomous decisions based on local observations during execution, effectively addressing the policy coordination challenges in fully decentralized approaches. However, existing CTDE methods often use parameter sharing mechanisms [9], which can reduce the training complexity of complex systems and accelerate policy convergence but inevitably lead to homogeneous agent behaviors. Inspired by the differentiated division of labor in real-life teams, a role refers to the corresponding functional positioning in multi-agent collaboration, whose core is to guide agents to form differentiated behavioral patterns aligned with team objectives. Methods like ROMA [10] learn implicit role representations from agents’ local observations and introduce them into Q-value networks to generate heterogeneous behaviors. RODE [11] further associates roles with fixed subsets in the joint action space to enable specific roles to perform certain actions. Recent work, such as ACORM [12], has advanced role-based learning by introducing contrastive learning to dynamically cluster agents into roles based on behavioral patterns and leveraging attention mechanisms to enhance coordination, though it remains focused on improving single-task performance. However, existing role-based methods suffer from three fundamental and restrictive limitations: First, they are inherently confined to single-task scenarios and cannot adapt to dynamic changes across multiple tasks, requiring complete retraining when faced with new task structures. Second, they fail to adequately model the dynamic emergence of roles in complex scenarios—ignoring the evolutionary process of agents’ roles as they interact with the environment, adjust goals, and respond to peer behaviors during task execution. Third, the lack of a role knowledge reuse mechanism—roles are not explicitly modeled as carriers of collaborative knowledge, making it impossible to share collaboration experience of similar roles across different tasks, resulting in severe waste of training resources. These limitations prevent existing methods from adapting to the dynamic changes in tasks in real-world scenarios, greatly restricting their application scope.

Besides, with the evolution of swarm intelligence technology, the scale of multi-agent systems continues to expand, and the complexity of the task grows exponentially. If agents have poor generalization ability and cannot adapt to dynamic task environments, they need to spend a lot of time training separately when many new tasks appear frequently. To solve this, researchers have introduced transfer learning (TL) [13], which transfers strategies learned in simple tasks to complex ones. Reusing knowledge from source tasks speeds up learning and improves performance, promoting the application of related algorithms in complex real-world tasks. UPDeT [14] introduces a transformer-based policy decoupling framework that decomposes agent decision-making into independent modules, enabling cross-agent collaboration through contextual modeling. REFIL [15] leverages randomized entity group decomposition to identify task commonalities, combining counterfactual reasoning to recognize subgroup behavior patterns and reconstruct value functions.

However, compared to the strong adaptability of human collaboration in complex dynamic environments and the ability to seamlessly transfer experience-based strategies across tasks, multi-agent systems still face two core and unresolved challenges. The first is how to achieve efficient cross-task knowledge transfer that goes beyond superficial feature reuse to truly accelerate the training of new tasks. The second is how to establish dynamic role division mechanisms deeply coupled with cross-task transfer capabilities. These mechanisms should enable agents to complete sophisticated collaboration through role division within a specific task while achieving strategy transfer via cross-task generalization of role representations. Existing algorithms fail to address these two challenges synergistically, as they tend to focus on a single point and lack integrated modeling of task differences and role dynamics. Specifically, cross-task transfer methods overemphasize building universal task representations to enhance generalization but ignore the role division that is the core of multi-agent collaboration. This leads to overly vague transferred knowledge that cannot support refined teamwork in complex scenarios. Dynamic role mechanisms, by contrast, only focus on refining differentiated behaviors within a specific task and lack cross-task adaptability. Their role representations are tied to specific task settings and cannot be reused across related tasks. Both approaches neglect the collaborative modeling of task differences and, more critically, the explicit association of roles as bridges for knowledge transfer. This fundamental gap makes it difficult to effectively balance cross-task collaboration efficiency and role-specific behavioral differentiation, severely limiting the practical application of multi-agent systems in real-world dynamic scenarios. To address this critical shortcoming, we innovatively propose a framework of Multi-agent Transfer Learning Based on Contrastive Role Relationship Representation (MCRR). Our research integrates contrastive learning-driven role mechanisms with multi-agent transfer frameworks by constructing a cross-task role contrast model to achieve collaborative optimization of role strategies and transfer learning. Specifically, contrastive learning is employed to get the differences and commonalities of role behaviors across tasks, generating role representations with task generalization capabilities that are embedded into the policy network of the transfer framework. This enables agents to retain collaborative experience from source tasks while dynamically adjusting role division according to target tasks during cross-task transfer. This integration effectively bridges the gap in traditional transfer methods that lack role dynamics and endows dynamic role mechanisms with cross-task adaptability.

On the StarCraft Multi-Agent Challenge (SMAC) [16] benchmarks, our proposed MCRR method achieves state-of-the-art performance in most tasks. As the current SOTA method, MATTAR [17] enables cross-task transfer via task relationship modeling but lacks a role mechanism, leading to limited generalization in unseen tasks with win rates often below 70% in complex scenarios. In contrast, MCRR significantly enhances policy generalization by deeply integrating role contrastive learning and task relationship modeling. Its average win rate is 4.75–12% higher than MATTAR in source tasks and 5.08–16.4% higher in unseen tasks, performing particularly well in high-difficulty tasks. It comprehensively outperforms existing baselines like MATTAR, fully verifying the role mechanism’s effectiveness as a bridge for knowledge transfer.In summary, our contributions are threefold:

This study constructs a role relationship representation and proposes the MCRR framework. By dynamically clustering roles and extracting behavioral representations via contrastive learning, and enhancing cross-role interaction with attention mechanisms, it achieves collaborative optimization of role strategies and transfer learning.
We leverage role representations to realize more expressive credit assignment via an attention mechanism, promoting strategic coordination in a sophisticated role space. This mechanism enhances transfer learning by embedding cross-task role generalization into the hybrid network, thereby optimizing the coordination capabilities of agent roles in different tasks.
In SMAC benchmark experiments, MCRR significantly outperforms baseline methods such as MATTAR and UPDeT in win rates for source tasks and unseen tasks across task series like mixed formations and quantity variations, verifying that the role mechanism, as a bridge for knowledge transfer, enhances cross-task generalization ability.

2. Preliminaries

2.1. Problem Formulation

The multi-agent reinforcement learning problem can be modeled as Dec-POMDP [18], whose tuple is

G = 〈I, S, A, P, R, Ω, O, n, γ〉

Where I is a finite set containing n agents, and

s \in S

is the global state. At each time step, each agent i receives a local observation

o_{i} \in O

according to the observation function

Ω (s, i)

, and then selects an action

a_{i} \in A

. Each independent action jointly forms a joint action

a = {[a_{1}, . . ., a_{n}]}^{T} \in A^{n}

, which is used to enter the next state of the environment according to the transition function

P (s^{'} | s, a)

, and thereby obtains a reward

r \in R (s, a)

shared by all agents. The discount factor

γ

is a value between 0 and 1, used to measure the current discount of the future final reward.

2.2. Task Relationship

MATTAR [17] proposed a multi-agent policy transfer via a task relationship modeling reinforcement learning framework, based on task relationship modeling, capturing the common structure of different tasks by modeling the similarity of state transitions and return functions among different tasks. Specifically, task representation learning is integrated into a forward model (predicting next states, observations, and rewards), with each task associated with a representation vector

z_{i}

. Forward model parameters are generated by a shared hypernetwork (representation explainer). In the transfer phase, the learned representation explainer is fixed, and new task representations are learned by minimizing the forward model prediction loss on new tasks, modeled as linear combinations of source task representations:

z_{n e w} = \sum_{i = 1}^{N_{src}} w_{i} z_{i} s . t . w_{i} \geq 0, \sum_{i = 1}^{N_{src}} w_{i} = 1

(1)

where

N_{s r c}

denotes the number of source tasks involved in training. w represents the similarity of each source task. It quantitatively characterizes the contribution degree of each source task in the process of constructing the new task representation

z_{n e w}

. Through task relationships, agents can flexibly combine the knowledge they have learned in the past to solve problems in new task.

3. Method

In this section, we introduce the MCRR framework. The framework learns from the knowledge-sharing model of the same role in different real-world tasks. It builds unique advantages in multi-agent transfer learning with the role concept as the core. Specifically, we use contrastive learning to analyze the differences and similarities of role behaviors in different tasks. This helps generate representations with strong generalization ability for role-specific knowledge transfer. By embedding these representations into the agent’s policy network, we can use roles as a bridge for knowledge transfer. This allows us to fully use the proven collaborative experience from source tasks. Further, we introduce these roles into the multi-agent policy mixing network using an attention mechanism. This helps the agent adopt more effective collaboration strategies based on its role, thereby enhancing overall performance.

MCRR consists of individual Q-networks in Figure 1c and a population invariant (PIN) mixing network [15] in Figure 1a. Its algorithm structure primarily includes the role relationship representation module and the attention collaboration mechanism: The former dynamically clusters agent behavior patterns via contrastive learning and leverages the InfoNCE [19] function, a classic contrastive learning objective, to optimize agent embedding vectors

{e_{i}^{t}}_{i = 1}^{n}

. This objective enhances representation discriminability. It pulls embeddings of positive samples closer, as these samples are agents with similar behavioral patterns. It also pushes those of negative samples farther apart, since these are agents with dissimilar patterns. It ultimately generates task-generalizable role representations

{z_{i}^{t}}_{i = 1}^{n}

(as shown in Figure 1d). In the attention collaboration mechanism, an attention module is introduced to decompose the global state into environmental, ally, and enemy states, extract key representations from each part via an encoder, and calculate attention weights for different role states using role representations as dynamic indices (as shown in Figure 1b). The two modules work synergistically to enable efficient sharing of division-of-labor knowledge in source tasks and precise transfer of role strategies in unseen tasks. It should be noted that while this paper employs terminology such as “allies” and “enemies” in subsequent algorithmic descriptions, these terms are primarily derived from the adversarial benchmarks used in our experiments.

3.1. Contrastive Role Representation

To address the following challenges—existing role-based methods lack cross-task role transfer capability, and in multi-agent transfer learning, the absence of an efficient knowledge coupling mechanism between role dynamics and cross-task transfer capability makes it difficult to balance cross-task agent collaboration efficiency—we propose a novel approach: using contrastive learning to dynamically cluster agents with similar behavioral patterns into distinct roles while distinguishing agents with dissimilar strategies. Specifically, the method focuses on two core objectives: (1) quantifying the similarity of agent behavioral strategies across different tasks and constructing role representations; (2) optimizing role representations to enhance knowledge transferability.

For the first objective, we extract high-dimensional behavioral representations

e_{i}^{t} = f_{ϕ} (o_{i}^{t}, a_{i}^{t - 1}, e_{i}^{t - 1})

from each agent’s observation-action trajectories, where

ϕ

is an encoder composed of an attention mechanism [20] and a shared GRU [21], where

o_{i}^{t}

denotes the observation of agent i at time t,

a_{i}^{t - 1}

represents the action executed by agent i at the previous time step, and

e_{i}^{t - 1}

is the hidden state of the GRU. This agent embedding can fully capture the behavioral strategy information of agents. Therefore, we can use the distance between these embeddings to characterize the behavioral differences among agents.

To address the second issue, we need to obtain more accurate character traits as the basis for the transfer, referring to the study [15], for the individual Q network, we break down the observation

o_{i}

of agent i into three parts: the part related to the environment

o_{i}^{e n v}

, the part related to agent i itself

o_{i}^{o w n}

, and the part related to other entities

o_{i}^{o t h e r}

. Then we combine the environmental observation

o_{i}^{e n v}

, self-observation

o_{i}^{o w n}

, and the action at the previous time step

a_{i}^{t - 1}

of agent i as a representation of the core behavioral strategy of agent i. This combined input is fed into an MLP to generate fixed-dimensional independent representations of the agent. Meanwhile, we use the attention mechanism to extract the core associated representations between agent i and other agents. Subsequently, these two representations are concatenated and serve as the key basis for the subsequent network to distinguish agent roles. After obtaining agents’ core discriminative representations, we apply contrastive learning, which aims to make representations of similar samples more alike and those of dissimilar samples more distinct. Ideal role representations should depend solely on behavioral patterns rather than agent identities. To achieve this, we introduce mutual information to quantify dependence and propose maximizing it to reduce role uncertainty and filter irrelevant information. The role encoder is formalized as

z^{t} \sim f_{θ} (z_{t} | e_{t})

, where

z_{t}

represents the role characteristics of the agent at time t.

e_{t}

is a representation of the core behavioral strategy of the agent. The role R follows distribution

P (R)

, and the distribution of the agent embedding e is determined by its role. The learning objective of the role encoder is to maximize the mutual information

I (z; R)

:

max I (z; R) = E_{z, R} [log \frac{p (R | z)}{p (R)}]

(2)

Optimizing mutual information is directly difficult in practice as it demands full knowledge of probability distributions that are often unknown or high-dimensional in complex tasks. Inspired by InfoNCE in contrastive learning [22,23], we construct a loss function that quantifies role similarity for dynamic clustering by pulling agents of the same role closer and pushing agents of different roles farther in the representation space to minimize the loss. This mechanism directly supports role clustering by enhancing compactness within roles and separability between roles. It avoids the complexity of high-dimensional distribution estimation, improving the discriminative ability of role representations. The method explicitly clusters representations of the same role into the same clusters and separates representations of different roles into distinct clusters, reducing individual noise irrelevant to roles and learning cross-agent role commonalities. This better adapts to dynamic multi-agent tasks and improves transferability and generalization. Our goal is to minimize the following loss:

L_{role} = - log \frac{\sum_{i^{'} \in A_{r}} exp (f_{θ} {(i)}^{⊤} W f_{θ} (i^{'}))}{\sum_{i^{'} \in A_{r}} exp (f_{θ} {(i)}^{⊤} W f_{θ} (i^{'})) + \sum_{k \notin A_{r}} exp (f_{θ} {(i)}^{⊤} W f_{θ} (k))}

(3)

L_{role}

is the InfoNCE loss function for contrastive role learning, used to measure the discriminative ability of role representations.

A_{r}

is the set of agents belonging to the same role r (positive sample pairs), and

i^{'} \in A_{r}

indicates that

i^{'}

and i belong to this role where

i^{'}

denotes the agent

i^{'}

.

f_{θ} (i) = f_{θ} ((o_{i}^{e n v}, o_{i}^{o w n}, o_{i}^{o t h e r}), a_{i})

is the role representation vector of agent i generated by the encoder

θ

based on the observation

o_{i}

and action

a_{i}

of the agent, used to capture the behavioral representations of the agent. W is a learnable similarity matrix for adjusting the calculation of similarity between role representations. For positive pairs, the summation over

i^{'} \in A_{r}

enforces similarity between same-role agents, while the summation over

k \notin A_{r}

in negative pairs maximizes dissimilarity across roles, directly optimizing for cluster separation. The contrastive learning representations after training significantly distinguish behavioral patterns between same and different roles. Task relationship representations model unseen tasks as linear combinations of source tasks through forward models and hypernetworks, capturing the dynamic similarities between tasks. The synergistic effect of the two enables agents to: (1) learn discriminative role behaviors in the representation space based on role representations; (2) establish cross-task mappings through task relationship modeling, reuse source task knowledge by virtue of the cross-task generalization ability of role representations, and enable agents to achieve rapid migration of role requirements in new tasks based on the dynamic adaptation of role representations. Experimental results on the SMAC benchmark demonstrate that our method significantly outperforms baseline approaches in win rates on unseen tasks, confirming its ability to achieve efficient adaptation and cross-task generalization through role knowledge transfer. Fundamentally, the MCRR framework is structured as a versatile model capable of addressing diverse multi-agent collaborative challenges beyond specific benchmarks. In non-confrontational scenarios, “enemy states” can be flexibly remapped as environmental features or interaction targets, while “ally states” represent collaborative observations among team members. By leveraging this generalized mapping of role representations, the model extends its applicability from simulated combat to a wide spectrum of complex coordination scenarios, including real-world rescue operations.

3.2. Attention Role Collaboration

To better coordinate the behavioral strategies of multiple agents across multi-tasks and multi-roles, we derive role representations from agents’ local behavioral strategies and introduce them into the value decomposition network. In the value decomposition framework, VDN [24] (Value-Decomposition Network) simplifies the global Q-value as the sum of local Q-values of individual agents, while QMIX [25] combines local Q-values more flexibly through a monotonic mixing network to achieve complex coordination strategies. Addressing the challenge of input/output dimension variations across tasks in multi-agent transfer learning, we employ a population-invariant network, which is based on the value decomposition framework and consists of an agent-shared individual Q-network and a monotonic mixing network that learns global Q-values by combining local ones.

Based on this, the self-attention mechanism is introduced to enhance cross-dimensional coordination efficiency. First, we decompose the global state s into environment-related states

s_{e n v}

, ally states

S_{a l l y}

, and enemy states

S_{e n e m y}

. Subsequently, an encoder and attention mechanism are employed to fix the representation dimensions of multiple inputs (

S_{a l l y}

and

S_{e n e m y}

) in multi-task transfer tasks, enhancing the dynamic correlation and adaptive coordination between different role states by extracting key representations. This enables multi-agent systems to more efficiently capture long-range correlations between agents in different positions and with different functions in complex tasks, avoid repeatedly transmitting or indiscriminately processing similar ally/enemy state information, and reduce unnecessary interaction overheads, thereby making the strategy of agents more stable and coherent during task execution, and better adapting to different scenarios such as changes in the number of units and enemy configurations in tasks, improving strategic coherence and generalization ability.

q = M L P_{q} (S_{a l l y}, S_{e n e m y}),

(4)

k = M L P_{K} (S_{a l l y}, S_{e n e m y})

(5)

v = M L P_{V} (S_{a l l y}, S_{e n e m y})

(6)

h = softmax (q k^{⊤} / \sqrt{e_{d}}) v

(7)

S_{a l l y}

and

S_{e n e m y}

, respectively, represent the states of ally and enemy agents, corresponding to the cooperative and adversarial representations in multi-role tasks.

M L P_{q}

,

M L P_{K}

and

M L P_{V}

are multi-layer perceptrons that encode ally states into query vectors q through nonlinear transformations, and generate key vectors k and value vectors v by concatenating ally and enemy states, which are used for representation retrieval, similarity calculation, and information aggregation, respectively. h is a cross-role interaction representation vector output by the attention mechanism, which fuses key information through scaled dot-product operations, where

e_{d}

is an agent embedding dimension parameter.

Throughout the policy learning process, we fix the task representation z and introduce role representations to enhance multi-agents’ ability to capture cross-task commonalities and differences by encoding the difference information of agents’ local behavioral strategies, explicitly modeling the strategic contrast (e.g., action pattern differences) between different roles (such as allies and enemies) in similar tasks, and thereby improving the generalization of the task representation z. The contrastive learning module requires continuous parameter optimization during the training phase to dynamically adapt to the role representation distributions of different tasks and enhance agents’ ability to capture role representation differences. The task representation is jointly learned with a representation explainer responsible for representation mapping using 50,000 samples containing role interaction dynamics over 50,000 timesteps. When transferring to new tasks, the parameters of the contrastive learning module and the representation explainer are fixed, and the trained individual Q-network and PIN mixing network from source tasks are reused. At this point, the new task representation z trained over 50,000 timesteps is deeply fused with role representations to form composite representations containing cross-task role dependencies, which are inserted into the input layer of the individual Q-network. This enables agents to execute policies in a decentralized manner based on the Q-network fused with role representations, optimizing cross-agent collaboration using global contrastive information while maintaining local decision-making autonomy and significantly improving policy transfer efficiency and environmental generalization in complex tasks.

To maintain the conciseness and logical flow of the discussion, the detailed execution steps of the MCRR framework are provided in Algorithm A1 in Appendix A.2. This pseudocode completes the methodological logic by outlining the iterative details of representation training and policy learning.

4. Experiments

This section presents the experiments designed to evaluate our method’s capabilities in two aspects: (1) the role mechanism’s function in multi-task training, which enhances overall performance by efficiently acquiring common role knowledge across multiple tasks; (2) its generalization ability to unseen tasks, demonstrating that the algorithm can extract role knowledge from multiple source tasks and transfer it to unseen ones using roles as the foundation.

We evaluated MCRR on the SMAC [16] benchmarks.SMAC, short for StarCraft Multi-Agent Challenge, is built on the StarCraft II engine. It includes cooperative multi-agent tasks from simple to complex, covering scenarios like collaborative combat and target destruction. As a recognized authoritative experimental environment in the multi-agent field, SMAC provides standardized interfaces and unified evaluation metrics. It serves as a core benchmark for verifying the performance of algorithms such as multi-agent reinforcement learning and collaborative decision-making, ensuring the fairness and comparability of experimental results.

In the experimental design, we adapted the MCRR framework to multi-task training tasks by systematically expanding the classic SMAC benchmark maps and constructing three task series with gradient difficulties. Specifically, the first series centers on mixed formations of Stalkers and Zealots, emphasizing the agents’ generalization ability to collaborative strategies for different roles; the second series introduces complex combat units including Marines, Maneuvers and Medivacs, highlighting the integrated coordination of multi-arms cooperation and resource management; The third series focuses on the number variation of Marines, covering tasks from low-population-scale confrontations to large-scale team battles. The overall task design provides a hierarchical learning environment for multi-agent systems by delicately controlling the type and quantity proportion of agents, effectively supporting the validity verification of cross-task knowledge transfer. For more detailed definitions, naming conventions, and specific configurations of each experimental task, readers may refer to Appendix A, which provides a comprehensive task mapping table for reference.

4.1. Role Generalization in Unknown Tasks

It is noted that algorithms adapting to multi-scale tasks are scarce, and MCRR accurately characterizes agents’ division-of-labor and collaboration logic through this mechanism, offering differentiated advantages in underlying design compared with MATTAR, UPDeT (single-source task transfer), and REFIL. For fair comparison, we transfer knowledge from each source task to unseen tasks, calculating the best (UPDeT-b) and mean (UPDeT-m) performance for each unseen task under the UPDeT framework, and compared MCRR, the “w/o Role” variant (with the role mechanism removed from the PIN Mixing Network), and the current state-of-the-art algorithm in this field, MATTAR.

From SMAC benchmark results in Table 1, Table 2 and Table 3, the role mechanism’s key value in “source task role knowledge sharing and new task role knowledge transfer” is clear: In source task training, it helps agents quickly follow role - division logic, reuse teamwork experience, and speed up strategy convergence and knowledge buildup. After transferring to new tasks, it accurately invokes pre-stored role collaboration patterns to adapt to unit interactions, significantly outperforming methods like MATTAR in cross-task knowledge reuse and task generalization.

The experimental results show that MCRR significantly outperforms traditional methods in multi-task transfer learning on the SMAC benchmark. In the Stalkers, Zealots task series, MCRR achieved win rates of 90% and 94% in the unseen tasks 1s8z and 1s9z, respectively, far exceeding MATTAR’s 79% and 60%, verifying the role mechanism’s enhancement of cross-task strategy generalization. In the complex tasks with multiple units, Marines, Maneuvers, Medivacs (Table 2), MCRR maintained a 100% win rate in medium-scale tasks (such as MMM0 and MMM1), but its win rate dropped to 62% and 26% in large-scale multi-unit tasks (MMM5 and MMM6) due to the fixed role classification limitation, reflecting the current framework’s limitation in runtime dynamic role adjustment. In the tasks with varying numbers of Marines, MCRR achieved a 100% win rate in tasks like 3m and 4m, and outperformed MATTAR with an 89% win rate in competitive tasks like 7m8m, where MATTAR’s win rate was 83%, demonstrating the collaborative effectiveness of role representation and the attention mechanism.

4.2. Role-Aided Good Initialization for Policy Fine-Tuning

The value of role-aided good initialization in policy fine-tuning is particularly prominent in difficult tasks. When addressing high-dimensional multi-agent scenarios involving dynamic adversarial interactions, complex goal hierarchies, or frequent role-dependent coordination demands, random or naive initialization often leads to inefficient exploration: agents may struggle to even establish basic collaboration patterns and get stuck in behavioral chaos. By contrast, role-aided initialization seeds the policy with pre-learned role-specific heuristics, creating a structured starting point that eliminates the need to re-learn foundational role-based coordination from scratch. This allows the fine-tuning process to focus directly on adapting stable role patterns to the unique complexities of the task. As a result, the strategy avoids costly detours in early learning stages and converges faster toward effective, role-coherent strategies even in demanding environments. As shown in the Figure 2, in highly challenging complex multi-agent tasks, the MCRR fine-tuning strategy with role-aided initialization exhibits a median test win rate curve that significantly outperforms other methods. This empirical result intuitively confirms the crucial value of role-aided initialization in policy fine-tuning.

4.3. Dynamic Evolution of Role Representations

To further investigate the internal decision making mechanism of the MCRR framework, we utilized t-SNE to visualize the role embeddings

z_{t}

under the 2s3z scenario with

K = 3

roles. The progression from Figure 3a–c illustrates the evolution of role clusters from the beginning of the task to the final stage. At the initial stage (Figure 3a), agents form three distinct and separated clusters in the feature space as indicated by the dashed circles based on a preliminary division of the original characteristics of the units, reflecting the model’s ability to identify fundamental functional divisions based on early observations. As the task progresses toward the end (Figure 3c), the shapes and memberships of these clusters undergo significant dynamic migration, providing a direct visualization of how agents flexibly switch their role positioning in response to real time feedback. Notably, at the final stage, the learned embeddings clearly distinguish between defeated units and those still actively engaged in combat, demonstrating the representation space’s high sensitivity to individual status changes. This highlights the core advantages of MCRR: the system avoids assigning rigid labels and instead allows agents to move smoothly within the role semantic space to adapt to changing environments, while the role representation captures behavioral characteristics across phases to ensure team coordination remains consistent with macro strategies.

5. Related Works

Role Concept: Researchers have long acknowledged the utility of role concepts in multi-agent reinforcement learning [26,27,28]. Analogous to roles, concepts like skills [29] and sub-tasks [30] also enable differentiated behavioral strategies by classifying agents during execution, fostering more efficient coordination. RODE [11] associates each role with a fixed subset of the full action space to mitigate learning complexity, while SIRD [31] extends this by transforming role discovery into hierarchical action space clustering. COPA [32] achieves dynamic role allocation by periodically distributing global team compositions to agents even during execution.

Transfer Learning: Transfer learning enables knowledge transfer from source to target tasks to improve training efficiency, which has been widely applied in single-agent scenarios but remains limited in multi-agent domains. Early studies achieved knowledge transfer by explicitly calculating task similarity, while recent works like Liu et al. [33] extended MDP similarity definitions, designing scalable transfer methods based on N-step returns to accelerate multi-agent learning. Qin et al. [17] proposed the MATTAR framework, capturing common task structures by modeling similarities in state transitions and reward functions to enhance policy transfer and generalization. In multi-agent transfer, transferable knowledge manifests as experience, policies, and representations. Niu et al. [34] proposed an experience transfer method based on environmental differences, defining similarity via reward prediction errors to guide sample sampling. Bo et al. [35] developed an isomorphic task transfer algorithm utilizing knowledge distillation to enable effective policy transfer across collaborative agents. Shi et al. [36] introduced the MALT algorithm, leveraging attention mechanisms to measure representation importance and horizontal connections for cross-task representation transfer, supporting knowledge reuse across heterogeneous agents.

6. Conclusions

This paper presents the MCRR framework, which integrates contrastive learning-driven role mechanisms with multi-agent transfer frameworks to construct cross-task role contrast models for collaborative optimization of role strategies and transfer learning. The method offers key advantages: dynamic clustering of agent behavior patterns via contrastive learning generates task-generalizable role representations, enabling agents to efficiently share role division knowledge during source task training and achieve policy transfer through dynamic adaptation of role representations in unseen tasks. The combination of attention, collaboration mechanisms and role representations enhances cross-dimensional coordination efficiency in complex scenarios, significantly improving win rates in SMAC benchmark tasks like mixed formations and quantity variations, which validates roles as bridges for knowledge transfer. The current framework relies on human prior knowledge to predefine the number of role categories (e.g., fixed 3-role classification), which may limit adaptability to complex division of labor in large-scale multi-agent tasks. Future work will explore introducing dynamic role adjustment mechanisms to enhance practicality.

Author Contributions

Methodology: Z.W., J.W. and J.Z.; Writing–original draft: Z.W.; Writing–review and editing: Z.W. and J.Z.; Formal analysis: Z.W.; Validation: Z.W. and J.Z.; Visualization: Z.W.; Supervision: Z.W. and J.Z.; Investigation: Z.W., J.W. and J.Z.; Software: Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shenzhen Science and Technology Program under Grant No. KJZD20230923114213027.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Experimental Detail

SMAC (StarCraft II Micromanagement Benchmark) includes combat scenarios focused on StarCraft II unit micromanagement tasks, and is a popular benchmark in the field of multi-agent reinforcement learning. We adopt a partially observable setting, where an agent can only observe a circular area centered on itself with a radius equal to its sight range (the default sight range is 9). We train ally units using MCRR to fight against enemy units controlled by the built-in AI. At the start of each episode, ally and enemy units spawn in pre-defined regions on the map. Each agent selects actions from a discrete action space, which includes four types of actions: no-op (no operation), move[direction], attack[enemyid], and stop. Under the control of these actions, agents can move and attack on a continuous map. At each timestep, agents receive a shared reward equal to the total damage dealt to enemy units. Additionally, an extra bonus of 10 is awarded for killing a single enemy unit, and a bonus of 200 is awarded for winning the combat (killing all enemy units). We consider three series of SMAC tasks, each containing multiple maps, with detailed descriptions provided in Table A1, Table A2 and Table A3.

Table A1. Settings of tasks in the SZ series. The bolded names indicate the source tasks.

Map Name	Ally Units	Enemy Units	Difficulty
1s8z	1 Stalkers, 8 Zealots	1 Stalkers, 8 Zealots	Easy
1s9z	1 Stalkers, 9 Zealots	1 Stalkers, 9 Zealots	Easy
2s3z	2 Stalkers, 3 Zealots	2 Stalkers, 3 Zealots	Easy
2s8z	2 Stalkers, 8 Zealots	2 Stalkers, 8 Zealots	Easy
2s9z	2 Stalkers, 9 Zealots	2 Stalkers, 9 Zealots	Easy
3s5z	3 Stalkers, 5 Zealots	3 Stalkers, 5 Zealots	Easy
3s5z_vs_3s6z	3 Stalkers, 5 Zealots	3 Stalkers, 6 Zealots	Super Hard
7s3z	7 Stalkers, 3 Zealots	7 Stalkers, 3 Zealots	Easy
3s5z_vs_3s7z	3 Stalkers, 5 Zealots	3 Stalkers, 7 Zealots	Extremely Hard

Table A2. Settings of tasks in the M series. The bolded names indicate the source tasks.

Map Name	Ally Units	Enemy Units	Difficulty
3m	3 Marines	3 Marines	Easy
4m	4 Marines	4 Marines	Easy
4m_vs_5m	4 Marines	5 Marines	Hard
5m	5 Marines	5 Marines	Easy
5m_vs_6m	5 Marines	6 Marines	Hard
6m	6 Marines	6 Marines	Easy
6m_vs_7m	6 Marines	7 Marines	Hard
7m	7 Marines	7 Marines	Easy
7m_vs_8m	7 Marines	8 Marines	Hard
8m	8 Marines	8 Marines	Easy
8m_vs_9m	8 Marines	9 Marines	Easy
9m	9 Marines	9 Marines	Easy
9m_vs_10m	9 Marines	10 Marines	Easy
10m	10 Marines	10 Marines	Easy
10m_vs_11m	10 Marines	11 Marines	Easy
10m_vs_12m	10 Marines	12 Marines	Super Hard

Table A3. Settings of tasks in the MMM series. The bolded names indicate the source tasks.

Map Name	Ally Units	Enemy Units	Difficulty
MMM0	1 Medivac, 2 Marauders, 5 Marines	1 Medivac, 2 Marauders, 5 Marines	Easy
MMM	1 Medivac, 2 Marauders, 7 Marines	1 Medivac, 2 Marauders, 7 Marines	Easy
MMM1	1 Medivac, 1 Marauder, 7 Marines	1 Medivac, 2 Marauders, 7 Marines	Hard
MMM2	1 Medivac, 2 Marauders, 7 Marines	1 Medivac, 3 Marauders, 8 Marines	Super Hard
MMM3	1 Medivac, 2 Marauders, 8 Marines	1 Medivac, 3 Marauders, 9 Marines	Super Hard
MMM4	1 Medivac, 3 Marauders, 8 Marines	1 Medivac, 4 Marauders, 9 Marines	Super Hard
MMM5	1 Medivac, 3 Marauders, 8 Marines	1 Medivac, 4 Marauders, 10 Marines	Super Hard
MMM6	1 Medivac, 3 Marauders, 8 Marines	1 Medivac, 4 Marauders, 11 Marines	Super Hard

Appendix A.2. Pseudo-Code

The pseudo-code of the entire algorithm process is shown as Algorithm A1.This pseudo-code fully presents the complete logical chain of the algorithm, from initialization and iterative optimization to termination, covering key links such as data preprocessing, core modules (role-based representation modeling, MCRR-based policy fine-tuning, contrastive learning-driven knowledge transfer), and iterative termination criteria.

Algorithm A1 MCRR

Input: ${S_{i}}_{N_{s r c}}$ : source tasks
${B_{i}}_{N_{s r c}}$ : replay buffers for all source tasks
K: number of clusters
n: number of agents
$T_{d}$ : time interval for updating contrastive loss
$f_{θ}$ : role encoder
1: Initialize task representations ${z_{i}}_{N_{s r c}}$ for all source tasks ${S_{i}}_{N_{s r c}}$ as orthogonal unit vectors. Below we train the representation explainer network
2: Initialize the replay buffer ${B_{i}}_{N_{s r c}}$ for storing agent trajectories
3: Set $t = 0$
4: while $t < ask_rep_max_iteration$ do
5: for all source task $S_{i}$ do
6: Store the trajectory into the buffer $B_{i}$
7: Sample a batch of trajectories from buffer $B_{i}$
8: Get the partial observation ${o_{i}^{t}}_{i = 1}^{n}$ for each agent, the joint action $a^{t}$ and the global state $s^{t}$
9: Compute the model prediction loss $J_{S_{i}} (θ)$ for source task $S_{i}$
10: Update representation explainer network parameters
11: end for
12: Set $t = t + 1$
13: end while
14: Empty out all the buffers ${B_{i}}_{N_{s r c}}$ . Next, we’ll train the agent policy network
15: for $e p i s o d e = 1, 2, \dots$ do
16: for each source task $S_{i}$ do
17: Get the partial observation ${o_{i}^{t}}_{i = 1}^{n}$ of each agent and the global state $s^{t}$
18: for $a g e n t i = 1, 2, \dots, n$ do
19: Calculate the agent embedding $e_{i}^{t} = f_{ϕ} (o_{i}^{t}, a_{i}^{t - 1}, e_{i}^{t - 1})$
20: Calculate the role representation $z_{i}^{t} = f_{θ} (e_{i}^{t})$
21: Select the local action $a_{i}^{t}$ according to Q-function $Q_{i} (e_{i}^{t}, a_{i}^{t})$
22: end for
23: Execute joint action $a^{t} = {[a_{1}^{t}, a_{2}^{t}, \dots, a_{n}^{t}]}^{⊤}$ , and obtain global reward $r^{t}$
24: Store the trajectory to $B_{i}$
25: Sample a batch of trajectories from $B_{i}$
26: if $e p i s o d e mod T_{d} = = 0$ then
27: Partition agent embeddings ${e_{i}^{t}}_{i = 1}^{n}$ into K clusters ${C_{j}}_{j = 1}^{K}$ using K-means
28: for $a g e n t i = 1, 2, \dots, n$ do
29: Construct positive keys ${z_{r}}_{r \in C_{j}}$ and negative keys ${z_{r}}_{r \notin C_{j}}$ for query $z_{i}, i \in C_{j}$
30: end for
31: Update contrastive learning loss
32: Update momentum role encoder
33: end if
34: Compute the TD loss $L_{S_{i}}^{TD} (ψ)$
35: Update the PIN network parameters $ψ \leftarrow ψ - α \cdot \nabla_{ψ} L_{S_{i}}^{TD} (ψ)$
36: end for
37: end for

References

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012. [Google Scholar]
Terven, J. Deep reinforcement learning: A chronological overview and methods. AI 2025, 6, 46. [Google Scholar] [CrossRef]
Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
Chen, D.; Chen, K.; Li, Z.; Chu, T.; Yao, R.; Qiu, F.; Lin, K. Powernet: Multi-agent deep reinforcement learning for scalable powergrid control. IEEE Trans. Power Syst. 2021, 37, 1007–1017. [Google Scholar] [CrossRef]
Altun, H.O.; Ceran, H.F.; Metin, K.K.; Erol, T.; Fişne, E. Strategic Implementation of Super-Agents in Heterogeneous Multi-Agent Training for Advanced Military Simulation Adaptability. IEEE Access 2025, 13, 96544–96563. [Google Scholar] [CrossRef]
Fereidooni, Z.; Palesi, L.I.; Nesi, P. Multi-Agent Optimizing Traffic Light Signals Using Deep Reinforcement Learning. IEEE Access 2025, 13, 106974–106988. [Google Scholar] [CrossRef]
Foerster, J.; Assael, I.A.; De Freitas, N.; Whiteson, S. Learning to communicate with deep multi-agent reinforcement learning. Adv. Neural Inf. Process. Syst. 2016, 29, 2145–2153. [Google Scholar]
Gupta, J.K.; Egorov, M.; Kochenderfer, M. Cooperative multi-agent control using deep reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, São Paulo, Brazil, 8–12 May 2017; Springer International Publishing: Cham, Switzerland, 2017. [Google Scholar]
Wang, T.; Dong, H.; Lesser, V.; Zhang, C. Roma: Multi-agent reinforcement learning with emergent roles. arXiv 2020, arXiv:2003.08039. [Google Scholar] [CrossRef]
Wang, T.; Gupta, T.; Mahajan, A.; Peng, B.; Whiteson, S.; Zhang, C. Rode: Learning roles to decompose multi-agent tasks. arXiv 2020, arXiv:2010.01523. [Google Scholar] [CrossRef]
Hu, Z.; Zhang, Z.; Li, H.; Chen, C.; Ding, H.; Wang, Z. Attention-guided contrastive role representations for multi-agent reinforcement learning. arXiv 2023, arXiv:2312.04819. [Google Scholar]
Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A survey on deep transfer learning. In Proceedings of the International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar]
Hu, S.; Zhu, F.; Chang, X.; Liang, X. Updet: Universal multi-agent reinforcement learning via policy decoupling with transformers. arXiv 2021, arXiv:2101.08001. [Google Scholar]
Iqbal, S.; De Witt, C.A.; Peng, B.; Böhmer, W.; Whiteson, S.; Sha, F. Randomized entity-wise factorization for multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning PMLR 2021, Virtual, 18–24 July 2021. [Google Scholar]
Samvelyan, M.; Rashid, T.; De Witt, C.S.; Farquhar, G.; Nardelli, N.; Rudner, T.G.; Hung, C.M.; Torr, P.H.; Foerster, J.; Whiteson, S. The starcraft multi-agent challenge. arXiv 2019, arXiv:1902.04043. [Google Scholar]
Qin, R.; Chen, F.; Wang, T.; Yuan, L.; Wu, X.; Kang, Y.; Zhang, Z.; Zhang, C.; Yu, Y. Multi-agent policy transfer via task relationship modeling. Sci. China Inf. Sci. 2024, 67, 182101. [Google Scholar] [CrossRef]
Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; Springer International Publishing: Cham, Switzerland, 2016; Volume 1. [Google Scholar]
Oord, A.V.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Laskin, M.; Srinivas, A.; Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In Proceedings of the International Conference on Machine Learning PMLR 2020, Virtual, 13–18 July 2020. [Google Scholar]
Yuan, H.; Lu, Z. Robust task representations for offline meta-reinforcement learning via contrastive learning. In Proceedings of the International Conference on Machine Learning PMLR 2022, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning. arXiv 2017, arXiv:1706.05296. [Google Scholar]
Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
Lhaksmana, K.M.; Murakami, Y.; Ishida, T. Role-based modeling for designing agent behavior in self-organizing multi-agent systems. Int. J. Softw. Eng. Knowl. Eng. 2018, 28, 79–96. [Google Scholar]
Xia, Y.; Zhu, J.; Zhu, L. Dynamic role discovery and assignment in multi-agent task decomposition. Complex Intell. Syst. 2023, 9, 6211–6222. [Google Scholar] [CrossRef]
Cao, J.; Yuan, L.; Wang, J.; Zhang, S.; Zhang, C.; Yu, Y.; Zhan, D.C. LINDA: Multi-agent local information decomposition for awareness of teammates. Sci. China Inf. Sci. 2023, 66, 182101. [Google Scholar] [CrossRef]
Yang, J.; Borovikov, I.; Zha, H. Hierarchical cooperative multi-agent reinforcement learning with skill discovery. arXiv 2019, arXiv:1912.03558. [Google Scholar]
Yuan, L.; Wang, C.; Wang, J.; Zhang, F.; Chen, F.; Guan, C.; Zhang, Z.; Zhang, C.; Yu, Y. Multi-Agent Concentrative Coordination with Decentralized Task Representation. In Proceedings of the IJCAI 2022, Vienna, Austria, 23–29 July 2022. [Google Scholar]
Zeng, X.; Peng, H.; Li, A. Effective and stable role-based multi-agent collaboration by structural information principles. Proc. AAAI Conf. Artif. Intell. 2023, 37, 11772–11780. [Google Scholar] [CrossRef]
Liu, B.; Liu, Q.; Stone, P.; Garg, A.; Zhu, Y.; Anandkumar, A. Coach-player multi-agent reinforcement learning for dynamic team composition. In Proceedings of the International Conference on Machine Learning PMLR 2021, Virtual, 18–24 July 2021. [Google Scholar]
Liu, Y.; Hu, Y.; Gao, Y.; Chen, Y.; Fan, C. Value Function Transfer for Deep Multi-Agent Reinforcement Learning Based on N-Step Returns. In Proceedings of the IJCAI 2019, Macao, China, 10–16 August 2019. [Google Scholar]
Niu, L.; Liang, W.; Tao, J.; Zhou, W.; Yan, H. Multi-agent reinforcement learning policy transfer by buffer. In Proceedings of the 2021 7th International Conference on Big Data and Information Analytics (BigDIA), Chongqing, China, 29–31 October 2021. [Google Scholar]
Bo, C.; Liu, S.; Liu, Y.; Guo, Z.; Wang, J.; Xu, J. Research on Isomorphic Task Transfer Algorithm Based on Knowledge Distillation in Multi-Agent Collaborative Systems. Sensors 2024, 24, 4741. [Google Scholar] [CrossRef] [PubMed]
Shi, H.; Li, J.; Mao, J.; Hwang, K.S. Lateral transfer learning for multiagent reinforcement learning. IEEE Trans. Cybern. 2021, 53, 1699–1711. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The MCRR framework based on QMIX. (a) The overall architecture. (b) The attention module that incorporates learned role representations into the mixing network’s input for better value decomposition. (c) The structure of a shared individual Q-network. (d) The details of contrastive role representation learning.

Figure 2. On the unseen extremely hard task 3s5z_vs_3s7z, role representations provide a good initialization.

Figure 3. Visualization of role embedding evolution in 2s3z. Numbers in different colors in the figure correspond to the units in the upper figure. (a) Represents the role division during the initial stage of the task. (b) Representing the role division during the middle stage of the task. (c) Representing the role division at the end of the task.

Table 1. Transfer performance (mean win rates with variance) on the Stalkers/Zealots series of SMAC maps.

	Task	Our	w/o Role	MATTAR	UPDeT-b	UPDeT-m	REFIL
Source tasks	2s3z	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	0.94 ± 0.04	0.60 ± 0.11	0.75 ± 0.09
	3s5z	1.00 ± 0.00	0.96 ± 0.04	0.99 ± 0.01	0.86 ± 0.13	0.47 ± 0.15	0.43 ± 0.13
	3s5z3s6z	0.83 ± 0.07	0.73 ± 0.13	0.48 ± 0.13	0.09 ± 0.08	0.03 ± 0.03	0.01 ± 0.01
Unseen tasks	1s8z	0.90 ± 0.06	0.86 ± 0.07	0.79 ± 0.09	0.16 ± 0.11	0.08 ± 0.06	0.08 ± 0.04
	1s9z	0.94 ± 0.01	0.83 ± 0.11	0.60 ± 0.12	0.11 ± 0.10	0.04 ± 0.04	0.03 ± 0.01
	2s8z	0.96 ± 0.03	0.98 ± 0.04	0.93 ± 0.09	0.29 ± 0.22	0.14 ± 0.12	0.08 ± 0.05
	2s9z	0.92 ± 0.03	0.85 ± 0.13	0.84 ± 0.04	0.15 ± 0.13	0.06 ± 0.05	0.05 ± 0.04
	7s3z	0.42 ± 0.12	0.35 ± 0.08	0.16 ± 0.12	0.02 ± 0.04	0.01 ± 0.01	0.06 ± 0.04

The best results on each map are bolded. Besides, 3s5z3s6z is short for 3s5z_vs_3s6z, and 3s5z_3s7z in the later text is short for 3s5z_vs_3s7z similarly.

Table 2. Transfer performance (mean win rates with variance) on the Marines series of SMAC maps.

	Task	Our	w/o Role	MATTAR	UPDeT-b	UPDeT-m	REFIL
Source tasks	MMM	1.00 ± 0.00	0.96 ± 0.03	1.00 ± 0.00	1.00 ± 0.00	0.48 ± 0.03	0.97 ± 0.01
	MMM2	0.96 ± 0.03	0.83 ± 0.10	0.92 ± 0.20	0.78 ± 0.04	0.15 ± 0.19	0.04 ± 0.02
	MMM6	0.27 ± 0.03	0.15 ± 0.04	0.09 ± 0.02	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
Unseen tasks	MMM0	1.00 ± 0.00	1.00 ± 0.00	0.98 ± 0.02	0.73 ± 0.21	0.30 ± 0.16	0.93 ± 0.02
	MMM1	1.00 ± 0.00	1.00 ± 0.00	0.97 ± 0.04	0.84 ± 0.07	0.27 ± 0.13	0.38 ± 0.06
	MMM3	0.95 ± 0.14	0.75 ± 0.13	0.86 ± 0.10	0.57 ± 0.15	0.28 ± 0.08	0.12 ± 0.04
	MMM4	1.00 ± 0.00	0.80 ± 0.13	0.93 ± 0.12	0.41 ± 0.14	0.20 ± 0.07	0.06 ± 0.03
	MMM5	0.62 ± 0.08	0.54 ± 0.03	0.47 ± 0.15	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00

The best results on each map are bolded.

Table 3. Transfer performance (mean win rates with variance) on the Marines, Maneuvers, Medivacs series of SMAC maps.

	Task	Our	w/o Role	MATTAR	UPDeT-b	UPDeT-m	REFIL
Source tasks	5m	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	0.77 ± 0.09	0.73 ± 0.03
	5m6m	0.76 ± 0.07	0.77 ± 0.11	0.72 ± 0.05	0.93 ± 0.05	0.32 ± 0.03	0.00 ± 0.00
	8m9m	0.86 ± 0.04	0.96 ± 0.04	0.83 ± 0.05	0.81 ± 0.19	0.35 ± 0.05	0.01 ± 0.01
	10m11m	0.93 ± 0.03	0.91 ± 0.09	0.81 ± 0.09	0.94 ± 0.04	0.43 ± 0.02	0.03 ± 0.02
Unseen tasks	3m	1.00 ± 0.00	1.00 ± 0.00	0.94 ± 0.27	0.81 ± 0.08	0.36 ± 0.04	0.68 ± 0.06
	4m	1.00 ± 0.00	1.00 ± 0.00	0.97 ± 0.02	0.95 ± 0.06	0.57 ± 0.03	0.74 ± 0.02
	4m5m	0.18 ± 0.05	0.14 ± 0.03	0.04 ± 0.05	0.29 ± 0.17	0.10 ± 0.06	0.00 ± 0.00
	6m	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	0.91 ± 0.09	0.71 ± 0.02
	6m7m	0.90 ± 0.03	0.86 ± 0.05	0.74 ± 0.15	0.78 ± 0.05	0.35 ± 0.10	0.01 ± 0.00
	7m	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	0.99 ± 0.01	0.92 ± 0.03	0.66 ± 0.03
	7m8m	0.89 ± 0.09	0.82 ± 0.07	0.83 ± 0.04	0.73 ± 0.11	0.38 ± 0.05	0.01 ± 0.01
	8m	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	0.99 ± 0.02	0.83 ± 0.05	0.63 ± 0.05
	9m	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	0.99 ± 0.01	0.66 ± 0.11	0.55 ± 0.05
	9m10m	0.89 ± 0.05	1.00 ± 0.00	0.84 ± 0.09	0.80 ± 0.16	0.33 ± 0.09	0.01 ± 0.00
	10m	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	0.99 ± 0.01	0.17 ± 0.08	0.46 ± 0.02
	10m12m	0.18 ± 0.07	0.12 ± 0.02	0.07 ± 0.01	0.07 ± 0.04	0.03 ± 0.02	0.00 ± 0.00

The best results on each map are bolded. Besides, xmym is short for xm_vs_ym (e.g., 5m6m short for 5m_vs_6m.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, Z.; Wu, J.; Zhang, J. Multi-Agent Transfer Learning Based on Contrastive Role Relationship Representation. AI 2026, 7, 13. https://doi.org/10.3390/ai7010013

AMA Style

Wu Z, Wu J, Zhang J. Multi-Agent Transfer Learning Based on Contrastive Role Relationship Representation. AI. 2026; 7(1):13. https://doi.org/10.3390/ai7010013

Chicago/Turabian Style

Wu, Zixuan, Jintao Wu, and Jiajia Zhang. 2026. "Multi-Agent Transfer Learning Based on Contrastive Role Relationship Representation" AI 7, no. 1: 13. https://doi.org/10.3390/ai7010013

APA Style

Wu, Z., Wu, J., & Zhang, J. (2026). Multi-Agent Transfer Learning Based on Contrastive Role Relationship Representation. AI, 7(1), 13. https://doi.org/10.3390/ai7010013

Article Menu

Multi-Agent Transfer Learning Based on Contrastive Role Relationship Representation

Abstract

1. Introduction

2. Preliminaries

2.1. Problem Formulation

2.2. Task Relationship

3. Method

3.1. Contrastive Role Representation

3.2. Attention Role Collaboration

4. Experiments

4.1. Role Generalization in Unknown Tasks

4.2. Role-Aided Good Initialization for Policy Fine-Tuning

4.3. Dynamic Evolution of Role Representations

5. Related Works

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Experimental Detail

Appendix A.2. Pseudo-Code

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI