Next Article in Journal
MVSegNet: A Multi-Scale Attention-Based Segmentation Algorithm for Small and Overlapping Maritime Vessels
Previous Article in Journal
Adaptive Parallel Methods for Polynomial Equations with Unknown Multiplicity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Multi-Agent Cooperative Group Game Model Based on Intention-Strategy Optimization

1
College of Aerospace Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
2
College of Intelligent Manufacturing, Yangzhou Polytechnic Institute, Yangzhou 225127, China
3
College of Information Engineering, Yangzhou University, Yangzhou 225009, China
*
Author to whom correspondence should be addressed.
Algorithms 2026, 19(1), 22; https://doi.org/10.3390/a19010022
Submission received: 2 December 2025 / Revised: 14 December 2025 / Accepted: 23 December 2025 / Published: 24 December 2025

Abstract

With the rapid advancement of artificial intelligence technology, multi-agent systems are being widely applied in fields such as autonomous driving and robotic collaboration. However, existing methods often suffer from the disconnection between intention recognition and strategy optimization, leading to inefficiencies in group collaboration. This paper proposes a multi-agent cooperative group game model based on Intention-Strategy Optimization (ISO-MAGCG). The model establishes a two-layer optimization framework encompassing intention and strategy, enabling dynamic adaptation through the co-evolution of upper-layer intention recognition and lower-layer strategy optimization. A Group Attention-based Intention Recognition Network (GAIN) is designed to efficiently capture complex interactions among agents. Furthermore, an Adaptive Group Evolution Algorithm (AGEA) is proposed to ensure the stability of large-scale cooperative endeavors. Experiments conducted in navigation, resource collection, and defense collaboration scenarios validate the effectiveness of the proposed method. Compared with mainstream algorithms such as QMIX, MADDPG, and MAPPO, ISO-MAGCG demonstrates significant superiority in metrics including task success rate and cooperative efficiency, achieving an average improvement of 8.4% in task success rate, a 12% enhancement in cooperative efficiency, and an intention recognition accuracy of 94.3%. The results indicate notable performance advantages and favorable scalability.

1. Introduction

With the rapid development of artificial intelligence technology, Multi-Agent Systems (MASs) have found extensive applications in diverse fields such as autonomous driving [1], robotic collaboration, intelligent transportation, resource allocation, and distributed computing. In these complex application scenarios, multiple agents need to collaborate and compete effectively within dynamic environments to optimize the overall system performance. However, traditional multi-agent collaboration methods often rely on pre-defined coordination mechanisms or simplistic communication protocols, struggling to adapt to complex and changing environmental conditions [2]. Furthermore, as the number of agents increases and environmental complexity grows, achieving effective intention recognition, strategy optimization, and group coordination among agents has become a critical challenge. Particularly in partially observable environments, agents must infer the intentions of others through observed behaviors and adjust their own strategies accordingly, posing new challenges to traditional game theory and reinforcement learning methods.
Current research on multi-agent cooperation primarily focuses on reinforcement learning, game theory, and intention recognition, yet several limitations persist. In Multi-Agent Reinforcement Learning (MARL), the QMIX algorithm proposed by Rashid et al. (2020) enables centralized training with decentralized execution via monotonic value function factorization [3], but its performance degrades significantly in non-monotonic environments. QTRAN, developed by Son et al. (2019), relaxes the monotonicity assumption [4] but suffers from poor training stability. The QPLEX algorithm, introduced by Kong et al. (2025), enhances expressive capacity through dueling decomposition [5], albeit at the cost of high computational complexity. In the field of intention recognition, Li et al. (2025) designed an intention inference method based on Variational Autoencoders (VAEs) [6], which lacks adaptability in dynamic settings. Zhang et al. (2025) incorporated attention mechanisms within a multi-agent actor-critic framework [7], but omitted explicit intention modeling. Chernyavskiy et al. (2025) investigated the role of opponent modeling in multi-agent learning [8], yet the computational overhead is substantial. Regarding cooperative game theory, Xu et al. (2025) studied dynamic coalition formation [9] but did not account for real-time intention changes. Asadi et al. (2021) proposed deep learning-based solutions for cooperative games [10], but theoretical guarantees for convergence are lacking. Juan et al. (2013) explored coordination mechanisms in large-scale multi-agent systems [11], yet scalability remains limited. Yun et al. (2015) analyzed cooperative strategies under incomplete information [12], but the intention modeling is oversimplified. Li et al. (2022) introduced a graph neural network-based approach for multi-agent collaboration [13] but neglected the temporal characteristics of intentions. More recently, transformer-based architectures have been introduced to multi-agent learning. Pang et al. (2022) proposed the Multi-Agent Transformer (MAT) that leverages sequential modeling and cross-agent attention for coordination [14]. Zhao et al. (2020) introduced ROMA, which enables emergent role discovery through a role encoder that clusters agents with similar behaviors [15]. While these methods demonstrate improved coordination capabilities, they primarily focus on learning implicit coordination patterns rather than explicit intention modeling, and the integration of interpretable intention recognition with strategy optimization remains underexplored. In summary, existing methods predominantly suffer from the following issues: a lack of synergistic mechanisms integrating intention recognition and strategy optimization; low computational efficiency in large-scale groups; insufficient adaptability and robustness in dynamic environments; and inadequate theoretical convergence analysis.
To address the aforementioned challenges, this paper proposes a multi-agent cooperative group game model based on Intention-Strategy Optimization (ISO-MAGCG). The main contributions of this work are as follows. Firstly, an intention-strategy bi-level optimization framework is constructed, enabling dynamic adaptation of intentions and strategies through the co-evolution of upper-level intention recognition and lower-level strategy optimization. Secondly, a Group Attention-based Intention recognition Network (GAIN) is designed, which efficiently captures complex interaction relationships and intention dependencies among agents through dynamic graph modeling and multi-head attention mechanisms. Finally, an Adaptive Group Evolution Algorithm (AGEA) is proposed, which ensures the stability and convergence of large-scale group cooperation via dynamic weight allocation and distributed solution methods. The primary innovations of this paper lie in establishing a unified theoretical framework integrating intention recognition and strategy optimization; proposing an efficient method for modeling group intentions; and designing an adaptive equilibrium algorithm with theoretical guarantees.

2. Fundamental Theory

2.1. Multi-Agent Reinforcement Learning Theory

Multi-Agent Reinforcement Learning (MARL) represents an extension of reinforcement learning to multi-agent environments [16], with its theoretical foundation rooted in Markov Games. Unlike single-agent reinforcement learning, the optimal policy of each agent in multi-agent environments depends not only on the environmental state but is also influenced by the strategies of other agents, rendering the learning process significantly more complex [17]. In multi-agent systems, agents exhibit intricate interaction patterns, including cooperative, competitive, and mixed relationships, which directly impact both the overall system performance and individual rewards [18].
Formally, an n-agent Markov game can be defined as a tuple, as shown in Equation (1), with the multi-agent Markov game illustrated in Figure 1:
g = N , S , { A i } i = 1 N , { R i } i = 1 N  
where the parameters are defined as follows: N = {1, 2, …, n} represents the agent set, encompassing all decision-making agents in the system; S denotes the global state space, describing all possible environmental states; Ai represents the action space of agent i, defining all executable actions for that agent; P: S × A × S → [0, 1] denotes the state transition probability function, where A = A1 × A2 × … × An is the joint action space; Ri: S × A→R represents the reward function of agent i; and γ ∈ [0, 1) represents the discount factor, which balances the importance of immediate rewards versus future rewards.
Each agent’s policy is defined as a mapping from states to probability distributions over actions, as shown in Equation (2):
π i : S Δ A i
where Δ(Ai) represents the probability distribution over the action space. The joint policy is denoted as π = (π1, π2, …, πn), which describes the strategy combination of all agents.
The state value function of agent i under joint policy π is defined as the expected cumulative discounted reward obtainable from state S, following policy ππ, as shown in Equation (3):
V i π s = E π t = 0 γ t r i t + 1 bigg | s 0 = s
where r i t + 1 represents the immediate reward obtained by agent i at time t, and Eπ denotes the expectation under joint policy π.
Correspondingly, the state-action value function (Q-function) is defined as shown in Equation (4):
Q i π s , a = E π t = 0 γ t r i t + 1 bigg | s 0 = s ,   a 0 = a
where a represents the joint action.
Nash equilibrium is an important solution concept in multi-agent games, describing a stable strategy combination where no agent can obtain higher rewards by unilaterally changing their strategy. A strategy combination π* constitutes a Nash equilibrium if and only if:
V i π s V i π i , π i s , i N , π i Π i , s S
where π i * = π 1 * , , π i 1 * , π i + 1 * , , π n * denotes the equilibrium strategies of all agents except agent i, and π i Π i represents any alternative strategy that agent i could adopt. This condition ensures that at Nash equilibrium, no agent has an incentive to unilaterally deviate from the current strategy.
While the Nash equilibrium originates from non-cooperative game theory, its application in cooperative multi-agent systems requires clarification. In this work, we adopt the framework of potential games [19], where a global potential function Φ exists such that:
V i π i , π i V i π i , π i = Φ π i , π i Φ π i , π i , i , π i , π i
In potential games, the Nash equilibrium corresponds to local maxima of the potential function Φ. Combined with the superadditivity property of cooperative games, this ensures that the Nash equilibrium sought by AGEA represents a stable cooperative configuration where individual and group incentives are naturally aligned.

2.2. Intention Recognition Theory

Intention recognition refers to the process of inferring the internal goals and intentions of agents by observing their behavioral sequences. In multi-agent environments, accurate intention recognition is crucial for achieving effective cooperation, as it enables agents to predict the future behaviors of other agents and thereby formulate better cooperative strategies. The challenge of intention recognition lies in the fact that the mapping relationship between behaviors and intentions is typically many-to-one, meaning that identical behaviors may correspond to different intentions, while different behaviors may also stem from the same intention [20].

2.2.1. Bayesian Intention Inference

Bayesian intention inference is based on Bayes’ theorem, updating beliefs about agent intentions through observed behavioral sequences. The core idea of this approach is to transform the intention recognition problem into a probabilistic inference problem, gradually approximating true intentions by continuously updating posterior probabilities [21].
Let the intention space of an agent be I = {I1, I2, …, Im}, where each Ij represents a possible intention; the observed behavioral sequence is o1: t = {o1, o2, …, ot}, where ok represents the behavior observed at time k. Then, the posterior probability of intention Ij is:
P I j o 1 : t = P o 1 : t I j P I j k = 1 m P o 1 : t I k P I k
where the terms are defined as follows: P(Ij) is the prior probability of intention Ij, reflecting the initial belief about that intention before observing any behaviors; P(o1:t∣Ij) is the likelihood probability of observing behavioral sequence o1:to1:t under intention Ij, representing the possibility that an agent with intention Ij produces the observed sequence; the denominator is a normalization constant ensuring that the sum of posterior probabilities for all intentions equals 1 [22].
For temporal behavioral sequences, the likelihood probability can be further decomposed as:
P o 1 : t | I j = k = 1 t P o k | o 1 : k 1 , I j
where o1:k−1 represents the observation sequence before time k.

2.2.2. Recursive Reasoning Model

In multi-agent environments, agents not only need to reason about the intentions of other agents but also need to consider other agents’ reasoning about their own intentions, forming a multi-level recursive reasoning structure. The depth of recursive reasoning reflects the rationality level and computational capacity of agents. This recursive reasoning process can be described using Theory of Mind, i.e., agents possess the ability to understand and predict the mental states of other agents [23].
Define the strategy of a k-th order recursive reasoning agent as πi(k), where k represents the depth of reasoning hierarchy:
0th-order agent: Does not consider the existence of other agents, directly optimizing its own expected reward.
π i 0 = arg max π i E R i π i , s
k-th order agent: assumes other agents are (k − 1)-th order and optimizes strategy accordingly.
π i k = arg max π i E R i π i , π i k 1 , s
where Ri represents agent i’s expected reward function, and π i ( k 1 ) represents the (k − 1)-th order strategies of other agents.
The convergence of recursive reasoning can be analyzed through fixed-point theorems. Under certain conditions, the recursive reasoning sequence converges to a stable strategy:
lim k π i k = π i *
where π i * is the fixed point of recursive reasoning.

2.3. Cooperative Game Theory

Cooperative game theory studies how agents form coalitions and allocate cooperative rewards. Unlike non-cooperative games [24], cooperative games allow binding agreements between agents to jointly pursue higher overall rewards. In multi-agent systems, cooperative game theory provides theoretical foundations for coalition formation, stability analysis, and reward allocation among agents [25].

2.3.1. Coalition Game Fundamentals

A coalition game can be represented as a tuple ⟨N, v⟩, where N is the agent set and v is the characteristic function representing the total reward that coalition S can obtain. The characteristic function must satisfy the boundary condition [26]:
v = 0
i.e., the empty coalition has zero reward.
Superadditivity is an important property of cooperative games, indicating that the reward of merging two disjoint coalitions is no less than the sum of their individual rewards:
v S T v S + v T , S , T N , S T =
This property ensures the incentive for cooperation, i.e., agents have motivation to form larger coalitions.

2.3.2. Shapley Value

The Shapley value is a fair reward allocation scheme that distributes rewards based on each agent’s marginal contribution to all possible coalitions [27]. The core idea of the Shapley value is to consider the average marginal contribution of an agent across all possible coalition formation orders.
The Shapley value of agent i is defined as:
ϕ i v = S N { i } | S | ! | N | | S | 1 ! | N | ! v S { i } v S
where the terms are defined as follows: |S| represents the size of coalition S; |N| represents the total number of agents; v(S∪{i}) − v(S) represents the marginal contribution brought by agent i joining coalition S; |S|!(|N| − |S| − 1)! is the weight coefficient, representing the proportion of arrangements where agent i joins after exactly |S| agents among all !|N|! possible agent arrangements.

2.3.3. Core Solution

The core solution is a set of stable reward allocation schemes that ensures no sub-coalition has an incentive to deviate from the current allocation scheme [28]. The core solution is defined as:
Core v = x n : i N x i = v N , i S x i v S , S N
where x = (x1, x2, …, xn) represents the reward allocation vector, and xi is the reward obtained by agent i. The first constraint condition ensures allocation efficiency, i.e., the sum of all agents’ rewards equals the total reward of the grand coalition; the second constraint condition ensures allocation stability, i.e., the total reward obtained by members of any sub-coalition is no less than the reward that sub-coalition could obtain by acting independently.

3. Intention-Strategy Optimization Based Multi-Agent Cooperative Group Game Model

3.1. Overall Framework Design

The Intention-Strategy Optimization based Multi-Agent Cooperative Group Game Model (ISO-MAGCG) proposed in this paper adopts a bi-level optimization architecture that organically combines intention recognition with strategy optimization. The core idea of this framework is to capture complex interaction relationships and intention dependencies among agents through an upper-level intention recognition network, while the lower-level strategy optimization module makes collaborative decisions based on identified intention information. The two levels achieve dynamic adaptation and co-evolution through feedback mechanisms.
The overall framework comprises three core components: the Group Attention-based Intention Recognition Network (GAIN), the Intention-Strategy Optimization Framework (ISO Framework), and the Adaptive Group Game Equilibrium Algorithm (AGEA). The system inputs are environmental states and agent observation sequences, with outputs being optimized cooperative strategies. The overall architecture of the framework is shown in Figure 2.

3.2. Group Attention-Based Intention Recognition Network

3.2.1. Dynamic Graph Construction

In multi-agent environments, the interaction relationships among agents are dynamically changing, and traditional static graph structures cannot effectively capture such time-varying characteristics. The GAIN network proposed in this paper first constructs dynamic graphs to represent interaction relationships among agents.
Let at time t, the agent set be N = {1, 2, …, n}, and each agent’s observation be o i t R d o , where d o is the observation dimension. The dynamic graph construction process is as follows:
The node feature matrix is defined as:
H t = h 1 t , h 2 t , , h n t T
where h i t is the node feature of agent i at time t, obtained through an observation encoder:
h i t = G R U o i t , h i t 1
Here, GRU is a GRU-based encoder used to capture temporal information.
The edge weight matrix is computed through the attention mechanism:
e i j t = a T tan h W [ h i t h j t ]
where W is the weight matrix, a is the attention vector, and | denotes vector concatenation operation.

3.2.2. Multi-Head Attention Mechanism

To capture multiple types of interaction relationships among agents, the GAIN network employs a multi-head attention mechanism. The computation process for the k-th attention head is:
α i j k = exp LeakyReLU a k T W k h i W k h j l N i exp LeakyReLU a k T W k h i W k h l
where α i j k is the attention weight of the k-th head, Wk is the weight matrix of the k-th head, and LeakyReLU is the activation function.
The output of multi-head attention is obtained through concatenation and linear transformation:
h i = W O | k = 1 K j N i α i j k W k h j
where K is the number of attention heads, and WO are the output layer parameters.

3.2.3. Intention Prediction

Based on the updated node features, the intention distribution prediction for agent i is:
p i t = S o f t m a x ( W intent h i t + b intent )
where p i t R | I | is the intention probability distribution of agent i at time t, and |I| is the size of the intention space.
The current formulation assumes a discrete intention space I. For continuous intention representations, the prediction module can output Gaussian parameters:
μ i t , i t = f intent h i t + 1 ; θ intent
where μ i t R d I and i t R d I × d I parameterize a continuous intention distribution. For latent intention discovery without predefined categories, a Variational Autoencoder architecture can be employed:
L VAE = E q z | o log p o | z β D KL q z | o p z
where z represents the latent intention. These extensions remain important directions for future research.
The complete GAIN network architecture diagram is shown in Figure 3.

3.2.4. Intention Label Generation

To enable supervised training of the GAIN network, ground-truth intention labels are required for each agent at each timestep. This section describes our hybrid labeling approach combining rule-based heuristics with trajectory clustering.
(1)
Rule-Based Labeling for Structured Scenarios
For scenarios with well-defined objectives, intention labels are derived from observable goal-directed behaviors using domain-specific rules. The ground-truth intention is determined as:
I i g t t = arg max I k I ; P I k g i , s t , a i , t H : t tag 20 a
where gi represents the assigned goal of agent i, st is the current environmental state, and a i , t H : t denotes the action history over horizon H = 5.
Table 1 summarizes the intention space and labeling rules for each experimental scenario.
For the navigation task, specific rules include:
  • Navigate: Agent’s velocity vector points toward its target with cosine similarity > 0.8.
  • Avoid: Agent deviates from shortest path while another agent is within dsafe = 2.0 units.
  • Wait: Agent remains stationary (speed < 0.1) for more than 3 consecutive timesteps.
  • Replan: Agent changes movement direction by more than 90° within 5 timesteps.
(2)
Trajectory Clustering for Complex Scenarios
For the defense cooperation scenario with continuous action spaces, we employed unsupervised trajectory clustering. Agent trajectories are segmented into windows of Tw = 20 timesteps, and pairwise distances are computed using Dynamic Time Warping (DTW):
D T W τ i , τ j = min π A k , l π a k i a l j 2
Hierarchical agglomerative clustering with Ward linkage groups similar trajectories, and cluster centroids are manually labeled by domain experts.
(3)
Label Quality Assurance
To ensure labeling quality, we implemented a two-stage verification:
Automatic Consistency Check: Labels are validated against expected behavioral patterns.
Manual Sampling: 10% of labels (~5000 samples per scenario) are verified by two independent annotators.
Table 2 presents the label statistics and quality metrics.
The achieved labeling accuracy exceeded 95% across all scenarios, providing reliable supervision for GAIN training.

3.3. Intention-Strategy Bi-Level Optimization Framework

3.3.1. Upper-Level Intention Optimization

The objective of upper-level intention optimization is to maximize the overall reward of group cooperation while maintaining the accuracy of intention recognition. The group intention consistency loss is defined as:
L intent = i = 1 n CE y i , p i + λ KL p i p ¯
where yi is the true intention label of agent i, pi is the predicted intention probability, CE is cross-entropy loss, KL is KL divergence, P ¯ is the average intention distribution, and λ is the regularization parameter.
The upper-level optimization problem can be formulated as: P
θ intent * = arg min θ intent L intent θ intent
where θ i n t e n t * is the parameters of the intention recognition network.

3.3.2. Lower-Level Strategy Optimization

Lower-level strategy optimization is based on identified intention information, employing an improved Actor-Critic algorithm. The policy network parameter update for agent i is:
θ π i θ π i + α π θ π i J π i θ π i
where the policy objective function is defined as:
J π i θ π i = E s ~ ρ π , a ~ π i Q i s , a p
Here, p = [p1, p2, …, pn] is the intention distribution of all agents, and ρπ is the state-action distribution induced by the policy.
The value network update employs temporal difference learning:
θ Q i θ Q i α Q θ Q i L Q i θ Q i
where the value loss function is:
L Q i θ Q i = E [ ( y i target Q i s , a ) 2 ]
The target value is defined as:
y i target = r i + γ E a ~ π [ Q i ( s , a ) p ]

3.3.3. Bi-Level Collaborative Optimization

The key to bi-level optimization lies in information transfer and collaborative updates between upper and lower levels. The bi-level objective function is defined as:
L total = α 1 L intent + α 2 i = 1 n L Q i
where α1 and α2 are balancing parameters.
The intent-policy two-layer optimization framework process is shown in Figure 4.

3.4. Adaptive Group Game Equilibrium Algorithm

3.4.1. Dynamic Weight Allocation Mechanism

To handle the heterogeneity and contribution differences of agents in large-scale groups, the AGEA algorithm employs a contribution-based dynamic weight allocation mechanism. The contribution of agent i is defined as:
c i = V N V N \ { i } V N + ϵ
where N\{i} represents the set of all agents except agent i, V(N) is the value of agent i under the current joint strategy, V(N\{i}) is the value after removing agent i, and ϵ is a small constant to prevent division by zero.
The contribution-based dynamic weight is computed as:
w i = exp c i / τ j = 1 n exp c j / τ
where τ is the temperature parameter controlling the concentration degree of weight allocation.

3.4.2. Distributed Nash Equilibrium Solving

The AGEA algorithm employs a distributed approach to solve the Nash equilibrium of group games. Each agent maintains a local strategy and estimates of other agents’ strategies.
The best response strategy update for agent i is:
π i t + 1 = arg max π i E [ R i ( π i , π ^ i t ) p i ]
The strategy estimate update employs the exponential moving average:
π ^ i t + 1 = 1 β π ^ i t + β π i t
where β is the update rate.

3.4.3. Convergence Guarantee

To ensure algorithm convergence, a potential function is defined:
Φ π = i = 1 n w i V i π μ 2 i = 1 n | π i π i * | 2
where μ is the regularization parameter and π i * is the optimal strategy of agent i.
Proposition 1 (Convergence of AGEA).
Under the following assumptions, AGEA converges to a neighborhood of Nash equilibrium: (A1) The value function Vi is L-Lipschitz continuous;
(A2) Learning rates satisfy απ < 1/(2L) and β > 0;
(A3) The potential function Φ is bounded below.
Proof Sketch. Define the Lyapunov function:
V t = Φ π * Φ π t + η 2 i = 1 n π ^ i t π i t 2
where  π *  is a Nash equilibrium and  η > 0  is a weighting constant.
Step 1 (Strategy Update Descent).  By the best response property:
Φ π t + 1 Φ π t α π i = 1 n π i Φ B R i π ^ i t π i t α π 2 L 2 i | | B R i π i t | | 2
Step 2 (Estimation Error Decay).  From the exponential moving average update (Equation (32)), the estimation error decays exponentially:
| | π ^ i t + 1 π i t + 1 | | 1 β | | π ^ i t π i t | | + O α π
Step 3 (Combined Descent).  Choosing η sufficiently small yields:
V t + 1 V t c V t + ϵ
for constants c > 0 and smallϵ dependent on learning rates, establishing convergence to an ϵ-neighborhood of equilibrium.

3.5. Complete Algorithm Flow

The pseudocode for the ISO-MAGCG algorithm is shown in Algorithm 1.
Algorithm 1 ISO-MAGCG
Input: Environment E, Agent set N, Episodes T, Learning rates απ, αQ, αintent
Output: Optimized cooperative strategies π*
1    Initialize GAIN, Actor, Critic networks and replay buffer D
2    for episode = 1 to T do
3        Reset environment, get initial state s0
4        for step t = 0 to max_steps do
5            // Upper layer: Intent recognition
6            Construct dynamic graph Gt and compute node features Ht
7            Apply multi-head attention: Ht+1 = MultiHeadAttention (Ht)
8            Predict intent distributions: p i t = Softmax(Wintent  h i t + 1 )
9            
10          // Lower layer: Strategy optimization
11          for agent i = 1 to n do
12              Select action: a i t = πi(st, It) + εt
13          end for
14          Execute actions, observe rewards and next state
15          Store experience (st, at, rt, st+1, It) to buffer D
16
17          if |D| > batch_size then
18              Sample batch from D
19              // Update networks
20              Update Critic: θ i Q θ i Q Q▽LQ
21              Update Actor: θ π i θ π i + απ ∇Ji
22              Update Intent: θintent ← θintent - αintent ∇Lintent
23              
24              // AGEA equilibrium solving
25              Compute contributions: c i t = ContributionCalculation()
26              Update weights: w i t = exp(β c i t )/Σj exp(β c j t )
27              Update strategy estimations and compute best responses
28              Soft update target networks
29          end if
30      end for
31      Check convergence, break if converged
32 end for
33 return π* = { π 1 * , π 2 * , …, π n * }

3.6. Algorithm Complexity Analysis

3.6.1. Time Complexity

The time complexity of the GAIN network is primarily determined by the multi-head attention mechanism. For n agents and K attention heads, the complexity of a single forward pass is:
O K n 2 d + K n d 2
where the first term comes from attention weight computation and the second term from feature transformation.
In the ISO framework, the upper-level intention optimization has complexity O (n·dintent), while the lower-level strategy optimization employs Actor-Critic methods with complexity O (n·(ds·da + da·dπ)), where dsds, dada, and dπ represent the dimensions of state, action, and policy networks, respectively.
The AGEA algorithm’s complexity mainly comes from contribution computation and weight updates, which is O (n2).
The overall time complexity is:
O K n 2 d + n d int + n d s d a + n 2

3.6.2. Space Complexity

The space complexity consists of the following components:
  • GAIN network parameters: O (K·d2 + n·d);
  • Actor-Critic network parameters: O (n·(ds·da + da·dπ));
  • Dynamic graph storage: O (n2);
  • Experience replay buffer: O(B·(ds + da + dQ)).
where B is the buffer size and dQ is the value network dimension.
The total space complexity is:
O K d 2 + n d + n d s d a + n 2 + B d s
Through the above analysis, it can be seen that the algorithm’s complexity grows quadratically with the number of agents, which may become a bottleneck in large-scale multi-agent systems. To address this, hierarchical optimization and approximation algorithms can be employed to reduce computational complexity.

4. Experimental Design and Results Analysis

4.1. Experimental Environment Setup

4.1.1. Experimental Platform Configuration

The experiments in this paper were conducted under the hardware and software environment shown in Table 3. The hardware configuration adopts a high-performance computing platform to ensure the computational requirements of large-scale multi-agent simulation; the software environment is built on mainstream deep learning frameworks to guarantee experimental reproducibility and standardization.

4.1.2. Experimental Scenario Design

To comprehensively validate the effectiveness of the ISO-MAGCG model, this paper designed three multi-agent cooperation scenarios with different complexity levels, covering various situations from discrete to continuous action spaces and from static to dynamic environments. Table 4 details the configuration parameters for each scenario.
The multi-agent navigation task tests agents’ path planning and conflict avoidance capabilities under spatial constraints; the resource collection task evaluates agents’ cooperation abilities in resource allocation and task coordination; and the defense cooperation task verifies agents’ real-time collaboration and strategy adaptation capabilities in dynamic adversarial environments.

4.1.3. Network Architecture Parameters

The ISO-MAGCG model contains multiple neural network modules, with each module’s architecture carefully optimized to adapt to the special requirements of multi-agent learning. Table 5 provides detailed parameter configurations.
The GAIN network adopts moderate hidden layer dimensions to balance expressive capability and computational efficiency; the Actor and Critic networks employ a funnel-shaped structure with gradually decreasing layers; the learning rate settings were determined through extensive preliminary experiments to find optimal values.

4.2. Baseline Method Comparison

4.2.1. Baseline Algorithm Selection

To objectively evaluate the performance of the ISO-MAGCG model, this paper selected five representative algorithms from the current multi-agent reinforcement learning field as baselines. These algorithms cover different technical approaches and theoretical foundations, comprehensively reflecting the development level of existing technologies. Table 6 details the core characteristics and main parameter configurations of each baseline algorithm.
QMIX is a typical value function decomposition method, realizing centralized training and distributed execution via monotonicity assumptions; MADDPG adopts centralized critics to tackle continuous action space problems; COMA addresses credit assignment issues through counterfactual baselines; QTRAN relaxes QMIX’s monotonicity constraints; MAPPO extends Proximal Policy Optimization to multi-agent scenarios. For comprehensive comparison with recent advances, two state-of-the-art methods (MAT and ROMA) were additionally included. MAT reformulates multi-agent reinforcement learning as a sequence modeling task, leveraging a transformer with cross-agent attention to capture coordination patterns, and we adopted its official encoder–decoder implementation (d_model = 128, 4 attention heads, 2 layers). ROMA introduces emergent role discovery, enabling agents to cluster into functional roles via a role encoder, configured with 4 latent roles and 64-dimensional role embeddings following the original paper’s cooperative task recommendations. These two methods represent distinct multi-agent coordination paradigms: MAT implicitly models agent interactions via the transformer’s attention mechanism, while ROMA explicitly discovers agent roles to facilitate labor division. Both differ fundamentally from ISO-MAGCG’s explicit intention modeling: MAT learns implicit coordination patterns without interpretable intermediate representations, and ROMA focuses on role differentiation rather than intention recognition. In contrast, our GAIN network specifically models agent intentions, which is complementary yet distinct from ROMA’s role modeling.

4.2.2. Evaluation Metrics

This paper established a multi-dimensional evaluation metric system to comprehensively assess algorithm performance from perspectives of task completion effectiveness, learning efficiency, and cooperation quality. The evaluation metrics are defined as follows:
Task Success Rate (TSR):
TSR = N success N total × 100 %
This metric represents the proportion of episodes where agents successfully complete tasks, serving as the most intuitive performance indicator reflecting the algorithm’s task completion capability.
Average Cumulative Reward (ACR):
ACR = 1 T t = 1 T i = 1 n R i t
where T is the total number of episodes, n is the number of agents, and Rt i is the reward obtained by agent i in episode t. This metric comprehensively reflects the overall performance of the agent group.
Cooperation Efficiency (CE):
CE = R coop R indep R indep × 100 %
where Rcoop is the average reward under cooperative strategy and Rindep is the average reward under independent action. This metric measures the performance improvement of inter-agent cooperation relative to independent action.
Convergence Speed (CS):
CS = arg min t R t i 0.9 × R final i
where Rt is the average reward at episode t and Rfinal is the final stable performance. This metric reflects the learning efficiency, measured as the number of episodes required to reach 90% of final performance.
Task success rate is the most intuitive performance indicator, reflecting the algorithm’s ability to complete tasks; average cumulative reward comprehensively considers task completion quality, efficiency, and cooperation degree; cooperation efficiency measures the performance improvement of inter-agent cooperation relative to independent action; convergence speed reflects learning efficiency.

4.3. Comparative Experimental Results

4.3.1. Overall Performance Comparison

Table 7 shows the comprehensive performance comparison results of each algorithm in three experimental scenarios. The data are based on independent experiments of 1000 episodes for each scenario, with each algorithm running 5 times and taking the average, with a 95% confidence interval.
Statistical Significance Testing
The Kruskal–Wallis test confirmed statistically significant differences across all algorithms for every performance metric (H > 28.7, p < 0.001). Post hoc pairwise comparisons via the Wilcoxon signed-rank test with Bonferroni correction further verified that ISO-MAGCG outperformed all baseline algorithms significantly (p < 0.001) across all experimental scenarios. Cohen’s d effect sizes ranged from 1.4 to 2.9, indicating large practical significance. The 95% confidence intervals for ISO-MAGCG’s performance gains over the top baseline (MAPPO) were [6.8%, 10.2%] for TSR, [13.1%, 18.5%] for ACR, and [10.2%, 15.8%] for CE.
Comparison with Transformer-based and Role-based Methods
Among baselines, MAT attained the strongest performance (TSR: 83.5%, 79.1%, 76.8% across three scenarios), validating the efficacy of transformer-driven cross-agent attention for multi-agent coordination. Nevertheless, ISO-MAGCG outperformed MAT by 6.1%, 5.1%, and 5.3% in TSR across the three scenarios, respectively. This performance disparity stems from two core limitations of MAT:
  • MAT’s attention mechanism operates on the observation-action space without explicit intention modeling, yielding implicit coordination patterns but lacking the interpretability and goal-directedness afforded by intention recognition. In contrast, ISO-MAGCG’s GAIN module explicitly predicts agent intentions to guide strategy optimization.
  • ISO-MAGCG’s bi-level optimization framework enables co-evolution of intention recognition and strategy learning, whereas MAT frames coordination as a single-level sequence prediction task; the mutual feedback between ISO-MAGCG’s intention and strategy modules fosters more coherent cooperative behaviors.
ROMA delivered competitive results (TSR: 82.1%, 77.5%, 75.2%), particularly in role-differentiation-centric scenarios, but its capability-based role clustering mechanism does not support goal inference. In the defense cooperation scenario—where agents must dynamically switch between offensive and defensive behaviors based on situational intentions—ROMA’s static role assignments became restrictive, and ISO-MAGCG’s dynamic intention recognition yielded a 6.9% higher TSR.
Notably, the cooperation efficiency (CE) metric highlighted ISO-MAGCG’s most pronounced advantage over transformer and role-based methods: it achieved CE improvements of 9.2%, 9.1%, and 9.4% over MAT across the three scenarios, demonstrating that explicit intention modeling enables more effective cooperation than implicit attention-driven coordination.

4.3.2. Convergence Analysis

Convergence analysis was conducted in the multi-agent navigation task with 8 agents in a 20 × 20 grid environment. We measured the number of training episodes required for each algorithm to achieve 90% of its final performance (Table 8). Each algorithm was independently executed 5 times, and results are reported as mean ± standard deviation, and the convergence curve is shown in Figure 5.
Convergence Statistical Analysis: The Mann–Whitney U test confirmed that ISO-MAGCG achieved significantly faster convergence than all baseline algorithms (p < 0.001). The algorithm demonstrated 22.4% faster convergence compared to the best baseline (MAPPO) with 95% confidence interval [18.7%, 26.1%]. Training stability analysis using coefficient of variation showed ISO-MAGCG maintained the most consistent performance (CV = 0.021) compared to baselines (CV range: 0.048–0.064).

4.4. Ablation Study

4.4.1. Component Effectiveness Validation

Ablation experiments were performed across three standardized test scenarios: Navigation task: 8 agents in a 20 × 20 grid environment; Resource collection: 6 agents in a 15 × 15 environment with 10 randomly distributed resource points; Defense collaboration: 10 agents defending against 5 invasion targets
Each variant was trained for 2000 episodes in each scenario, with performance evaluated as the average task success rate over the final 100 episodes.The ablation test results are shown in Table 9.
The ablation study revealed each component’s contribution:
GAIN Network (−11.7%): Removing the entire GAIN network caused the largest performance drop, confirming intention recognition’s critical role.
Attention Mechanism (−7.6%): The variant “w/o Attention (GAIN w/MLP)” replaces graph attention layers with standard MLP while preserving the GRU encoder and intention prediction head. This isolates the attention mechanism’s contribution: 7.6% of performance stemmed from attention, while the remaining 4.1% (11.7–7.6%) came from other GAIN components (temporal encoding, dynamic graph construction).
ISO Framework (−8.4%): Removing bi-level optimization and training intention recognition independently caused significant degradation, demonstrating the importance of co-evolution.
AGEA (−6.0%): Replacing AGEA with uniform weight allocation reduced performance, confirming contribution-based dynamic weighting’s value.
Multi-Head Attention (−4.3%): Reducing from K = 4 to K = 1 attention head caused moderate loss, suggesting multiple heads capture diverse interaction patterns.

4.4.2. Hyperparameter Sensitivity Analysis

Hyperparameter sensitivity testing was conducted in the multi-agent navigation task (8 agents). With other parameters fixed at optimal values, each target parameter was varied individually. Each parameter configuration was trained for 1500 episodes, with performance measured as the average over the final 200 episodes.The results of the sensitivity analysis are shown in Table 10.
The results show that ISO-MAGCG has good robustness to most hyperparameters, showing only medium sensitivity to learning rate, which provides good parameter selection guidance for practical applications.
Temperature Parameter Analysis. The temperature parameter τ in Equation (30) controls weight concentration in dynamic allocation. Figure 6 illustrates performance sensitivity across different τ values.
Three regimes are observable:
(1)
Under-differentiation (τ < 0.5): Low temperatures produce near-uniform weights regardless of contribution, failing to leverage agent heterogeneity. Performance degrades 8–12%.
(2)
Optimal range (0.5 ≤ τ ≤ 2.0): Balanced weights appropriately reward high contributors while maintaining collective optimization. Performance remains within 2% of optimal.
(3)
Over-concentration (τ > 5.0): Excessive focus on top contributors neglects supporting agents, causing coordination failures and 10–15% degradation.
Practical Guidance: Initialize τ = 1.0. For homogeneous teams, use lower values (τ ≈ 0.5); for heterogeneous teams with specialist roles, use higher values (τ ≈ 2.0).

4.5. Scalability Analysis

Agent Number Scalability

Scalability was evaluated in a standardized multi-agent navigation task, where environment size was dynamically adjusted according to agent count (n × n grid, where n = number of agents × 2.5). Each scale configuration was trained for 2000 episodes, with average task success rate and per-episode computation time measured over the final 100 episodes.The scalability analysis experiment of the number of agents is shown in Table 11.
The results indicate that ISO-MAGCG maintains good performance as the number of agents increases, with a performance retention rate of 86.2% even with 20 agents, demonstrating excellent scalability.

4.6. Experimental Results Discussion

4.6.1. Performance Advantage Analysis

Experimental results demonstrate that ISO-MAGCG significantly outperformed the baseline algorithms across all test scenarios, with core advantages including: (1) high intention recognition accuracy (94.3%) via dynamic graph modeling in the GAIN network, laying the foundation for effective cooperation; (2) pronounced collaborative optimization effects enabled by the bi-level framework, which fosters mutual promotion between intention recognition and strategy learning, addressing the separation issue in traditional methods; (3) accelerated convergence (30% faster than the top baseline) to enhance learning efficiency; and (4) superior training stability (stability index = 0.94, smaller standard deviation), ensuring algorithm reliability.
Further comparisons with MAT (transformer-based) and ROMA (role-based) revealed key design distinctions and advantages of ISO-MAGCG:
  • Explicit vs. Implicit Coordination: Unlike MAT, which learns implicit coordination patterns from rewards via transformer attention, ISO-MAGCG’s explicit intention modeling enables faster convergence (via structured intermediate supervision) and interpretable predictions—critical for safety-critical applications (e.g., autonomous driving, medical robotics) requiring action explainability.
  • Dynamic Intentions vs. Static Roles: ROMA’s capability-based static role clustering is limited in dynamic scenarios, where agents need to switch behavioral modes (e.g., Navigate → Avoid → Cooperate) within episodes. ISO-MAGCG’s dynamic intention recognition adapts to such temporal shifts, overcoming ROMA’s rigidity.
  • Task-Specific Attention Design: While both MAT and GAIN utilize attention mechanisms, MAT focuses on observation–action pairs for sequence-based action prediction, whereas GAIN’s graph attention targets agent interaction graphs for intention inference—enabling more relevant feature extraction for cooperative tasks.
The consistent 5–7% performance advantage of ISO-MAGCG over MAT/ROMA across all metrics underscores that explicit intention modeling provides fundamental benefits that cannot be fully offset by advanced architectures (transformers) or alternative abstractions (roles).

4.6.2. Limitations and Improvement Directions

Despite achieving good results, ISO-MAGCG has the following limitations with proposed solutions:
  • Computational Complexity and Large-Scale Scalability. The O(n2) complexity poses challenges for systems with >100 agents. Three approximation strategies are proposed:
    • Hierarchical Attention: Partition agents into k groups, compute intra-group attention O((n/k)2) and inter-group attention O(k2). Overall complexity reduces to O(n2/k + k2), approaching O(n) when k = √n. Preliminary tests with k = 4 groups on 16 agents showed only 3.1% degradation.
    • Sparse Attention: Sample edges based on spatial proximity within threshold dconnectd. For uniformly distributed agents, expected edges become O n · ρ · d 2 c o n n e c t , significantly reducing computation in sparse scenarios.
    • Low-Rank Factorization: Approximate the n × n attention matrix using rank-r decomposition, reducing complexity to O(nr).
  • Intention Space Design. Current intention spaces are predefined. Future work will explore unsupervised intention discovery using VAE architectures.
  • Environmental Adaptability. Adaptability in highly dynamic environments needs further validation through domain randomization and curriculum learning.
  • Communication Overhead. Distributed design requires information exchange that may be affected in bandwidth-constrained scenarios. Federated learning approaches warrant investigation.
  • Simulation-to-Real Transfer. All experiments were conducted in simulation. Deploying to physical systems (UAV swarms, robotic teams) faces challenges:
    (a)
    Observation Noise: Preliminary robustness tests with Gaussian noise (σ = 0.1) showed only 3.2% degradation, suggesting reasonable tolerance. Systematic evaluation under realistic sensor models is needed.
    (b)
    Communication Constraints: Limited bandwidth may necessitate fully decentralized variants with local intention estimation.
    (c)
    Actuation Dynamics: Continuous actions must respect physical limits not captured in kinematic simulations.
    (d)
    Domain Shift: Real-world intention distributions may differ, requiring online adaptation capabilities.
Potential Applications:
(a)
UAV swarm coordination for search and rescue;
(b)
Warehouse multi-robot picking systems;
(c)
Autonomous vehicle platoons for highway driving.
Systematic sim-to-real validation through hardware-in-the-loop testing represents a critical direction.

5. Conclusions and Future Work

5.1. Research Summary

This paper addresses key issues in multi-agent systems, such as the separation of intention recognition and strategy optimization, and low group cooperation efficiency, by proposing the Intention-Strategy Optimization based Multi-Agent Cooperative Group Game Model (ISO-MAGCG). The main work and contributions include:
  • Theoretical contributions: Established an intention-strategy bi-level optimization theoretical framework, integrating intention recognition and strategy optimization into a unified framework for collaborative evolution; proposed group intention recognition theory based on graph attention, capturing complex intention dependencies through dynamic graph modeling and multi-head attention mechanisms; constructed an intention-aware multi-agent group game model, extending traditional game theory applications by using intention states as prior information for game strategies.
  • Methodological contributions: Designed the Graph Attention Intention Network (GAIN) achieving 94.3% intention recognition accuracy; developed the Adaptive Group Evolution Algorithm (AGEA) combining global search and local optimization for efficient joint optimization; proposed a distributed training architecture supporting parallel training of large-scale multi-agent systems.
  • Experimental validation: Constructed a multi-scenario evaluation system covering navigation, resource collection, and defense cooperation. Compared with five mainstream baseline algorithms, ISO-MAGCG significantly outperforms existing methods in task success rate, cumulative reward, cooperation efficiency, and other metrics, with average task success rate improvement of 8.4%, cooperation efficiency improvement of approximately 12%, and convergence speed improvement of 30%, demonstrating excellent performance advantages and scalability.

5.2. Future Research Directions

Future research will be extended in the following directions:
  • Theoretical aspects: Research adaptive intention space learning methods to reduce dependence on expert knowledge, automatically discovering intention patterns based on unsupervised learning; extend convergence theory under non-stationary environments, considering the impact of environmental dynamics, observation noise, and other practical factors; develop multi-level intention modeling theory, establishing relationships between different abstraction levels such as short-term intentions and long-term goals; investigate continuous and latent intention space learning using VAE architectures while maintaining interpretability.
  • Methodological aspects: Develop efficient approximation algorithms based on sampling, compression, and other techniques to reduce O(n2) computational complexity; research federated learning architectures to reduce communication overhead while protecting agent privacy; explore multi-modal intention recognition methods, integrating multiple information sources such as behavioral trajectories, linguistic communication, and visual information to improve recognition accuracy.
  • Application aspects: Validate method effectiveness in real scenarios such as UAV swarms, robot cooperation, and intelligent transportation; research human–robot collaboration systems, considering human behavioral uncertainty and subjectivity; address challenges of system scale, heterogeneity, and dynamics for large-scale distributed systems like smart cities and IoT.
With the rapid development of artificial intelligence technology, multi-agent cooperation will play increasingly important roles in more domains. This research has laid a solid foundation for further development in this field, and future work will continue to advance multi-agent cooperation technology toward greater intelligence, efficiency, and practicality.

Author Contributions

T.M.: Conceptualization, Investigation, Methodology, Software, Writing—original draft, Formal analysis, Funding acquisition. C.R.: Investigation, Software, Writing—review and editing. Z.J.: Investigation, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly supported by the Jiangsu Vocational Education Smart Scene Application “Double-qualified” Master Studio and the Yangzhou Science and Technology Plan Project—Research on Key Technologies of Smart Water Big Data Analysis and Platform Management (Grant No. YZ2023202).

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ardy, M.M.M.; Adzkiya, D. Optimisation of multi-agent based medical diagnosis with adaptive particle swarm and firefly algorithm in healthcare sector. Eng. Headw. 2025, 29, 33–40. [Google Scholar] [CrossRef]
  2. Mohamed, M.A.; Shadoul, M.; Yousef, H.; Al Abri, R.; Sultan, H.M. Multi-agent based optimal sizing of hybrid renewable energy systems and their significance in sustainable energy development. Energy Rep. 2024, 12, 4830–4853. [Google Scholar] [CrossRef]
  3. Rashid, T.; Samvelyan, M.; Schroeder, C.W.D.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 7234–7284. [Google Scholar]
  4. Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.E.; Yi, Y. QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019; pp. 5887–5896. [Google Scholar]
  5. Kong, X.; Yang, J.; Chai, X.; Zhou, Y. An advantage duplex dueling multi-agent Q-learning algorithm for multi-UAV cooperative target search in unknown environments. Simul. Model. Pract. Theory 2025, 142, 103118. [Google Scholar] [CrossRef]
  6. Li, G.; Zhao, Z.; Chai, R.; Zhu, M. An analysis of the mechanism and mode evolution for blockchain-empowered research credit supervision based on prospect theory: A case from China. Mathematics 2025, 13, 3557. [Google Scholar] [CrossRef]
  7. Zhang, Y.; Zhou, Y. Cooperative multi-agent actor-critic approach using adaptive value decomposition and parallel training for traffic network flow control. Neurocomputing 2025, 623, 129384. [Google Scholar] [CrossRef]
  8. Chernyavskiy, A.; Skrynnik, A.; Panov, A. Applying opponent and environment modelling in decentralised multi-agent reinforcement learning. Cogn. Syst. Res. 2025, 89, 101306. [Google Scholar] [CrossRef]
  9. Xu, Z.; Sun, H.; Sun, P. Irregular mobility: A dynamic alliance formation incentive mechanism under incomplete information. Inf. Sci. 2025, 712, 122155. [Google Scholar] [CrossRef]
  10. Asadi, M. Detecting IoT botnets based on the combination of cooperative game theory with deep and machine learning approaches. J. Ambient Intell. Humaniz. Comput. 2021, 13, 5547–5561. [Google Scholar] [CrossRef]
  11. Terán, J.; Aguilar, J.L.; Cerrada, M. Mathematical models of coordination mechanisms in multi-agent systems. CLEI Electron. J. 2013, 16, 5. [Google Scholar] [CrossRef]
  12. Jia, Y.; Zhang, Z.; Tan, X.; Liu, X. Asymmetric active cooperation strategy in spectrum sharing game with imperfect information. Int. J. Commun. Syst. 2015, 28, 539–554. [Google Scholar] [CrossRef]
  13. Li, M.; Chen, S.; Shen, Y.; Liu, G.; Tsang, I.W.; Zhang, Y. Online multi-agent forecasting with interpretable collaborative graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 4761–4773. [Google Scholar] [CrossRef] [PubMed]
  14. Pang, Y.; Zhao, X.; Hu, J.; Yan, H.; Liu, Y. Bayesian spatio-temporal graph transformer network (B-STAR) for multi-aircraft trajectory prediction. Knowl. Based Syst. 2022, 249, 108998. [Google Scholar] [CrossRef]
  15. Zhao, Z.; Zhang, Y.; Chen, W.; Zhang, F.; Wang, S.; Zhou, Y. Sequence value decomposition transformer for cooperative multi-agent reinforcement learning. Inf. Sci. 2025, 720, 122514. [Google Scholar] [CrossRef]
  16. Yi, P.; Lei, J.; Hong, Y.; Chen, J. Embodied intelligent game: Models and algorithms for autonomous interactions among heterogeneous agents. Unmanned Syst. 2025, 13, 1365–1394. [Google Scholar] [CrossRef]
  17. Jiang, L.; Li, Q.; Chen, X. A novel multi-agent game-theoretic model for cybersecurity strategies in EV charging networks: Addressing risk propagation and budget constraints. Energy 2025, 330, 136847. [Google Scholar] [CrossRef]
  18. Grover, P.; Huo, M. Phase transition in a kinetic mean-field game model of inertial self-propelled agents. Chaos 2024, 34, 123106. [Google Scholar] [CrossRef]
  19. Byeon, H.; Thingom, C.; Keshta, I.; Soni, M.; Hannan, S.A.; Surbakti, H. A logic Petri net model for dynamic multi-agent game decision-making. Decis. Anal. J. 2023, 9, 100355. [Google Scholar] [CrossRef]
  20. Inukai, J.; Taniguchi, T.; Taniguchi, A.; Hagiwara, Y. Recursive Metropolis-Hastings naming game: Symbol emergence in a multi-agent system based on probabilistic generative models. Front. Artif. Intell. 2023, 6, 1229127. [Google Scholar] [CrossRef] [PubMed]
  21. An, N.; Zhao, X.; Wang, Q.; Wang, Q. Model-free distributed optimal consensus control of nonlinear multi-agent systems: A graphical game approach. J. Frankl. Inst. 2023, 360, 8166–8187. [Google Scholar] [CrossRef]
  22. Manoharan, A.; Sujit, P.B. Nonlinear model predictive control framework for cooperative three-agent target defense game. J. Intell. Robot. Syst. 2023, 108, 21. [Google Scholar] [CrossRef]
  23. Inoue, D.; Ito, Y.; Kashiwabara, T.; Saito, N.; Yoshida, H. Partially centralized model-predictive mean field games for controlling multi-agent systems. IFAC J. Syst. Control 2023, 24, 100228. [Google Scholar] [CrossRef]
  24. Zhao, W.; Fan, S.; Hou, J.; Yu, J. Noncooperative multiagent game model for the electricity market on the basis of the renewable energy portfolio standard. Eng. Rep. 2022, 4, e12544. [Google Scholar] [CrossRef]
  25. Zhang, J.; Wang, G.; Yue, S.; Song, Y.; Liu, J.; Yao, X. Multi-agent system application in accordance with game theory in bi-directional coordination network model. J. Syst. Eng. Electron. 2020, 31, 279–289. [Google Scholar] [CrossRef]
  26. Zeng, Y.; Xu, K.; Qin, L.; Yin, Q. A semi-Markov decision model with inverse reinforcement learning for recognizing the destination of a maneuvering agent in real time strategy games. IEEE Access 2020, 8, 15392–15409. [Google Scholar] [CrossRef]
  27. Jia, W.; Ji, M. Multi-agent deep reinforcement learning for large-scale traffic signal control with spatio-temporal attention mechanism. Appl. Sci. 2025, 15, 8605. [Google Scholar] [CrossRef]
  28. Wang, D.; He, J.; Wang, X.; Li, Z. Sensor activation policy optimization for K-diagnosability based on multi-agent reinforcement learning. Inf. Sci. 2025, 718, 122360. [Google Scholar] [CrossRef]
Figure 1. Multi-Agent Markov Game Framework Diagram.
Figure 1. Multi-Agent Markov Game Framework Diagram.
Algorithms 19 00022 g001
Figure 2. ISO-MAGCG Overall Framework Architecture.
Figure 2. ISO-MAGCG Overall Framework Architecture.
Algorithms 19 00022 g002
Figure 3. GAIN Network Structure Diagram.
Figure 3. GAIN Network Structure Diagram.
Algorithms 19 00022 g003
Figure 4. Intention-Strategy Bi-Level Optimization Framework Flowchart.
Figure 4. Intention-Strategy Bi-Level Optimization Framework Flowchart.
Algorithms 19 00022 g004
Figure 5. Convergence Curve Analysis.
Figure 5. Convergence Curve Analysis.
Algorithms 19 00022 g005
Figure 6. Temperature parameter (τ) sensitivity analysis. Shaded region indicates optimal range.
Figure 6. Temperature parameter (τ) sensitivity analysis. Shaded region indicates optimal range.
Algorithms 19 00022 g006
Table 1. Intention Space Definition and Labeling Methods.
Table 1. Intention Space Definition and Labeling Methods.
ScenarioIntention SpaceSizeLabeling Method
Multi-Agent Navigation{Navigate, Avoid, Wait, Replan}4Rule-based: velocity direction toward target (Navigate), proximity deviation (Avoid), stationary > 3 steps (Wait), direction change > 90° (Replan)
Resource Collection{Collect, Transport, Cooperate, Explore}4Rule-based + proximity: agents within d_coop = 3.0 moving toward same resource (Cooperate)
Defense Cooperation{Intercept, Patrol, Support, Retreat, Hold}5DTW trajectory clustering + manual semantic labeling
Table 2. Intention Label Statistics and Quality Metrics.
Table 2. Intention Label Statistics and Quality Metrics.
ScenarioTotal LabelsDistributionAuto-Check PassManual AccuracyAgreement
Navigation48,000Navigate 42%, Avoid 28%, Wait 18%, Replan 12%97.3%96.8%97.2%
Resource Collection52,000Collect 35%, Transport 25%, Cooperate 24%, Explore 16%96.1%95.4%96.1%
Defense Cooperation45,000Intercept 22%, Patrol 28%, Support 20%, Retreat 15%, Hold 15%94.8%95.1%95.7%
Table 3. Experimental Platform Configuration.
Table 3. Experimental Platform Configuration.
Configuration TypeSpecific Configuration
Hardware Configuration
CPUIntel Core i9-12900K
GPUNVIDIA RTX 4090 (24 GB VRAM)
Memory64 GB DDR4
Software Environment
Operating SystemUbuntu 20.04 LTS
Programming LanguagePython 3.9
Deep Learning FrameworkPyTorch 1.12.0
GPU AccelerationCUDA 11.6
Simulation EnvironmentOpenAI Gym + PettingZoo
Table 4. Experimental Scenario Configuration.
Table 4. Experimental Scenario Configuration.
ScenarioEnvironment ScaleAgent NumberState SpaceAction SpaceReward Design
Multi-Agent Navigation20 × 20 gridn ∈ {4, 6, 8, 10}R4n{0, 1, 2, 3, 4}Reach target +10, collision −5, time −0.1
Resource Collection15 × 15 gridn ∈ {6, 8, 10, 12}R6n+25{0, 1, 2, 3, 4, 5}Collection +5, cooperation +2, efficiency reward
Defense Cooperation25 × 25 continuousn ∈ {8, 10, 12, 15}R8n+4m[−1, 1]2Defense +8, cooperation +3, area control
Table 5. Network Architecture Parameter Settings.
Table 5. Network Architecture Parameter Settings.
Network ModuleParameter ConfigurationValue
GAIN NetworkHidden layer dimension128
Number of attention heads4
Number of network layers3
Actor NetworkHidden layer dimensions[256, 128, 64]
Activation functionReLU
Output layer activationTanh
Critic NetworkHidden layer dimensions[256, 128, 64]
Activation functionReLU
Training ParametersActor learning rate3 × 10−4
Critic learning rate1 × 10−3
Intent learning rate5 × 10−4
Discount factor0.99
Soft update coefficient0.005
Batch size256
Table 6. Baseline Algorithm Configuration.
Table 6. Baseline Algorithm Configuration.
AlgorithmTypeCore FeaturesMain ParametersYear
QMIXValue decompositionMonotonicity assumption, centralized training distributed executionMixing network [64, 32], learning rate 5 × 10−42018
MADDPGPolicy gradientCentralized Critic, continuous actionsActor/Critic [400, 300], α = 1 × 10−22017
COMAActor-CriticCounterfactual baseline, credit assignmentGRU hidden 64, entropy coefficient 1 × 10−32018
QTRANValue decompositionRelaxed monotonicity constraintTransform network [64, 64], λ = 5 × 10−32019
MAPPOPolicy optimizationParameter sharing, centralized valueNetwork [64, 64], ε = 0.2, λ = 0.952021
MATTransformer-basedMulti-agent transformer, sequential modelingd_model = 128, heads = 4, layers = 22022
ROMARole-basedEmergent role discovery, role encoderRole dim = 64, n_roles = 4, α = 5 × 10−42020
Table 7. Algorithm Overall Performance Comparison.
Table 7. Algorithm Overall Performance Comparison.
AlgorithmNavigation Task Resource Collection Defense Cooperation
TSR (%)ACRCE (%)TSR (%)ACRCE (%)TSR (%)ACRCE (%)
QMIX72.3 ± 2.1145.2 ± 8.323.4 ± 3.268.5 ± 2.889.7 ± 5.118.7 ± 2.965.8 ± 3.1112.4 ± 6.721.2 ± 3.5
MADDPG75.8 ± 1.9152.6 ± 7.128.1 ± 2.871.2 ± 2.395.3 ± 4.822.4 ± 2.169.4 ± 2.7118.9 ± 5.925.3 ± 2.8
COMA78.1 ± 2.3158.9 ± 6.831.7 ± 3.173.6 ± 2.698.1 ± 4.225.8 ± 2.771.7 ± 2.9125.3 ± 6.228.6 ± 3.2
QTRAN76.4 ± 2.0149.8 ± 7.526.9 ± 2.970.9 ± 2.492.6 ± 4.920.3 ± 2.568.2 ± 3.0115.7 ± 6.423.1 ± 2.9
MAPPO81.2 ± 1.7165.4 ± 5.935.2 ± 2.476.8 ± 2.1103.2 ± 3.829.6 ± 2.374.5 ± 2.5131.8 ± 5.132.4 ± 2.6
MAT83.5 ± 1.6172.8 ± 5.438.6 ± 2.279.1 ± 1.9108.5 ± 3.533.2 ± 2.176.8 ± 2.3138.4 ± 4.835.7 ± 2.4
ROMA82.1 ± 1.8168.3 ± 5.736.4 ± 2.577.5 ± 2.0105.1 ± 3.931.1 ± 2.475.2 ± 2.6134.2 ± 5.333.8 ± 2.7
ISO-MAGCG89.6 ± 1.4 *187.3 ± 4.2 *47.8 ± 1.9 *84.2 ± 1.8 *125.6 ± 3.1 *42.3 ± 2.0 *82.1 ± 2.0 *156.2 ± 4.6 *45.1 ± 2.2 *
Note: * indicates p < 0.001 compared to all baseline algorithms (Wilcoxon signed-rank test). Values represent mean ± standard deviation over 10 independent runs. Bold indicates best.
Table 8. Algorithm Convergence Performance Comparison.
Table 8. Algorithm Convergence Performance Comparison.
AlgorithmConvergence EpisodesFinal Performance (%)Training StabilityComputation Time (min)p-Value *
QMIX1850 ± 12068.9 ± 2.80.78 ± 0.058.7 ± 0.6<0.001
MADDPG1720 ± 9572.1 ± 2.30.82 ± 0.0412.4 ± 0.8<0.001
COMA1680 ± 11074.5 ± 2.60.85 ± 0.0315.6 ± 1.1<0.001
QTRAN1780 ± 10571.8 ± 2.40.80 ± 0.049.8 ± 0.7<0.001
MAPPO1520 ± 8577.5 ± 2.10.89 ± 0.0210.3 ± 0.5<0.001
MAT1380 ± 7580.2 ± 1.90.91 ± 0.0214.8 ± 0.6<0.001
ROMA1450 ± 8078.6 ± 2.00.88 ± 0.0311.2 ± 0.5<0.001
ISO-MAGCG1180 ± 6585.3 ± 1.80.94 ± 0.0213.2 ± 0.4-
Note: * p-values from Mann–Whitney U test comparing each algorithm to ISO-MAGCG. Training stability measured as coefficient of variation of performance over final 200 episodes.
Table 9. Ablation Study Results.
Table 9. Ablation Study Results.
Model.NavigationResource CollectionDefense CooperationAverageDecrease
Complete89.6 ± 1.484.2 ± 1.882.1 ± 2.085.3%-
w/o GAIN78.4 ± 2.372.6 ± 2.769.8 ± 3.273.6%−11.7%
w/o Attention (GAIN w/MLP)82.1 ± 2.076.8 ± 2.374.2 ± 2.777.7%−7.6%
w/o ISO81.2 ± 2.176.1 ± 2.473.5 ± 2.876.9%−8.4%
w/o AGEA83.7 ± 1.978.9 ± 2.275.2 ± 2.679.3%−6.0%
w/o Multi-Head (K = 1)85.1 ± 1.880.3 ± 2.077.6 ± 2.481.0%−4.3%
Table 10. Hyperparameter Sensitivity Analysis Results.
Table 10. Hyperparameter Sensitivity Analysis Results.
HyperparameterTest RangeOptimal ValuePerformance RangeSensitivity
Learning rate[1 × 10−5, 1 × 10−2]3 × 10−4[76.2%, 89.6%]Medium
Number of attention heads[1, 8]4[81.3%, 89.6%]Low
Weight parameter[0.01, 1.0]0.1[82.7%, 89.6%]Low
Batch size[64, 512]256[85.1%, 89.6%]Very low
Temperature[0.1, 10]1.0[77.2%, 89.6%]Medium-High
Discount factor[0.9, 0.999]0.99[84.8%, 89.6%]Low
Table 11. Agent Number Scalability Analysis.
Table 11. Agent Number Scalability Analysis.
Agent NumberISO-MAGCGMAPPOQMIXPerformance Retention RateComputation Time (min)
492.1 ± 1.285.3 ± 1.878.6 ± 2.1100%2.1
889.6 ± 1.481.2 ± 1.772.3 ± 2.397.3%8.5
1285.2 ± 1.876.1 ± 2.165.8 ± 2.892.5%19.1
1682.6 ± 2.072.8 ± 2.461.4 ± 3.189.7%35.2
2079.4 ± 2.368.5 ± 2.756.9 ± 3.486.2%58.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, M.; Chen, R.; Zhu, J. A Multi-Agent Cooperative Group Game Model Based on Intention-Strategy Optimization. Algorithms 2026, 19, 22. https://doi.org/10.3390/a19010022

AMA Style

Tang M, Chen R, Zhu J. A Multi-Agent Cooperative Group Game Model Based on Intention-Strategy Optimization. Algorithms. 2026; 19(1):22. https://doi.org/10.3390/a19010022

Chicago/Turabian Style

Tang, Mingjun, Renwen Chen, and Junwu Zhu. 2026. "A Multi-Agent Cooperative Group Game Model Based on Intention-Strategy Optimization" Algorithms 19, no. 1: 22. https://doi.org/10.3390/a19010022

APA Style

Tang, M., Chen, R., & Zhu, J. (2026). A Multi-Agent Cooperative Group Game Model Based on Intention-Strategy Optimization. Algorithms, 19(1), 22. https://doi.org/10.3390/a19010022

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop