Sparse Communication for Policy Shaping in Multi-Agent Reinforcement Learning

Li, Jiahao; Li, Renjie; Wang, Nan

doi:10.3390/s26113413

Open AccessArticle

Sparse Communication for Policy Shaping in Multi-Agent Reinforcement Learning

by

Jiahao Li

,

Renjie Li

and

Nan Wang

^*

Department of Electronic Engineering, Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266000, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(11), 3413; https://doi.org/10.3390/s26113413

Submission received: 20 April 2026 / Revised: 19 May 2026 / Accepted: 20 May 2026 / Published: 28 May 2026

(This article belongs to the Special Issue Advances in Multi-Agent Systems: Cooperative and Intelligent Control Strategies for Complex Applications)

Download

Browse Figures

Versions Notes

Abstract

Efficient coordination under limited communication is a central challenge in multi-agent reinforcement learning (MARL). Existing approaches often focus on message exchange without explicitly modeling how communication affects policy learning, leading to redundant interactions and limited coordination gains. In this paper, we propose a threshold-gated sparse communication framework built upon QMIX, a monotonic value-decomposition method that mixes individual agent action values into a global team action value. In the proposed framework, communication is integrated into the agent utility function to directly influence policy learning. Each agent encodes local observations into structured representations and activates communication through a learned trigger mechanism. Messages are aggregated via neighbor-constrained attention and incorporated into utility estimation for decentralized decision-making. Experimental results on the StarCraft Multi-Agent Challenge (SMAC) benchmark show that the proposed method improves coordination quality and training stability while significantly reducing communication frequency. On MMM, the Marine–Marauder–Medivac heterogeneous scenario, the communication rate is reduced to approximately 30–38% while achieving up to 96.6% win rate, compared to 92.1% for QMIX. On 10m_vs_11m, a homogeneous scenario where ten allied Marines fight against eleven enemy Marines, communication remains within 28–37% while reaching 88.4% win rate, compared to 85.6% for QMIX. Moreover, on the same task, varying communication thresholds induce clearly differentiated policy behaviors, indicating that sparse communication not only reduces overhead but also plays a critical role in shaping coordination policies. These results demonstrate that selective communication enables efficient coordination while explicitly regulating policy formation.

Keywords:

multi-agent reinforcement learning; communication efficiency; sparse communication; threshold-gated mechanism

1. Introduction

Multi-agent reinforcement learning (MARL) has become a key framework for solving cooperative decision-making problems in complex environments, with applications ranging from multi-robot systems [1,2] and autonomous driving to large-scale resource scheduling and distributed control [3,4]. In these settings, agents must achieve effective coordination under partial observability and decentralized execution constraints, while training can leverage global information under the centralized training with decentralized execution (CTDE) paradigm [5,6].

In addition to local decision-making, communication is often introduced to alleviate partial observability by enabling agents to exchange complementary information. Early studies demonstrated that learnable communication can significantly improve cooperation in complex tasks [7,8]. Subsequent works have further explored methods to make communication more selective and efficient, for example by learning when to communicate or whom to communicate with [9,10,11]. These advances suggest that communication plays an important role in enabling coordinated behaviors in MARL.

Despite these developments, learning effective coordination remains challenging. Value decomposition methods such as Value-Decomposition Networks (VDN) and QMIX have shown strong performance and training stability in cooperative MARL [5,6,12]. VDN decomposes the team value into a sum of individual utilities, whereas QMIX uses a monotonic mixing network to combine individual action values into a global team action value. However, these approaches do not explicitly model inter-agent information exchange during execution. As a result, agents rely primarily on local observations, which may limit their ability to form structured cooperative behaviors such as coordinated targeting, spatial formation, and fine-grained micro-control, especially in complex environments such as the StarCraft Multi-Agent Challenge (SMAC) [13,14].

A natural direction is to incorporate sparse communication to improve efficiency. Existing studies on sparse communication primarily focus on reducing communication frequency or bandwidth consumption, often by learning gating or scheduling mechanisms [11,15,16]. While these approaches successfully reduce communication overhead, they typically treat communication as an auxiliary information channel and do not explicitly analyze how communication influences the resulting policy. As a consequence, communication may become less frequent but still fails to induce meaningful changes in coordination patterns or policy structure. Recent works also highlight that effective communication should be adaptive and task-relevant, rather than merely sparse [17]. However, how communication affects the emergence of coordinated behaviors remains underexplored.

In this paper, we propose a threshold-gated sparse communication framework built upon QMIX, which explicitly leverages communication to shape policy behavior rather than merely transmitting information. Each agent encodes local observations using a lightweight convolutional neural network (CNN)-based spatial representation and activates communication based on structured deviations in local states. The transmitted messages are aggregated under a neighbor-constrained attention mechanism and integrated into the agent utility network, enabling communication-aware decision-making.

We evaluate the proposed approach on the StarCraft Multi-Agent Challenge (SMAC). The experimental results show that the proposed method improves coordination quality and stabilizes learning while substantially reducing communication frequency. More importantly, the results indicate that sparse and selective communication leads to more structured cooperative behaviors, including more consistent focus-fire patterns and improved micro-control decisions. These findings suggest that communication can play an active role in shaping cooperative strategies, rather than merely serving as a channel for information exchange.

Based on this motivation, we formulate and investigate three verifiable research questions concerning communication efficiency, behavior-level policy changes, and the role of communication in utility-based cooperative policy learning. By doing so, the proposed framework is connected to testable hypotheses, rather than being presented only as an architectural modification to existing value-decomposition methods.

Our main contributions are summarized as follows:

(1): We propose a threshold-gated sparse communication mechanism that enables agents to selectively communicate based on structured changes in local observations. Unlike dense communication baselines that broadcast messages at every decision step, the proposed trigger suppresses redundant transmissions and activates communication only when the learned local state deviation indicates potential coordination demand.
(2): We introduce a communication-aware QMIX framework that explicitly links communication to policy formation. Specifically, activated neighbor messages are aggregated through a neighborhood-constrained attention module and fused into the recurrent individual utility-estimation pathway, so communication context can influence decentralized action selection and centralized value mixing rather than serving as an auxiliary message channel.
(3): We demonstrate that sparse communication can improve coordination and policy structure while maintaining competitive performance on SMAC benchmarks. Beyond win rate and communication ratio, we further analyze behavior-level coordination indicators such as focus-fire consistency, target distribution entropy, spatial separation, and attack–move transitions, showing how threshold-controlled communication affects learned cooperative behaviors.

2. Related Works

2.1. MARL for Collaboration (Value-Based CTDE)

Multi-Agent Reinforcement Learning (MARL) has become a prominent paradigm for learning cooperative behaviors in multi-agent teams, enabling emergent coordination such as focus-fire, formation control, and synchronized maneuvers in complex partially observable environments [13,14,18,19]. A widely adopted training paradigm is centralized training with decentralized execution (CTDE), where additional global information is used during training to improve stability and credit assignment, while each agent executes using only local information at test time [3,5,6]. Within CTDE, value decomposition methods are particularly effective for cooperative tasks. VDN factorizes the team action value as a sum of individual utilities, providing a simple yet scalable mechanism for credit assignment [6]. QMIX further generalizes value decomposition by learning a mixing network that combines individual action values into a joint

Q_{tot}

under a monotonicity constraint, enabling decentralized greedy action selection while optimizing for the team return [5]. Subsequent value-based approaches improve representation learning, temporal credit assignment, and robustness under partial observability, leading to strong performance on standard cooperative benchmarks [12,13,20,21,22]. Despite this progress, cooperative MARL remains challenged by partial observability, non-stationarity induced by simultaneously learning agents, and the need for effective coordination mechanisms beyond implicit coupling through shared rewards [13,23,24]. In particular, when local observations are insufficient to infer teammates’ intents or to align on shared tactical objectives, explicit inter-agent communication becomes a key ingredient for reliable group strategy emergence [7,9,13].

2.2. Communication in MARL

To facilitate coordination beyond what value decomposition alone can provide, a line of work studies differentiable communication mechanisms that allow agents to exchange messages end-to-end with the task objective [7,8,17]. Early approaches learn continuous or discrete messages and backpropagate gradients through the communication channel, enabling agents to share hidden representations that encode intent or local context [8]. Later, attention-based communication frameworks introduce content-based addressing to decide “who to listen to” and “what to aggregate,” improving scalability and enabling selective information sharing in larger teams [9,10]. Graph neural networks and message-passing architectures similarly model inter-agent interactions as relational aggregation over dynamically defined neighborhoods, which is well-suited for variable-sized agent populations and structured cooperation [1,2,9,10]. While these methods demonstrate that explicit information exchange can substantially improve coordination, many of them assume frequent communication and do not explicitly optimize the trade-off between message cost and cooperative performance, which can be critical in bandwidth- or latency-constrained settings [11,16,22]. This limitation motivates the study of communication-efficient and sparse communication mechanisms, which aim to reduce communication overhead while preserving coordination quality.

2.3. Communication Efficiency and Sparse/Event-Triggered Communication

In practical multi-agent systems, communication is often constrained by bandwidth, latency, energy, or privacy, motivating research on communication-efficient MARL [4,9,11,16]. A representative direction is to learn when to communicate via scheduling or gating, where each agent decides whether to broadcast (or which subset of agents should communicate) under a limited budget [9,11,16]. Other approaches introduce explicit regularizers or bottlenecks on messages to reduce communication frequency, message dimensionality, or overall information flow [9,10,11]. Sparse or event-triggered communication is particularly appealing because it encourages agents to communicate only at critical decision points, which often correspond to tactical events such as target switching, engagement initiation, or coordinated retreat [9,11,16]. However, learning sparse communication is non-trivial: Naive gating can collapse to always-on or always-off behaviors, discrete triggers can lead to high-variance gradients, and excessive sparsification may remove the very signals needed for intent alignment, degrading group strategies even if individual behaviors remain locally reasonable [25,26,27]. Therefore, achieving an effective balance between communication sparsity and coordination quality remains an ongoing challenge, especially for value-based CTDE methods in partially observable cooperative tasks where emergent group strategies critically depend on timely and selective information exchange [3,5,9,10].

2.4. Gaps in Existing Research and the Positioning of This Work

The studies reviewed above establish three important foundations. Value-based CTDE methods provide stable joint value learning and effective credit assignment in cooperative MARL [28,29,30]. Differentiable and attention-based communication methods show that message exchange and attention-guided state abstraction can improve coordination under partial observability [31,32]. Sparse, targeted, hypergraph-based, and event-triggered methods further demonstrate that communication cost can be reduced by improving message specificity, suppressing irrelevant information, or triggering information exchange only when it is useful [33,34,35]. However, these lines of work still leave an important gap: The communication decision is often separated from the formation of the cooperative policy. In many existing approaches, communication is optimized mainly as a message-passing or scheduling layer. The downstream policy may receive messages, but the method does not explicitly show how communication changes individual utility estimation, joint value aggregation, or the resulting coordination pattern. Dense communication is expressive but can introduce redundant and task-irrelevant context. Selective or sparse communication reduces the number of messages, but its main objective is often communication efficiency rather than policy shaping. Event-triggered communication can activate messages around state changes, but the trigger is not always coupled with a communication-aware value-decomposition pathway or behavior-level analysis.

Consequently, current sparse and selective communication models do not fully explain how the retained messages are incorporated into a shared cooperative policy under CTDE [30,33,34]. They may reduce the frequency of communication, but message sparsity alone does not guarantee that the remaining information contributes to target selection, formation control, attack timing, or other coordinated behaviors. This limitation is particularly important for SMAC-style micromanagement tasks, where coordination quality depends not only on whether agents exchange information, but also on whether the exchanged information affects the individual utilities that drive decentralized action selection.

This work addresses the above gap by positioning sparse communication as a policy-shaping component within a value-decomposition framework. Specifically, the proposed method first uses a learned threshold-gated trigger to suppress redundant transmissions, then applies a neighborhood-constrained attention mechanism to aggregate only activated and relevant neighbor messages. The resulting communication context is fused into the recurrent utility-estimation pathway, and the QMIX mixer aggregates these communication-aware individual utilities into the team value. Compared with dense communication, the proposed framework avoids indiscriminate message broadcasting. Compared with sparse, selective, or event-triggered schemes that mainly emphasize communication reduction [33,34,35], our method explicitly connects communication to utility estimation and further evaluates its effect on behavior-level coordination patterns.

3. Method

This section presents the proposed communication framework, including the overall communication architecture (Section 3.1), the visual-encoding-based trigger module (Section 3.2), and the communication-aware QMIX formulation (Section 3.3). The detailed network configurations are summarized in Appendix A Table A1.

3.1. Communication Setting and Overall Framework

In cooperative multi-agent settings, agents operate under partial observability, where each agent only has access to local observations. Communication enables agents to exchange complementary information and improve coordination. However, unrestricted communication often introduces redundant interactions and unnecessary communication overhead. To address this issue, we design a neighbor-sparse communication mechanism with trigger-based message transmission and threshold-based regularization, so that nearby agents communicate selectively while excessive communication is discouraged during training. As illustrated in Figure 1, the proposed framework integrates sparse communication, message aggregation, recurrent utility estimation, and value decomposition into a unified architecture.

Before introducing the neural communication architecture, we first define the underlying communication setting. Consider a cooperative system with N agents. At time step t, the agents form a directed communication graph

G^{t} = (V, E^{t})

, where each node corresponds to an agent and each directed edge indicates that one agent is allowed to transmit information to another. A dense communication baseline can be represented by the full adjacency mask

A_{i j}^{full} = I (i \neq j),

(1)

where every agent can send a message to every other agent at each decision step. Under this setting, all potential communication links are active, which maximizes information exchange but may also introduce redundant and task-irrelevant messages.

In contrast, the proposed framework decomposes communication availability into two factors: a physical neighborhood constraint and a learned trigger decision. The neighborhood constraint determines whether two agents are spatially close enough to communicate, while the trigger decision determines whether the sender actually transmits a message. Therefore, the effective communication link from agent j to agent i is active only when both conditions are satisfied. This design can be viewed as applying a structured sparse mask on top of the dense communication graph, rather than assuming unrestricted message broadcasting.

Figure 1. Overall architecture of the proposed communication-aware QMIX framework. The pipeline consists of local observation encoding, communication mask construction, trigger-based message gating, neighbor-constrained attention aggregation, recurrent individual utility estimation, and QMIX value mixing. Figure 2 further details the internal process from fused trigger representation to binary communication activation.

Figure 2. Trigger module architecture. Spatial occupancy and local vector observations are encoded and fused into a trigger representation. The fused representation is mapped to a continuous activation score, which is converted into a binary communication trigger through a hard threshold function. During training, the non-differentiable threshold operation is optimized using a straight-through estimator.

Formally, the proposed effective communication mask is factorized as

{\hat{M}}_{i j}^{t} = A_{i j}^{full} M_{i j}^{t} T_{j}^{t},

(2)

where

M_{i j}^{t}

denotes the spatial neighborhood mask and

T_{j}^{t}

is the binary trigger of sender agent j. The dense baseline corresponds to the special case where

M_{i j}^{t} = 1

and

T_{j}^{t} = 1

for all

i \neq j

. By contrast, our method suppresses communication either when the receiver and sender are outside the local communication range or when the sender’s trigger is inactive. This formulation makes explicit the difference between dense communication and the proposed threshold-gated sparse communication mechanism.

At each time step t, each agent i maintains an action–observation history

τ_{i}^{t} = (o_{i}^{1 : t}, u_{i}^{0 : t - 1}),

where

o_{i}^{t}

denotes the local observation vector and

u_{i}^{t}

denotes the executed action. For the first decision step, the previous action input is initialized as a zero vector. Based on its local history, each agent estimates an individual action value function for decentralized decision-making.

Unless otherwise stated, all agents share the parameters of the local encoder, message encoder, trigger-based communication module, recurrent utility network, and action value head. This parameter-sharing scheme improves sample efficiency and is standard in cooperative MARL.

3.1.1. Neighbor-Based Sparse Communication

To reduce communication overhead and avoid redundant interactions, communication is restricted to a dynamic local neighbor set determined by spatial proximity. Specifically, at each time step t, agents within a predefined communication radius of agent i are considered its neighbors. We define a binary adjacency mask

M_{i j}^{t} = I (d_{i j}^{t} \leq r, i \neq j),

(3)

where

d_{i j}^{t}

is the spatial distance between agents i and j at time t, r is the communication radius, and

M_{i j}^{t} = 1

indicates that agent i can receive information from agent j at time t. Here,

I (\cdot)

denotes the indicator function, which equals 1 if the condition is satisfied and 0 otherwise.

Each agent first encodes its current local observation and previous action into a feature representation as

z_{i}^{t} = f_{emb} ([o_{i}^{t}, u_{i}^{t - 1}]),

(4)

where

f_{emb} (\cdot)

denotes the local feature encoder and

z_{i}^{t} \in R^{d_{z}}

is the resulting feature vector.

Based on this representation, a message embedding is generated as

m_{i}^{t} = f_{msg} (z_{i}^{t}; θ_{msg}),

(5)

where

f_{msg} (\cdot)

is the message encoder and

m_{i}^{t} \in R^{d_{m}}

denotes the outgoing message vector.

To enable adaptive communication, we introduce a binary trigger variable

T_{i}^{t} \in {0, 1}

, which is described in detail in Section 3.2. It determines whether agent i transmits its message. The gated transmitted message is defined as

{\tilde{m}}_{i}^{t} = T_{i}^{t} \cdot m_{i}^{t} .

(6)

Combining neighborhood sparsity and trigger gating yields the effective communication mask

{\hat{M}}_{i j}^{t} = M_{i j}^{t} \cdot T_{j}^{t},

(7)

where

{\hat{M}}_{i j}^{t} = 1

indicates that agent i receives the message from agent j at time t. Accordingly, the effective sender set for receiver i is

N_{i}^{t} = {j ∣ {\hat{M}}_{i j}^{t} = 1} .

(8)

3.1.2. Message Aggregation and Fusion

For each receiver agent i, incoming messages from effective neighbors are aggregated into a communication context vector

c_{i}^{t}

. Since different neighbors may provide information of different relevance, we adopt an attention-based aggregation mechanism as

\begin{matrix} e_{i j}^{t} & = ψ ([z_{i}^{t}, {\tilde{m}}_{j}^{t}]), \end{matrix}

(9)

\begin{matrix} α_{i j}^{t} & = \frac{exp (e_{i j}^{t}) {\hat{M}}_{i j}^{t}}{\sum_{k \neq i} exp (e_{i k}^{t}) {\hat{M}}_{i k}^{t}}, \end{matrix}

(10)

\begin{matrix} c_{i}^{t} & = \sum_{j \neq i} α_{i j}^{t} {\tilde{m}}_{j}^{t}, \end{matrix}

(11)

where

ψ (\cdot)

is a scoring network that maps the concatenated feature

[z_{i}^{t}, {\tilde{m}}_{j}^{t}]

to a scalar attention logit, and

α_{i j}^{t}

denotes the normalized attention weight. The scalar

e_{i j}^{t}

represents the unnormalized attention score measuring the relevance of the message from agent j to agent i at time t. The aggregation is performed only over the valid incoming senders indicated by

{\hat{M}}_{i j}^{t}

. When at least one valid incoming sender exists, the attention weights are normalized as above; otherwise, we directly set

c_{i}^{t} = 0,

which represents the null communication context.

The agent recurrent state is then updated by

h_{i}^{t} = GRU (h_{i}^{t - 1}, [z_{i}^{t}, c_{i}^{t}]),

(12)

where

h_{i}^{t} \in R^{d_{h}}

denotes the hidden state vector. The individual utility network outputs the Q-values over the discrete action space as

q_{i}^{t} = f_{Q} (h_{i}^{t}; θ_{Q}),

(13)

and the utility corresponding to the executed action

u_{i}^{t}

is

Q_{i} (τ_{i}^{t}, u_{i}^{t}) = q_{i}^{t} [u_{i}^{t}] .

(14)

During decentralized execution, each agent computes its local feature representation, trigger activation, communicated message, aggregated communication context, and action values using only its local observation history and the messages received from neighboring agents.

3.2. Trigger-Based Communication Module

To further reduce redundant communication, we introduce a trigger-based communication module together with an excess-activation regularization mechanism, so that message transmission is learned adaptively while excessive communication is explicitly discouraged during training.

Each agent constructs a multi-modal trigger representation based on (i) a spatial occupancy map of nearby allies and (ii) its local vector observation. Specifically, the occupancy of neighboring allied agents is encoded as a binary map

X_{i}^{t} \in {0, 1}^{31 \times 31}

, where the map is centered at agent i, and each grid cell indicates whether a neighboring ally occupies the corresponding relative spatial location. The map size

31 \times 31

is chosen to provide a fixed-resolution coverage of the agent-centered local perceptual region in SMAC while keeping the visual encoder lightweight. If the local perceptual region exceeds the map boundary, the valid observable area is clipped by the fixed

31 \times 31

boundary.

A CNN encoder extracts spatial features as

g_{vis, i}^{t} = f_{cnn} (X_{i}^{t}; θ_{cnn}),

(15)

where

g_{vis, i}^{t} \in R^{d_{vis}}

is the visual feature vector. In parallel, the local vector observation is encoded as

g_{vec, i}^{t} = f_{mlp} (o_{i}^{t}; θ_{vec}),

(16)

where

g_{vec, i}^{t} \in R^{d_{vec}}

is the vector feature representation. The two representations are fused into a unified trigger feature as

g_{i}^{t} = ϕ ([g_{vec, i}^{t}, g_{vis, i}^{t}]),

(17)

where

ϕ (\cdot)

is a linear projection layer.

The trigger head first maps the fused representation to a scalar logit:

a_{i}^{t} = f_{tri} (g_{i}^{t}),

(18)

where

a_{i}^{t}

denotes the unnormalized trigger logit. The logit is then converted into a continuous activation score through a sigmoid function:

p_{i}^{t} = σ (a_{i}^{t}),

(19)

where

p_{i}^{t} \in (0, 1)

represents the communication activation score. A binary communication trigger is obtained by applying a hard threshold function:

T_{i}^{t} = H (p_{i}^{t} - 0.5),

(20)

where

H (\cdot)

is the Heaviside step function, defined as

H (x) = \{\begin{matrix} 1, & x > 0, \\ 0, & x \leq 0 . \end{matrix}

(21)

Thus,

T_{i}^{t} = 1

means that agent i transmits its message at time step t, while

T_{i}^{t} = 0

means that the message is suppressed. We adopt

0.5

as the decision boundary because

p_{i}^{t}

is produced by a sigmoid function and therefore has a natural midpoint between inactive and active communication states.

The hard threshold function is non-differentiable and has zero gradients almost everywhere. To enable end-to-end training, we adopt a straight-through estimator (STE). In the forward pass, the model uses the hard binary trigger

T_{i}^{t} = H (p_{i}^{t} - 0.5)

. In the backward pass, the gradient of the hard threshold is approximated by the identity mapping:

\frac{\partial T_{i}^{t}}{\partial p_{i}^{t}} \approx 1 .

(22)

Equivalently, this can be interpreted as using the discrete trigger for message transmission while allowing gradients to flow through the continuous activation score during optimization. As a result, the trigger module can receive gradients from the temporal-difference loss through the communication-dependent utility estimation pathway:

\frac{\partial L_{TD}}{\partial θ_{tri}} \approx \frac{\partial L_{TD}}{\partial T_{i}^{t}} \frac{\partial p_{i}^{t}}{\partial θ_{tri}} .

(23)

It is important to distinguish the binary decision boundary from the regularization threshold. The value

0.5

is used only to convert the sigmoid activation score into a binary trigger. By contrast, the parameter

δ

is not used for hard trigger generation. Instead,

δ

appears in the communication regularization term and determines the activation level above which excessive communication is penalized. Therefore, the proposed design decouples message transmission from communication regularization:

T_{i}^{t}

controls whether communication occurs in the forward pass, whereas

δ

controls how strongly high activation scores are discouraged during training.

The trigger network is optimized through two coupled pathways. First, the temporal-difference loss backpropagates through the communication-dependent utility estimation path, since the binary trigger affects message transmission, message aggregation, and ultimately the individual utilities and joint action value. Second, the excess-activation regularizer directly penalizes large activation scores

p_{i}^{t}

. The STE enables gradients from the TD loss to pass through the non-differentiable binary trigger, while the regularization term provides an additional continuous supervision signal for controlling communication tendency.

3.3. Communication-Augmented QMIX

We adopt a QMIX-based value decomposition framework under centralized training and decentralized execution (CTDE) [5]. The joint action value function is defined as

Q_{tot} (τ^{t}, u^{t}) = f_{mix} ({Q_{i} (τ_{i}^{t}, u_{i}^{t})}_{i = 1}^{N}, s^{t}; θ_{mix}),

(24)

where

s^{t}

denotes the global state vector available only during training, and

f_{mix} (\cdot)

is the mixing network constrained to satisfy the monotonicity condition

\frac{\partial Q_{tot}}{\partial Q_{i}} \geq 0, \forall i .

(25)

Due to the monotonic mixing constraint, the maximizing joint action can be obtained by independently selecting the greedy action of each agent. Let

u_{i}^{⋆} = arg max_{u_{i}} Q_{i} (τ_{i}^{t + 1}, u_{i}),

(26)

and denote the corresponding joint greedy action as

u^{⋆} = (u_{1}^{⋆}, \dots, u_{N}^{⋆})

. Then the TD target is written as

y^{t} = r^{t} + γ (1 - d^{t}) Q_{tot}^{-} (τ^{t + 1}, u^{⋆}),

(27)

where

r^{t}

is the team reward,

γ

is the discount factor,

d^{t} \in {0, 1}

is the episode termination indicator, and

Q_{tot}^{-}

denotes the target network.

The temporal-difference loss is

L_{TD} = E_{B} [{(y^{t} - Q_{tot} (τ^{t}, u^{t}))}^{2}],

(28)

where

B

denotes the replay batch.

To encourage communication efficiency without over-suppressing useful message exchange, we introduce a communication regularization term based on excess trigger activation as

L_{comm} = E_{B} [\frac{1}{N H} \sum_{t = 1}^{H} \sum_{i = 1}^{N} max (0, p_{i}^{t} - δ)],

(29)

where H is the episode horizon, N is the number of agents, and

δ \in (0, 1)

is a prescribed activation threshold. Under this design, trigger activations below

δ

incur no additional penalty, while activations above

δ

are penalized proportionally to their excess magnitude. The threshold

δ

is a scenario-dependent hyperparameter controlling the activation level at which communication is penalized. Its concrete values and selection strategy are described in detail in Section 4. Since the regularization is applied to the continuous activation score

p_{i}^{t}

rather than the binary trigger

T_{i}^{t}

, it provides a smoother optimization signal for controlling communication intensity. Therefore, a higher

δ

corresponds to a looser communication constraint, since a larger activation range is exempt from regularization.

The empirical communication rate is computed as the average proportion of active triggers over all agents and time steps:

ρ_{comm} = \frac{1}{N H} \sum_{t = 1}^{H} \sum_{i = 1}^{N} T_{i}^{t} .

(30)

This metric measures the realized frequency of message transmission during execution, while

L_{comm}

regularizes the continuous activation scores during training. This evaluation indicator was used to assess communication efficiency in Section 4.1 and Section 4.3.

The overall training objective is

L = L_{TD} + λ_{comm} L_{comm},

(31)

where

λ_{comm}

controls the trade-off between task performance and communication efficiency.

4. Experiments

We evaluate the proposed method on the StarCraft II Multi-Agent Challenge (SMAC) benchmark, based on StarCraft II version 4.10, a cooperative micromanagement benchmark where each learning agent controls one allied unit and coordinates with teammates to defeat enemy units under partial observability. The experiments are conducted on four representative scenarios: 2s3z, where two allied Stalkers and three allied Zealots fight against an enemy team with the same unit composition; 10m_vs_11m, where ten allied Marines fight against eleven enemy Marines; MMM, where a heterogeneous team composed of Marines, Marauders, and a Medivac fights against a comparable enemy group; and 1c3s5z, where one Colossus, three Stalkers, and five Zealots form a heterogeneous combat group.

These scenarios cover homogeneous and heterogeneous unit compositions, symmetric and asymmetric combat settings, and different coordination difficulties. The 2s3z scenario is a symmetric mixed-unit task, where agents must coordinate ranged Stalkers and melee Zealots, making it suitable for evaluating formation control, target selection, and cooperation between units with different roles. The 10m_vs_11m scenario tests a homogeneous Marine team under numerical disadvantage, where successful policies require concentrated fire, spacing control, and effective attack–move transitions. The MMM and 1c3s5z scenarios further emphasize heterogeneous role coordination: MMM requires cooperation between damage-dealing Marines and Marauders and the healing Medivac, while 1c3s5z evaluates front-line protection, long-range damage support, and coordinated engagement among units with different attack ranges and tactical functions.

All learning curves are evaluated every 10k environment steps using test episodes, and the shaded areas in the figures denote the standard deviation across runs. Depending on the scenario, the total training budget ranges from 1.2 M to 2.05 M environment steps. For scalar summaries, we report the average over the last 80 evaluations. To improve reproducibility, the main simulation settings and hyperparameters, including random seeds, software version, communication radius, message dimension, learning rate, batch size, replay buffer size, and the communication coefficient

λ_{comm}

, are summarized in Appendix A Table A1.

To study both task-level performance and behavior-level policy variation under communication control, we carry out three groups of experiments: multi-map performance comparison, communication effect evaluation, and ablation analysis. The results are reported using both task-level metrics and behavior-level metrics.

For task-level evaluation, we use win rate as the primary performance metric. For behavior-level evaluation, we consider several interpretable metrics that characterize coordination and micro-control behaviors. Unless otherwise stated, all reported behavior metrics are computed from test episodes at every 10k training steps and summarized using the last 80 evaluation records.

Minimum separation mean measures local spatial coordination:

{minsep}_{mean} = \frac{1}{T} \sum_{t = 1}^{T} \frac{1}{A} \sum_{i = 1}^{A} min_{j \neq i} {∥ x_{i}^{t} - x_{j}^{t} ∥}_{2} .

(32)

Here, T denotes the total time steps, A is the number of agents, and

x_{i}^{t}

is the position vector of agent i at time t. Larger values indicate more dispersed formations.

Focus-fire quantifies coordinated targeting:

{FocusFire}_{mean} = \frac{1}{| T_{attack} |} \sum_{t \in T_{attack}} max_{j} \sum_{i = 1}^{A} I (a_{i}^{t} = j) .

(33)

Here,

T_{attack}

is the set of attack time steps, and

a_{i}^{t}

denotes the action executed by agent i at time t. When

a_{i}^{t} = j

, agent i attacks enemy j at time t. Higher values indicate stronger coordination.

Target entropy evaluates target diversity:

{Entropy}_{mean} = \frac{1}{| T_{attack} |} \sum_{t \in T_{attack}} - \sum_{j = 1}^{E} ρ_{j}^{t} log ρ_{j}^{t} / log (E) .

(34)

Here, E is the number of enemies, and

ρ_{j}^{t}

is the proportion of agents attacking enemy j at time t. Lower values indicate more concentrated targeting. In our logs, this metric is reported as bucket_entropy.

Attack_move ratio measures micro-control behavior:

AttackMoveRatio = \frac{\sum_{t = 2}^{T} \sum_{i = 1}^{A} I ({Attack}_{i}^{t - 1} \land {Move}_{i}^{t})}{\sum_{t = 2}^{T} \sum_{i = 1}^{A} I ({Attack}_{i}^{t - 1}) + ϵ} .

(35)

Here,

ϵ

is a small constant to avoid division by zero, and

{Attack}_{i}^{t}

and

{Move}_{i}^{t}

denote indicator variables of whether agent i executes an attack action or a movement action at time t, respectively. Higher values indicate more effective attack_move ratio transitions.

To study communication under controlled regimes, we introduce a threshold-based excess-activation penalty calibrated from the steady-state statistics of trigger activations. Specifically, for each scenario, we first run a reference setting to estimate the stable-phase trigger activation level after convergence, and denote the corresponding mean activation rate by

μ_{s}

. We use the steady-state activation statistics only as a calibration reference for selecting scenario-specific thresholds, rather than as part of training. Based on this calibration anchor, we define three threshold settings:

δ_{low} = μ_{s}, δ_{mid} = μ_{s} + 0.05, δ_{high} = μ_{s} + 0.1 .

(36)

Under the proposed excess-activation penalty design, communication is penalized only when the trigger activation exceeds the prescribed threshold. Therefore, a higher threshold corresponds to a looser communication constraint and typically leads to a higher realized communication rate.

Table 1 summarizes the communication threshold statistics used in the experiments. Across the four SMAC scenarios, the estimated stable-phase mean activation rate

μ_{s}

remains relatively concentrated, ranging from 0.298 on 2s3z to 0.308 on MMM. Based on these scenario-specific calibration anchors, we construct three comparable communication regimes, namely

δ_{low}

,

δ_{mid}

, and

δ_{high}

, which are used throughout the subsequent experiments.

To ensure meaningful separation between communication regimes, we further carry out a preliminary sensitivity analysis around

μ_{s}

. In this analysis, the threshold offset is applied relative to

μ_{s}

, and ally minimum separation (ally_minsep) is used as the primary diagnostic metric because it most directly reflects formation-level coordination and spatial control.

As shown in Table 2, offsets around 0.05 and 0.10 yield distinguishable yet stable changes in formation behavior, while larger offsets do not consistently provide further separation. Therefore, we adopt 0.05 and 0.10 as the two positive threshold offsets. We do not claim that these offsets are universally optimal; rather, they are used to define reproducible communication regimes for comparative analysis.

We then assess our method under these three regimes and compare it with the original QMIX baseline without explicit communication control, enabling a systematic analysis of the trade-off between communication cost and task performance.

4.1. Performance on Multi-Map Tasks

To study whether communication control can preserve performance while reducing communication, we first carry out a multi-map comparison across the four SMAC scenarios using win rate as the primary evaluation metric. Figure 3 illustrates the training dynamics under different communication thresholds, while Table 3 reports the final win rate averaged over the last 80 evaluation records.

The results show that all threshold settings converge to competitive performance across the four scenarios. While the learning curves differ in stability, the final win rates remain comparable, indicating that communication control does not degrade overall task performance. As shown in Table 3, several threshold-controlled settings outperform the QMIX baseline.

Specifically, on 2s3z, the proposed method achieves a best win rate of 0.974 under

δ_{high}

, slightly improving over the QMIX baseline (0.972) and outperforming the mid-threshold setting (0.966). On MMM, the improvement is more pronounced: The low-threshold setting reaches 0.966, higher than the QMIX baseline (0.921), corresponding to a relative gain of approximately 4.5%. On 10m_vs_11m, the best performance is achieved at

δ_{mid}

with a win rate of 0.884, compared to 0.856 for QMIX. Similarly, on 1c3s5z, the highest win rate (0.955) is obtained under

δ_{high}

, exceeding the baseline performance (0.941). Overall, these results demonstrate that the proposed communication mechanism not only maintains competitive performance across different thresholds but can also provide consistent improvements over the baseline in several scenarios, especially in more challenging tasks such as MMM and 10m_vs_11m.

In addition, high-threshold settings (corresponding to higher communication frequency under the proposed excess-activation penalty design) tend to exhibit more pronounced fluctuations during training. This effect is especially visible in maps such as MMM and 2s3z, where the corresponding curves show noticeable oscillations before convergence. We attribute this phenomenon to the presence of redundant or noisy information under high communication intensity, which can interfere with stable policy updates. In contrast, lower-threshold settings impose stronger communication suppression, often leading to smoother training dynamics and more stable convergence behavior. These observations further support that effective coordination benefits from controlled, rather than excessive, communication.

To further study communication efficiency under the same threshold settings, we report the realized communication rate in Figure 4. The results show that the proposed threshold-based excess-activation penalty produces clearly distinguishable communication patterns across all scenarios.

Specifically, on MMM, the communication rate increases from 0.298 to 0.384 across different threshold settings; on 2s3z, it ranges from 0.277 to 0.378; on 10m_vs_11m, from 0.285 to 0.376; and on 1c3s5z, from 0.311 to 0.397. Since a higher threshold delays the onset of regularization, these results are consistent with the intended behavior of the proposed penalty design. Despite variations across tasks, the overall communication level remains consistently moderate (approximately 0.27–0.40), indicating that the proposed mechanism effectively avoids excessive message passing.

Importantly, when compared to the reference activation levels (shown by the background bars), the effective communication rate is reduced by a substantial margin. Here, the retained communication ratio is defined as the realized communication rate normalized by the corresponding reference activation level. The retained communication ratio is typically around 56% to 76%, demonstrating that a significant portion of potential communication is filtered out by the trigger mechanism.

Combining these observations with the win rate results, we find that reducing communication does not degrade performance. On the contrary, several settings with lower communication frequency achieve equal or better performance than the baseline. This suggests that the proposed threshold mechanism successfully filters out redundant interactions while preserving critical information exchange, leading to more efficient and stable multi-agent coordination.

4.2. Communication Effect Evaluation

To study how communication control changes the learned policy beyond task-level win rate, we carry out a behavior-level analysis using focus-fire, bucket entropy, ally minimum separation, and attack_move ratio as complementary evaluation metrics. Figure 5 reports the corresponding results across the four SMAC scenarios.

4.2.1. Threshold-Dependent Policy Shaping Analysis

The results show that communication control leads to distinguishable policy variations across multiple behavioral dimensions. To provide a more compact quantitative summary, Table 4 reports the converged behavior-level statistics under different communication thresholds. The values are estimated from the stable segments of Figure 5 and summarized as mean ± standard deviation.

Table 4 provides a compact statistical summary of the behavior-level coordination metrics under different communication thresholds. These metrics are used to interpret how threshold-controlled sparse communication affects the learned coordination policy rather than merely reducing the number of transmitted messages. Specifically, the attack_move ratio reflects the agents’ tendency to engage in combat actions, focus-fire measures whether multiple agents concentrate attacks on common targets, ally minimum separation reflects formation spacing and local collision-avoidance behavior, and bucket entropy measures the dispersion of target selection. Therefore, changes in these metrics provide behavioral evidence of how different communication thresholds reshape cooperative policies.

It should be noted that the threshold parameter

δ

controls the strength of excess-activation regularization rather than the hard binary decision boundary of the trigger. A smaller

δ

imposes a stronger penalty on high communication activation scores, thereby encouraging more selective and sparse communication. In this regime, agents are forced to rely more heavily on locally salient and behaviorally necessary messages. A larger

δ

relaxes this regularization and allows more communication signals to enter the utility-estimation pathway. As a result, different threshold levels change the amount and type of information incorporated into the recurrent hidden state, which further affects individual utility estimation and the resulting decentralized action choices.

The statistical trends in Table 4 show that the effect of threshold selection is scenario-dependent. In the relatively simple 2s3z scenario, the three thresholds produce very similar attack–move and focus-fire values, with focus-fire remaining around

2.35

–

2.36

. The small variation across thresholds indicates that this scenario has relatively low sensitivity to communication sparsity, since the required cooperative behavior can be learned with limited additional communication. This suggests that sparse communication mainly reduces redundant information exchange in simple coordination tasks without substantially changing the learned policy structure.

In contrast, more complex scenarios exhibit clearer threshold-dependent behavioral changes. In 10m_vs_11m,

δ_{high}

achieves the highest focus-fire score (

3.36 \pm 0.17

), while

δ_{mid}

obtains the lowest bucket entropy (

0.26 \pm 0.02

). This indicates that relaxing communication regularization can strengthen shared target information and improve concentrated attack behavior, whereas a moderate threshold can produce more stable target-selection concentration. In MMM,

δ_{low}

yields the highest attack_move ratio (

0.94 \pm 0.10

) and focus-fire score (

3.47 \pm 0.25

), suggesting that stronger communication sparsity can encourage agents to preserve only highly task-relevant signals and execute more decisive local combat behaviors. However,

δ_{high}

produces the lowest bucket entropy (

0.24 \pm 0.03

), indicating that a more relaxed communication regime can help align target selection across heterogeneous agents. In 1c3s5z,

δ_{high}

improves several coordination indicators, including focus-fire (

3.25 \pm 0.25

), ally minimum separation (

0.0082 \pm 0.0010

), and bucket entropy (

0.33 \pm 0.04

), implying that this heterogeneous scenario benefits from richer communication context when coordinating different unit types.

These observations suggest a mechanism-level interpretation of policy shaping. Lower thresholds impose stronger communication regularization and encourage agents to transmit only behaviorally salient information, which can promote compact and decisive local coordination. Middle thresholds provide a compromise between communication suppression and information sharing, often yielding stable target-selection behavior. Higher thresholds allow more communication information to be incorporated into the utility-estimation process, which can improve global target alignment and heterogeneous unit coordination in more difficult scenarios. Therefore, sparse communication affects the policy not only by changing communication frequency, but also by changing which information is available when agents estimate their utilities and select decentralized actions.

Overall, the results support the claim that threshold-controlled sparse communication reshapes coordination behavior in a scenario-dependent manner. The behavioral differences across attack tendency, focus-fire, formation spacing, and target-selection entropy provide empirical evidence that the communication threshold influences the learned policy structure. However, the effect is not strictly monotonic across all metrics or scenarios. Instead, different thresholds emphasize different dimensions of coordination, indicating that the threshold should be regarded as a behavior-shaping factor that balances communication efficiency, local decision decisiveness, and global cooperative alignment.

4.2.2. Effect of Task Complexity Across Maps

To study whether the benefit of communication control depends on task difficulty, we compare the performance and behavioral spread across maps of different complexity. The results show that the impact of communication becomes more pronounced as task complexity increases.

In 2s3z, the performance gap between the best and worst threshold settings is small (0.974 vs. 0.966, gap 0.008), whereas in 10m_vs_11m, the gap is substantially larger (0.884 vs. 0.856, gap 0.028). Behavioral differences follow the same trend. In 2s3z, focus-fire varies within a relatively narrow range of approximately 2.420–2.507, while in 10m_vs_11m it spans roughly 3.126–3.414. Similarly, the attack–move ratio exhibits a visibly larger spread in the more complex scenarios.

These results indicate that communication control is increasingly critical in complex tasks, where improper communication can more easily degrade coordination quality. In contrast, selective communication enables more stable and effective policies as coordination demands grow.

4.3. Ablation Study

4.3.1. Effect of Disabling Communication

To study the contribution of the communication mechanism itself, we introduce a static no-communication control, where the communication module remains in the architecture but its output is fixed to zero by forcing the trigger variable to be inactive for all agents at all time steps. This keeps the network structure unchanged and ensures a fair comparison, while fully disabling observation-dependent communication.

Figure 6 presents the results on the MMM scenario. At the task level, disabling communication leads to only a marginal degradation in final performance, with the win rate decreasing from approximately 0.954 to 0.935. This indicates that agents can still learn a reasonably effective policy without communication.

The results show that substantial differences emerge in behavior-level metrics even though the win-rate drop is limited. Specifically, the attack–move ratio drops markedly from approximately 0.824 to about 0.521, indicating a clear reduction in attack–move ratio micro-control. At the same time, the focus-fire value increases from around 3.355 to 3.653, suggesting that agents tend to concentrate fire more aggressively on a single target. In addition, the episode length decreases from roughly 68 to 55, reflecting shorter and less sustained engagements.

These results reveal a meaningful shift in the learned policy structure. Without communication, agents tend to adopt a simplified strategy characterized by more static positioning and direct damage exchange, leading to faster but less adaptive interactions. In contrast, communication enables agents to coordinate movement and timing, resulting in more refined micro-control and prolonged engagements.

Overall, although communication brings only limited gains in final win rate on MMM, it plays a critical role in shaping coordinated behaviors. In this ablation setting, communication mainly improves micro-control and policy quality rather than merely increasing the final task-level metric.

4.3.2. Applicability Across MARL Backbones

To study whether the proposed communication framework is specific to QMIX or can be transferred to other MARL algorithms, we further integrate it into multiple backbones, including parameter-sharing Deep Q-Network (DQN) [36], Multi-Agent Proximal Policy Optimization (MAPPO) [37], QMIX [5], and Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [23]. In all cases, the same communication module is inserted before action selection, providing a unified interface across value-based and actor–critic methods.

Figure 7 shows the win-rate comparison on 2s3z and MMM. Importantly, all curves in this figure correspond to backbone methods equipped with the proposed communication framework. The results show that, across all methods, the communication-enabled variants remain trainable under identical training settings, indicating that the proposed framework is broadly compatible with different MARL backbones.

However, the effectiveness of communication-aware learning varies substantially across different backbones. In particular, QMIX consistently achieves the highest performance. On 2s3z, QMIX reaches a win rate of approximately 0.95–1.00, outperforming DQN (about 0.75–0.85), MADDPG (about 0.45–0.65), and MAPPO (about 0.10–0.30). A similar trend is observed on MMM, where QMIX maintains a win rate close to 0.90–1.00, while the other methods remain substantially lower.

These results indicate that the proposed communication framework is broadly compatible with different MARL algorithms, but its effectiveness is not uniform across learning paradigms. A plausible explanation is that QMIX, with explicit cooperative value decomposition, is better aligned with shared communication signals and therefore better able to exploit additional coordination information. By contrast, actor–critic methods such as MAPPO and MADDPG may face more difficult credit assignment and optimization instability under multi-agent interaction noise, which can limit the benefit obtained from communication. Similarly, DQN lacks an explicit coordination mechanism, which may further constrain its performance.

Overall, the results demonstrate that the proposed framework is broadly applicable, but achieves the strongest empirical gains when combined with cooperative value-based methods such as QMIX.

4.3.3. Comparison with Mainstream Communication Methods

To further evaluate the effectiveness of the proposed trigger-based communication mechanism, we additionally compare our method with several representative communication-based MARL approaches, including ATOC [9], SchedNet [38], and G2ANet [39]. Following the same experimental protocol, all methods are trained under identical QMIX backbones, replay-buffer settings, training budgets, and evaluation procedures. The comparison focuses on two aspects: task performance (battle win rate) and communication efficiency (communication rate).

From the results on the 2s3z map shown in Figure 8, all methods eventually converge to high win rates, but their convergence behaviors and communication costs differ significantly. ATOC achieves rapid early-stage improvement and reaches a battle win rate of approximately

0.75

at

0.1 \times 10^{6}

environment steps, outperforming SchedNet (≈0.28) and our method (≈0.45) during the initial exploration stage. However, ATOC continuously increases its communication activity throughout training, with the communication rate rising from nearly

0.10

to around

0.40

after convergence. SchedNet maintains an even higher communication rate between

0.35

and

0.50

over the entire training process. In contrast, the proposed method gradually reduces communication usage from approximately

0.48

in the early stage to around

0.20

after convergence, while still maintaining a final battle win rate close to

0.99

. Compared with ATOC and SchedNet, our approach therefore achieves similar asymptotic performance using nearly half the communication frequency.

The comparison on the more challenging MMM scenario further demonstrates the robustness and communication efficiency of the proposed method. As illustrated in Figure 8, ATOC and SchedNet exhibit obvious instability during later training stages. In particular, the win rate of ATOC drops sharply to nearly

0.10

around

0.8 \times 10^{6}

environment steps, while SchedNet decreases to below

0.20

near

1.0 \times 10^{6}

steps. By contrast, our method maintains a stable win rate above

0.95

after approximately

0.35 \times 10^{6}

steps and finally converges close to

1.0

. Meanwhile, the communication rate of our method decreases from approximately

0.52

to around

0.24

during training, whereas G2ANet continuously increases its communication rate to more than

0.55

in the later stage. Although G2ANet achieves competitive final performance, it requires substantially denser communication interactions. Overall, the results indicate that the proposed trigger-based communication mechanism can effectively suppress redundant communication while preserving stable cooperative behaviors and strong task performance across different SMAC scenarios.

5. Discussion

The results suggest that strong coordination in cooperative MARL does not strictly rely on dense or continuous communication. In practice, communication that is only triggered when necessary is often sufficient to reach comparable—if not better—performance, while noticeably reducing communication overhead. This, to some extent, implies that excessive communication may do more harm than good: Redundant or noisy signals can accumulate and disrupt stable policy learning. By contrast, selectively activated communication appears to coincide more naturally with moments that genuinely require coordination. Beyond efficiency considerations, communication also seems to play a deeper role in shaping how policies evolve. Rather than acting purely as an information channel, it influences the structure of agent behaviors. When communication is maintained at an appropriate level of sparsity, coordination becomes more organized: Focus-fire behavior is more consistent, attack–move execution appears smoother, and formation control exhibits greater stability. On the other hand, when communication occurs too frequently, training tends to become less stable, often accompanied by larger performance fluctuations. This pattern suggests that an excess of signals may interfere with the gradual formation of effective coordination strategies.

The introduction of a lightweight CNN-based spatial encoder further refines this process. By incorporating local neighborhood information—such as agent density and relative positioning—the trigger module becomes sensitive to spatial context that is directly interpretable. Consequently, communication decisions are no longer based solely on latent features but are grounded in observable structural cues. Empirically, this shift is reflected in clearer behavioral differences across communication regimes, indicating that communication can, in effect, modulate how coordination patterns take shape.

Another aspect worth noting is the explicit trade-off between communication cost and coordination quality enabled by the excess-activation penalty threshold. When the threshold is set lower, communication is more tightly constrained, often leading to a substantial reduction in communication frequency without degrading performance—and in some cases even improving it. In contrast, looser thresholds tend to allow unnecessary communication to persist, which can undermine training stability. Taken together, these observations point to a broader implication: In MARL systems, it is not merely the presence of sparse communication that matters, but how carefully that sparsity is regulated to support structured policy development.

Limitations

Several limitations of the proposed framework should be acknowledged. To begin with, the trigger mechanism depends on manually defined hyperparameters, such as threshold values and bias offsets. These parameters may not transfer seamlessly across environments and often require additional tuning as task complexity changes. In addition, the current approach primarily captures local interactions and does not explicitly account for long-range dependencies or global coordination signals, which could become important in scenarios involving large-scale synchronization. Therefore, the conclusions of this study should be interpreted within the evaluated SMAC micromanagement scenarios. Further validation is needed before generalizing the observed benefits of threshold-gated sparse communication to cooperative MARL tasks with different communication topologies, reward structures, reward sparsity levels, or agent scales.

Finally, although the empirical results highlight a clear connection between communication and policy structure, the underlying mechanisms remain insufficiently understood. A more rigorous theoretical account of how communication influences emergent behaviors is still lacking. Future work could therefore explore directions such as adaptive threshold learning, the integration of global communication pathways, and a more formal analysis of communication-driven policy formation.

6. Conclusions

This paper studies the role of communication in cooperative multi-agent reinforcement learning (MARL) from the perspective of coordination under limited bandwidth. We propose a threshold-gated sparse communication framework built upon QMIX, where agents selectively activate communication based on structured changes in local observations. The communication mechanism is integrated into the utility network through neighbor-constrained aggregation, enabling communication to directly influence policy learning. Experiments on the SMAC benchmark show that the proposed approach significantly reduces communication frequency while maintaining or improving task performance. More importantly, the results reveal that, within the evaluated SMAC scenarios, communication plays a critical role in shaping policy behavior: Selective communication leads to more structured coordination patterns, including improved focus-fire consistency, more effective micro-control, and more stable training dynamics. These findings should therefore be interpreted within the specific experimental settings considered in this paper, rather than as a general claim about all cooperative MARL systems. Overall, the results suggest that, for the evaluated tasks, behavior-aware and selective communication can be more beneficial than frequent message exchange, highlighting sparse communication as a promising mechanism for regulating coordination under limited bandwidth.

Author Contributions

Conceptualization, J.L. and R.L.; methodology, J.L.; software, J.L. and R.L.; validation, J.L. and R.L.; formal analysis, J.L.; resources, J.L.; data curation, R.L.; writing—original draft preparation, J.L.; writing—review and editing, N.W.; visualization, J.L. and R.L.; supervision, N.W.; project administration, N.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data is sourced from the SMAC. They are available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MARL	Multi-Agent Reinforcement Learning
CTDE	Centralized Training with Decentralized Execution
SMAC	StarCraft Multi-Agent Challenge
QMIX	Monotonic Q-value mixing method for value decomposition
VDN	Value-Decomposition Networks
TD	Temporal-Difference
CNN	Convolutional Neural Network
GRU	Gated Recurrent Unit
STE	Straight-Through Estimator
DQN	Deep Q-Network
MAPPO	Multi-Agent Proximal Policy Optimization
MADDPG	Multi-Agent Deep Deterministic Policy Gradient
PyMARL	Python Multi-Agent Reinforcement Learning framework
`2s3z`	SMAC scenario with two Stalkers and three Zealots
`10m_vs_11m`	SMAC scenario with ten allied Marines versus eleven enemy Marines
`MMM`	SMAC Marine–Marauder–Medivac scenario
`1c3s5z`	SMAC scenario with one Colossus, three Stalkers, and five Zealots

Appendix A

We have provided the network configuration in the Table A1 and the time using for training in Table A2. For each task, we trained our model on a NVIDIA 4090 GPU. We provided 1.2 million to 2 million training steps for the smac environment.

Table A1. Experimental settings and hyperparameter configurations.

Item	Configuration
StarCraft II version	4.10, Build B75689
Maps	2s3z, 10m_vs_11m, MMM, 1c3s5z
Random seeds	3 independent seeds
Training timesteps	2.05M
Difficulty	7
Step multiplier	8
Communication radius	All allied agents
Message dimension	64
RNN hidden dimension	64
Encoder hidden dimension	128
Trigger hidden dimension	64
Attention hidden dimension	64
Learning rate	$5 \times 10^{- 4}$
Discount factor $γ$	0.99
Batch size	64
Replay buffer size	20,000
Target update interval	200
Gradient clipping norm	10
Exploration schedule	$ϵ$ : 1.0 to 0.05 over 200,000 steps
QMIX mixing embedding dim	32
QMIX hypernetwork	2 layers, hidden dim 64
Communication weight $λ_{comm}$	0.001
Optimizer	RMSprop, $α = 0.99$ , $ϵ = 10^{- 5}$
Activation	ReLU

Table A2. Training time.

Task	Time (h)
MMM	$25.5 \pm 0.8$
10m_vs_11m	$23.3 \pm 1.6$
1c3s5z	$23.1 \pm 1.8$
2s3z	$15.4 \pm 1.1$

References

Orr, J.; Dutta, A. Multi-Agent Deep Reinforcement Learning for Multi-Robot Applications: A Survey. Sensors 2023, 23, 3625. [Google Scholar] [CrossRef] [PubMed]
Omidshafiei, S.; Pazis, J.; Amato, C.; How, J.P.; Vian, J. Deep Decentralized Multi-Task Multi-Agent Reinforcement Learning under Partial Observability. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; PMLR: New York, NY, USA, 2017; Volume 70, pp. 2681–2690. [Google Scholar] [CrossRef]
Zhang, K.; Yang, Z.; Başar, T. Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms. In Handbook of Reinforcement Learning and Control; Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D., Eds.; Springer: Cham, Switzerland, 2021; pp. 321–384. [Google Scholar] [CrossRef]
Nguyen, T.T.; Nguyen, N.D.; Nahavandi, S. Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications. IEEE Trans. Cybern. 2020, 50, 3826–3839. [Google Scholar] [CrossRef] [PubMed]
Rashid, T.; Samvelyan, M.; Schroeder de Witt, C.; Farquhar, G.; Foerster, J.N.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; PMLR: New York, NY, USA, 2018; Volume 80, pp. 4295–4304. [Google Scholar] [CrossRef]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks for Cooperative Multi-Agent Learning Based on Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2018), Stockholm, Sweden, 10–15 July 2018; International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2018; pp. 2085–2087. [Google Scholar] [CrossRef]
Sukhbaatar, S.; Szlam, A.; Fergus, R. Learning Multiagent Communication with Backpropagation. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016), Barcelona, Spain, 5–10 December 2016; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 2244–2252. [Google Scholar] [CrossRef]
Foerster, J.N.; Assael, Y.M.; de Freitas, N.; Whiteson, S. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016), Barcelona, Spain, 5–10 December 2016; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 2137–2145. [Google Scholar] [CrossRef]
Jiang, J.; Lu, Z. Learning Attentional Communication for Multi-Agent Cooperation. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 7265–7275. [Google Scholar] [CrossRef]
Nayak, S.; Choi, K.; Ding, W.; Dolan, S.; Gopalakrishnan, K.; Balakrishnan, H. Scalable Multi-Agent Reinforcement Learning through Intelligent Information Aggregation. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023; PMLR: New York, NY, USA, 2023; Volume 202, pp. 25817–25833. [Google Scholar] [CrossRef]
Han, S.; Dastani, M.; Wang, S. Sparse Communication in Multi-Agent Deep Reinforcement Learning. Neurocomputing 2025, 625, 129344. [Google Scholar] [CrossRef]
Wang, J.; Ren, Z.; Liu, T.; Yu, Y.; Zhang, C. QPLEX: Duplex Dueling Multi-Agent Q-Learning. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, 3–7 May 2021; OpenReview.net: Amherst, MA, USA, 2021. [Google Scholar] [CrossRef]
Samvelyan, M.; Rashid, T.; Schroeder de Witt, C.; Farquhar, G.; Nardelli, N.; Rudner, T.G.J.; Hung, C.-M.; Torr, P.H.S.; Foerster, J.N.; Whiteson, S. The StarCraft Multi-Agent Challenge. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019), Montréal, QC, Canada, 13–17 May 2019; International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2019; pp. 2186–2188. [Google Scholar] [CrossRef]
Ellis, B.; Cook, J.; Moalla, S.; Samvelyan, M.; Sun, M.; Mahajan, A.; Foerster, J.N.; Whiteson, S. SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2023; pp. 37567–37593. [Google Scholar] [CrossRef]
Hu, G.; Zhu, Y.; Zhao, D.; Zhao, M.; Hao, J. Event-Triggered Communication Network with Limited-Bandwidth Constraint for Multi-Agent Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 3966–3978. [Google Scholar] [CrossRef] [PubMed]
Singh, A.; Jain, T.; Sukhbaatar, S. Learning When to Communicate at Scale in Multiagent Cooperative and Competitive Tasks. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019; OpenReview.net: Amherst, MA, USA, 2019. [Google Scholar] [CrossRef]
Zhu, C.; Dastani, M.; Wang, S. A Survey of Multi-Agent Deep Reinforcement Learning with Communication. Auton. Agents Multi-Agent Syst. 2024, 38, 4. [Google Scholar] [CrossRef]
Busoniu, L.; Babuška, R.; De Schutter, B. A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans. Syst. Man. Cybern. Part C Appl. Rev. 2008, 38, 156–172. [Google Scholar] [CrossRef]
Hernandez-Leal, P.; Kartal, B.; Taylor, M.E. A Very Condensed Survey and Critique of Multiagent Deep Reinforcement Learning. In Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2020), Auckland, New Zealand, 9–13 May 2020; International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2020; pp. 2146–2148. [Google Scholar]
Li, J.; Kuang, K.; Wang, B.; Liu, F.; Chen, L.; Fan, C.; Wu, F.; Xiao, J. Deconfounded Value Decomposition for Multi-Agent Reinforcement Learning. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), Baltimore, MD, USA, 17–23 July 2022; PMLR: New York, NY, USA, 2022; Volume 162, pp. 12843–12856. [Google Scholar] [CrossRef]
Li, Y.; Xie, G.; Lu, Z. Revisiting Cooperative Off-Policy Multi-Agent Reinforcement Learning. In Proceedings of the 42nd International Conference on Machine Learning (ICML 2025), Vancouver, BC, Canada, 13–19 July 2025; PMLR: New York, NY, USA, 2025; Volume 267, pp. 36435–36450. [Google Scholar] [CrossRef]
Papoudakis, G.; Christianos, F.; Schäfer, L.; Albrecht, S.V. Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS 2021), Virtual Event, 6–14 December 2021; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2021. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 6379–6390. [Google Scholar] [CrossRef]
Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Palo Alto, CA, USA, 2018; pp. 2974–2982. [Google Scholar] [CrossRef]
Bengio, Y.; Léonard, N.; Courville, A. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv 2013, arXiv:1308.3432. [Google Scholar] [CrossRef]
Maddison, C.J.; Mnih, A.; Teh, Y.W. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017; OpenReview.net: Amherst, MA, USA, 2017. [Google Scholar] [CrossRef]
Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017; OpenReview.net: Amherst, MA, USA, 2017. [Google Scholar] [CrossRef]
Gronauer, S.; Diepold, K. Multi-Agent Deep Reinforcement Learning: A Survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
Oroojlooy, A.; Hajinezhad, D. A Review of Cooperative Multi-Agent Deep Reinforcement Learning. Appl. Intell. 2023, 53, 13677–13722. [Google Scholar] [CrossRef]
Chen, R.; Tan, Y. Credit Assignment with Predictive Contribution Measurement in Multi-Agent Reinforcement Learning. Neural Netw. 2023, 164, 681–690. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Tan, Y. Attentive Relational State Representation in Decentralized Multiagent Reinforcement Learning. IEEE Trans. Cybern. 2022, 52, 252–264. [Google Scholar] [CrossRef] [PubMed]
Hu, K.; Xu, K.; Xia, Q.; Li, M.; Song, Z.; Song, L.; Sun, N. An Overview: Attention Mechanisms in Multi-Agent Reinforcement Learning. Neurocomputing 2024, 598, 128015. [Google Scholar] [CrossRef]
Jiang, R.; Zhang, X.; Liu, Y.; Xu, Y.; Zhang, X.; Zhuang, Y. Multi-Agent Cooperative Strategy with Explicit Teammate Modeling and Targeted Informative Communication. Neurocomputing 2024, 586, 127638. [Google Scholar] [CrossRef]
Zhu, T.; Shi, X.; Xu, X.; Gui, J.; Cao, J. HyperComm: Hypergraph-Based Communication in Multi-Agent Reinforcement Learning. Neural Netw. 2024, 178, 106432. [Google Scholar] [CrossRef] [PubMed]
Feng, Z.; Huang, M.; Wu, Y.; Wu, D.; Cao, J.; Korovin, I.; Gorbachev, S.; Gorbacheva, N. Approximating Nash Equilibrium for Anti-UAV Jamming Markov Game Using a Novel Event-Triggered Multi-Agent Reinforcement Learning. Neural Netw. 2023, 161, 330–342. [Google Scholar] [CrossRef] [PubMed]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.M.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2022. [Google Scholar] [CrossRef]
Kim, D.; Moon, S.; Hostallero, D.; Kang, W.J.; Lee, T.; Son, K.; Yi, Y. Learning to Schedule Communication in Multi-Agent Reinforcement Learning. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019; OpenReview.net: Amherst, MA, USA, 2019. [Google Scholar] [CrossRef]
Liu, Y.; Wang, W.; Hu, Y.; Hao, J.; Chen, X.; Gao, Y. Multi-Agent Game Abstraction via Graph Attention Neural Network. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; AAAI Press: Palo Alto, CA, USA, 2020; Volume 34, pp. 7211–7218. [Google Scholar] [CrossRef]

Figure 3. Training win rate under different communication thresholds on four SMAC scenarios: 2s3z, 10m_vs_11m, MMM, and 1c3s5z. Curves are evaluated every 10k environment steps using Test episodes, and shaded areas denote the standard deviation across runs.

Figure 4. Average communication rate under different threshold settings across four SMAC scenarios.

Figure 5. Strategy metrics under different communication thresholds. The whole figure is organized into four scenario-level subfigures: (a) 2s3z, (b) 10m_vs_11m, (c) MMM, and (d) 1c3s5z. Within each scenario-level subfigure, the four second-level panels report attack-move ratio, focus fire, ally minimum separation, and bucket entropy from left to right. Curves are evaluated every 10k environment steps on test episodes, and shaded areas denote the standard deviation.

Figure 6. Ablation on MMM: attack–move ratio, focus-fire, episode length, and win rate under communication-enabled and no-communication settings. Curves are evaluated every 10k environment steps on test episodes, and shaded areas denote the standard deviation.

Figure 7. Win-rate comparison of different MARL backbones equipped with the proposed communication framework on 2s3z and MMM. Curves are evaluated every 10k environment steps on test episodes, and shaded areas denote the standard deviation.

Figure 8. Comparison with mainstream communication-based MARL methods on the 2s3z and MMM scenarios. The first column presents the battle win rate, while the second column reports the communication rate during training.

Table 1. Communication threshold statistics across SMAC maps. The stable-phase mean

(μ_{s})

is computed from a converged reference setting without excess-activation regularization. The reported range corresponds to the minimum and maximum trigger rates observed in the stable phase.

Table 1. Communication threshold statistics across SMAC maps. The stable-phase mean

(μ_{s})

is computed from a converged reference setting without excess-activation regularization. The reported range corresponds to the minimum and maximum trigger rates observed in the stable phase.

Map	Initial Trigger Rate	Stable-Phase Mean ( $μ_{s}$ )	$δ_{low}$	$δ_{mid}$	$δ_{high}$
MMM	$0.510 \pm 0.068$	$0.308 \pm 0.016$	0.308	0.358	0.408
2s3z	$0.482 \pm 0.052$	$0.298 \pm 0.020$	0.298	0.348	0.398
10m_vs_11m	$0.523 \pm 0.079$	$0.299 \pm 0.014$	0.299	0.349	0.399
1c3s5z	$0.503 \pm 0.063$	$0.306 \pm 0.017$	0.306	0.356	0.406

Table 2. Preliminary sensitivity analysis of the positive bias relative to

μ_{s}

(all values in

\times 10^{- 3}

).

Table 2. Preliminary sensitivity analysis of the positive bias relative to

μ_{s}

(all values in

\times 10^{- 3}

).

Bias	0.025	0.05	0.075	0.10	0.125	0.15
ally_minsep	5.50	4.61	4.31	4.04	4.31	4.11
Diff. from $μ_{s}$ baseline	0.83	1.76	2.06	2.34	2.07	2.27

Table 3. Final win rate under different communication thresholds across SMAC scenarios, averaged over the last 80 evaluation records. Best results are highlighted in bold.

Map	QMIX	$δ_{low}$	$δ_{mid}$	$δ_{high}$
2s3z	$0.972 \pm 0.012$	$0.973 \pm 0.015$	$0.966 \pm 0.014$	$0.974 \pm 0.010$
MMM	$0.921 \pm 0.028$	$0.966 \pm 0.018$	$0.952 \pm 0.022$	$0.954 \pm 0.020$
10m_vs_11m	$0.856 \pm 0.030$	$0.873 \pm 0.022$	$0.884 \pm 0.018$	$0.879 \pm 0.024$
1c3s5z	$0.941 \pm 0.060$	$0.952 \pm 0.045$	$0.948 \pm 0.040$	$0.955 \pm 0.038$

Table 4. Summary of behavior-level coordination metrics under different communication thresholds. Values are estimated from the converged segments of Figure 5 and reported as mean ± standard deviation. Higher attack_move ratio, focus-fire, and ally minimum separation indicate stronger coordination, while lower bucket entropy indicates more concentrated target selection.

Scenario	Attack_Move Ratio	Focus-Fire	Ally Minsep	Bucket Entropy
`2s3z` $_δ_{low}$	$0.20 \pm 0.03$	$2.35 \pm 0.12$	$0.0546 \pm 0.0049$	$0.43 \pm 0.03$
`2s3z` $_δ_{mid}$	$0.22 \pm 0.05$	$2.36 \pm 0.12$	$0.0512 \pm 0.0035$	$0.44 \pm 0.04$
`2s3z` $_δ_{high}$	$0.21 \pm 0.05$	$2.35 \pm 0.10$	$0.0525 \pm 0.0020$	$0.44 \pm 0.03$
`10m_vs_11m` $_δ_{low}$	$0.88 \pm 0.08$	$3.28 \pm 0.16$	$0.0072 \pm 0.0013$	$0.28 \pm 0.02$
`10m_vs_11m` $_δ_{mid}$	$0.75 \pm 0.06$	$3.13 \pm 0.13$	$0.0064 \pm 0.0007$	$0.26 \pm 0.02$
`10m_vs_11m` $_δ_{high}$	$0.86 \pm 0.10$	$3.36 \pm 0.17$	$0.0065 \pm 0.0008$	$0.27 \pm 0.02$
`MMM` $_δ_{low}$	$0.94 \pm 0.10$	$3.47 \pm 0.25$	$0.0117 \pm 0.0020$	$0.31 \pm 0.03$
`MMM` $_δ_{mid}$	$0.87 \pm 0.11$	$3.02 \pm 0.22$	$0.0132 \pm 0.0028$	$0.25 \pm 0.03$
`MMM` $_δ_{high}$	$0.67 \pm 0.10$	$2.89 \pm 0.22$	$0.0117 \pm 0.0017$	$0.24 \pm 0.03$
`1c3s5z` $_δ_{low}$	$0.57 \pm 0.08$	$2.98 \pm 0.30$	$0.0070 \pm 0.0024$	$0.34 \pm 0.04$
`1c3s5z` $_δ_{mid}$	$0.60 \pm 0.10$	$2.78 \pm 0.24$	$0.0080 \pm 0.0017$	$0.36 \pm 0.04$
`1c3s5z` $_δ_{high}$	$0.54 \pm 0.08$	$3.25 \pm 0.25$	$0.0082 \pm 0.0010$	$0.33 \pm 0.04$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Li, R.; Wang, N. Sparse Communication for Policy Shaping in Multi-Agent Reinforcement Learning. Sensors 2026, 26, 3413. https://doi.org/10.3390/s26113413

AMA Style

Li J, Li R, Wang N. Sparse Communication for Policy Shaping in Multi-Agent Reinforcement Learning. Sensors. 2026; 26(11):3413. https://doi.org/10.3390/s26113413

Chicago/Turabian Style

Li, Jiahao, Renjie Li, and Nan Wang. 2026. "Sparse Communication for Policy Shaping in Multi-Agent Reinforcement Learning" Sensors 26, no. 11: 3413. https://doi.org/10.3390/s26113413

APA Style

Li, J., Li, R., & Wang, N. (2026). Sparse Communication for Policy Shaping in Multi-Agent Reinforcement Learning. Sensors, 26(11), 3413. https://doi.org/10.3390/s26113413

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sparse Communication for Policy Shaping in Multi-Agent Reinforcement Learning

Abstract

1. Introduction

2. Related Works

2.1. MARL for Collaboration (Value-Based CTDE)

2.2. Communication in MARL

2.3. Communication Efficiency and Sparse/Event-Triggered Communication

2.4. Gaps in Existing Research and the Positioning of This Work

3. Method

3.1. Communication Setting and Overall Framework

3.1.1. Neighbor-Based Sparse Communication

3.1.2. Message Aggregation and Fusion

3.2. Trigger-Based Communication Module

3.3. Communication-Augmented QMIX

4. Experiments

4.1. Performance on Multi-Map Tasks

4.2. Communication Effect Evaluation

4.2.1. Threshold-Dependent Policy Shaping Analysis

4.2.2. Effect of Task Complexity Across Maps

4.3. Ablation Study

4.3.1. Effect of Disabling Communication

4.3.2. Applicability Across MARL Backbones

4.3.3. Comparison with Mainstream Communication Methods

5. Discussion

Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI