4.1.1. State Representation in GNN-Enhanced MARL
State representation in MARL faces the challenge of capturing both individual agent states and complex interactions within dynamic environments [
26]. GNN integration introduces novel approaches leveraging inherent graph structures in multi-agent systems, moving beyond traditional vector representations to enable richer coordination modeling [
39,
55].
The state representation methods can be systematically classified across multiple dimensions based on their underlying network architectures, learning paradigms, and coordination mechanisms. These approaches span from fundamental graph convolutional networks (GCNs) [
31] to advanced architectures including hierarchical attention networks [
38], partially equivariant systems [
39], and hypergraph neural networks [
41]. The classification reveals three primary architectural categories:
spectral-based methods that provide Laplacian-based neighbor aggregation [
31],
attention-based approaches that enable selective information processing through learnable attention mechanisms [
33,
38,
42], and
advanced specialized architectures including inductive learning methods [
37], symmetry-aware networks [
39], and hypergraph-based systems [
41].
Table 2 provides a comprehensive comparison of these methods across network architectures, learning paradigms, coordination mechanisms, theoretical foundations, and scalability characteristics.
The evolution from traditional vector representations to graph-based encoding represents a paradigm shift in MARL. Early approaches struggled with exponential state space growth, while modern GNN-based methods naturally model spatial-temporal relationships through dynamic graph structures, enabling efficient neighbor aggregation without full enumeration of the global state space.
Graph-based encoding transforms environments into structured representations where agents are represented as nodes and their interactions as edges. The environment can be modeled as a graph , where denotes the set of nodes (agents) and the set of edges (interactions), enabling localized and topology-aware feature learning.
Graph convolutional network (GCN). A standard graph convolutional network (GCN) [
31] performs layer-wise propagation using a message aggregation scheme based on the graph Laplacian. Specifically, the node representations are updated according to:
where
is the adjacency matrix with added self-connections,
is the degree matrix of
,
is the node feature matrix at layer
l,
is the trainable weight matrix, and
is an activation function such as ReLU. This update rule is identical to the foundational GCN propagation mechanism previously introduced in Equation (
12), which forms the basis for many advanced graph-based MARL models. A well-documented issue with GCNs is the tendency towards over-smoothing, where repeated neighborhood averaging can make node representations overly similar after several layers. For state representation, this may cause distinct agent states to become indistinguishable, potentially hindering fine-grained policy learning.
Multi-agent graph embedding-based coordination (MAGEC). Multi-agent graph embedding methods enhance continuous-space representation by constructing spatial graphs that encode agent–obstacle–target relations [
37]. In such spatial graphs, nodes represent agents and relevant environmental entities. To learn coordination policies over this representation, GNNs such as GraphSAGE [
32] are applied:
Here,
denotes a neighborhood aggregation function. This formulation directly applies the inductive GraphSAGE framework from Equation (
13), where the aggregator function learns to generate embeddings from sampled local neighborhoods. This spatial-graph plus GraphSAGE pipeline improves coordination robustness by leveraging spatial locality and inductive generalization.
Multi-agent state aggregation enhancement. Traditional state aggregation in multi-agent systems suffers from the curse of dimensionality. GNN-based models address this by enabling structured and localized message passing. In particular, GCNs enhance state aggregation by embedding agent states within relational graphs, allowing each agent to integrate neighborhood information without full enumeration of the global state space. To extract compact global features, a permutation-invariant pooling function is applied over the final node embeddings:
yielding a fixed-size global state descriptor. To further improve adaptivity, graph attention networks (GAT) employ learnable attention coefficients to prioritize important neighbors [
33]:
which allows agents to adaptively focus on the most relevant peers under dynamic conditions. The calculation of these attention weights is the core mechanism of GATs, as detailed earlier in Equations (
14) and (
15). While GAT’s attention offers expressive weighting, its performance can be sensitive to the graph topology. In sparse neighborhoods, attention weights may become concentrated or noisy, leading to state representations that are unstable or disproportionately influenced by a small subset of neighbors.
Hierarchical graph attention network (HGAT). To address scalability and relational complexity, Ryu et al. [
38] propose a Hierarchical Graph Attention Network (HGAT) with a two-level attention architecture. At the lower level, node-level attention models fine-grained interactions. At the higher level, hyper-node attention aggregates representations of predefined agent groups. The local embedding of an agent
i is computed using attention-weighted message passing:
where
is the attention coefficient. This local update rule directly mirrors the GAT aggregation mechanism previously shown in Equation (
16). The effectiveness of HGAT’s hierarchical representation, however, is heavily dependent on the predefined agent grouping strategy. An improperly designed hierarchy can create information bottlenecks or misrepresent the true coordination structure, potentially leading to suboptimal state abstractions. This structure enables HGAT to generate context-aware agent embeddings that are invariant to the number of agents.
Equivariant and symmetry-aware state representations (PEnGUiN). Recent advances leverage structural symmetries to enhance sample efficiency. McClellan et al. [
39] introduce partially equivariant graph neural networks (PEnGUiN), which selectively enforces equivariance to agent permutations. Each agent’s representation is decomposed into equivariant and invariant components:
. The framework introduces a learnable symmetry score
to blend fully equivariant and non-equivariant updates:
While partial equivariance enhances sample efficiency in symmetric tasks, it remains a strong inductive bias. In environments with subtle but critical asymmetries, the equivariant component of the representation might overly constrain the policy space, making it difficult for agents to learn specialized, non-symmetric behaviors. This architecture generalizes both equivariant GNNs and standard GNNs, adapting dynamically to varying degrees of environmental symmetry.
Predictive and hypergraph state representations. Graph-assisted predictive state representations (GAPSR) extend single-agent PSR to multi-agent partially observable systems by leveraging agent connectivity graphs [
40]. The framework constructs primitive predictive states
between agent pairs:
. The final predictive representation is obtained through graph-based aggregation:
. This approach enables decentralized agents to capture interactions while avoiding the complexity of joint state estimation.
Hypergraph coordination networks (HYGMA). HYGMA addresses multi-agent coordination through dynamic spectral clustering and hypergraph neural networks to capture higher-order agent relationships [
41]. The framework constructs hypergraphs by solving a normalized cut minimization problem:
where
is the graph Laplacian. The resulting hypergraph enables attention-enhanced information processing through hypergraph convolution:
where
represents learned attention coefficients. Representing states with hypergraphs allows for capturing higher-order relationships, but this comes at the cost of increased computational complexity. The performance is critically dependent on the hyperedge construction step (e.g., spectral clustering), which is often a non-trivial and computationally intensive preprocessing task.
Relationship modeling enhancement via graph edges. Liu et al. [
42] propose a
two-stage graph attention network (G2ANet) for automatic game abstraction. G2ANet constructs an agent-coordination graph where agents are initially fully connected, then uses hard attention to identify and remove unrelated edges, followed by soft attention to learn importance weights. Hard attention employs Bi-LSTM, and its weight is computed as
. Soft attention then uses query-key mechanisms:
. The resulting sub-graph for each agent is processed by GNNs to obtain joint encodings
.
Relational state abstraction [
43] transforms observations into spatial graphs where entities are connected through directional spatial relations. The method employs R-GCNs to aggregate information,
, achieving translation invariance. This update rule is a direct application of the R-GCN framework described in Equation (
17), using relation-specific weight matrices
to process different spatial relationships. A max-pooling operation generates fixed-size representations
for the critic network, demonstrating significant sample efficiency improvements.
Multimodal graph fusion. Graph-based multimodal fusion demonstrates superior integration capabilities by explicitly modeling structural relationships between modalities [
44]. D’Souza et al. propose a multiplexed GNN approach that creates targeted encodings for each modality through autoencoders, with multiplexed graph layers representing different relationships. This framework processes different data types and addresses challenges where not all modalities are present for all samples.
Generalized graph drawing for Laplacian state representations. Wang et al. [
45] develop an improved Laplacian state representation learning method by modeling states and transitions as graph structures. The approach views the MDP as a graph
. The graph Laplacian is defined as
. Notably, this is the unnormalized graph Laplacian, whereas the propagation in GCNs, as seen in Equation (
12), typically utilizes a symmetrically normalized version for stable feature propagation. The
d-dimensional Laplacian representation of a state is constructed from the smallest eigenvectors of
L. To address the non-uniqueness of the standard spectral graph drawing objective
the authors propose a generalized objective with decreasing coefficients:
where
. This generalized objective has the smallest eigenvectors as its
unique global minimizer, ensuring a faithful approximation to the ground truth Laplacian representation.
Summary
The evolution of state representation architectures reflects a consistent shift toward capturing richer structural information in multi-agent environments. While GCN-based models rely on spectral aggregation across local neighborhoods, recent approaches such as GAT and hypergraph neural networks have advanced toward modeling high-order relations and multi-scale dependencies. In particular, GraphSAGE introduces behavior-driven aggregation schemes that emphasize stable abstraction over selective attention. Alongside architectural advancements, the theoretical foundations behind these methods, ranging from spectral graph theory to equivariant representation learning, remain heterogeneous and often incomplete. This inconsistency highlights the need for more unified and interpretable frameworks to support generalization across diverse multi-agent structures.
Table 3 provides a summary of the advantages and limitations of different categories of state representation methods.
In practice, the selection of state representation methods is highly task-dependent. GCN-based methods are appealing in medium-scale systems with relatively stable interaction topologies, but their uniform aggregation limits performance when neighbor importance is uneven. Attention-based approaches (e.g., GAT, G2ANet) provide stronger adaptability to dynamic or heterogeneous environments by selectively weighting neighbors, though at the cost of increased computation. Hypergraph-based models (e.g., HYGMA) naturally capture higher-order or group-level interactions, making them suitable for large-scale systems, yet they require careful hyperedge construction. Symmetry-aware networks (e.g., PEnGUiN) are particularly effective when environments exhibit agent interchangeability or partial symmetry, improving sample efficiency and generalization but offering less benefit in asymmetric domains. Hierarchical models (e.g., HGAT) strike a balance between scalability and structure by integrating both local and group-level interactions, though they depend on meaningful group partitioning for effectiveness.
4.1.3. GNN-Enhanced Reward Design in MARL
Graph neural networks (GNNs) have been increasingly adopted for reward modeling in reinforcement learning due to their ability to capture structural dependencies among agents, states, and actions. Existing methods span a variety of strategies, including potential-based reward shaping [
51,
56,
57], representation-based shaping using Laplacian embeddings [
45], value function decomposition via graph-based architectures [
53], and decentralized intrinsic reward modeling based on local message passing [
52,
58]. Some works also explore relational reward sharing grounded in graph-structured inter-agent preferences [
59]. These approaches collectively demonstrate how GNNs enable more structured and adaptive reward signals, particularly in sparse or multi-agent environments.
We summarize representative methods and their characteristics in
Table 5.
Reward shaping methods. Reward shaping methods tackle the challenge of sparse reward environments by intelligently designing auxiliary reward signals. Reward propagation [
51] employs graph convolutional networks (GCNs) to propagate reward information from rewarding states through message-passing mechanisms, learning potential functions for potential-based reward shaping that preserve the optimal policy. This propagation is a direct application of the GCN update rule shown in Equation (
12), where reward signals are treated as node features that are diffused across the state graph. By approximating the underlying state transition graph through sampled trajectories and using the graph Laplacian as a surrogate for the true transition matrix, this method draws connections to the proto-value functions framework. Extensive validation across tabular domains (FourRooms), vision-based navigation tasks (MiniWorld), Atari 2600 games, and continuous control problems (MuJoCo) reveals significant performance improvements over actor–critic baselines while maintaining computational efficiency.
Hierarchical graph topology (HGT). Building upon potential-based reward shaping principles, hierarchical graph topology (HGT) [
56] constructs an underlying graph where states serve as nodes and edges represent transition probabilities within the Markov decision process (MDP). Rather than operating on flat graph structures, HGT decomposes complex probability graphs into interpretable subgraphs and aggregates messages from these components for enhanced reward shaping effectiveness. The hierarchical architecture proves especially valuable in environments with sparse and delayed rewards by enabling agents to capture long-range dependencies and structured transition patterns.
Reward shaping with Laplacian representations. Wang et al. [
45] explore a geometric approach to reward design by leveraging learned Laplacian representations for pseudo-reward signal generation in goal-achieving tasks. Their strategy centers on Euclidean distance computation within the representation space, where the pseudo-reward corresponds to the negative L2 distance between the current and goal states in the Laplacian representation space:
Given that Laplacian representations can effectively capture the geometric properties of environment dynamics, this representation-space distance metric provides meaningful guidance signals for agent learning. Further analysis reveals that different dimensions of the Laplacian representation contribute varying degrees of effectiveness to reward shaping, with lower dimensions (corresponding to smaller eigenvalues) demonstrating superior learning acceleration. This observation establishes both theoretical foundations and practical guidelines for dimension-selective reward shaping, where pseudo-rewards can be computed using individual dimensions as
for the
i-th dimension
of the learned representation.
Graph convolutional recurrent networks for reward shaping. Addressing fundamental limitations in existing potential-based reward shaping methods, the graph convolutional recurrent network (GCRN) algorithm [
57] introduces three complementary innovations for enhanced reward shaping optimization.
Spatio-Temporal Dependency Modeling: By integrating graph convolutional networks (GCN) with bi-directional gated recurrent units (Bi-GRUs), GCRN simultaneously captures both spatial and temporal dependencies within the state transition graph. The forward computation proceeds as:
where
K represents the Krylov basis,
X combines state and action information, and ⊕ denotes concatenation.
Augmented Krylov Basis for Transition Matrix Approximation: While traditional approaches rely on the graph Laplacian
as the GCN filter under value function smoothness assumptions, GCRN develops an augmented Krylov algorithm for more precise transition matrix
P approximation. This reliance on the Laplacian
connects to the core mechanism of standard GCNs, as its structure is fundamentally related to the symmetrically normalized adjacency matrix used in Equation (
12). The resulting Krylov basis
K combines top eigenvectors from the sampled transition matrix
with Neumann series vectors, yielding superior short-term and long-term behavior modeling compared to standard graph Laplacian approaches.
Look-Ahead Advice Mechanism: Departing from conventional state-only potential functions, GCRN incorporates a look-ahead advice mechanism that exploits both state and action information. The enhanced shaping function becomes:
where
now operates on state–action pairs
S and
A, facilitating more precise action-level guidance.
Training combines base and recursive loss components derived from hidden Markov model message-passing techniques:
where
represents the optimality probability calculated through forward and backward message passing.
By maintaining the policy invariance guarantee of potential-based reward shaping, GCRN achieves substantial improvements in both convergence speed and final performance, as validated through comprehensive experiments on Atari 2600 and MuJoCo environments.
Graph convolutional value decomposition. GraphMIX [
53] proposes a GNN-based framework for value function factorization in MARL, enabling fine-grained credit assignment through graph attention mechanisms. It models agents as nodes in a fully connected directed graph, where edge weights, computed via attention, capture dynamic inter-agent influence. This approach leverages the expressive power of architectures like GCN (Equation (
12)) or the more powerful GIN (Equation (
18)) to process the graph structure and inform the value decomposition. The initial assumption of a fully connected graph, however, introduces a significant scalability bottleneck, as its computational complexity grows quadratically (
) with the number of agents, making it less suitable for systems with very large agent populations.
To optimize reward allocation, GraphMIX introduces a dual-objective loss structure. A global loss encourages accurate estimation of team-level returns by evaluating joint actions over the full state, while a local loss guides individual agents based on their assigned share of the global reward, computed from learned node-level embeddings. This design ensures that agents not only cooperate effectively but also learn to attribute outcomes to their actions.
The overall training objective combines both components:
This integrated loss promotes coordinated learning by balancing team performance with personalized reward feedback, thereby improving both training stability and credit assignment precision.
Decentralized graph-based multi-agent reinforcement learning using reward machines (DRGM). DGRM [
58] integrates GNNs with formal reward machine representations to support decentralized reward optimization in multi-agent settings. The method encodes complex, non-Markovian reward structures using reward machines, while leveraging truncated Q-functions that rely only on information from each agent’s local
-hop neighborhood.
By exploiting the structure of agent interaction graphs, DGRM significantly reduces computational overhead—limiting the scope of each agent’s decision making to a small, relevant subgraph. This reliance on local
-hop neighborhoods is a principle central to inductive GNNs like GraphSAGE (Equation (
13)), enabling scalability by restricting message passing to immediate neighbors. The authors theoretically show that the influence of distant agents on local policy gradients diminishes exponentially with distance, ensuring that this approximation remains accurate. While this design promotes scalability, the truncation to a local neighborhood represents a fundamental trade-off; the model may fail to learn globally optimal strategies in tasks where critical long-range dependencies exist beyond the
-hop communication radius. This decentralized design allows agents to coordinate effectively without requiring full access to global information, achieving scalable learning in large systems.
Reward-sharing relational networks. While Haeri et al. [
59] do not directly implement graph neural network architectures, their reward-sharing relational networks (RSRN) framework establishes crucial theoretical foundations and design insights for GNN applications in multi-agent reward optimization. RSRN conceptualizes multi-agent systems as directed graphs
, where
denotes the agent set,
represents inter-agent relational edges, and weight matrix elements
in
quantify how much agent
i “cares about” agent
j’s success. Relational rewards emerge through scalarization functions
, with the weighted product model specified as
.
Policy learning occurs through long-term shared return maximization:
The relational graph structure naturally accommodates GNN processing, enabling graph convolutional operations to learn and optimize agent reward propagation patterns. Specifically, architectures like R-GCNs (Equation (
17)) are well suited to process such explicitly relational graphs, where each edge type (or weight) can be modeled with a distinct transformation. A primary challenge for this framework is its dependence on a pre-defined relational graph. The design of an effective weight matrix
requires significant, often unavailable, domain knowledge about agent social dynamics, and a misspecified structure can lead to unintended behaviors. Different network topologies (survivalist, communitarian, authoritarian, tribal) generate distinct emergent behaviors, suggesting that GNN-based approaches could exploit similar relational inductive biases to enhance multi-agent coordination and reward optimization.
GNN-driven intrinsic rewards for heterogeneous MARL. The CoHet algorithm [
52] develops a graph-neural-network-based intrinsic reward mechanism for cooperative challenges in decentralized heterogeneous multi-agent reinforcement learning. CoHet optimizes reward design through GNN message-passing mechanisms, specifically employing local neighborhood observation predictions for intrinsic reward calculation. Coordinated behavior emerges by minimizing deviations between agents’ actual observations and their neighbors’ predictions. The central reward is calculated as
where
represents the Euclidean-distance-based weight, and
corresponds to neighbor
j’s prediction of agent
i’s next observation. Local information processing occurs through the GNN architecture
. This update function embodies the general message passing paradigm central to GNNs, where a node’s representation is updated by aggregating features from its neighbors.
By enabling decentralized training based exclusively on local neighborhood information, this design promotes effective coordination among heterogeneous agents through intrinsic reward signals.
Summary
GNN-enhanced reward design methods in MARL can be broadly grouped into several categories, each addressing different challenges. Potential-based reward shaping methods (e.g., reward propagation, HGT, GCRN) are effective in sparse-reward settings, as they propagate or decompose reward signals to provide denser guidance, though they may introduce training overhead or require accurate graph approximations. Representation-based shaping with Laplacian embeddings offers geometric insights into environment dynamics and goal distances, making it useful for navigation and goal-reaching tasks, but performance depends heavily on representation quality. Value decomposition methods such as GraphMIX leverage graph attention to balance global team rewards and individual credit assignment, excelling in cooperative domains but requiring careful loss balancing. Decentralized or relational reward designs (e.g., DGRM, RSRN) scale well to large multi-agent systems by exploiting local neighborhoods or inter-agent preference structures, though they may lose global optimality. Finally, intrinsic reward modeling (e.g., CoHet) is particularly suited for heterogeneous teams, as it encourages coordination from local prediction errors, but the design of intrinsic signals can be task-dependent. In practice, potential-based and representation-based shaping are preferable in structured single-task domains, while decentralized and intrinsic reward strategies are more practical in large-scale or heterogeneous environments. Hybrid designs that combine decomposition with relational or intrinsic signals are promising directions for real-world systems such as UAV coordination, mixed-robot teams, and resource allocation networks.
Table 6 provides a summary of the applicable scenarios and limitations of different categories of GNN-enhanced reward design methods in MARL.