Dynamic Heterogeneous Multi-Agent Inverse Reinforcement Learning Based on Graph Attention Mean Field

Song, Li; Channa, Irfan Ali; Wang, Zeyu; Sun, Guangyu

doi:10.3390/sym17111951

Open AccessArticle

Dynamic Heterogeneous Multi-Agent Inverse Reinforcement Learning Based on Graph Attention Mean Field

¹

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

²

School of Computer and Computing Science, Hangzhou City University, Hangzhou 310015, China

³

Department of Artificial Intelligence, Aror University of Art, Architecture, Design and Heritage, Sukkur 65170, Pakistan

⁴

Tongzhou Operation Area of the Beijing Oil and Gas Branch of Beijing Pipeline Limited Company, Beijing 100101, China

⁵

Swiss Federal Institute of Technology in Lausanne, Lausanne 1015, Switzerland

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(11), 1951; https://doi.org/10.3390/sym17111951

Submission received: 22 September 2025 / Revised: 31 October 2025 / Accepted: 7 November 2025 / Published: 13 November 2025

Download

Browse Figures

Versions Notes

Abstract

Multi-agent inverse reinforcement learning (MA-IRL) infers the underlying reward functions or objectives of multiple agents by observing their behavioral data, thereby providing insights into collaboration, competition, or mixed interaction strategies among agents, and addressing the symmetrical ambiguity problem where multiple rewards may correspond to the same strategy. However, most existing algorithms mainly focus on solving cooperative and non-cooperative tasks among homogeneous multi-agent systems, making it difficult to adapt to the dynamic topologies and heterogeneous behavioral strategies of multi-agent systems in real-world applications. This makes it difficult for the algorithm to adapt to scenarios with locally sparse interactions and dynamic heterogeneity, such as autonomous driving, drone swarms, and robot clusters. To address this problem, this study proposes a dynamic heterogeneous multi-agent inverse reinforcement learning framework (GAMF-DHIRL) based on a graph attention mean field (GAMF) to infer the potential reward functions of agents. In GAMF-DHIRL, we introduce a graph attention mean field theory based on adversarial maximum entropy inverse reinforcement learning to dynamically model dependencies between agents and adaptively adjust the influence weights of neighboring nodes through attention mechanisms. Specifically, the GAMF module uses a dynamic adjacency matrix to capture the time-varying characteristics of the interactions among agents. Meanwhile, the typed mean-field approximation reduces computational complexity. Experiments demonstrate that the proposed method can efficiently recover reward functions of heterogeneous agents in collaborative tasks and adversarial environments, and it outperforms traditional MA-IRL methods.

Keywords:

dynamic heterogeneous multi-agent; graph attention mean field; adversarial inverse reinforcement learning; maximum entropy; dynamic adjacency matrix

1. Introduction

Inverse reinforcement learning (IRL) aims to infer an underlying reward function from expert demonstrations, thereby enabling agents to learn complex behaviours without explicit reward engineering [1]. However, traditional IRL methods, such as maximum margin planning [2] and Bayesian approaches [3], often struggle with the quality of the expert demonstration data or symmetric reward ambiguity that multiple rewards correspond to the same strategy, because they rely on hand-designed features. To address these limitations, recent advances in deep learning have led to the development of scalable and flexible IRL frameworks. Among these, adversarial inverse reinforcement learning (AIRL) [4] has emerged as a powerful paradigm by integrating generative adversarial networks [5] into the reward learning process. This innovative approach uses a GAN to transform reward learning into a distribution matching problem, which significantly improves the robustness and generalization ability of the reward function [6]. AIRL operates through an adversarial mechanism where a discriminator network distinguishes between expert and learned policies, whereas a policy is optimized by a generator to deceive the discriminator. This mechanism effectively solves the symmetrical ambiguity problems that affect conventional IRL methods.

However, existing AIRL approaches often fail to capture the nuanced relationships between agents that emerge in complex systems, such as urban traffic networks [7], and swarm robotics systems [8]. These challenges have motivated the development of specialized multi-agent AIRL (MA-AIRL) approaches. Recent progress in MA-AIRL can be categorized into three main dimensions: decomposition strategies, opponent modeling, and centralized training with the decentralized execution. Decomposition approaches, such as those proposed by Xue et al. [9], decompose the common task reward into various demonstrations. Opponent modeling techniques, exemplified by the work [10], incorporate predictions of other agents’ policies into each agent’s reward learning process. Centralized training with a decentralized execution paradigm, based on the MADDPG framework [11], has been adapted for MA-AIRL and shows promise in environments requiring coordination. Furthermore, the theoretical understanding of MA-AIRL has improved significantly in recent years. Freihaut et al. conducted a rigorous analysis of feasible reward sets in MA-IRL and addressed these limitations by introducing entropy regularization games, ensuring uniqueness of equilibria and improving interpretability [12]. Concurrently, Golmisheh et al. applied game concepts and IRL algorithms to derive reward functions, and they analyzed the stability and convergence of MA-AIRL algorithms in both cooperative and competitive settings [13].

Despite these advances, significant challenges remain in the development of MA-AIRL methods [14,15]. Generally, multiple agents are considered to be homogeneous. However, in scenarios such as autonomous driving, drone swarms, and robot clusters, the states and objectives of agents may be heterogeneous and subject to dynamic changes. This makes it difficult to guarantee the performance of the algorithm. To address these challenges, we propose a graph attention mean field adversarial dynamic heterogeneous multi-agent inverse reinforcement learning (GAMF-DHIRL). It can capture heterogeneous and dynamic changes in the states and objectives of multiple agents in times, even in scenarios with sparse local interactions and dynamic heterogeneity. In GAMF-DHIRL, building upon AIRL, a graph attention mean field theory is introduced to establish a reward and policy optimization model within a graph network framework. The model uses graph attention mechanisms to capture local dependencies between agents and a typed mean field approximation to propagate the distribution of group policies, achieving global policy coordination.

The main contributions of this study are as follows:

(1): We proposed a graph attention mean field adversarial dynamic heterogeneous multi-agent inverse reinforcement learning to infer the potential reward functions of agents.
(2): In the GAMF-DHIRL, the GAMF module uses a dynamic adjacency matrix to capture the time-varying characteristics of the interactions between agents. Meanwhile, the typed mean-field approximation reduces computational complexity.
(3): The simulation experiments show that this method can efficiently recover the reward functions of heterogeneous agents and significantly outperform traditional MA-IRL methods.

The remainder of this paper is organized as follows. Section 2 reviews the related work on AIRL, adversarial multi-agent IRL, theory of mind, and mean field approximations. Section 3 demonstrates the research background of Markov games and mean field games, AIRL, and mean field IRL. Section 4 details the formal problem formulation and the theoretical foundations of GAMF-DHIRL. Section 5 presents extensive experimental results in simulated environments. Finally, Section 6 concludes the paper, discusses limitations and future directions.

2. Related Works

Adversarial IRL: Integrating adversarial training paradigms into IRL has revolutionized the learning of rewards from expert demonstrations [16,17]. The seminal work by Fu et al. introduced adversarial IRL and established a generative adversarial network-based framework that could robustly recover reward functions [4]. Furthermore, recent theoretical advances have significantly improved our understanding of the AIRL. Liu et al. proposed a travel time estimation method that combines active adversarial IRL and transformers to effectively reduce the cost of sample annotation [18]. Meanwhile, Zhang et al. proposed a new generative adversarial imitation learning model that can automatically learn a difference metric and generate strategies with expert behavior [19]. Song et al. proposed an adaptive generative adversarial maximum entropy IRL algorithm that utilizes updated mixed expert demonstrations to learn rewards and optimize strategies [20].

Multi-agent adversarial IRL: Extending adversarial IRL to multi-agent systems has emerged as a critical research area, driven by applications in autonomous driving, financial markets, and cybersecurity. In recent years, rapid progress has been made in this field in recent years. For example, Yu et al. proposed an effective and scalable multi-agent IRL framework for Markov games with high-dimensional state-action spaces and unknown dynamics [21]. Donge et al. developed a model-based outer-loop inverse optimal control model, and introduced a model-free inverse reinforcement learning algorithm to learn stable rewards [22]. Gruver et al. proposed adversarial IRL and continuous latent variables to learn a shared reward function in scenarios involving varying numbers of agents [23]. Despite significant progress, several fundamental challenges remain in adversarial IRL research, particularly in multi-agent settings, where the joint state-action space grows exponentially with the number of agents [24]. Recent studies have begun to address this issue by developing a mean field and a MoT. However, more fundamental breakthroughs are required [25,26].

IRL based on mean field and MoT: The application of mean field theory to IRL has emerged as a powerful framework for scaling IRL to multi-agents [27]. Classical approaches by Yang et al. established the foundation for mean field IRL by using a mean field distribution to approximate agent interactions and address the curse of dimensionality in multi-agent IRL [28]. Yu et al. proposed a deep latent variable model that learns rewards from demonstrations of different but related tasks in an unsupervised manner, and infers rewards for new, structurally similar tasks from a single demonstration [21]. Mo et al. developed a reinforcement learning algorithm framework that combines fictional game technology to learn the potential cost functions [7]. Chen et al. integrated deep neural networks with mean-field approximations in IRL to learn complex reward structures [29]. Cutting-edge research by Anahtarci et al. proposed the maximum causal entropy IRL under discrete-time stationary mean-field games, introducing a gradient descent algorithm with a guaranteed convergence rate that efficiently computes the optimal solution [30].

Theory of mind (ToM) has revolutionized IRL by allowing agents to reason about the mental states and intentions of others [31]. Remarkable progress has been made in this field in the recent years. For example, Huang et al. studied the effective assessment of human preferences using predictions of small-scale approach-avoidance behaviors in real life within an IRL framework [32]. Tian et al. proposed a new multi-agent IRL framework based on ToM that infers the underlying connections between humans during the learning process [33]. Wu et al. proposed a novel IRL framework that uses ToM reasoning to estimate the posterior distribution of the baseline reward pattern based on the exhibited behavior, and infers a reward function for each agent through distributed equilibrium [34]. Oguntola et al. embedded semantically meaningful beliefs into strategies modeled by deep networks, making them easy to understand [35]. Additionally, Wei et al. used a class of prior distributions to parameterize the accuracy of an agent’s model of the environment, efficiently estimating agent rewards and subjective dynamics in high-dimensional settings [36]. Zhang et al. proposed a multi-agent framework combining ToM and maximum entropy IRL to infer teammates’ reward functions through Bayesian inference [37].

The above researches mainly consider the homogeneous nature of agents. How to combine the advantages of mean field and ToM to improve the performance of algorithms in dynamic heterogeneous environments is a major challenge facing existing adversarial multi-agent IRL.

3. Preliminaries

3.1. Markov Games and Mean Field Games

A Markov game is an extended Markov decision process (MDP) that describes situations where multiple agents interact in a stochastic environment. In this game, each agent has a set of available strategies in each state and chooses an action based on these strategies and the current state [38]. A Markov game in RL is formulated as

(S, {\{a_{i}\}}_{i = 1}^{N}, {\{r_{i}\}}_{i = 1}^{N}, κ, γ, P)

, where N is the number of agents,

S

represents a set of states,

a_{i}

represents the set of actions for an agent i,

r_{i} : S \times a_{1} \times \dots \times a_{N} \to R

is the reward function of an agent i,

γ

represents the discount factor. The function

P (s^{t + 1} | s^{t}, a_{1}, \dots, a_{N}) : S \times a_{1} \times \dots \times a_{N} \to P (S)

is the state transition distribution, where

P (S)

indicates the set of probability distributions for the set

S

. And

κ \in P (S)

indicates the distribution of the initial state. The agents take actions

(a_{1}, a_{2}, \dots, a_{N})

from the state

s_{t}

to the next state

s_{t + 1}

with probability

P (s_{t + 1} | s_{t}, a_{1}, \dots, a_{N})

. For convenience, we utilize a wavy line to represent the joint characteristic of variables for all agents.

\tilde{a} = a_{1} \times \dots \times a_{N}

indicates a joint policy for all agents,

\tilde{a} = (a_{1}, \dots, a_{N})

represents a joint policy for an agent,

\tilde{π} = (π_{1}, \dots, π_{N})

represents a joint policy,

\tilde{r} = (r_{1} (s, \tilde{a}), r_{2} (s, \tilde{a}) \dots, r_{N} (s, \tilde{a}))

denotes a joint reward for all agents. The agent i aims to maximize the expected rewards

E_{\tilde{π}} [\sum_{t = 1}^{T} γ_{t} r_{i} (s_{t}, {\tilde{a}}_{t}) | s_{0} = s, {\tilde{a}}_{0} = \tilde{a}]

, where

\tilde{π}

indicates that the agent selects the joint actions under the

κ

and P. Based on Markov games in RL, suppose that, if we cannot obtain the reward function of the IRL for multiple agents, then a Markov game is formulated as

(S, {\{a_{i}\}}_{i = 1}^{N}, κ, γ, P)

[39]. Given the demonstrations

D

, the main objectives of the multi-agent IRL are as follows: (1) the reward function is recovered, and (2) strategy optimization is realized by making the learner’s strategy approach the expert strategy. Let

(1, 2, \dots, N)

represent the number of agents. Mean field games simplify the complex interaction problems among multiple agents into an interactive problem between a representative agent and a mean field through an empirical distribution

m \in P (S)

, wherein a mean field is the overall average behavior of the remaining agents. The transition function is defined as

P (s^{t + 1} | s^{t}, a_{1}, \dots, a_{N}, m_{t})

. A mean field flow is composed of

N + 1

mean fields, that is,

\tilde{m} ≜ {\{m_{n}\}}_{n = 0}^{N}

. Thus, an MDP based on mean field games (MFGs) is defined as

(S, {\{a_{i}\}}_{i = 1}^{N}, {\{r_{i}\}}_{i = 1}^{N}, κ, γ, P, m_{0})

, where

m_{0}

represents the initial mean field. Let

ξ ≜ {\{(s_{t}, a_{t})\}}_{t = 0}^{T}

to be the expert trajectory. The long-term reward of an agent under the mean field flow

\tilde{m}

is

R (ξ) ≜ \sum_{t = 0}^{T - 1} r (s_{t}, a_{t}, m_{t})

. Therefore, an agent’s accumulated rewards are formulated as

J (\tilde{m}, \tilde{π}) ≜ E_{ξ \sim (m, π)} [R (ξ)]

, where

s_{0} \sim m_{0}

,

a_{t} \sim π_{t} (\cdot | s_{t})

,

s_{t + 1} \sim P (\cdot | s_{t}, a_{t}, m_{t})

.

3.2. Multi-Agent Adversarial Inverse Reinforcement Learning

To solve the reward functions under demonstrations

D

, Ziebart et al. [40] innovatively proposed maximum entropy IRL, and expressed constrained optimization problems as maximum entropy models,

\begin{matrix} P (ξ_{j} | ω) = \frac{1}{z (ω)} e^{\sum_{s_{i} \in ξ_{j}} ω^{T} ϕ_{s_{i}}} \prod_{(s_{i}, a_{i}, s_{i + 1}) \in ξ_{j}} p (s_{i + 1} | s_{i}, a_{i}) \end{matrix}

(1)

where the vector

ϕ_{s_{i}}

represents the characteristic function of rewards, and the partition function

z (ω) = \sum_{ξ_{j}} e^{\sum_{s_{i} \in ξ_{j}} ω^{T} ϕ_{s_{i}}} \prod_{(s_{i}, a_{i}, s_{i + 1}) \in ξ_{j}} p (s_{i + 1} | s_{i}, a_{i})

is a normalized constant; and

p (s_{i + 1} | s_{i}, a_{i})

represents the state transition probability. Furthermore, the maximum entropy method is extended to the multi-agent IRL. Under expert demonstrations

D = {\{ξ_{j}\}}_{j = 1}^{M}

, N multi-agents can obtain the reward function and optimize the strategy by imitating the expert strategy

{\tilde{π}}_{E} = (π_{1, E}, \dots, π_{N, E})

, where

ξ_{j} = {\{(s_{j, t}, {\tilde{a}}_{j, t})\}}_{t = 1}^{T}

represents an expert trajectory obtained by sampling

s_{0} \sim κ (s)

,

{\tilde{a}}_{t} = π_{E} ({\tilde{a}}_{t} | s_{t})

,

{\tilde{s}}_{t + 1} \sim P (s_{t + 1} | s_{t}, {\tilde{a}}_{t})

. However, in a multi-agent complex dynamic high-dimensional environment, the computation of

z (ω)

is difficult. To solve this problem, multi-agent generative adversarial imitation learning (MA-GAIL) aims to achieve strategy optimization through mimicking expert’s policies directly, as shown in Equation (2),

\begin{matrix} M A G A I L_{η} ({\tilde{π}}_{E}) = \underset{\tilde{r}}{arg max} - η (\tilde{r}) + \sum_{i = 1}^{N} (E_{{\tilde{π}}_{E}} [r_{i}]) - (max_{\tilde{π}} \sum_{i = 1}^{N} (λ H_{i} (π_{i}) + E_{π_{i}, π_{E_{- i}}} [r_{i}])) \end{matrix}

(2)

where

H_{i} (π_{i}) = E_{π_{i}, π_{E_{- i}}} [- log π_{i} (a_{i} | s)]

indicates the discounted causal entropy of the policy

π_{i}

, and

λ

represents the hyperparameter measuring the ability of the entropy regularization term. In Equation (2), the generator controls the policies of all agents in a distributed manner, whereas the discriminator determines whether the behavior of each agent is the corresponding expert behavior. However, MA-GAIL can only mimic expert behaviors and cannot be adapted to new tasks. To solve this problem, MA-AIRL combines a generative adversarial network with the maximum entropy IRL. The goal of MA-AIRL is to maximize the objective function [41],

\begin{matrix} max_{θ} J (θ) = max_{θ} E_{τ \sim D} J [log p_{θ} (τ)] \end{matrix}

(3)

where the probability distribution is parameterized as

p_{θ} (τ) \propto p (s_{0}) \prod_{t = 0}^{T - 1} p (s_{t + 1} | s_{t}, a_{t}) e^{r_{θ} (s_{t}, a_{t})}

. The discriminator aims to minimize the cross-entropy loss between expert demonstrations and generated samples:

\begin{matrix} L (θ) = \sum_{t = 0}^{T} - E_{D} [log D_{θ} (s_{t}, a_{t})] - E_{π_{t}} [log (1 - D_{θ} (s_{t}, a_{t}))] \end{matrix}

(4)

The objective of the policy is to maximize the reward

\hat{r} (s, a)

,

\begin{matrix} \hat{r} (s, a) = log (D_{θ} (s, a)) - log (1 - D_{θ} (s, a)) \\ = f_{θ} (s, a) - log π (a | s) \end{matrix}

(5)

The optimal objective of the discriminator is achieved when

π = π_{E}

, where

π

is the policy learned through the generator, and

π_{E}

is the expert’s policy. At this point, the optimal output of the discriminator is 0.5, which means that

e^{f_{θ} (s, a)} = π_{E} (a | s)

.

3.3. Mean Field Inverse Reinforcement Learning

Based on Section 3.2, the adversarial IRL problem based on a mean-field theory (MF-AIRL) is modeled as an MDP based on a mean field, aiming to learn the potential reward function and optimize the policy from expert demonstrations. The optimization problem can be formulated as follows [27]:

\{\begin{matrix} min_{θ} D_{KL} (p_{η_{E}, π_{E}} (ξ) ∥ p_{θ} (ξ)) \\ p_{θ} (ξ) = \frac{1}{Z (θ)} [\prod_{t = 0}^{T} m_{θ, t} (s_{t})] \cdot e^{\sum_{t = 0}^{T} r_{θ} (s_{t}, a_{t}, m_{θ, t})} \end{matrix}

(6)

where

r_{θ} (s_{t}, a_{t}, m_{θ, t})

is the reward function parameterized with

θ

,

p_{θ}

represents the probability of the generated expert trajectories. MF-AIRL does not model N independent agents but rather models a representative agent while treating the average influence from all neighboring agents as the overall state distribution. The objective with mean field is formulated as follows:

\begin{matrix} J_{m} (π) = E [\sum_{t = 1}^{T} r (S_{t}, A_{t}, m_{t})] \end{matrix}

(7)

The mean field

{(m_{t})}_{t \in T} \in M

induced by some fixed policies

π

is recursively defined as

m_{t + 1} (s_{t + 1}) = \sum_{s \in S} m_{t} (s_{t}) \sum_{a \in A} π_{t} (a_{t} | s_{t}) p (s_{t + 1} | s_{t}, a_{t}, m_{t})

, which indicates the space of all mean fields. However, the MF-AIRL, which is based on probabilistic context variables, assumes that multi-agents are homogeneous. This eliminates the need to address graph sparsity and allows the focus to be placed on multi-agent strategy optimization under multi-task scenarios [29]. Existing AIRLs focus on addressing cooperative and non-cooperative tasks for homogeneous multi-agents [15,28]. Consequently, the algorithm cannot consider the heterogeneity of multi-agents, which is a major challenge, making it difficult for the algorithm to adapt to scenarios such as autonomous driving, drone swarms, and robot clusters. These scenarios feature locally sparse interactions and dynamic heterogeneities. This prevents the algorithm from promptly capturing the heterogeneous and dynamic changes in the states and objectives of multi-agents.

4. Graph Attention Mean Field Adversarial Inverse Reinforcement Learning

In scenarios with sparse local interactions and dynamic heterogeneity, we introduce a graph attention mean field theory based on adversarial IRL to establish a reward without a symmetrical ambiguity problem and a strategy optimization model within a graph network framework. The model uses graph attention mechanisms to capture local dependencies between agents, whereas the typed mean-field approximation propagates group strategy distributions to achieve global strategy coordination.

4.1. Problem Description

In dynamic heterogeneous multi-agent systems, let

A = \{a_{1}, \dots, a_{N}\}

be a set of agents. Suppose that each agent

a_{i}

belongs to a certain type

τ_{i}^{t} \in \{τ_{1}, \dots τ_{K}\}

at time t. The type

τ_{i}^{t}

and strategy

π_{i}^{t}

may change over time. Therefore, the dynamic heterogeneous multi-agent IRL is modeled as a dynamic heterogeneous graph network

G_{t} = (V_{t}, E_{t})

, where the nodes

υ_{i}^{t} \in V_{t}

represent the agents, and the edges

e_{i j}^{t} \in E_{t}

represent the interaction relationships between them. Under expert demonstrations, the main goal is to learn each agent’s strategy

π_{i}^{t}

, maximize long-term cumulative rewards, and adapt to dynamic changes in the type and topology. Actual expert demonstration data can be used as expert demonstrations. Alternatively, we can refer to references [21,27] and use the entropy-regularized MFNE strategy to generate expert demonstration trajectories. Given a joint policy

{\{π_{t} (a_{t} | s_{t})\}}_{t = 1}^{T}

generated via the entropy MFNE

({\tilde{μ}}^{*}, {\tilde{π}}^{*})

, the entropy regularized MFNE for an

M F G (S, A, P, μ_{0}, r)

. Then,

({\tilde{μ}}^{*}, {\tilde{π}}^{*})

is the optimal solution to the optimization problem:

\{\begin{matrix} min_{μ, π} D_{KL} (p_{1} (ξ) ∥ p_{2} (ξ)) \\ s . t . μ = Φ (π) \end{matrix}

(8)

where

D_{K L}

indicates the KL divergence,

p_{1} (τ) = μ_{0} (s_{0}) \cdot \prod_{t = 0}^{T - 1} p (s_{t + 1} | s_{t}, a_{t}, μ_{t}) \cdot \prod_{t = 0}^{T} π_{t} (a_{t} | s_{t})

, and

p_{2} (τ) \propto μ_{0} (s_{0}) \cdot \prod_{t = 0}^{T - 1} p (s_{t + 1} | s_{t}, a_{t}, μ_{t}) \cdot e^{R (ξ)}

. The trajectory distribution increases exponentially with cumulative rewards without a symmetrical ambiguity problem. Additionally, let

τ_{i}^{t}

be type,

π_{i}^{t} (a | s, τ_{i}^{t}, m_{τ_{i}^{t}}^{t})

represent the individual strategy of the type-specific mean field

m_{τ_{i}^{t}}^{t}

, and

m^{t} = {\{m_{τ_{k}}^{t}\}}_{k = 1}^{K}

be the population distribution, where

m_{τ_{k}}^{t}

is the state-action distribution of type

τ_{k}

.

4.2. Improved IRL Based on Graph Attention Mean Field

First, we can obtain the expert demonstration trajectories sampled through the entropy regularized MFNE

({\tilde{m}}_{E}, {\tilde{π}}_{E})

. Assuming that these expert demonstration trajectories are optimal, we can maximize the likelihood function of the entropy-regularized MFNE static distribution for the expert trajectories, which is in turn induced by

ω

-parameterized rewards

r_{ω} (s, a, m)

. These expert policies conform to the characteristics of the mean-field Nash equilibrium with some unknown parameterized rewards.

({\tilde{m}}_{ω}, {\tilde{π}}_{ω})

represents the entropy regularized MFNE induced by

ω

. Then, the learning of rewards can be regarded as an adjustment to

ω

. The probability of a trajectory

ξ = {\{s_{t}, {\tilde{a}}_{t}\}}_{t = 1}^{T}

is generated via the entropy regularized MFNE policies with

r_{ω}

,

\begin{matrix} p_{ω} (ξ) = μ_{0} (s_{0}) \cdot P (s_{t + 1} | s_{t}, {\tilde{a}}_{t}, m_{ω, t}) \cdot \prod_{t = 0}^{T} {\tilde{π}}_{t, ω} ({\tilde{a}}_{t} | s_{t}; \tilde{ω}) \end{matrix}

(9)

where

{\tilde{π}}_{t} ({\tilde{a}}_{t} | s_{t}; \tilde{ω})

denotes the stationary joint distributions,

m_{0} (s_{0})

is the distribution of the initial mean field flow, and

P (s_{t + 1} | s_{t}, {\tilde{a}}_{t}, m_{ω, t})

represents the state transition probability.

The classical maximum entropy IRL proposed via Ziebart et al. [40] learns the reward function by maximizing the likelihood function of the observed expert demonstration trajectory distribution defined in Equation (1). Inspired by this approach, the learning of the reward function in a dynamic heterogeneous multi-agent IRL is reduced to

\begin{matrix} max_{ω} L (ω) ≜ E_{ξ \sim (μ_{E}, π_{E})} [R_{ω} (ξ) + \sum_{t = 0}^{T - 1} log P (s_{t + 1} | s_{t}, {\tilde{a}}_{t}, m_{t}^{ω})] - log Z_{ω} \end{matrix}

(10)

where

Z_{ω} ≜ \sum_{ξ} e^{R_{ω} (ξ)} \prod_{t = 0}^{T - 1} log P (s_{t + 1} | s_{t}, a_{t}, m_{t, ω})

is the partition function, which is defined as a summation of all expert demonstration trajectories.

However, IRL faces dynamic heterogeneous multi-agents in the process of reward learning and strategy optimization, resulting in large differences in neighboring strategies. This is the main challenge we face in this study. Therefore, we propose a hierarchical graph attention mean-field multi-agent inverse reinforcement learning, which infers implicit reward functions

r_{ψ} (s, a, m)

from expert demonstration trajectories through AIRL and learns dynamic heterogeneous policies

π_{θ} (a | s, τ, m)

, where m is the mean-field distribution, thereby adapting to dynamic changes in type and topology. First, to capture the dynamic interactions between multiple agents and reduce the computational complexity, we model the latent variables of dynamic heterogeneous multi-agents. Under the expert demonstration trajectory, for the variational encoder VAE

q_{ϕ}

, the input is

(s_{i, t}, a_{i, t - 1})

, and the output latent variable is

z_{i}^{t} \in R^{d}

:

z_{i}^{t} \sim q_{ϕ} (z | s_{i}^{t}, a_{i}^{t - 1}, τ_{i}^{t - 1})

. The type

τ_{i}^{t}

of agent i is determined by its latent variable, and the update formula is

τ_{i}^{t} = arg {max}_{k} p (z_{i}^{t} | μ_{k})

. Thus, we obtain the type probability distribution of agents as follows:

\begin{matrix} p (τ_{i}^{t} = τ_{k} | z_{i}^{t}) = \frac{e^{- ∥z_{i}^{t} - μ_{k}∥}}{\sum_{l = 1}^{K} e^{- ∥z_{i}^{t} - μ_{l}∥}} \end{matrix}

(11)

where

μ_{k}

is the prototype vector of type

τ_{k}

, which is updated through the sliding window expectation–maximization (EM) algorithm to ensure the timeliness of type estimation and achieve online clustering.

The pseudocode of the online type inference and clustering via the sliding window EM is shown in Algorithm 1.

Algorithm 1 Online type inference and clustering via sliding window EM

Require:: Sliding window size W, number of types K, type encoder $q_{ϕ}$ , current time t.
Ensure:: Agent types $τ_{i}^{t}$ and updated prototypes ${μ_{k}}$ .
1:: Initialize: Maintain a buffer $B$ of recent latent variables ${(z_{i}^{t^{'}}, i)}$ for $t^{'} = t - W, . . ., t$ .
2:: for each agent $i \in {1, \dots, N}$ do
3:: Encode latent variable: $z_{i}^{t} \sim q_{ϕ} (z ∣ s_{i}^{t}, a_{i}^{t - 1}, τ_{i}^{t - 1})$
4:: Add $(z_{i}^{t}, i)$ to the buffer $B$ ; remove the oldest entries if $| B | > W \times N$ .
5:: end for
6:: E-step (Responsibility): For each latent-agent pair $(z, i) \in B$ , compute the responsibility $γ_{i k}$ that type k has for z:

$γ_{i k} = \frac{exp (- ∥ z - μ_{k} ∥)}{\sum_{l = 1}^{K} exp (- ∥ z - μ_{l} ∥)} (This is Equation (11) in the main paper)$
7:: M-step (Maximization): Update each type prototype $μ_{k}$ :

$μ_{k} \leftarrow \frac{\sum_{(z, i) \in B} γ_{i k} \cdot z}{\sum_{(z, i) \in B} γ_{i k}}$
8:: for each agent $i \in {1, \dots, N}$ do
9:: Re-assign type based on updated prototypes: $τ_{i}^{t} = arg {max}_{k} p (z_{i}^{t} ∣ μ_{k}) = arg {min}_{k} ∥ z_{i}^{t} - μ_{k} ∥$
10:: end for

The process of online type inference and clustering is detailed in Algorithm 1. The core of this process is a sliding window EM algorithm that operates on a fixed-size buffer

B

containing the most recent latent variables

z_{i}

of all agents. This buffer, with size W, ensures that the type prototypes adapt to recent agent behaviors, providing robustness against non-stationarity.

The algorithm proceeds as follows: First, for each agent, the type encoder

q_{ϕ}

generates a new latent variable

z_{i}^{t}

based on its current state and previous action (Line 3–4). These new latent variables are added to the sliding window buffer

B

. The E-step (Line 6) computes the responsibility

γ_{i k}

, which is the probability that the latent variable z belongs to type k, based on the current prototypes. The M-step (Line 7) then updates each prototype

μ_{k}

as the weighted average of all latent variables in the buffer, with weights given by their responsibilities

γ_{i k}

. Finally, each agent is assigned to the type whose updated prototype is closest to its current latent variable

z_{i}^{t}

(Line 9).

This loop runs at every time step t, allowing the types and their representations to evolve continuously as the agents’ strategies and the environment change. The use of a sliding window prevents the prototypes from being overly influenced by outdated behavior, striking a balance between adaptability and stability.

Then, based on the definition in Section 3.1, the feature of node

υ_{i}^{t}

in the heterogeneous dynamic graph

G^{t} = (V^{t}, E^{t})

of the graph attention average field is

h_{i}^{t} = [s_{i}^{t}, z_{i}^{t}]

, and the neighbor set is

N (i) = \{j | d i s t (i, j) \leq r\}, |N (i)| \leq \bar{d}

, where

r_{g}

is the interaction radius and

\bar{d}

is the maximum number of neighbors. To achieve local feature updating, the graph attention network (GAT) is used to update the features of the graph nodes,

\begin{matrix} h_{i}^{t} = G A T (s_{i}^{t}, {\{h_{j}^{t}\}}_{j \in N (i)}, α_{i j}^{t}) \end{matrix}

(12)

In Equation (12), the coefficient of the attention mechanism is

α_{i j}^{t} = s o f t m a x (L e a k y R e L U (

u_{τ_{i}^{t}, τ_{j}^{t}}^{T} [W_{τ} h_{i}^{t} | | W_{τ} h_{j}^{t}]))

, where

W_{τ_{i}}

and

W_{τ_{j}}

are type-specific weight matrices,

u_{τ_{i}^{t}, τ_{j}^{t}}^{T}

is the attention parameter for the type pair

(τ_{i}^{t}, τ_{j}^{t})

. For each type

τ_{k}

, the type mean field

m_{τ_{k}, ω}^{t}

depends on the strategies

π_{i}

of all agents of the same type, and

π_{i}

is influenced by the reward function through the parameter

ω

, so the type mean field

m_{τ_{k}, ω}^{t}

is also influenced by the reward function and thus depends on the parameter

ω

. The nature of these mean fields (MFs) makes the MF flow and strategy in the entropy regularized MFNE interdependent in Equation (10); therefore, it is not possible to directly obtain the type mean field

m_{τ_{k}, ω}^{t}

. Furthermore, the state transition probabilities in Equation (10) also depend on

μ_{ω}

and the dynamic changes in the multi-agent environment. Directly solving the optimization problem in Equation (10) is difficult, which increases the computational complexity of learning rewards and optimizing strategies in a multi-agent dynamic environment. We consider using MLP to map the average of node features to obtain the type mean field

m_{τ_{k}}^{t}

,

\begin{matrix} m_{τ_{k}}^{t} = m_{τ_{k}, ω}^{t} = M L P (\frac{1}{|A_{τ_{k}}^{t}|} \sum_{i \in A_{τ_{k}}^{t}} h_{i}^{t}) \end{matrix}

(13)

where

A_{τ_{k}}^{t} = \{i | τ_{i}^{t} = τ_{k}\}

is a set of agents of type

τ_{k}

at time t.

The strategy of agents depends on the type mean field

m_{τ_{k}}^{t}

. Therefore, all types of mean fields are aggregated to form global information

{\tilde{m}}_{τ_{k}}^{t}

. The dynamic heterogeneous strategy of all agents is calculated as follows:

\begin{matrix} {\tilde{π}}_{i}^{t} (a_{i} | s_{i}, τ_{i}^{t}, m_{τ_{i}}^{t}) = s o f t m a x (f_{θ} (s_{i}, τ_{k}^{t}, {\tilde{m}}_{τ_{k}}^{t})) \end{matrix}

(14)

Owing to the overall consistency condition in GAMF-DHIRL (proven in the later sections), the mapping of the average node feature of an agent at each time step matches that of the mean field. Simultaneously, the type mean field

m_{τ_{k}, ω}^{t}

in Equation (14) makes the transfer function

P (s_{t}, a_{t}, m_{τ_{k}, ω}^{t})

depend only on

h_{i}^{t}

, thereby decoupling

P (s_{t}, a_{t}, m_{τ_{k}, ω}^{t})

from the reward parameter

ω

. Therefore, in the likelihood function of Equation (10), the reward parameter

ω

is omitted, and we obtain the maximum likelihood estimation of GAMF-DHIRL:

\begin{matrix} max_{ω} \tilde{L} (ω; {\tilde{m}}_{τ_{k}}^{t}) ≜ E_{ξ \sim D_{E}} [{\tilde{R}}_{ω} (ξ) + \sum_{t = 0}^{T - 1} log P (s_{t + 1} | s_{t}, {\tilde{a}}_{t}, {\tilde{m}}_{τ_{k}}^{t})] - log {\tilde{Z}}_{ω} \end{matrix}

(15)

where

{\tilde{R}}_{ω} (τ) ≜ \sum_{t = 0}^{T - 1} r_{ω} (s_{t}, a_{t}, {\tilde{m}}_{τ_{k}}^{t})

represents the rewards that are dependent on m, and

{\tilde{Z}}_{ω} ≜ \sum_{ξ \in D_{E}} e^{{\tilde{R}}_{ω} (ξ)}

indicates the partition function. Based on the optimization objective of the maximum entropy IRL proposed by Ziebart et al. [40], we can obtain the maximum likelihood estimation objective in Equation (10).

Theorem 1.

Consider a multi-agent system with dynamic agent types and a graph structure. Let the expert demonstration trajectories

τ_{1}, \dots τ_{j}, \dots τ_{M}

be independent and identical, which are sampled with MAT-MFNE induced by some unknown rewards. Assume that, for all

s_{i, t} \in S_{i}

,

a_{i, t} \in A_{i}

, and

m_{τ_{k}}^{t} \in P (S)

,

r_{ω} (s, a, m)

is differentiable with respect to reward parameter

ω_{i}

. Then, under Assumptions A1–A3, when

M \to \infty

, the equation satisfies

\nabla_{ω} \tilde{L} (ω; {\tilde{m}}_{τ_{k}}^{t}) = 0

, which has a root

\tilde{ω}

that tends toward the maximum value of the likelihood function in Equation (10).

Assumption 1.

The mean-field distribution

{\tilde{m}}_{τ_{k}}^{t}

is a consistent estimator of the true empirical distribution

m^{t}

for all t, and the error introduced by the typed decomposition

m^{t} \approx \prod_{i = 1}^{N} {\tilde{m}}_{τ_{k}}^{t}

is bounded by

O (1 / \sqrt{N})

.

Assumption 2.

In the dynamic graph

G_{t}

, the graph diameter is bounded for any time step t, ensuring information propagation in the system. The agent type inference is statistically consistent.

Assumption 3.

The reward function

r_{ω} (s, a, m)

is twice continuously differentiable in ω for all

s \in S

,

a \in A

, and

m \in P (S)

, and the Hessian of the log-pseudolikelihood is negative definite in a neighborhood of

ω^{*}

.

Proof.

Consider a standard game with players and a reward function

{\tilde{R}}_{ω} = {\{r_{i} (\tilde{a}; ω_{i})\}}_{i = 1}^{N}

. Assume that the expert demonstration

D = (ξ_{1}, \dots, ξ_{m}, \dots, ξ_{M})

is generated by

π (\tilde{a}; {\tilde{ω}}^{*})

, where

ω^{*}

represents the true value of the parameter. The pseudolikelihood objective is to maximize Equation (17). Based on Equation (16), the gradient of

\tilde{L}

with respect to

ω

is as follows:

\begin{matrix} max_{ω} \tilde{L} (ω; {\tilde{m}}_{τ_{k}}^{t}) ≜ E_{ξ \sim D_{E}} [{\tilde{R}}_{ω} (ξ) + \sum_{t = 0}^{T - 1} log P (s_{t + 1} | s_{t}, {\tilde{a}}_{t}, {\tilde{m}}_{τ_{k}}^{t})] - log {\tilde{Z}}_{ω} \end{matrix}

(16)

where

{\tilde{R}}_{ω} (τ) ≜ \sum_{t = 0}^{T - 1} r_{ω} (s_{t}, a_{t}, {\tilde{m}}_{τ_{k}}^{t})

, and

{\tilde{Z}}_{ω} ≜ \sum_{ξ \in D_{E}} e^{{\tilde{R}}_{ω} (ξ)}

. Note that our objective

\tilde{L}

depends on the type-based mean-field distribution

{\tilde{m}}_{τ_{k}}^{t}

, which is a direct consequence of our MAT-MFNE framework. Under Assumption A1, we can treat

{\tilde{m}}_{τ_{k}}^{t}

as an exogenous input that converges to its true value, allowing us to differentiate

\tilde{L}

with respect to

ω

. The gradient is derived as:

\begin{matrix} \frac{\partial}{\partial ω} \tilde{L} (ω; {\tilde{m}}_{τ_{k}}^{t}) = \frac{\partial}{\partial ω} (E_{ξ \sim D_{E}} [{\tilde{R}}_{ω} (ξ) + \sum_{t = 0}^{T - 1} log P (s_{t + 1} | s_{t}, {\tilde{a}}_{t}, {\tilde{m}}_{τ_{k}}^{t})] - log {\tilde{Z}}_{ω}) \\ = \frac{1}{M} \sum_{j = 1}^{M} \frac{\partial}{\partial ω} {\tilde{R}}_{ω} (ξ_{j}) - \frac{1}{{\tilde{Z}}_{ω}} \frac{\partial}{\partial ω} log {\tilde{Z}}_{ω} \\ = \frac{1}{M} \sum_{j = 1}^{M} \frac{\partial}{\partial ω} {\tilde{R}}_{ω} (ξ_{j}) - \sum_{j = 1}^{M} \frac{e^{{\tilde{R}}_{ω} (ξ_{j})}}{{\tilde{Z}}_{ω}} \frac{\partial}{\partial ω} {\tilde{R}}_{ω} (ξ_{j}) \end{matrix}

(17)

Let

p_{D_{E}} (ξ) ≜ \frac{1}{M} \sum_{j = 1}^{M} \sum_{t = 0}^{T} I (s_{t}^{j} = s)

represent the empirical expert demonstration trajectory distribution. Based on Equation (17), we can obtain the following equation,

\begin{matrix} \frac{\partial}{\partial ω} \tilde{L} (ω; {\tilde{m}}_{τ_{k}}^{t}) = \frac{\partial}{\partial ω} p_{D_{E}} (ξ) {\tilde{R}}_{ω} (ξ) - \sum_{j = 1}^{M} \frac{e^{{\tilde{R}}_{ω} (ξ_{j})}}{{\tilde{Z}}_{ω}} \frac{\partial}{\partial ω} {\tilde{R}}_{ω} (ξ_{j}) \\ = \frac{\partial}{\partial ω} p_{D_{E}} (ξ) {\tilde{R}}_{ω} (ξ) - M \cdot p_{D_{E}} (ξ) \frac{e^{{\tilde{R}}_{ω} (ξ)}}{{\tilde{Z}}_{ω}} \frac{\partial}{\partial ω} {\tilde{R}}_{ω} (ξ) \\ = (1 - M \cdot \frac{e^{{\tilde{R}}_{ω} (ξ)}}{{\tilde{Z}}_{ω}}) p_{D_{E}} (ξ_{j}) \frac{\partial}{\partial ω} {\tilde{R}}_{ω} (ξ) \end{matrix}

(18)

when

M \to \infty

, by the law of large numbers and the Assumption A1,

p_{D_{E}} (ξ)

tends toward the true trajectory distribution

p (ξ_{j})

. Let

ω^{*}

be the maximizer of the likelihood objective in Equation (10). Furthermore, under Assumption A2, the dynamic graph and type inference ensure that the model distribution

p_{ω}

is well-defined. When the number of expert demonstration samples approaches the limit

M \to \infty

, and optimality

ω = ω^{*}

, we have:

\begin{matrix} M \cdot \frac{e^{{\tilde{R}}_{ω^{*}} (ξ)}}{{\tilde{Z}}_{ω^{*}}} = M \cdot \frac{e^{{\tilde{R}}_{ω^{*}} (ξ)}}{\sum_{ξ \in D_{E}} e^{{\tilde{R}}_{ω^{*}} (ξ)}} = 1 \end{matrix}

(19)

Substituting the result from Equation (19) into Equation (18), the gradient of Equation (18) is 0. □

4.3. Graph Attention Mean Field Adversarial IRL

Theorem 1 bridges the gap between the original intractable objective of the MLE in Equation (10) and the tractable empirical objective of the MLE in Equation (16). However, as discussed in Section 3.3, the exact computation of the partition function

Z_{ω}

is typically difficult. Similar to the AIRL in [41], we employ importance sampling to estimate the

{\hat{Z}}_{ω}

of the adaptive sampler. Because the policies in MFGs are time-varying, we use a set of T adaptive samplers

π (π_{θ_{0}}, π_{θ_{1}}, \dots, π_{θ_{T}})

, where each

π_{θ_{T}}

serves as a parameterized policy. Our proposed GAMF-DHIRL infers implicit reward functions

r_{ψ} (s, a, m)

from expert demonstration trajectories

π_{θ} (s, a, τ, m)

and learns strategies that adapt to dynamic heterogeneity, where

m^{t}

is the mean field distribution. This framework optimizes rewards and strategies through a game between the generator and discriminator, enabling agents to dynamically adapt to type changes and the non-stationarity of population distributions. The discriminator

D_{ψ}

is designed to distinguish between expert strategies

π_{E}

and generative strategies

π_{θ}

and can output the estimation for rewards:

{\hat{D}}_{ω} (s_{t}, a_{t}) ≜ \frac{e^{f_{ω} (s_{t}, a_{t}, {\tilde{m}}_{τ_{k}}^{t})}}{e^{f_{ω} (s_{t}, a_{t}, {\tilde{m}}_{τ_{k}}^{t})} + π_{θ_{t}} (a_{t} | s_{t})}

, where

f_{ψ} (s, a, m) \approx r_{ψ} (s, a, m)

is the reward function. The optimization objective is as follows:

\begin{matrix} max_{ω} E_{ξ \sim D_{E}} [\sum_{t = 0}^{T - 1} log {\hat{D}}_{ω} (s_{t}, a_{t})] + E_{ξ \sim π_{θ}} [\sum_{t = 0}^{T - 1} log (1 - {\hat{D}}_{ω} (s_{t}, a_{t}))] \end{matrix}

(20)

The generator can maximize the discriminator reward while satisfying the graph attention mean field game equilibrium:

\begin{matrix} π_{θ} = arg {max}_{π} E_{π} (\sum_{t = 0}^{T} γ^{t} log D_{ψ} (s_{t}, a_{t}, m_{t})) \end{matrix}

(21)

The objective of the generator is optimized as follows:

\begin{matrix} max_{θ} E_{τ \sim π_{θ}} [\sum_{t = 0}^{T - 1} log {\hat{D}}_{ω} (s_{t}, a_{t}) - log (1 - {\hat{D}}_{ω} (s_{t}, a_{t}))] \\ = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T - 1} f_{ω} (s_{t}, a_{t}, m_{τ_{k}}^{t}) - {log}_{π_{θ_{t}}} (a_{t} | s_{t})] \end{matrix}

(22)

where the optimization of

D_{ψ}

is equivalent to estimating the reward function

r_{ψ}

. The policy gradient includes a mean-field term:

\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} log π_{θ} (a | s, τ, m) \cdot Q (s, a, m)]

, where

Q (s, a, m) = E [\sum_{t = 0}^{\infty} γ^{t} r_{ψ} (s_{t}, a_{t}, m_{t})]

. The update of the policy parameter

θ

is interleaved with the update of the reward parameter

ω

. Intuitively, tuning

π_{θ}

can be viewed as a policy optimization process, that is, finding the GAMF-DHIRL’s policy induced by the current reward parameter to minimize the variance of importance sampling. Training

f_{ω}

estimates the reward function by distinguishing between the observed trajectories and those generated by the current adaptive sampler

π_{θ}

. We can train

π_{θ}

using backward induction, that is, adjusting

π_{θ_{t}}

based on

π_{θ_{t + 1}}, \dots, π_{θ_{T - 1}}

. In the optimal case,

f_{ω}

approximates the latent reward function of GAMF-DHIRL, while

π

approximates the observed policy. The agent achieves efficient and adaptive collaboration with neighboring agents through sparse graph attention, and ensures the statistical consistency of group strategies through typed mean fields. Finally, we obtain the reward model:

\begin{matrix} r_{θ} (s, a, m) = f_{θ} (s, a) + g_{θ} (m) \end{matrix}

(23)

where

g_{θ} (m)

captures the impact of population distribution.

Theorem 2.

(Consistency of Adversarial Rewards) Under the AIRL framework [4] and assuming that the expert policy

π_{E}

and the learner’s policy

π_{θ}

induce state-action mean field distributions that are absolutely continuous with respect to each other, the reward function

r_{ψ}

recovered by the optimally trained discriminator

D_{ψ}

in the AIRL setting satisfies:

1.: Policy Invariance: At Nash equilibrium, the learned policy $π_{θ}$ perfectly recovers the expert policy $π_{E}$ , and the induced mean field distributions are indistinguishable, i.e., $p_{π_{θ}} (s, a, m) = p_{π_{E}} (s, a, m)$ .
2.: Reward Authenticity: The recovered reward $r_{ψ} (s, a, m)$ is a monotonic transformation of the true underlying reward $r_{true} (s, a, m)$ . Specifically, there exists a strictly increasing function g and a function $χ (s, m)$ dependent only on the state and mean field such that:

$r_{ψ} (s, a, m) = g (r_{true} (s, a, m)) + χ (s, m) .$

(24)

Proof.

The proof follows and extends the theoretical foundations of AIRL [4] to the graph-based mean field multi-agent setting.

The discriminator

D_{ψ}

, parameterized to distinguish expert trajectories from generated ones, is trained to minimize the cross-entropy loss:

L_{D} (ψ) = - E_{(s, a, m) \sim π_{E}} [log D_{ψ} (s, a, m)] - E_{(s, a, m) \sim π_{θ}} [log (1 - D_{ψ} (s, a, m))] .

(25)

For a fixed generator policy

π_{θ}

, the optimal discriminator for this objective is known to be:

D_{ψ}^{*} (s, a, m) = \frac{p_{π_{E}} (s, a, m)}{p_{π_{E}} (s, a, m) + p_{π_{θ}} (s, a, m)},

(26)

where

p_{π} (s, a, m)

is the state-action-mean field distribution induced by policy

π

.

Following the AIRL equation [4], we parameterize the discriminator as:

D_{ψ} (s, a, m) = \frac{exp (f_{ψ} (s, a, m))}{exp (f_{ψ} (s, a, m)) + π_{θ} (a ∣ s, m)},

(27)

where

f_{ψ} (s, a, m)

is a learned function. Substituting this parameterization into the optimal discriminator form in Equation (26) and equating them at equilibrium yields:

\begin{matrix} \frac{exp (f_{ψ}^{*} (s, a, m))}{exp (f_{ψ}^{*} (s, a, m)) + π_{θ} (a ∣ s, m)} & = \frac{p_{π_{E}} (s, a, m)}{p_{π_{E}} (s, a, m) + p_{π_{θ}} (s, a, m)} \\ \Rightarrow exp (f_{ψ}^{*} (s, a, m)) & = \frac{p_{π_{E}} (s, a, m)}{p_{π_{θ}} (s, a, m)} π_{θ} (a ∣ s, m) . \end{matrix}

(28)

Taking the logarithm on both sides, we obtain:

f_{ψ}^{*} (s, a, m) = log p_{π_{E}} (s, a, m) - log p_{π_{θ}} (s, a, m) + log π_{θ} (a ∣ s, m) .

(29)

Under the maximum entropy policy framework [40], the expert’s trajectory distribution is proportional to the exponential of the cumulative true reward:

p_{π_{E}} (τ) \propto exp (\sum_{t} r_{true} (s_{t}, a_{t}, m_{t}))

. This implies that the state-action distribution satisfies

log p_{π_{E}} (s, a, m) = r_{true} (s, a, m) + log p_{E} (s, m) - V_{E} (s, m)

, where

V_{E} (s, m)

is a log-partition function that dependents only on the state and mean field. A similar relation holds for

π_{θ}

:

log p_{π_{θ}} (s, a, m) = A_{π_{θ}} (s, a, m) + V_{π_{θ}} (s, m)

, where A is the advantage function.

Substituting these into Equation (29) and simplifying, we get:

\begin{matrix} f_{ψ}^{*} (s, a, m) & = [r_{true} (s, a, m) + log p_{E} (s, m) - V_{E} (s, m)] - [A_{π_{θ}} (s, a, m) + V_{π_{θ}} (s, m)] + log π_{θ} (a ∣ s, m) \\ = r_{true} (s, a, m) - A_{π_{θ}} (s, a, m) + \underset{State - dependent and action - dependent terms}{\underset{⏟}{log π_{θ} (a ∣ s, m) + log p_{E} (s, m) - V_{E} (s, m) - V_{π_{θ}} (s, m)}} . \end{matrix}

(30)

Crucially, in the single-agent case, AIRL [4] proves that, if

f_{ψ}

is decomposed as

f_{ψ} (s, a) = r_{ψ} (s, a) + h_{ψ} (s)

, and the policy is trained with a discount factor

γ

, the advantage term

A_{π_{θ}}

cancels out the action-dependent log-policy term, leaving a reward that is state-only. In our multi-agent mean field context, we make a corresponding assumption that the learned function

f_{ψ}

can recover a reward function

r_{ψ} (s, a, m)

that is disentangled from the policy. Thus, at convergence (

π_{θ} \to π_{E}

), we have:

r_{ψ} (s, a, m) \approx r_{true} (s, a, m) + χ (s, m),

(31)

where

χ (s, m)

is an arbitrary function. This completes the proof, showing that the recovered reward is equivalent to the true reward up to a constant shift dependent only on

(s, m)

, fulfilling the condition for reward authenticity. □

The pseudo-code of the graph attention mean field adversarial inverse reinforcement learning is shown in Algorithm 2. The overall structural framework of graph attention mean field adversarial inverse reinforcement learning is shown in Figure 1. First, we need to obtain mixed expert demonstrations, which consist of the trajectory data (state-action sequences) of one or more experts (or different types of multi-agents) performing tasks in the environment. Under mixed expert demonstrations, the system can achieve cooperation or competition among heterogeneous multi-agent systems, thereby completing dynamic type identification and accurately distinguishing the behavior categories of different agents in the multi-agent system. On this basis, the system further encodes the identified agent types into low-dimensional and computable representation vectors to capture the essential characteristics of agent behaviors. Subsequently, through online clustering methods, the agent behaviors are clustered in real time, and the type labels of the agents are dynamically allocated and updated to continuously optimize the accuracy of type representation. Next, the discriminator based on maximum entropy inverse reinforcement learning uses graph attention networks to calculate the interaction weights between agents, enabling each agent to adaptively aggregate information based on the type encoding of other agents. Simultaneously, the average field theory is used to approximate the complex multi-agent interactions as the interaction between individual and group average behaviors. The main goal of the discriminator is to distinguish as accurately as possible whether a certain trajectory comes from “expert demonstration data” or “generated data”, and gradually learn the implicit reward function without a symmetrical ambiguity problem during this process. Meanwhile, the policy generator continuously adjusts its behavior strategy under the restored reward to generate trajectories that are closer to the expert demonstration. Finally, the system integrates these newly generated high-quality data with the original expert data to form a new generation of hybrid expert demonstration data, thereby continuously expanding and diversifying the scale and diversity of expert data and forming a self-enhancing closed-loop learning process.

Algorithm 2 Graph Attention Mean Field Adversarial Inverse Reinforcement Learning

1:: Input: MFG with parameters $(S, A, P, m_{0})$ and observed trajectories $D_{E} = {\{ξ_{j}\}}_{j = 1}^{M}$ , type space $T$ , learning rate $α$ , discount factor $γ$ , number of types K.
2:: Initialization: Randomly initialize policy network $π_{θ} (a | s, τ, m)$ , discriminator $D_{ψ} (s, a, m)$ , type encoder $q_{ϕ} (z | s, a, τ)$ , type prototype $\{μ_{k}\}$ , and mean field $m_{τ_{k}}^{0} = 0$ for all $τ_{k} \in T$ .
3:: Estimate the mean field flow ${\tilde{m}}_{τ_{k}}^{t}$ from $D_{E}$ .
4:: for $t \in \{1, \dots, T\}$ do
5:: for each agent $i \in \{1, \dots, N\}$ do
6:: Obtain the encoded latent variables $z_{i}^{t} \sim q_{ϕ} (z | s_{i}^{t}, a_{i}^{t - 1}, τ_{i}^{t - 1})$
7:: Allocate type: $τ_{i}^{t} = arg {max}_{k} p (z_{i}^{t} | μ_{k})$ , and update the type prototype $μ_{k}$
8:: Calculate the neighbor attention weights and update the node features with Equation (12)
9:: Calculate type mean field with Equation (13)
10:: if $t mod update_interval = 0$ or $t = 1$ then
11:: Sample expert data batch $B_{E} \sim D_{E}$ , and generate policy batch $B_{G} \sim π_{θ}$
12:: Update the discriminator and minimize the cross-entropy: $ψ \leftarrow ψ - α \nabla_{ψ} [E_{B_{E}} [log D_{ψ}] + E_{B_{G}} [log (1 - D_{ψ})]]$
13:: Update the generator and maximize the rewards: $θ \leftarrow θ + α \nabla_{θ} E_{B_{G}} [log D_{ψ} (s, a, m)]$
14:: else
15:: Skip discriminator and generator update for efficiency
16:: end if
17:: if convergence condition is met then
18:: For each type $τ_{k}$ , select the optimal policy using Equation (21)
19:: Update the current policy: $π_{θ} \leftarrow π_{τ_{k}}^{*}$
20:: else
21:: Continue with current policy $π_{θ}$
22:: end if
23:: end for
24:: end for
25:: Output: Policy network parameters $θ$ , reward parameters $ψ$ , type prototype ${\{μ_{k}\}}_{k = 1}^{K}$

4.4. Analysis of Time Complexity

We assume that the number of agents N and the number of types K dynamically change, and

K ≪ N

. The state and action dimensions are

d_{s}, d_{a}

respectively, T represents time steps, the latent variable dimension is

d_{z}

, and the average number of neighbors

\bar{d}

representing the graph sparsity satisfies

\bar{d} ≪ N

.

Theorem 3.

The complexity of GAMF-DHIRL is

O (T N (\bar{d} + C))

, where

C = d_{z} d_{s} + d_{h} + d_{a}

.

Proof.

The proof consists of analyzing the computational cost of each module in GAMF-DHIRL. First, in the dynamic type identification, the type encoder

q_{ϕ} (z | s_{i}, a_{i})

is used to calculate latent variables

z_{i}

, resulting in a complexity of

O ((d_{s} + d_{a}) d_{z})

for a single agent and a complexity of

O (N (d_{s} + d_{a}) d_{z})

for all agents. Online clustering is then performed using a sliding window EM, requiring a complexity of

O (K d_{z})

for each update and a total complexity of

O (N K d_{z})

, where K is the number of types (

K ≪ N

). Because K and

d_{z}

are constants, this term is

O (N)

.

Furthermore, in the graph attention average field module, the neighbor aggregation complexity of node feature updates in the GAT layer is

O (|ε| d_{h})

, where

|ε| = N \bar{d}

is the number of edges,

\bar{d}

is the average number of neighbors, and the hidden layer dimension

d_{h}

is a constant. Therefore, the complexity of the typed average field aggregation is

O (N \bar{d})

.

Finally, during strategy optimization and adversarial training, the policy network

π_{i} (a_{i} | s_{i}, h_{i}, m_{τ_{i}})

has an input dimension of

d_{s} + d_{h} + d_{a}

and an output dimension of

d_{a}

, leading to a forward propagation complexity of

O (N (d_{s} + d_{h} + d_{a}) d_{a})

. The discriminator

D_{ψ} (s, a, m)

, which implements the objective in Equation (20), has an input dimension of

d_{s} + d_{a} + d_{h}

and an output dimension of 1, resulting in a complexity of

O (N (d_{s} + d_{a} + d_{h}))

.

In summary, the single-step time complexity is

O_{n e w} = O (N ((d_{s} + d_{a} + K) d_{z} + \bar{d} d_{h}

+ (d_{h} + d_{s} + d_{a}) d_{a}))

. Under the assumption of constant dimensions,

d_{h}

,

d_{z}

,

d_{a}

,

d_{s}

, and K are constants, which can be simplified as

O_{n e w} = O (N (\bar{d} + 1))

. For time steps T, the total complexity is

O_{n e w} = O (T N (\bar{d} + 1))

. □

5. Experiments and Simulations

We need to answer the following questions: (1) Can our proposed GAMF-DHIRL effectively learn the optimal reward function without the symmetrical ambiguity problem under expert demonstration? (2) Can our proposed GAMF-DHIRL obtain the optimal strategy for each agent based on the learned reward function under expert demonstrations? To enhance the accuracy of the experimental results, all experiments were independently conducted 5 times, and the average values were taken.

5.1. Experimental Setup

To address the two aforementioned issues, we used the classic test environment of multi-agent reinforcement learning with the simulated particle environments to evaluate the performance of the proposed algorithm. Specifically, we used the following three classic scenarios: (1) cooperative navigation: this scenario is primarily used to study how multiple agents cooperate in a two-dimensional space to cover multiple dynamic target points while avoiding collisions and optimizing paths; (2) predator–prey: this scenario is a classic adversarial task in multi-agent reinforcement learning, simulating the dynamic scenario of multiple predators cooperating to hunt one or more prey; and (3) cooperative communication: this scenario is primarily used to study the collaboration, communication, and observation capabilities of multiple agents. Most agents are “dumb” agents that can only observe their local environments. One agent is designated as a “listener” with global observation capabilities. The agents must collaborate in a two-dimensional space to reach a set of dynamic target positions. For all the experiments, we independently conducted them five times and took the average to ensure the accuracy of the results.

The dynamic movement of the target points in the above scenarios involving multiple agents increases the difficulty of the task, which is a multi-task solution problem. The different speeds of the intelligent agents result in their heterogeneity.

In the simulated particle environments, we first obtained expert demonstrations with ground-truth reward functions that could evaluate recovery rewards and optimization strategies. Then, we recovered actual rewards and optimized the policy under expert demonstrations. Our proposed GAMF-DHIRL can use a graph attention mean field to dynamically model dependencies between agents and adaptively adjust the influence weights of neighboring nodes through attention mechanisms. To illustrate the performance of GAMF-DHIRL, we compared GAMF-DHIRL with the following benchmark algorithms and state-of-the-art algorithms:

Multi-agent generative adversarial inverse reinforcement learning (MA-GAIRL) [41]: this algorithm combines generative adversarial models and maximum entropy IRL to learn multi-agent collaborative strategies from expert demonstrations through adversarial training while recovering the underlying reward functions without the symmetrical ambiguity problem.

Generative adversarial imitation learning (MA-GAIL) [42]: Agents learn strategies from expert demonstration data through adversarial training, rather than relying on artificially designed reward functions.

Multi-agent inverse reinforcement learning (MA-IRL) [43]: A distributed data-driven maximum entropy IRL framework for multiagent systems uses state observations to achieve optimal consensus among agents while simultaneously inferring unknown cost functions of all agents.

Multi-agent mean filed inverse reinforcement learning (MF-AIRL) [28]: A novel mean-field adversarial IRL framework based on maximum entropy IRL, which can tackle uncertainties in demonstrations.

The environmental parameter settings used in the experiment are listed in Table 1.

Both the generator (policy) and discriminator of network architectures in the GAMF-DHIRL are 2-layer multilayer perceptrons (MLPs) with 100 units per hidden layer. The generator (policy network) uses Tanh activation functions and a Softmax output for discrete actions. And the discriminator (discriminator network) uses LeakyReLU (negative slope is 0.2) activations and a Sigmoid output. Optimizer details are as follows: all networks are optimized using the Adam optimizer with a learning rate of

3 \times 10^{- 4}

and betas

β = (0.9, 0.999)

.

5.2. Evaluation Metrics

To more clearly illustrate the performance of the proposed algorithm, this section provides the definitions and explanations of the evaluation metrics, as follows:

Value loss: The value loss indicates the value of the difference without discount between the predicted value by the critic network and the target value at each step in an episode, which is calculated by

L_{value} = E [\frac{1}{T e} \sum_{t = 1}^{T e} {(V_{critic} (s_{s t e p}) - V_{target} (s_{s t e p}))}^{2}]

.

Averaged total loss: The total loss is the undiscounted cumulative value of the loss at each timestep, which is calculated by

L_{total} = \sum_{episode = 1}^{N_{epi}} (L_{value, episode})

; the averaged total loss is the averaged value of the total loss over the number of timesteps, which is calculated by

L_{avetotal} = \frac{1}{T e \cdot N_{epi}} L_{total}

.

Explained variance: The explained variance is the average without discount calculated based on all the data collected throughout the entire episodes, which can assess the accuracy of the value function’s prediction and be calculated by

EV = 1 - \frac{Var (G_{t} - V_{target})}{Var (G_{t})}

, where

G_{t} = \sum_{k = t}^{total timestep} r (s_{k}, {\tilde{a}}_{k})

represents the actual rewards starting from time step t, and Var is the variance calculated over all time steps throughout the entire episode.

Q values: The Q values represent the Q value predicted by the critic network for each action at each step in an episode, which is expected cumulative discount value of all future rewards and calculated by

Q_{critic} = E_{\tilde{π}} [\sum_{t = 1}^{T e} γ_{t} r_{i} (s_{t}, {\tilde{a}}_{t}) ∣ s_{0} = s, {\tilde{a}}_{0} = \tilde{a}]

.

Averaged total Q value: The total Q value is the undiscounted cumulative value of the Q values at each time step, which is calculated by

Q_{total} = \frac{1}{N_{epi}} \sum_{episode = 1}^{N_{epi}} (Q_{critic, episode})

; the averaged total Q value is the averaged value of the total Q value over the number of timesteps, which is calculated by

Q_{avetotal} = \frac{1}{T e \cdot N_{epi}} Q_{total}

.

Reported rewards: The reported rewards are the statistically processed reward indicators obtained by the agents when performing tasks in the environment. This value is generally the undiscounted average result obtained from multiple runs, which is calculated by

r_{r} = \frac{1}{5} \sum_{N = 1}^{5} (\tilde{r})

.

5.3. Reward Recovery

To address the first question, we validated the reward learning performance of the proposed GAMF-DHIRL algorithm in three classic scenarios. The total reward function is divided into the distance reward, target reward, and collision reward. The distance reward is generally set as the negative value of the distance between the agent and the target; the target reward is the reward obtained when the agent reaches the target; and the collision reward is the reward obtained from colliding with other agents. A larger reward indicates better reward learning performance of the algorithm. Figure 2a,b show that the recovered rewards of GAMF-DHIRL, MA-GAIRL, MA-GAIL, and MA-IRL under expert demonstration trajectories in the cooperative navigation environment and cooperative communication environment, respectively. Ground truth in the following figures represents the true rewards in the learning environments. From Figure 2a,b, it can be seen that rewards of the proposed GAMF-DHIRL are closer to rewards of the ground truth than those of MA-GAIRL, MA-GAIL, and MA-IRL. Because GAMF-DHIRL uses a dynamic graph attention network to adaptively capture time-varying dependencies between agents and uses mean field approximation to reduce computational complexity, it can learn rewards that are closer to the ground truth.

Figure 2c,d show the recovered rewards of agent 1, agent 2, agent 3, and agent 4 for GAMF-DHIRL and MA-GAIRL in the predator–prey environment. Figure 2a,b show the performances of GAMF-DHIRL and MA-GAIRL are better than those of MA-GAIL and MA-IRL, respectively. Thus, in Figure 2c,d, we compared only the reward learning performances of GAMF-DHIRL and MA-GAIRL. Because the environment is more complex and reward learning is more difficult, GAMF-DHIRL and MA-GAIRL obtained slightly lower reported rewards in the predator–prey environment than in the cooperative navigation and cooperative communication environments. However, owing to the superior performance of the proposed GAMF-DHIRL, even in the predator–prey environment, this algorithm can still obtain rewards closer to the ground truth during the later learning process, compared to MA-GAIRL, MA-GAIL, and MA-IRL.

Figure 3a–c show the difference of values of GAMF-DHIRL, MA-GAIRL, MA-GAIL, and MA-IRL in the cooperative navigation environment, predator-prey environment, and cooperative communication environment, respectively. The difference of values can evaluate the quality of the recovery reward function, and the smaller the value, the better. Figure 3 shows that GAMF-DHIRL has the largest difference of values of all agents, compared with MA-GAIRL, MA-GAIL, and MA-AIRL in these above three environments. Due to the complexity of the predator–prey environment, the number of agent types in this environment is greater than that in the cooperative navigation environment and cooperative communication environment. The value difference of agent 2 for GAMF-DHIRL is higher than that of MA-GAIRL. The value difference of agent 2 for GAMF-DHIRL is 0.03, while the value difference of agent 2 for MA-GAIRL is 0.10. The difference between them is relatively small; therefore, the performance of GAMF-DHIRL is still superior to that of MA-GAIRL.

Figure 4a,b show the changing trend of the difference of rewards during the learning process. During the learning process, it is necessary to make difference of rewards approach zero. From Figure 4, we can know that GAMF-DHIRL offers difference of rewards that are closer to zero, compared with MA-GAIRL, MA-GAIL, and MA-IRL. Therefore, GAMF-DHIRL outperforms MA-GAIRL, MA-GAIL, and MA-IRL.

5.4. Policy Optimization

To answer the first question, in the three classic scenarios, we compared our proposed GAMF-DHIRL with MA-GAIRL and MA-GAIL in terms of optimization. Table 2, Table 3 and Table 4 show the results of GAMF-DHIRL, MA-GAIRL, and MA-GAIL, respectively. In Table 2 and Table 3, we calculate the value loss, explained variance, Q values, and reported rewards. The greater the explained variance, the more accurate the algorithm’s learning and prediction performance will be. And the Q values and reported rewards should be as large as possible. From Table 2 and Table 3, it can be seen that GAMF-DHIRL has a lower value loss and a higher explained variance, Q values, and reported rewards, compared with MA-GAIRL and MA-GAIL. This is because GAMF-DHIRL can use a dynamic adjacency matrix to capture the time-varying characteristics of the interactions between agents. Therefore, GAMF-DHIRL outperforms MA-GAIRL and MA-GAIL.

Based on the results of value loss, explained variance, Q values, and reported rewards in Table 2 and Table 3, we use the averaged total loss and the averaged total Q value to reflect the performance of the data obtained throughout the training process in a complex predator-prey environment with a greater variety of agents. In Table 4, the total loss is the cumulative total of all losses, and the averaged total loss is the averaged total loss at each time step. The total Q value is the cumulative total of all Q value, and the averaged total Q value is the averaged total Q value at each time step. In this environment, agent 1, agent 2, and agent 3 have a cooperative relationship with each other and compete with agent 4. Therefore, the results of agent 1, agent 2, and agent 3 are similar, whereas they differ from those of agent 4. The results in Table 4 show that the agents of GAMF-DHIRL have a smaller averaged total loss, larger averaged total Q value, and reported rewards, compared to MA-GAIRL. Additionally, throughout the entire training process, the learning curves of the averaged total loss and averaged total Q value are shown in Figure A1a–d.

To verify the impact brought by the graph attention mean field, we conducted ablation experiments. We separately obtained the time taken by the GAMF-DHIRL proposed in this study and the MA-GAIRL without graph attention mean field to complete the total time step, as shown in Table 5. In addition, to verify the impact brought by the graph attention, we conducted ablation experiments to separately obtain the time taken by the MF-AIRL. In the cooperative navigation environment, predator–prey environment, and cooperative communication environment, the proposed GAMF-DHIRL had the shorter running time than MA-GAIRL and MF-AIRL, which indicates that the graph attention mean field can reduce the computational complexity of the algorithm.

To further illustrate the impact brought by the graph attention mean field and graph attention, the obtained results of difference of rewards for GAMF-DHIRL, MA-GAIRL, and MF-AIRL are shown in Table 6 throughout the entire training process. The smaller the difference of rewards is, the better the performance of the algorithm will be. From Table 6, we can know that GAMF-DHIRL has the smallest difference of reward than those of MA-GAIRL, and MF-AIRL. Therefore, the GAMF-DHIRL exhibits the best performance.

Additionally, we have investigated the sensitivity of the attention mechanism in GAMF-DHIRL by varying the average neighborhood sizes of 2, 4, 8, and 16. The results are shown in Table 7. From Table 7, it can be known that the optimal average neighborhood size is 8, and GAMF-DHIRL has the best Q values when the average neighborhood size is 8.

6. Conclusions and Future Works

This study proposes a dynamic and heterogeneous multi-agent inverse reinforcement learning framework based on a graph attention mean field. This framework effectively addresses the limitations of traditional inverse reinforcement learning methods in modeling dynamic topologies and heterogeneous behaviors. The model uses a dynamic graph attention network to adaptively capture time-varying dependencies between agents, reducing computational complexity through mean field approximation. This makes the model suitable for large-scale population studies. Experiments demonstrate that this method efficiently recovers the reward functions of heterogeneous agents in collaborative and adversarial scenarios, outperforming traditional IRL methods in terms of performance.

In an extremely large-scale mutation environment, the interference resistance and scalability of the proposed algorithm have not been verified explicitly. Our next research tasks include the following: (1) improving the interference resistance of the proposed algorithm, that is, how to achieve high coverage when some target points are covered or dynamically moving and (2) improving the algorithm’s scalability, that is, whether the algorithm can maintain good performance when the number of agents increases to tens of thousands.

Author Contributions

Conceptualization, L.S. and Z.W.; methodology, I.A.C. and L.S.; validation, G.S. and Z.W.; formal analysis, I.A.C. and Z.W.; investigation, G.S. and L.S.; resources, L.S.; writing original draft preparation, L.S. and I.A.C.; writing—review and editing, I.A.C. and Z.W.; visualization, L.S. and G.S.; supervision, I.A.C. and Z.W.; project administration, L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Postdoctoral Research Startup Fund of the Big Picture Center of Hangzhou City University (No. 201000-584105/002).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

Thanks to “the Research Center for High-Performance Computing of Ultra-Large-Scale Graph Data at Hangzhou City University”, “the Zhejiang Provincial Engineering Research Center for Real-Time Digital and Intelligent Technology in Urban Safety Governance”, and “the Supercomputing Center of Hangzhou City University” for providing the research start-up funds and some experimental hardware conditions for the research.

Conflicts of Interest

Author Zeyu Wang was employed by the Tongzhou Operation Area of the Beijing Oil and Gas Branch of Beijing Pipeline Limited Company, Beijing 100101, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

Appendix A

Throughout the entire training process, the learning curves of the averaged total loss and averaged total Q value are shown in Figure A1a–d. Figure A1a shows the learning curves of the averaged total loss for agent 1 using MA-GAIRL and GAMF-DHIRL. Figure A1b shows the learning curves of the averaged total Q for agent 1 using MA-GAIRL and GAMF-DHIRL. Figure A1c shows the learning curves of the averaged total loss for agents 2,3,4 using MA-GAIRL and GAMF-DHIRL. Figure A1d shows the learning curves of the averaged total Q for agents 2,3,4 using MA-GAIRL and GAMF-DHIRL.

In Figure A1a,b, to better illustrate the changes in the learning curve and the errors, when drawing the graph, we expanded the error of the averaged total loss by 10 times and the error of the averaged total Q value by 100 times. However, the error bars of the learning curve in the local magnified view are the actual values. From Figure A1a–d, it can be known that the agents of GAMF-DHIRL have a smaller averaged total loss, larger averaged total Q value, compared to MA-GAIRL. And the averaged total loss and the averaged total Q value for GAMF-DHIRL have smaller variances compared to those of MA-GAIRL.

Figure A1. Results of GAMF-DHIRL, MA-GAIRL, MA-GAIL, and MA-IRL in the simulated particle environments. (a) Averaged total loss for agent 1. (b) Averaged total Q for agent 1. (c) Averaged total loss for agent 2,3,4. (d) Averaged total loss for agent 2,3,4.

References

Ng, A.Y.; Russell, S. Algorithms for Inverse Reinforcement Learning. In Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, USA, 29 June–2 July 2000; pp. 663–670. [Google Scholar]
Abbeel, P.; Ng, A.Y. Apprenticeship Learning Via Inverse Reinforcement Learning. In Proceedings of the 21st International Conference on Machine Learning, New York, NY, USA, 4–8 July 2004; pp. 1–8. [Google Scholar]
Mandyam, A.; Li, D.; Yao, J.Y.; Cai, D.N.; Jones, A.; Engelhardt, B.E. Kernel Density Bayesian Inverse Reinforcement Learning. arXiv 2023, arXiv:2303.06827. [Google Scholar]
Fu, J.; Luo, K.; Levine, S. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–16. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Australia, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Ho, J.; Ermon, S. Generative adversarial imitation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 4565–4573. [Google Scholar]
Mo, Z.B.; Chen, X.; Di, X.; Iacomini, E.; Segala, C.; Herty, M.; Lauriere, M. A Game-Theoretic Framework for Generic Second-Order Traffic Flow Models Using Mean Field Games and Adversarial Inverse Reinforcement Learning. Transp. Sci. 2024, 58, 1167–1426. [Google Scholar] [CrossRef]
Chen, J.Y.; Lan, T.; Aggarwal, V. Hierarchical Adversarial Inverse Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 17549–17558. [Google Scholar] [CrossRef]
Xue, C.; Chen, L.; Gombolay, M. Better Than Diverse Demonstrators: Reward Decomposition from Suboptimal and Heterogeneous Demonstrations. IEEE Rob. Autom. Lett. 2025, 10, 7047–7054. [Google Scholar] [CrossRef]
Rucker, M.; Adams, S.; Hayes, R.; Beling, P.A. Inverse Reinforcement Learning for Strategy Identification. In Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics, Melbourne, Australia, 17–20 October 2021; pp. 3067–3074. [Google Scholar]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent Actor-critic for Mixed Cooperative-competitive Environments. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6382–6393. [Google Scholar]
Freihaut, T.; Ramponi, G. On Feasible Rewards in Multi-Agent Inverse Reinforcement Learning. arXiv 2024, arXiv:2411.15046. [Google Scholar]
Golmisheh, F.M.; Shamaghdari, S. Optimal Robust Formation of Multi-Agent Systems as Adversarial Graphical Apprentice Games with Inverse Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2025, 22, 4867–4880. [Google Scholar] [CrossRef]
Nasernejad, P.; Sayed, T.; Alsaleh, R. Multiagent modeling of pedestrian-vehicle conflicts using Adversarial Inverse Reinforcement Learning. Transp. A 2023, 19, 2061081. [Google Scholar] [CrossRef]
Sun, J.; Yu, L.; Dong, P.; Lu, B.; Zhou, B. Adversarial Inverse Reinforcement Learning with Self-Attention Dynamics Model. IEEE Rob. Autom. Lett. 2021, 6, 1880–1886. [Google Scholar] [CrossRef]
Yi, L.; Qin, Y.B. Learning Robust Adaptive Bitrate Algorithms with Adversarial Inverse Reinforcement Learning. Chin. J. Electron. 2024, 34, 1309–1320. [Google Scholar] [CrossRef]
Xiang, G.; Li, S.; Shuang, F.; Gao, F.; Yuan, X. SC-AIRL: Share-Critic in Adversarial Inverse Reinforcement Learning for Long-Horizon Task. IEEE Rob. Autom. Lett. 2024, 9, 3179–3186. [Google Scholar] [CrossRef]
Liu, S.; Zhang, Y.; Wang, Z.L.; Liu, X.; Yang, H. Personalized Origin-destination Travel Time Estimation with Active Adversarial Inverse Reinforcement Learning and Transformer. Transp. Res. Part E Logist. Trans. 2024, 193, 103839. [Google Scholar] [CrossRef]
Zhang, X.; Li, Y.H.; Zhang, Z.M.; Zhang, Z.L. f-GAIL: Learning f-Divergence for Generative Adversarial Imitation Learning. In Proceedings of the 34th Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 1–11. [Google Scholar]
Song, L.; Li, D.Z.; Xu, X. Adaptive Generative Adversarial Maximum Entropy Inverse Reinforcement Learning. Inform. Sci. 2025, 695, 121712. [Google Scholar] [CrossRef]
Yu, L.T.; Song, J.M.; Ermon, S. Multi-Agent Adversarial Inverse Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, USA, 9–15 June 2019; pp. 7194–7201. [Google Scholar]
Donge, V.S.; Lian, B.S.; Lewis, F.L.; Davoudi, A. Multiagent Graphical Games with Inverse Reinforcement Learning. IEEE Trans. Control Netw. Syst. 2023, 10, 841–852. [Google Scholar] [CrossRef]
Gruver, N.; Song, J.M.; Kochenderfer, M.J.; Ermon, S. Multi-agent Adversarial Inverse Reinforcement Learning with Latent Variables. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, Auckland, New Zealand, 9–13 May 2020; pp. 1855–1857. [Google Scholar]
Seraj, E.; Xiong, J.; Schrum, M.; Gombolay, M. Mixed-Initiative Multiagent Apprenticeship Learning for Human Training of Robot Teams. In Proceedings of the 37th Conference on Neural Information Processing Systems, New Orleans, Louisiana, USA, 10–16 December 2023; pp. 1–15. [Google Scholar]
Alsaleh, R.; Sayed, T. Markov-game modeling of cyclist-pedestrian interactions in shared spaces: A multi-agent adversarial inverse reinforcement learning approach. Transp. Res. Part C Emerg. Technol. 2021, 128, 103191. [Google Scholar] [CrossRef]
Lin, X.M.; Beling, P.A.; Cogill, R. Multiagent Inverse Reinforcement Learning for Two-Person Zero-Sum Games. IEEE Trans. Games 2018, 10, 56–68. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, L.B.; Liu, J.M.; Hu, S. Individual-Level Inverse Reinforcement Learning for Mean Field Games. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, Virtual Event, 9–13 May 2022; pp. 253–262. [Google Scholar]
Chen, Y.; Zhang, L.B.; Liu, J.M.; Witbrock, M. Adversarial Inverse Reinforcement Learning for Mean Field Games. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, London, UK, 29 May–2 June 2023; pp. 1088–1090. [Google Scholar]
Chen, Y.; Lin, X.; Yan, B.; Zhang, L.B.; Liu, J.M.; Tan, N.; Witbrock, M. Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20–27 February 2024; pp. 11407–11415. [Google Scholar]
Anahtarci, B.; Kariksiz, C.D.; Saldi, N. Maximum Causal Entropy IRL in Mean-Field Games and GNEP Framework for Forward RL. arXiv 2024, arXiv:2401.06566. [Google Scholar]
Jara-Ettinger, Z. Theory of Mind as Inverse Reinforcement Learning. Curr. Opin. Behav. Sci. 2019, 29, 105–110. [Google Scholar] [CrossRef]
Huang, T.R.; Chen, T.C.; Lin, T.Y.; Goh, J.O.S.; Chang, Y.L.; Yeh, S.L.; Fu, L.C. Spatially Small-scale Approach-avoidance Behaviors Allow Learning-free Machine Inference of Object Preferences in Human Minds. Int. J. Social Rob. 2023, 15, 999–1006. [Google Scholar] [CrossRef]
Tian, R.; Tomizuka, M.; Sun, L. Learning Human Rewards by Inferring Their Latent Intelligence Levels in Multi-Agent Games: A Theory-of-Mind Approach with Application to Driving Data. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, Prague, Czech Republic, 27 September–1 October 2021; pp. 4560–4567. [Google Scholar]
Wu, H.C.; Sequeira, P.; Pynadath, D.V. Multiagent Inverse Reinforcement Learning via Theory of Mind Reasoning. arXiv 2023, arXiv:2302.10238. [Google Scholar] [CrossRef]
Oguntola, I.; Campbell, J.; Stepputtis, S.; Sycara, K. Theory of Mind as Intrinsic Motivation for Multi-Agent Reinforcement Learning. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA, 23–29 July 2023; pp. 1–8. [Google Scholar]
Wei, R.; Zeng, S.; Li, C.L.; Garcia, A.; McDonald, A.; Hong, M.Y. Robust Inverse Reinforcement Learning Through Bayesian Theory of Mind. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA, 23–29 July 2023; pp. 1–12. [Google Scholar]
Zhang, L.; Liu, Q.; Huang, Z.; Zhu, F. Multiagent Inverse Reinforcement Learning via Theory of Mind Reasoning. J. Softw. 2023, 34, 4772–4803. [Google Scholar]
Kang, S.; Dong, Q.; Xue, Y.; Yanjun, W. MACS: Multi-Agent Adversarial Reinforcement Learning for Finding Diverse Critical Driving Scenarios. In Proceedings of the 2024 IEEE Conference on Software Testing, Verification and Validation, Toronto, Canada, 27–31 May 2024; pp. 1–12. [Google Scholar]
Wang, X.Y.; Klabjan, D. Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1–9. [Google Scholar]
Ziebart, B.D.; Maas, A.L.; Bagnell, J.A.; Dey, A.K. Maximum entropy inverse reinforcement learning. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, Chicago, IL, USA, 13–17 July 2008; pp. 1433–1438. [Google Scholar]
Wei, E.; Wicke, D.; Luke, S. Multiagent adversarial inverse reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, Montreal QC, Canada, 13–17 May 2019; pp. 2265–2266. [Google Scholar]
Fang, Z.; Chen, T.; Shen, T.; Jiang, D.; Zhang, Z.; Li, G. Multi-Agent Generative Adversarial Interactive Self-Imitation Learning for AUV Formation Control and Obstacle Avoidance. IEEE Rob. Autom. Lett. 2025, 10, 4356–4363. [Google Scholar] [CrossRef]
Zhang, H.; Yang, Y.; Zhao, W.; Yue, D. Distributed Data-Driven Inverse Reinforcement Learning for Multi-Agent Systems. IEEE Trans. Circuits Syst. I Regul. Pap. 2025, in press. [Google Scholar] [CrossRef]

Figure 1. The overall structural framework of graph attention mean field adversarial inverse reinforcement learning.

Figure 2. Results of GAMF-DHIRL, MA-GAIRL, MA-GAIL, and MA-IRL in the simulated particle environments. (a) Reported rewards in a cooperative navigation environment. (b) Reported rewards in a predator–prey environment. (c) Reported rewards in a cooperative communication environment for agents 1,2,3. (d) Reported rewards in a cooperative communication environment for agent 4.

Figure 3. Difference of values in the cooperative navigation environment, predator–prey environment, cooperative communication environment. (a) Results in a cooperative navigation environment; (b) Results in a predator–prey environment. (c) Results in a cooperative communication environment.

Figure 4. Difference of rewards in the cooperative navigation environment and predator–prey environment. (a) Results in a cooperative navigation environment. (b) Results in a predator–prey environment.

Table 1. The parameters of the experiment.

Parameters	Values
Batch size	1000
Total timesteps	$10^{7}$
Trajectory limitation	200
Discounted factor $γ$	0.99
Maximum gradient norm	0.5
Interval of save	100
Coefficient of value function loss	0.5
Coefficient of environment	0.01

Table 2. Results of GAMF-DHIRL, MA-GAIRL, and MA-GAIL in the cooperative navigation environment.

Methods	Values Loss	Explained Variance	Q Values	Reported Rewards
MA-GAIL	23.1 ± 0.001	0.426 ± 0.001	−21.5 ± 0.300	0.674 ± 0.001
MA-GAIRL	4.78 ± 0.001	0.398 ± 0.005	−7.34 ± 0.070	0.805 ± 0.002
GAMF-DHIRL	4.77 ± 0.001	0.457 ± 0.008	−6.57 ± 0.180	0.894 ± 0.004

Table 3. Results of GAMF-DHIRL, MA-GAIRL, and MA-GAIL in the cooperative communication environment.

Methods	Values Loss	Explained Variance	Q Values	Reported Rewards
MA-GAIL	16.98 ± 2.001	0.30 ± 0.008	−17.00 ± 2.000	0.70 ± 0.001
MA-GAIRL	13.08 ± 0.301	0.52 ± 0.003	−14.60 ± 1.200	0.72 ± 0.002
GAMF-DHIRL	0.56 ± 0.465	0.84 ± 0.017	−5.47 ± 0.103	0.89 ± 0.003

Table 4. Results of GAMF-DHIRL, MA-GAIRL, MA-GAIL, and MA-IRL in the predator–prey environment.

Methods	Agents	Averaged Total Loss	Averaged Total Q Value	Reported Rewards
	1	$1.62 \times 10^{4}$ ± $2.47 \times 10^{2}$	$- 4.44 \times 10^{5}$ ± $3.70 \times 10^{2}$	−0.399 ± 0.049
MA-	2	$1.05 \times 10^{0}$ ± $1.30 \times 10^{- 1}$	$- 8.38 \times 10^{1}$ ± $1.40 \times 10^{- 1}$	−0.032 ± 0.008
GAIRL	3	$9.15 \times 10^{- 1}$ ± $2.30 \times 10^{- 2}$	$- 3.95 \times 10^{1}$ ± $2.50 \times 10^{- 1}$	−0.139 ± 0.026
	4	$1.77 \times 10^{- 1}$ ± $1.50 \times 10^{- 2}$	$- 3.33 \times 10^{0}$ ± $4.20 \times 10^{- 1}$	0.086 ± 0.010
	1	$1.61 \times 10^{4}$ ± $1.09 \times 10^{2}$	$- 8.66 \times 10^{4}$ ± $2.94 \times 10^{2}$	0.134 ± 0.037
GAMF-	2	$6.93 \times 10^{- 2}$ ± $8.50 \times 10^{- 2}$	$- 5.81 \times 10^{0}$ ± $1.60 \times 10^{- 1}$	−0.053 ± 0.002
DHIRL	3	$1.22 \times 10^{- 1}$ ± $1.06 \times 10^{- 2}$	$- 1.11 \times 10^{1}$ ± $1.08 \times 10^{- 1}$	−0.005 ± 0.001
	4	$1.48 \times 10^{- 1}$ ± $1.26 \times 10^{- 2}$	$- 1.96 \times 10^{0}$ ± $1.00 \times 10^{- 1}$	0.217 ± 0.030

Table 5. Running times for GAMF-DHIRL and MA-GAIRL in simulated particle environments.

Environments	Methods	Running Times
Cooperative	MA-GAIRL	72755.91 ± 252.70 s
navigation environment	GAMF-DHIRL	71709.61 ± 518.93 s
	MF-AIRL	72025.85 ± 327.42 s
Predator–prey	MA-GAIRL	46950.76 ± 20.76 s
environment	GAMF-DHIRL	46259.26 ± 97.12 s
	MF-AIRL	46530.34 ± 45.31 s

Table 6. Difference of rewards in the cooperative navigation and predator–prey environments.

Environments	GAMF-DHIRL	MA-GAIRL	MF-AIRL
Cooperative navigation	0.046 ± 0.021	0.985 ± 0.042	0.742 ± 0.083
Predator–prey	0.055 ± 0.006	0.094 ± 0.005	0.083 ± 0.003

Table 7. Q values with different average neighborhood sizes in the cooperative navigation and operative communication environments.

Environments	2	4	8	16
Cooperative navigation	−10.86 ± 0.040	−9.23 ± 0.037	−6.83 ± 0.019	−9.72 ± 0.026
Cooperative communication	−10.27 ± 0.300	−7.25 ± 0.204	−5.82 ± 0.105	−8.30 ± 0.190

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, L.; Channa, I.A.; Wang, Z.; Sun, G. Dynamic Heterogeneous Multi-Agent Inverse Reinforcement Learning Based on Graph Attention Mean Field. Symmetry 2025, 17, 1951. https://doi.org/10.3390/sym17111951

AMA Style

Song L, Channa IA, Wang Z, Sun G. Dynamic Heterogeneous Multi-Agent Inverse Reinforcement Learning Based on Graph Attention Mean Field. Symmetry. 2025; 17(11):1951. https://doi.org/10.3390/sym17111951

Chicago/Turabian Style

Song, Li, Irfan Ali Channa, Zeyu Wang, and Guangyu Sun. 2025. "Dynamic Heterogeneous Multi-Agent Inverse Reinforcement Learning Based on Graph Attention Mean Field" Symmetry 17, no. 11: 1951. https://doi.org/10.3390/sym17111951

APA Style

Song, L., Channa, I. A., Wang, Z., & Sun, G. (2025). Dynamic Heterogeneous Multi-Agent Inverse Reinforcement Learning Based on Graph Attention Mean Field. Symmetry, 17(11), 1951. https://doi.org/10.3390/sym17111951

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Heterogeneous Multi-Agent Inverse Reinforcement Learning Based on Graph Attention Mean Field

Abstract

1. Introduction

2. Related Works

3. Preliminaries

3.1. Markov Games and Mean Field Games

3.2. Multi-Agent Adversarial Inverse Reinforcement Learning

3.3. Mean Field Inverse Reinforcement Learning

4. Graph Attention Mean Field Adversarial Inverse Reinforcement Learning

4.1. Problem Description

4.2. Improved IRL Based on Graph Attention Mean Field

4.3. Graph Attention Mean Field Adversarial IRL

4.4. Analysis of Time Complexity

5. Experiments and Simulations

5.1. Experimental Setup

5.2. Evaluation Metrics

5.3. Reward Recovery

5.4. Policy Optimization

6. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI