You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

14 November 2025

Scalable Wireless Sensor Network Control Using Multi-Agent Reinforcement Learning †

Electrical Engineering and Computer Science Department, University of Wyoming, Laramie, WY 82071, USA
This article is a revised and expanded version of a paper entitled “Decentralized Multi-Agent Reinforcement Learning for Large-Scale Mobile Wireless Sensor Network Control Using Mean Field Games”, which was presented at 2024 33rd International Conference on Computer Communications and Networks (ICCCN), Kailua-Kona, HI, USA, 29–31 July 2024.
This article belongs to the Special Issue Advanced Control Strategies and Applications of Multi-Agent Systems

Abstract

In this paper, the real-time decentralized integrated sensing, navigation, and communication co-optimization problem is investigated for large-scale mobile wireless sensor networks (MWSN) under limited energy. Compared with traditional sensor network optimization and control problems, large-scale resource-constrained MWSNs are associated with two new challenges, i.e., (1) increased computational and communication complexity due to a large number of mobile wireless sensors and (2) an uncertain environment with limited system resources, e.g., unknown wireless channels, limited transmission power, etc. To overcome these challenges, the Mean Field Game theory is adopted and integrated along with the emerging decentralized multi-agent reinforcement learning algorithm. Specifically, the problem is decomposed into two scenarios, i.e., cost-effective navigation and transmission power allocation optimization. Then, the Actor–Critic–Mass reinforcement learning algorithm is applied to learn the decentralized co-optimal design for both scenarios. To tune the reinforcement-learning-based neural networks, the coupled Hamiltonian–Jacobi–Bellman (HJB) and Fokker–Planck–Kolmogorov (FPK) equations derived from the Mean Field Game formulation are utilized. Finally, numerical simulations are conducted to demonstrate the effectiveness of the developed co-optimal design. Specifically, the optimal navigation algorithm achieved an average accuracy of 2.32 % when tracking the given routes.

1. Introduction

Coordinating a large number of mobile robots equipped with multi-modal sensors for information gathering in extreme environments (such as in underwater and underground workspaces where communication, computation, and power are limited) is a critically needed capability in many emerging applications []. For instance, state-of-the-art multi-agent simultaneous localization and mapping (SLAM) techniques [] heavily rely on the scalability and robustness of mobile wireless sensor networks (MWSNs), particularly when the agent population is large. Furthermore, modern monitoring systems [] have widely adopted robot-assisted large-scale wireless sensor networks. In such practical deployments, a MWSN typically consists of a single computationally capable remote station responsible for task planning, along with numerous low-cost mobile sensing robots characterized by limited energy and communication capabilities []. The agents’ navigation trajectories directly influence communication quality, since the channel attenuation depends on each agent’s position relative to the remote station.
This paper considers MWSNs comprising a fixed remote station and a large number of low-cost mobile sensors. The remote station acts as an intelligent coordinator, generating navigation plans and broadcasting them to the sensing agents. The agents, in turn, are tasked with following the assigned trajectories and transmitting collected data back to the remote station.
While utilizing low-cost mobile sensors significantly reduces deployment costs, it also introduces substantial challenges in both robot control and communication, especially as the number of agents increases []. Recent multi-agent navigation methods [], for instance, often require real-time position information from neighboring agents to compute control policies, causing the computational complexity to scale with the number of agents. This scalability issue is widely known as the “Curse of Dimensionality” [].
In addition, most low-cost robots are incapable of full-duplex point-to-point communication. As a result, data sharing among large populations of robots with low latency and minimal packet loss becomes infeasible. Simultaneously, the volume of sensed data is enormous (e.g., in multi-agent SLAM), and the number of agents requiring uplink communication with the remote station is high. Thus, maintaining the desired quality of service (QoS) becomes challenging, particularly due to the difficulty in coordinating transmission power among agents. Traditional wireless transmission power allocation schemes focus on optimizing the signal-to-interference-plus-noise ratio (SINR) in either centralized or distributed settings []. However, many practical MWSNs function as self-organizing (i.e., “ad hoc”) networks [], requiring decentralized solutions.
To address these challenges, this paper adopts the framework of Mean Field Game (MFG) theory, a decentralized decision-making paradigm for large-population multi-agent systems. In our previous studies [,], MFGs were successfully applied to multi-agent tracking tasks. The theoretical foundation of MFGs was introduced by Lasry and Lions [,] under the umbrella of stochastic non-cooperative game theory. The core concept is that an individual agent’s influence on the collective behavior can be effectively summarized via a local impact index derived from the population distribution. Specifically, it is shown in [,] that when the number of agents approaches infinity, each agent’s influence can be captured using a probability density function (PDF) of all agent states. This transforms the original large-scale multi-agent game into an equivalent two-player game: a local agent versus the population influence, significantly reducing computational complexity.
To compute the optimal decentralized control, a value function must be minimized, which accounts for the agent’s local state and the PDF of the population. As established in classical optimal control theory [], this value function satisfies the Hamilton–Jacobi–Bellman (HJB) equation, which is solved backward in time. Simultaneously, the evolution of the agents’ state distribution is governed by the Fokker–Planck–Kolmogorov (FPK) equation [], which is solved forward in time. Consequently, the optimal decentralized solution in an MFG framework is obtained by solving the coupled HJB–FPK system.
It was shown by [] that the solution to this coupled PDE system converges to an ϵ N -Nash equilibrium, which is regarded as the optimal solution for large-scale non-cooperative games. A complete and rigorous derivation of the ϵ N -Nash equilibrium was first provided in [] and later formalized through the Nash Certainty Equivalence Principle in []. Despite its theoretical elegance, solving the coupled HJB–FPK equations remains a formidable challenge due to their bidirectional structure and strong coupling [].
Recently, adaptive reinforcement learning approaches have emerged to approximate the solution of the HJB equation in a forward-in-time manner [,]. To extend this capability to the coupled HJB–FPK system, I applied a novel Actor–Critic–Mass (ACM) algorithm for the decentralized co-optimization of navigation and transmission power in large-scale MWSNs. Specifically, three neural networks are designed:
  • Mass Neural Network (Mass NN): Approximates the population-level PDFs of agents’ tracking errors and transmission power.
  • Critic Neural Network (Critic NN): Estimates the value function, which quantifies tracking accuracy and QoS performance.
  • Actor Neural Network (Actor NN): Learns the optimal control input for navigation and transmission power adjustment in real time.
The main contributions of this paper are as follows:
  • The decentralized co-optimization problem is formulated for MWSNs as two interconnected Mean Field Games: one for optimal navigation and one for transmission power control. The MFG framework effectively mitigates the “Curse of Dimensionality” associated with large-scale multi-agent systems.
  • The data-driven Actor–Critic–Mass (ACM) reinforcement learning algorithm is developed to learn the optimal solution of the MWSN control online, enabling real-time implementation in uncertain and dynamic environments.
  • The proposed novel MWSN algorithm is fully decentralized, requiring no inter-agent communication, making it highly scalable and communication-efficient for large populations of mobile agents.

3. Preliminaries

Before presenting the proposed Actor–Critic–Mass (ACM) algorithm, this section provides the necessary theoretical background on Mean Field Game (MFG) theory and the structure of the reinforcement learning framework adopted in this study. Specifically, we first review the fundamental principles of MFGs that enable scalable decision-making in large populations of interacting agents by coupling the Hamilton–Jacobi–Bellman (HJB) and Fokker–Planck–Kolmogorov (FPK) equations. Then, we introduce the ACM framework, which extends the classical actor–critic structure with an additional mass neural network to capture the evolution of the population density. This section thus establishes the mathematical foundations required for formulating the decentralized optimization problems and implementing the proposed algorithm in later sections.

3.1. Mean Field Game Theory

Mean Field Game (MFG) theory provides a mathematical framework for modeling decision-making in large populations of interacting agents. Instead of analyzing all pairwise interactions, which becomes intractable as the number of agents grows, MFG approximates the aggregate influence of the population using a probability density function (PDF) of states. This approximation reduces the original high-dimensional multi-agent problem into a tractable two-player game: a representative agent versus the mean field. The optimal strategy for each agent is then determined by solving a coupled system of partial differential equations: the Hamilton–Jacobi–Bellman (HJB) equation, which characterizes the optimal value function backward in time, and the Fokker–Planck–Kolmogorov (FPK) equation, which evolves the population state distribution forward in time. The equilibrium solution of this system corresponds to an ϵ N -Nash equilibrium, ensuring near-optimality when the agent population is large. This formulation forms the foundation for decentralized and scalable control in large-scale multi-agent systems.

3.2. The ACM Structure

The proposed Actor–Critic–Mass (ACM) framework extends the classical actor–critic reinforcement learning structure by introducing an additional mass network to capture the evolution of the population distribution. Specifically, the critic neural network approximates the value function defined by the Hamilton–Jacobi–Bellman (HJB) equation, enabling the evaluation of long-term performance for a given state and policy. The actor neural network outputs the local control action, which is updated by minimizing the residual of the critic-estimated value gradient. In parallel, the mass neural network approximates the probability density function (PDF) of agent states by learning the solution to the Fokker–Planck–Kolmogorov (FPK) equation. The three networks are coupled: the actor depends on the critic’s value gradient, the critic requires the mass distribution to capture population-level coupling effects, and the mass network evolves based on the actor’s control inputs. This closed-loop interaction allows decentralized agents to approximate the coupled HJB–FPK system online, ensuring convergence toward an ϵ N -Nash equilibrium.

4. Problem Formulation

Consider N (for example, N = 1000 ) mobile sensing robots, named “agents”, operating in an n-dimensional workspace where a remote station guides this massive team of sensing robots to search a designated area. A reference navigation trajectory (i.e., A i ( t ) : R R n ) is first broadcast to all robots. For simplicity, the area and the reference trajectory are denoted by the same notation A i ( t ) A . The robots are required to follow the received reference trajectory. Once the mobile sensing robot finishes tracking one trajectory, it will transmit sensed data back to the remote station and then receive a new navigation trajectory. During surveillance and data communication, mobile sensing agents will experience channel fading, shadowing effects, and interference from each other. Overall, to optimize the mobile wireless sensor network for all robots, two objectives need to be co-optimized, i.e., (1) tracking the reference trajectories effectively, considering the interference from other agents, and (2) transmitting back the sensed data efficiently and achieving the desired signal-to-interference-plus-noise ratio (SINR), even with an uncertain environment. As shown in Figure 1, the task is defined as two different scenarios, i.e., (1) Scenario 1: tracking control; (2) Scenario 2: transmission power allocation. These two scenarios run asynchronously to fulfill the two aforementioned objectives as depicted in the flow chart in Figure 2.
Figure 1. An illustration of the low-cost MWSN. At time T k , the remote station wants to detect the area A k , and therefore a reference trajectory is broadcast to a massive group of sensing robots. Then the robot team tracks the reference trajectory through decentralized optimal tracking control. When the area is fully detected, all sensing robots transmit the collected information back to the remote station. Then, a new area (reference trajectory) is broadcast once the data transmission is completed.
Figure 2. Low-cost mobile robots (e.g., AUVs, UAVs, and AGVs) have limited communication and computational resources. Hence, they alternate between two asynchronous modes: (1) Scenario 1—navigation; and (2) Scenario 2—data transmission. The workflow is illustrated in this figure.

4.1. Scenario 1: Optimal Navigation Formulation

Consider the motion dynamics of robot i in a team comprising N homogeneous robots as a stochastic differential equation:
x ˙ s , i = f ( x s , i ) + g ( x s , i ) u s , i d t + σ s d w s , i ,
where x s , i and u s , i are smooth and Lipchitz nonlinear equations representing the agent’s states and motion control input, respectively. The functions f ( x s , i ) and g ( x s , i ) describe the robot’s motion dynamics. The term w s , i ( t ) denotes a set of independent Wiener processes representing environmental noise, while σ s is the constant diffusion matrix for these processes.
Next, at any time T k , the tracking error can be represented as e s , i = x s , i A k ( t ) , where A k ( t ) is the current reference trajectory. The tracking error dynamics is thus derived as
d e s , i = d x s , i d A k = [ f s ( e s , i ) + g s ( e s , i ) u s , i ] d t + σ s d w s , i ,
where f s ( e s , i ) = f ( e s , i + A k ( t ) ) d A k ( t ) / d t , and g s ( e s , i ) = g i ( e s , i + A k ( t ) ) .
Remark 1.
The error dynamics in Equation (2) follow from the affine transformation e s , i = x s , i A k ( t ) . Since the Jacobian is the identity and the Hessian vanishes, Itô’s lemma reduces to d e s , i = d x s , i A ˙ k ( t ) d t , with the extra term absorbed into f s ( e s , i , t ) .
To track the reference trajectory in an optimal manner, a value function is proposed for any given time duration [ T k , T ] as follows:
V s , i ( E , u s , i ) = E T k T L s ( e s , i , u s , i ) + Φ s ( e s , i , E ) d t ,
where E = [ e s , 1 , , e s , N ] R n × N is the augmented tracking error matrix including all agents. The motion energy and states value is defined as
L s ( e s , i , u s , i ) = e s , i Q s 2 + u s , i R s 2 ,
and the coupling function Φ s ( e s , i , E ) : R n × N × R n R is an arbitrary Lipschitz function which represents the cost caused by other agents.
The value function (3) penalizes the tracking error, motion control input, and the coupling cost between other agents and local agent i. It is also worth noting that although the cost function summarizes the running cost over a finite time duration (i.e., [ T k 1 , T k ] ), the problem is still an infinite time optimal control problem. Instead of a predefined value, the end time T k is determined by the time of all agents reaching the POI. Thus, the end time T k has no specific restriction applied. Another important assumption is that the time required for transmitting information is significantly less than the time for required moving, meaning that the transmitting time can be ignored when I consider the optimal tracking problem.

4.2. Scenario 2: Optimal Transmission Power Allocation Formulation

Next, the channel fading model is discussed. In wireless communication, the transmitted signal loss between the remote station and individual robot is related to the distance of the signal path [,]. Moreover, Ref. [] shows that the power attenuation can be described as a stochastic differential equation (SDE) if the lognormal channel fading model is adopted. If we let the x R R n denote the position of the fixed remote station, then the distance between robot i and the remote station (i.e, transmitted path distance) can be represented as e p , i ( t ) = x s , i x R . Therefore, the dynamic equation of e p , i is derived as
d e p , i ( t ) = [ f p ( e p , i ) + g p ( e p , i ) u s , i ] d t + σ s d w s , i ,
where f p ( e p , i ) = f ( e p , i + x R ) and g p ( e p , i ) = g ( e p , i + x R ) .
Letting α i denote the power attenuation (loss) of the link between robot i and the remote station, this can be calculated by a function of the path distance α i = exp ( e p , i T e p , i )  []. Thus, the actual received power at the remote station is α i p i . To guarantee the quality of service (QoS), which is measured by the signal-to-interference-plus-noise ratio (SINR), the transmitter’s power for individual robots needs to be coordinated. In other words, the consensus for transmission power needs to be achieved for all agents. According to related research [,,,,], the SINR in a large population of users can be defined as
ξ i ( t ) = α i ( t ) p i ( t ) I i ( t ) + η i = α i p i j i N β N α j p j + η i ,
where I i is the interference and η i 0 denotes the variance power of the noise at its receiver node.
In (6), we approximate the aggregate interference coupling coefficient as β N 1 / N . This assumption is not meant to describe all heterogeneous fading environments but serves as a tractable large-system surrogate that captures the dominant scaling behavior of interference with respect to the number of active users. When transmitters experience independent small-scale fading and random phases, the aggregate interference at a receiver can be modeled as a sum of N weakly correlated random variables. Under normalized transmitted power, each term contributes O ( 1 / N ) to the total interference, and as N increases, the law of large numbers implies that the total interference remains bounded while the effective contribution of each interferer decays proportionally to 1 / N . This observation is consistent with shot-noise models of large wireless networks, where the interference field converges to a finite mean despite an increasing number of interferers [,]. In symmetric multi-user systems with homogeneous power control or scheduling, the interference power is effectively shared among N users, leading to an average cross-coupling that also scales inversely with N. Furthermore, the same approximation is adopted in other large-scale communication literatures such as [].
The objective of communication QoS requires that the SINR is higher than or equal to a desired threshold, i.e., ξ i μ i . Meanwhile, the power allocation objective is to minimize the total power consumption for all users, i.e., min j = 1 N p j . Based on [], the solution of the total power minimization subject to the QoS constraint is defined as
α i p i j i N β N α j p j + η i = μ i > 0 ,
which is equivalent to
α i p i j i β N α j p j + η i = μ i , α i p i = μ i j i β N α j p j + η i , α i p i = μ i j = 1 N β N α j p j β N α i p i + η i , α i p i ( 1 + μ i β N ) = μ i j = 1 N β N α j p j + η i , α i p i j = 1 N β N α j p j + η i = μ i 1 + β N μ i .
With the assumption that N , (8) can be calculated as
lim N α i p i j = 1 N β N α j p j + η i = lim N μ i 1 + β N μ i = μ i .
Without loss of generality, the transmission power of the ith robot can be modeled and described similarly as in [], i.e.,
d p i ( t ) = u p , i d t + σ p d w p , i , 1 i N ,
where p i represents the transmission power and u p , i ( t ) represents the power-control command generated by the base station or scheduler to regulate user i’s transmitted power in real time. In practice, this control corresponds to the standard closed-loop power-control signal used in modern systems (e.g., LTE or 5G NR), which adjusts the transmitted power at sub-millisecond intervals based on SINR feedback. The stochastic differential equation formulation serves as the continuous-time limit of this discrete process, where the drift term governed by u p , i models the deterministic adjustment of power and the diffusion term captures random channel fluctuations and measurement noise. This abstraction preserves the bounded, feedback-driven nature of practical power-control loops while enabling tractable stochastic analysis. Furthermore, σ p d w p , i provides the random noise from the transmitter.
Therefore, the communication value function can be defined for each robot at time T with transmitting time T x as
V p , i ( p , α , u p , i ) = E T T + T x Φ p ( p , α ) + Q p p i 2 + R p u p , i 2 d t
with
Φ p ( p , α ) = α i p i μ i β N j = 1 n α j p j + η i 2 ,
where p and α represent the sets of all users’ power and the power loss of the ith agent, respectively, Φ p ( · ) is the coupling term that forces the transmitter to achieve the desired SINR, R p u i 2 represents the penalty of abrupt power adjustment, and Q p p i 2 represents the additional penalty of high power. Although the integration in (11) is to infinity, the transmission stops when the remote station receives all data.
The objective of an individual robot is to find the optimal control u s , i and u p , i such that the value functions (3) and (11) are minimized. It is easy to observe from (3) and (11) that the coupling terms Φ s ( · ) and Φ p ( · ) require the current information from other robots, which is unattainable. To overcome these difficulties, Mean Field Games are applied where coupling functions are replaced by a function that is related to the PDF of all agents’ state information.

5. Mean Field Type Control

Mean Field Game (MFG) theory [] is an emerging technique that can effectively solve the stochastic decision-making and control problem with a large population of agents in a decentralized manner. In Mean Field Games, the information is encoded into a probability density function (PDF) which can be computed by a partial differential equation (PDE) named the Fokker–Planck–Kolmogorov (FPK) equation. The computed PDF overcomes the difficulty of collecting information from all other agents, as well as reducing the dimensions of the optimal control problem. In this section, the tasks in Scenarios 1 and 2 are formulated into two non-cooperative Mean Field Games. Then, a series of neural networks are designed and activated asynchronously to approximate the optimal solution. For example, when the robot is in Scenario 1, three NNs for Scenario 1 are activated, whereas Scenario 2’s NNs remain unchanged. The block diagram of the Actor–Critic–Mass (ACM) structure is demonstrated in Figure 3.
Figure 3. Block diagram of the Actor–Critic–Mass (ACM) architecture. Two sets of actor, critic, and mass neural networks are constructed to estimate the optimal control strategies for Scenario 1 and Scenario 2, respectively. These two sets are activated alternately based on the current scenario. Each neural network continuously interacts with the environment and updates its synaptic weights using real-time feedback, enabling online learning and adaptation.

5.1. Mean Field Power Allocation Games and Solution Learning

I first formulate Scenario 2 as a mean field type of transmission power allocation problem and derive the Actor–Critic–Mass structure to learn the optimal solution. As mentioned in the previous section, the coupling term Φ p ( · ) requires the current transmission power of other robots, which is unattainable. Therefore, I propose using the PDF to replace all agents’ transmission power p and power loss α .
Given a wireless channel, the PDFs of the power loss and transmission power for all robots, i.e., α m α ( α , t ) and p m p ( p , t ) , are used to compute the joint PDF of the received power of the i-th robot, i.e., m g ( α , p , t ) , with α p Θ = { α p | 0 α p p m a x } . Assuming N , the SINR constraint can be replaced with the expected value as follows:
μ i ( t ) = α i ( t ) p i ( t ) E ( α ( t ) p ( t ) ) + η i .
While m p ( p , t ) changes, m α ( α , t ) is fixed since the all agents remain still. Since the mobility and power allocation are separately considered, I consider the α and p as independent random variables and E [ m g ( α , p , t ) ] = E [ m α ( α , t ) ] E [ m p ( p , t ) ] .
Then, the communication value function (11) can be changed accordingly as
V p , i ( p i , α i , u p , i , m g , m p ) = E T Φ p ( p i , α i , m g , m p ) + Q p p i 2 + R p u p , i 2 d t
with
Φ p ( · ) = α i p i μ i E [ m α ( α , t ) ] E [ m p ( p , t ) ] + η i 2 .
To solve the optimal strategy which minimizes the value function, the HJB equation can be obtained. According to the optimal control theory [], the HJB equation is given as
0 = t V p , i ( p i , m p , t ) σ p 2 2 p p V p , i ( p i , m p , t ) H p p i , p V p , i ( p i , m p , t ) Φ p ( p i , α i , m g , m p ) ,
where
H p p i , p V p , i ( p i , m p , t ) = Q p p i 2 + R p u + 2 p V p , i ( p i , m p , t ) u p , i .
The optimal power adjustment strategy for individual agents can be represented as
u i * ( p i , u p , i , m p ) = 1 2 R p 1 p V p , i * ( p i , u p , i , m p )
To solve the HJB, i.e., Equation (16), the attenuation mass distribution m α ( α , t ) and the transmission power mass distribution m p ( p , t ) are needed. Recalling the Mean Field Game (MFG) [], the probability density function (PDF) m p ( p , t ) can be attained by solving the FPK equation:
0 = t m p ( p , t ) σ p 2 2 p p m p ( p i , t ) + p [ u p , i m p ( p , t ) ] m p ( p , 0 ) = m p , 0 ( p ) .
The expected value E [ m α ( α , t ) ] can be calculated by substituting the monotonic power attenuation function α i = exp ( e p , i T e p , i ) . Furthermore, the PDF of the transmitted path loss e p , i in (5), m g , can also be solved by the following FPK equation:
t m g ( e p , t ) + σ s 2 2 Δ m g ( e p , t ) div m g f p ( e p , i ) + g p ( e p , i ) u s , i = 0 m g ( e p , 0 ) = m g 0 ( e p ) ,
where u s , i is the motion control of the robot that is known in Scenario 2 and m g 0 ( e p ) is the initial transmitted path loss distribution.
Finally, the optimal strategy based on a Mean Field Game for all agents is officially defined. Considering the aforementioned value function (14), the optimization of the transmission power allocation for N players, i.e., Scenario 2, can be formulated as a non-cooperative game.
In Scenario 2, each mobile sensing robot (player) tries to minimize the value function given in (14) by computing a dynamic transmission power adjustment rate u p , i . Since the coupling effect is considered in the value function, the Nash Equilibrium is the optimal strategy for individual agents, while the number of agents is infinite. Let Ω p ( t ) = { u p , 1 ( t ) , , u p , 1 ( t ) } R N denote the set of transmission power for all mobile sensing robots at time t. Define a mapping F p : ( p ( t ) , Ω p ( t ) ) u p , i ( t ) to represent the power adjustment strategy set for agent i. I further denote U p , i as the set of transmission power other than that for agent i. Then, the optimal strategy equilibrium, i.e., the ϵ N Nash Equilibrium, can be defined as follows.
Definition 1 ( ϵ N Nash Equilibrium (NE) of Scenario 2).
Given power set p ( t ) and action set Ω p ( t ) at any time t, the ϵ N Nash Equilibrium (NE) of the N-player non-cooperative power allocation game is a strategy set Ω s * = ( u p , 1 * , , u p , N * ) that is generated by u p , i * = F p ( p ( t ) , Ω p ( t ) ) and satisfies the following conditions:
V p , i ( E , u p , i * ) V p , i ( p ( t ) , u p , i ) + ϵ N , u p , i U p u p , i * , i = 1 , 2 , 3 , , N ,
where ϵ N > 0 .
According to recent studies [,], the solution to the coupled Hamilton–Jacobi-Bellman (HJB) and Fokker–Planck–Kolmogorov (FPK) equations yields an ϵ N -Nash equilibrium, where lim N ϵ N = 0 .
Remark 2.
Unlike conventional distributed control methods, which require precise real-time information from neighboring agents, the Mean Field Game (MFG)-based decentralized control framework allows each agent to make decisions based on local information and the aggregate effect of the entire multi-agent system (MAS). This aggregate influence is captured through the probability density function (PDF) of agents’ transmission powers, denoted by m p ( p , t ) . Notably, this PDF can be computed using the FPK equation [], without requiring direct access to other agents’ instantaneous states or actions.
To determine the optimal transmission power, one must simultaneously solve the coupled Hamilton–Jacobi–Bellman (HJB) and Fokker–Planck–Kolmogorov (FPK) equations. However, this is inherently challenging: the HJB equation, i.e., Equation (16), is a backward-in-time partial differential equation (PDE), whereas the FPK equation, Equation (19), evolves forward in time. This opposing temporal structure makes the real-time solution of the mean field design highly complex.
To address this challenge, a novel online adaptive reinforcement learning framework, termed the Actor–Critic–Mass (ACM) architecture, is proposed. In this framework, three neural networks—the actor, the critic, and the mass network—are constructed to approximate the solutions to Equation (18), Equation (16), and Equation (19), respectively.
Assuming ideal neural network weights W V , p , i T , W u , p , i T , and W m , p , i T , respectively, the optimal value function, control policy, and mass distribution can be approximated as follows:
V p , i * ( p i , m p , t ) = W V , p , i T ϕ V , p , i ( p i ( t ) ) + ε V , p , i , u p , i * ( p i , m p ) = W u , p , i T ϕ u , p , i ( p i ( t ) ) + ε u , p , i , m p ( p i , t ) = W m , p , i T ϕ m , p , i ( p i ( t ) ) + ε m , p , i ,
where ϕ V , p , i ( p i ( t ) ) , ϕ u , p , i ( p i ( t ) ) , and ϕ m , p , i ( p i ( t ) ) are bounded, continuous activation functions, and ε V , p , i , ε u , p , i , and ε m , p , i denote the neural network approximation errors.
Correspondingly, the learned approximations for the value function, power control strategy, and mass distribution are expressed as
V ^ p , i ( p i , m ^ p , i , t ) = W ^ V , p , i T ϕ ^ V , p , i , u ^ p , i ( p i , m ^ p , i ) = W ^ u , p , i T ϕ ^ u , p , i , m ^ p , i ( p i , t ) = W ^ m , p , i T ϕ ^ m , p , i ,
where ϕ ^ V , p , i , ϕ ^ u , p , i , and ϕ ^ m , p , i represent the online-estimated basis functions corresponding to each approximated quantity.
Substituting the neural network approximations in Equation (23) into the original HJB equation (16), the optimal control law (18), and the FPK equation (19), the equations no longer hold exactly. Instead, residual errors are introduced, which are then used to update the critic, actor, and mass networks over time. These residuals are defined as follows:
e H J B , p , i = t V ^ p , i ( p i , m ^ p , i , t ) Φ p ( p i , α i , m ^ g , i , m p , i ) σ p 2 2 p p V ^ p , i ( p i , m ^ p , i , t ) H p p i , p V ^ p , i
e F P K , p , i = t m ^ p , i ( p , t ) σ p 2 2 p p m ^ p , i ( p i , t ) + p u ^ p , i m ^ p , i
e u , p , i = 1 2 R p 1 p V ^ p , i ( p i , m ^ p , i , t )
The residual terms in Equations (24)–(26) are obtained by substituting the neural network approximations into the original HJB, FPK, and optimal control equations, respectively, and represent the resulting approximation errors when these equations are not exactly satisfied.
According to Equation (22), these approximation errors vanish when the ideal neural network weights are reached. Hence, a gradient descent-based update law is applied to minimize the residuals and iteratively learn the optimal weights in Equation (22):
Critic : W ^ ˙ V , p , i = α h , p , i W V , p , i e H J B , p , i
Mass : W ^ ˙ m , p , i = α m , p , i W m , p , i e F P K , p , i
Actor : W ^ ˙ u , p , i = α u , p , i W u , p , i e u , p , i
where α h , p , i , α m , p , i , and α u , p , i denote the learning rates for the critic, mass, and actor networks, respectively.

5.2. Mean Field Optimal Navigation Games

As shown in [], the decentralized navigation problem for large-scale mobile sensor networks can be formulated as a mean-field-type optimal tracking control problem. Let m s ( e s , t ) denote the time-varying probability density function (PDF) of the agents’ tracking errors at time t. It is assumed that the coupling function Φ s ( e s , i , E ) depends on the distribution of tracking errors and can be expressed as Φ s ( e s , i , m s ( e s , i , t ) ) . With this, the cost function in Equation (3) becomes
V s , i ( e s , i , m s , u s , i ) = E T k T L s ( e s , i , u s , i ) + Φ s ( e s , i , m s ) d t .
To minimize this cost, the corresponding Hamilton–Jacobi–Bellman (HJB) equation can be derived as in []:
0 = t V s , i ( e s , i , m s , t ) σ s 2 2 Δ V s , i ( e s , i , m s , t ) + H s e s , i , e s V s , i ( e s , i , m s , t ) Φ s ( e s , i , m s ) ,
where the Hamiltonian H s ( · ) is given by
H s e s , i , e s V s , i = e s , i Q s e s , i + u s , i R s u s , i + e s V s , i f s ( e s , i ) + g s ( e s , i ) u s , i .
Solving the HJB equation yields the optimal cost-to-go function and control policy. The optimal control input is given by
u s , i * ( e s , i ) = 1 2 R s 1 g s ( e s , i ) e s , i V s , i * ( e s , i , m s , t ) .
Meanwhile, the tracking error distribution m s ( e s , t ) evolves according to the Fokker–Planck–Kolmogorov (FPK) equation:
0 = t m s ( e s , i , t ) σ s 2 2 Δ m s ( e s , i , t ) · m s D p H s e s , i , e s , i V s , i ( e s , i , m s , t ) , m s ( e s , 0 ) = m s , 0 ( e s ) ,
where D p denotes differentiation with respect to the second argument. Solving the coupled HJB–FPK equations yields the ϵ N -Nash equilibrium for Scenario 1.
To avoid solving the coupled PDEs directly, I employ a mean field reinforcement learning approach to approximate the optimal value function, control policy, and distribution. These are represented using neural networks as follows:
V ^ s , i ( e s , i , m s , t ) = W ^ V , s , i ϕ ^ V , s , i , u ^ s , i ( e s , i , m s ) = W ^ u , s , i ϕ ^ u , s , i , m ^ s , i ( e s , t ) = W ^ m , s , i ϕ ^ m , s , i
The weights of the neural networks are updated via gradient descent using the residual errors from the HJB, FPK, and control equations:
Critic : W ^ ˙ V , s , i = α h , s , i W V , s , i e H J B , s , i
Mass : W ^ ˙ m , s , i = α m , s , i W m , s , i e F P K , s , i
Actor : W ^ ˙ u , s , i = α u , s , i W u , s , i e u , s , i
Here, e H J B , s , i , e F P K , s , i , and e u , s , i denote the residual errors derived from the HJB Equation (31), the FPK Equation (34), and the control policy (33), respectively.

5.3. Convergence of Neural Network Weights

To ensure the reliability of the proposed learning framework, it is essential to analyze the convergence properties of the neural networks. Given the structural similarity between the neural networks used in Scenario 1 and Scenario 2, I focus our analysis on Scenario 2 without loss of generality.
Recall the update laws defined in Equations (27)–(29). The weight estimation errors of the actor, critic, and mass networks can be expressed as
Critic NN : W ˜ ˙ V , p , i ( t ) = W ^ ˙ V , p , i ( t )
Mass NN : W ˜ ˙ m , p , i ( t ) = W ^ ˙ m , p , i ( t )
Actor NN : W ˜ ˙ u , p , i ( t ) = W ^ ˙ u , p , i ( t )
where the weight estimation error is defined as W ˜ = W W ^ , with W ˜ V , p , i , W ˜ m , p , i , and W ˜ u , p , i representing the respective errors for the critic, mass, and actor networks.
Firstly, a lemma regarding the closed-loop dynamics is proposed.
Lemma 1.
Consider the continuous-time system described by Equation (10). There exists an optimal mean field-type control input u p , i * such that the resulting closed-loop system dynamics,
u p , i * + σ p d w p , i d t , 0 < i N ,
satisfy the inequality
p i u p , i * + σ p d w p , i d t γ p i 2 ,
where γ > 0 is a positive constant.
Based on this lemma, the performance of the Actor–Critic–Mass (ACM) learning algorithm in Scenario 2 can be analyzed through the following theorem.
Theorem 1 (Closed-loop Stability).
Let the initial control input be admissible, and suppose the actor, critic, and mass neural network weights are initialized within a compact set. Assume the neural network weight update laws are given by Equations (27)–(29). Then, there exist the positive constants α h , p , i , α m , p , i , and α u , p , i such that the transmission power p i and the weight estimation errors W ˜ V , p , i , W ˜ m , p , i , and W ˜ u , p , i of the critic, mass, and actor networks, respectively, are all uniformly ultimately bounded (UUB). Moreover, the estimated value function V ^ p , i , mass function m ^ p , i , and control input u ^ p , i are also UUB. If the neural networks are sufficiently expressive (i.e., with enough neurons and a properly designed architecture), the reconstruction errors can be made arbitrarily small. Under such ideal conditions, the system state p i and the weight estimation errors W ˜ V , p , i ( t ) , W ˜ m , p , i ( t ) , and W ˜ u , p , i ( t ) will asymptotically converge to zero, ensuring asymptotic stability of the closed-loop system.
Proof. 
A brief proof sketch is provided: Consider the transmission power dynamics in Equation (10) and the neural network approximations in Equation (23). Let the ideal network weights be W V , p , i , W u , p , i , and W m , p , i , and their online estimates be W ^ V , p , i , W ^ u , p , i , and W ^ m , p , i . Define the corresponding weight estimation errors as W ˜ V , p , i = W V , p , i W ^ V , p , i , W ˜ u , p , i = W u , p , i W ^ u , p , i , and W ˜ m , p , i = W m , p , i W ^ m , p , i . The residual errors e H J B , p , i , e F P K , p , i , and e u , p , i are defined in Equations (24)–(26) as the deviations obtained when the approximated neural networks are substituted into the corresponding Hamilton–Jacobi–Bellman (HJB), Fokker–Planck–Kolmogorov (FPK), and optimal control equations, respectively. The update laws in Equations (27)–(29) minimize these residuals through gradient descent with learning rates α h , p , i , α m , p , i , and α u , p , i . To analyze convergence, consider the following Lyapunov candidate function:
L = c 1 p i 2 + 1 2 W ˜ V , p , i W ˜ V , p , i + 1 2 W ˜ u , p , i W ˜ u , p , i + 1 2 W ˜ m , p , i W ˜ m , p , i ,
where c 1 > 0 is a constant and p i is the transmission power of agent i. Differentiating L along the trajectories of the closed-loop system and substituting the learning laws (27)–(29) yields
L ˙ λ ̲ p i 2 α h , p , i W ^ V , p , i e H J B , p , i 2 α u , p , i W ^ u , p , i e u , p , i 2 α m , p , i W ^ m , p , i e F P K , p , i 2 + C ε ,
where λ ̲ , C > 0 are constants and ε denotes the bounded neural network approximation errors introduced in Equation (22). The inequality shows that L ˙ is negative definite up to a small residual term proportional to ε , implying that both the transmission power p i and the weight estimation errors remain uniformly ultimately bounded (UUB). Under the standard universal approximation theorem, if the neural networks are sufficiently expressive such that the approximation errors ε can be made arbitrarily small, then L ˙ < 0 and the system trajectories asymptotically converge to the optimal solution ( V p , i , u p , i , m p , i ) . This establishes the stability and convergence results stated in Theorem 1.    □
The performance for the mean field navigation game can be similarly obtained. Furthermore, the algorithm can be summarized as a pseudocode in Algorithm 1.
Algorithm 1 ACM for Mean Field Navigation (Scenario 1) and Power Control (Scenario 2)
  • Require: Dynamics f s , g s , diffusivities σ s , σ p , costs ( Q s , R s ) , ( Q p , R p ) , schedule C ( t ) , stepsizes α h , α m , α u
  • 1: Initialize weights ( W ^ V , 1 , W ^ u , 1 , W ^ m , 1 ) and ( W ^ V , 2 , W ^ u , 2 , W ^ m , 2 )
  • 2: Define Σ 1 σ s σ s T , Σ 2 σ p 2
  • 3: for  t = 0 , 1 , , T 1  do
  • 4:          c C ( t )             ▹ c = s navigation, c = 2 communication
  • 5:         if  c = 1  then
  • 6:           ξ t e t = x t A k ( t ) ,    G g s ( e t ) ,    Φ ( ξ t , m ^ ) Φ s ( e t , m ^ s ) ,    Σ Σ 1
  • 7:         else
  • 8:           ξ t p t ,    G I ,    Φ ( ξ t , m ^ ) Φ p ( p t , α t , m ^ ) ,    Σ Σ 2
  • 9:         end if
  • 10:       Mass step (FPK residual):
  • 11:        m ^ c ( · , t ) MassNN c ( ξ t , t ) ;     m ^ c Normalize ( softplus ( m ^ c ) )
  • 12:        e FPK t m ^ c 1 2 · ( Σ m ^ c ) + · ( u ^ c m ^ c )
  • 13:        W ^ m , c W ^ m , c α m W ^ m , c e FPK 2
  • 14:       Critic step (HJB residual):
  • 15:        V ^ c CriticNN c ( ξ t , t , m ^ c ) ;    g V ξ V ^ c ;    H
  •             ξ t T Q c ξ t + u t T R c u t + g V T f c ( ξ t ) + G u t , s = 1 Q p p t 2 + R p u t 2 + g V T u t , s = 2
  • 16:        e HJB t V ^ c 1 2 tr ( Σ ξ 2 V ^ c ) + H Φ ( ξ t , m ^ c )
  • 17:        W ^ V , c W ^ V , c α h W ^ V , c e HJB 2
  • 18:       Actor step (policy residual):
  • 19:        u t ActorNN c ( ξ t , m ^ c )
  • 20:        r u u t + 1 2 R c 1 G T g V , s = 1 u t + 1 2 R p 1 g V , s = 2
  • 21:        W ^ u , c W ^ u , c α u W ^ u , c r u 2
  • 22:       Apply u t to environment, step to ( t + 1 ) , log instantaneous cost
  • 23: end for

6. Simulations

To validate the effectiveness of the proposed Actor–Critic–Mass (ACM) algorithm and its scalability in large-scale mobile wireless sensor networks, a series of numerical simulations are conducted. The simulations are designed to evaluate both navigation and communication performance under decentralized control. Specifically, the tracking accuracy, convergence of the mean field learning process, and communication quality of service (QoS) compared with a baseline decentralized algorithm are analyzed. The following subsections describe the simulation setup, performance metrics, and comparative results in detail.

Simulation Setup

The performance of the proposed decentralized multi-agent reinforcement learning algorithm is evaluated using a network of 1000 mobile wireless sensors. The simulation is conducted within a 6 × 5 m2 workspace. All agents are randomly initialized within this area, and a remote station is located at the origin.
The mobile sensors are tasked with tracking four distinct trajectories corresponding to four areas of interest. Each trajectory A k ( t ) , for k = 1 , 2 , 3 , 4 , is defined as
A 1 ( t ) = 0.4 sin ( 2 t 2 T 1 ) + 13 0.1 t 0.1 T 1 + 1.7 , A 2 ( t ) = 0.5 sin ( 2 t 2 T 2 ) + 0.05 t 0.05 T 2 + 2.5 0.1 t 0.1 T 2 + 1.5 ,
A 3 ( t ) = 0.1 t 0.1 T 3 + 2.5 0.4 sin ( 2 t 2 T 3 ) + 1 , A 4 ( t ) = 0.1 t 0.1 T 4 + 1.1 0.4 sin ( 2 t 2 T 4 ) 0.03 t + 0.03 T 4 + 1.7 ,
where T k denotes the activation time for trajectory k.
Each mobile sensor spends 20 s navigating along each reference trajectory. After completing the navigation for one area, the agent spends 15 s transmitting its collected data back to the remote station. The overall simulation timeline is illustrated in Figure 4, where A k denotes the tracking phase for area k and P k denotes the corresponding data transmission phase.
Figure 4. Timeline of scenarios. P 1 stands for transmission location 1, and so on.
The desired transmission SINR is set to 0.6 , and the noise variance for all agents is η = 0.1 . The diffusion coefficients are set as σ s = 0.1 I 2 and σ p = 0.2 I 2 . The cost function parameters are chosen as Q s = 15 I 2 , Q p = 1 , R s = I 2 , and R p = 1 . The mean field coupling function from Equation (30) is defined as
Φ s ( m s , e s , i , t ) = e s , i ( t ) E [ m s ( e , t ) ] 2 .
The initial distribution of agents’ positions is modeled as a Gaussian:
m s , 0 N 0.25 0.25 , 0 . 3 2 0 0 0 . 3 2 .
The simulation parameters are summarized in Table 1. A flow chart of the simulation process is provided in Figure 5. The actor, critic, and mass neural networks for both scenarios are updated separately. As shown, each agent starts by operating individually and alternates between the navigation and communication phases. In the navigation phase, the agent observes its state x s , i , estimates the population distribution m s by solving the Fokker–Planck–Kolmogorov (FPK) equation, and computes the value function V s from the Hamilton–Jacobi–Bellman (HJB) equation. A new optimal control action u s is then generated and implemented to update the agent’s trajectory. Similarly, in the communication phase, the agent observes its relative distance e s , i to the remote station, updates the power distribution m g via the FPK equation, and evaluates the value function V p using the HJB equation to produce the optimal transmission power u p . The new actions from both phases are implemented iteratively, allowing each agent to adapt its motion and power in real time until convergence to the mean field equilibrium occurs.
Table 1. Simulation parameters.
Figure 5. The flow chart of the algorithm.
To approximate the solutions of the HJB equations [Equations (31) and (16)], the FPK equations [Equations (34) and (19)], and the optimal control policies u s , i and u p , i , I design and deploy three neural networks: Critic NN, Mass NN, and Actor NN.
Figure 6 illustrates the trajectories of all mobile sensors throughout the simulation. The blue curves represent the desired reference trajectories, while the data transmission points are indicated by red squares. The thin colored curves correspond to the individual trajectories of the 1000 mobile sensing agents. It can be clearly observed that the agents successfully track the desired paths with high accuracy. The detailed tracking error average and the percentage are included in Table 2. On average, the four areas achieved 2.32 % accuracy at the end of each stage.
Figure 6. The overall trajectory for all sensing robots in Scenario 1. The red squares represent the transmitting points. The green curve represents the tracking reference.
Table 2. Trajectory tracking errors by time.
In addition, Figure 7b shows the time evolution of the normalized average tracking error across all agents. The tracking error remains bounded and converges close to zero, demonstrating effective trajectory tracking performance. However, due to the inherent stochasticity in the agents’ motion dynamics, the tracking error does not converge exactly to zero, which is consistent with the presence of system noise. The state density plot is shown in Figure 8 and displays similar behavior.
Figure 7. (a) Time evolution of scenarios. This reveals the process of the sensing networks’ workflow with respect to time. (b) Time evolution of the normalized tracking error on the x axis (red curve) and y axis (blue curve). (c) Time evolution of the transmission power for a single agent (red curve) and the population average (blue curve).
Figure 8. Navigation state density distribution evolution.
To further analyze the learning performance at the individual agent level, I examine the HJB equation residual of Robot 1 in Scenario 1 (i.e., navigation). Figure 9 plots the HJB residual over time in a stationary setting, where both the HJB and FPK equations are assumed to be time-invariant. In this scenario, the mass neural network is updated every 1 s to compute E [ m s ( e s ) ] , which is then used to update the critic neural network. As shown in Figure 9, the HJB residual converges to zero within approximately 1 s. Moreover, the HJB error distribution among all robots is plotted in Figure 10.
Figure 9. The plots of Robot 1’s HJB equation error, which is also the critic NN error. The small figure on the right part of each plot is the HJB error in the last 10 s of moving in each area.
Figure 10. The plots of all robots’ HJB error distributions.
Given the convergence of both the tracking error and the HJB residual, it can be concluded that the actor, critic, and mass networks successfully converge in Scenario 1. This implies that the learned control policy, value function, and tracking error distribution converge to the solution of the mean field equation system. Therefore, the learned control policy represents the unique solution to the ϵ N -Nash equilibrium.
Next, I analyze the performance of the Actor–Critic–Mass (ACM) algorithm in Scenario 2 (i.e., transmission). Similarly to Scenario 1, a set of three neural networks is employed to solve the coupled HJB–FPK equation system for decentralized optimal transmission power control. The evolution of the signal-to-interference-plus-noise ratio (SINR) during data transmission is shown in Figure 11. Furthermore, the SINR distribution is also plotted in Figure 12. As observed, the SINR converges to the desired target of 0.6 , confirming the effectiveness of the ACM algorithm in transmission tasks.
Figure 11. The plots of the average SINR and Robot 1’s SINR in all transmission locations. The red dots show Robot 1’s SINR and the blue dots represent the average SINR among all robots.
Figure 12. SINR distribution plot.
Interestingly, at transmission locations p 3 and p 4 , the initial SINR values are higher but subsequently decrease and stabilize near the target value. This behavior arises from the value function design in Equation (14), which penalizes high power outputs due to the energy constraints of the low-cost mobile sensing robots. Figure 7c further illustrates that during data transmission at p 3 and p 4 , the robot actively minimizes its SINR to reduce power consumption at the transmitter.
This power-saving behavior will next be compared with the performance of an alternative decentralized game theoretic method, namely the Parallel Update Algorithm (PUA) introduced in [].

7. Results and Analysis

To evaluate the effectiveness of the proposed ACM algorithm, it is compared with the Parallel Update Algorithm (PUA) introduced in []. The parameters for the PUA algorithm are selected as L = u = 1 , σ 2 = 0.1 , and λ = 0.78 , ensuring that the target SINR μ = 0.6 is achieved.
The transmission process in area A 1 is simulated using both the proposed ACM algorithm and the PUA algorithm. As shown in Figure 13, both methods achieve the target SINR. Furthermore, the total transmission power consumption across all agents converges to the same level, confirming that both algorithms reach a Nash equilibrium, as established in []. This validates the capability of the ACM algorithm to learn the optimal decentralized transmission strategy.
Figure 13. The transmitters’ power summation for all mobile sensing robots in Scenario 2. The red curve represents the ACM algorithm. The blue curve represents the PUA algorithm.
However, it is important to highlight key differences. The ACM algorithm updates neural network weights continuously, while the PUA algorithm performs updates every 1 s. The ACM algorithm uses 8.59 % less power than the PUA algorithm. Despite its slower update frequency, PUA converges faster to the Nash equilibrium. This is because the PUA algorithm does not explicitly account for the population influence (i.e., the mass effect), leading to a more aggressive strategy.
Nevertheless, Figure 13 clearly shows that the PUA algorithm results in higher transmission power levels compared to the ACM algorithm, particularly during the transmission intervals at p 3 and p 4 . Such power profiles are not ideal for energy-constrained mobile sensors. In contrast, the ACM algorithm, by modeling and responding to the mean field, promotes more energy-efficient behavior.
To evaluate the overall performance, a composite cost function that penalizes both SINR deviation and power consumption is defined:
J ( t ) = 1 N t T i = 1 N μ ^ i ( τ ) μ + i = 1 N p i ( τ ) d τ .
The values of J ( t ) for both algorithms are plotted in Figure 14. The results demonstrate that the ACM algorithm achieves superior performance by maintaining SINR requirements while minimizing power consumption.
Figure 14. The transmitters’ cost for all sensing robots in Scenario 2. The red curve represents the ACM algorithm. The blue curve represents the PUA algorithm.
Notably, at approximately 21 s in Figure 13, the PUA algorithm exhibits an overshoot in total power. This behavior stems from its lack of coordination and foresight—each agent increases its transmission power independently, without considering the collective dynamics. In contrast, the ACM algorithm anticipates these interactions by estimating the power distribution (via the mass NN), thereby avoiding destructive competition.
This comparison highlights a fundamental advantage of Mean Field Games: they serve as an implicit coordination mechanism in non-cooperative multi-agent systems. Through mass feedback, each agent adapts to the aggregate behavior of the population without requiring explicit communication, enabling convergence to a socially efficient ϵ -Nash equilibrium. Moreover, this feature significantly reduces communication overheads and channel usage, making ACM particularly well-suited for large-scale mobile sensor networks.

8. Discussion

The proposed Actor–Critic–Mass (ACM) algorithm demonstrates scalable, decentralized optimization for large-scale mobile wireless sensor networks (MWSNs). By jointly solving the Hamilton–Jacobi–Bellman (HJB) and Fokker–Planck–Kolmogorov (FPK) equations through reinforcement learning, the ACM enables each agent to learn optimal control policies without inter-agent communication. The simulation results confirm that ACM maintains the desired signal-to-interference-plus-noise ratio (SINR) while minimizing transmission power, offering energy-efficient performance under dynamic network conditions.
Compared with recent reinforcement learning approaches, ACM provides comparable or superior energy savings while avoiding centralized coordination. El Jamous et al. [] achieved energy-efficient WiFi power control using deep RL but relied on a global critic and full state feedback. Choi et al. [] reported up to 70% power savings in 5G H-CRANs through centralized interference coordination, whereas ACM attains similar gains using only local feedback and the learned mean field distribution. Likewise, Soltani et al. [] demonstrated efficiency improvements in MARL-based routing but required synchronized updates, in contrast to ACM’s asynchronous and fully distributed updates.
The observed convergence of the HJB residuals and SINR trajectories aligns with mean field control theory, which predicts ϵ N -Nash convergence as N []. Unlike earlier mean field control formulations that depend on offline PDE solutions [], ACM learns the evolving population density online, achieving faster adaptation under stochastic conditions. This emergent, energy-aware coordination behavior is comparable to that seen in other MFG-based cooperative systems such as autonomous driving [].

9. Conclusions

This paper presents a novel decentralized co-optimization framework for mobile sensing and communication in large-scale wireless sensor networks, based on Mean Field Game (MFG) theory. The proposed Actor–Critic–Mass (ACM) algorithm leverages three neural networks—Actor NN, Critic NN, and Mass NN—to approximate the solution of the coupled Hamilton–Jacobi–Bellman (HJB) and Fokker–Planck–Kolmogorov (FPK) equations online. Two value functions are formulated to capture both navigation and communication objectives. Minimizing these functions ensures that the desired SINR is maintained and that the sensing agents accurately follow designated trajectories. The ACM algorithm achieved 2.32 % trajectory tracking accuracy and uses 8.59 % less power than the PUA algorithm. The resulting optimal policies are shown to approximate an ϵ -Nash equilibrium. Compared with traditional centralized and distributed optimization algorithms, the ACM framework significantly reduces communication overhead and computational complexity, enabling scalable, real-time implementation. Numerical simulations validate the effectiveness and efficiency of the approach, demonstrating high tracking accuracy and energy-efficient transmission.
Limitations: Despite these promising results, several simplifying assumptions constrain the present study. First, all agents were assumed to be homogeneous with identical dynamics and sensing capabilities; extending the framework to heterogeneous agents with distinct models and cost functions remains an open challenge. Second, communication latency, packet loss, and synchronization issues were neglected, yet such factors can significantly influence decentralized learning and coordination performance in practice. Third, the convergence guarantees rely on the infinite-population (mean field) limit, while practical deployments involve finite populations whose deviations from the mean field warrant further theoretical characterization. Future work will relax these assumptions by incorporating asynchronous updates, heterogeneous agent dynamics, and realistic wireless channel models to more accurately capture large-scale multi-agent behaviors in real-world environments.
Future Perspectives: Building on these results, several promising directions remain open. First, extending the ACM framework to heterogeneous multi-agent systems, where agents differ in sensing and communication capabilities, would improve applicability to realistic networks. Second, incorporating safety constraints, robustness against adversarial disturbances, and energy awareness into the MFG formulation is essential for deployment in safety-critical domains. Third, coupling ACM with online system identification or adaptive learning strategies could enhance robustness to dynamic or partially observed environments. Finally, exploring hierarchical and multi-layered networked systems—such as UAV–AUV–ground cooperative networks—offers a path toward general-purpose autonomous sensing and communication in real-world applications.
In practical deployments, additional challenges such as sensor drifts, intermittent communication, and packet loss can significantly affect system performance. Small but accumulating sensor drifts may distort the agents’ state estimates and consequently the learned mean field distribution. Future work will therefore incorporate online calibration mechanisms and adaptive filtering to mitigate such errors. Similarly, communication failures or delays can cause asynchronous updates among agents, leading to instability in distributed learning. To address this, we plan to integrate resilient consensus and delay compensation strategies into the ACM framework, allowing agents to maintain stability even under partial or delayed information exchange. These efforts will be validated through experiments on the planned large-scale mobile sensing testbed to ensure robustness under realistic operating conditions.

Funding

This research was funded by NASA EPSCoR award #80NSSC23M0170 and NASA Wyoming Space Grant #80NSSC20M0113.

Data Availability Statement

Data are contained within the article.

Acknowledgments

This article is a revised and expanded version of a paper entitled “Decentralized multi-agent reinforcement learning for large-scale mobile wireless sensor network control using mean field games”, which was presented at 2024 33rd International Conference on Computer Communications and Networks (ICCCN) []. The author has used AI tools to improve the grammar.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ACMActor–Critic–Mass
FPKFokker–Planck–Kolmogorov
HJBHamilton–Jacobi–Bellman
MFGMean Field Game
MWSNMobile Wireless Sensor Network
NNNeural Network
PDEPartial Differential Equation
PDFProbability Density Function
PUAParallel Update Algorithm
QoSQuality of Service
SINRSignal-to-Interference-plus-Noise Ratio
SLAMSimultaneous Localization and Mapping
UUBUniformly Ultimately Bounded
AGVAutomated Guided Vehicle
AIArtificial Intelligence
AUVAutonomous Underwater Vehicle
MASMulti-Agent System
POIPoint of Interest
UAVUnmanned Aerial Vehicle
UCRLUpper-Confidence Reinforcement Learning
V2XVehicle-to-Everything

References

  1. Liu, L.; Zheng, Z.; Zhu, S.; Chan, S.; Wu, C. Virtual-Mobile-Agent-Assisted Boundary Tracking for Continuous Objects in Underwater Acoustic Sensor Networks. IEEE Internet Things J. 2024, 11, 9171–9183. [Google Scholar] [CrossRef]
  2. Huang, P.; Zeng, L.; Chen, X.; Luo, K.; Zhou, Z.; Yu, S. Edge Robotics: Edge-Computing-Accelerated Multirobot Simultaneous Localization and Mapping. IEEE Internet Things J. 2022, 9, 14087–14102. [Google Scholar] [CrossRef]
  3. Fernández-Jiménez, F.J.; Dios, J.R.M.d. A Robot–Sensor Network Security Architecture for Monitoring Applications. IEEE Internet Things J. 2022, 9, 6288–6304. [Google Scholar] [CrossRef]
  4. Lee, J.S.; Jiang, H.T. An Extended Hierarchical Clustering Approach to Energy-Harvesting Mobile Wireless Sensor Networks. IEEE Internet Things J. 2021, 8, 7105–7114. [Google Scholar] [CrossRef]
  5. Su, Y.; Guo, L.; Jin, Z.; Fu, X. A Mobile-Beacon-Based Iterative Localization Mechanism in Large-Scale Underwater Acoustic Sensor Networks. IEEE Internet Things J. 2021, 8, 3653–3664. [Google Scholar] [CrossRef]
  6. Wang, D.; Chen, H.; Lao, S.; Drew, S. Efficient Path Planning and Dynamic Obstacle Avoidance in Edge for Safe Navigation of USV. IEEE Internet Things J. 2024, 11, 10084–10094. [Google Scholar] [CrossRef]
  7. Ma, C.; Li, A.; Du, Y.; Dong, H.; Yang, Y. Efficient and scalable reinforcement learning for large-scale network control. Nat. Mach. Intell. 2024, 6, 1006–1020. [Google Scholar] [CrossRef]
  8. Huang, M.; Caines, P.E.; Charalambous, C.D. Stochastic power control for wireless systems: Classical and viscosity solutions. In Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No. 01CH37228), Orlando, FL, USA, 4–7 December 2001; Volume 2, pp. 1037–1042. [Google Scholar]
  9. Kafetzis, D.; Vassilaras, S.; Vardoulias, G.; Koutsopoulos, I. Software-defined networking meets software-defined radio in mobile ad hoc networks: State of the art and future directions. IEEE Access 2022, 10, 9989–10014. [Google Scholar] [CrossRef]
  10. Zejian, Z.; Xu, H. Decentralized Adaptive Optimal Tracking Control for Massive Multi-agent Systems: An Actor-Critic-Mass Algorithm. In Proceedings of the 58th IEEE Conference on Decision and Control, Nice, France, 11–13 December 2019. [Google Scholar]
  11. Zejian, Z.; Xu, H. Decentralized Adaptive Optimal Control for Massive Multi-agent Systems Using Mean Field Game with Self-Organizing Neural Networks. In Proceedings of the 58th IEEE Conference on Decision and Control, Nice, France, 11–13 December 2019. [Google Scholar]
  12. Guéant, O.; Lasry, J.M.; Lions, P.L. Mean field games and applications. In Paris-Princeton Lectures on Mathematical Finance 2010; Springer: Berlin/Heidelberg, Germany, 2011; pp. 205–266. [Google Scholar]
  13. Lasry, J.M.; Lions, P.L. Mean field games. Jpn. J. Math. 2007, 2, 229–260. [Google Scholar] [CrossRef]
  14. Prag, K.; Woolway, M.; Celik, T. Toward data-driven optimal control: A systematic review of the landscape. IEEE Access 2022, 10, 32190–32212. [Google Scholar] [CrossRef]
  15. Huang, M.; Caines, P.E.; Malhamé, R.P. Large-population cost-coupled LQG problems with nonuniform agents: Individual-mass behavior and decentralized ε-Nash equilibria. IEEE Trans. Autom. Control 2007, 52, 1560–1571. [Google Scholar] [CrossRef]
  16. Huang, M.; Sheu, S.; Sun, L. Mean field social optimization: Feedback person-by-person optimality and the dynamic programming equation. In Proceedings of the 2020 59th IEEE Conference on Decision and Control (CDC), Jeju, Republic of Korea, 14–18 December 2020. [Google Scholar]
  17. Cardaliaguet, P.; Porretta, A. An Introduction to Mean Field Game Theory; Springer: Berlin/Heidelberg, Germany, 2021; pp. 1–158. [Google Scholar]
  18. Liu, M.; Zhao, L.; Lopez, V.; Wan, Y.; Lewis, F.; Tseng, H.E.; Filev, D. Game-Theoretic Decision-Making for Autonomous Driving; CRC Press: Boca Raton, FL, USA, 2025; pp. 236–272. [Google Scholar]
  19. Wei, X.; Zhao, J.; Zhou, L.; Qian, Y. Broad Reinforcement Learning for Supporting Fast Autonomous IoT. IEEE Internet Things J. 2020, 7, 7010–7020. [Google Scholar] [CrossRef]
  20. Liang, S.; Wang, X.; Huang, J. Actor–Critic Reinforcement Learning Algorithms for Mean Field Games in Continuous Time, State, and Action Spaces. arXiv 2024, arXiv:2401.00052. [Google Scholar] [CrossRef]
  21. Angiuli, A.; Subramanian, J.; Perolat, J.; Carpentier, A.; Geist, M.; Pietquin, O. Deep Reinforcement Learning for Mean Field Control and Games. arXiv 2023, arXiv:2309.10953. [Google Scholar]
  22. Bogunovic, I.; Pirotta, M.; Rosolia, U. Safe-M3-UCRL: Safe Mean-Field Multi-Agent Reinforcement Learning under Global Constraints. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Auckland, New Zealand, 6–10 May 2024; pp. 973–981. [Google Scholar]
  23. Zaman, A.; Ratliff, L.; Mesbahi, M. Robust Multi-Agent Reinforcement Learning via Mean-Field Games. In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
  24. Jiang, Y.; Xu, K.; Wu, Y.; Zhang, M. A Survey of Fully Decentralized Multi-Agent Reinforcement Learning. arXiv 2024, arXiv:2306.02766. [Google Scholar]
  25. Gabler, L.; Scheller, S.; Albrecht, S.V. Decentralized Actor–Critic Reinforcement Learning for Cooperative Tasks with Sparse Rewards. Front. Robot. AI 2024, 11, 1229026. [Google Scholar]
  26. Gu, Y. Centralized training with hybrid execution in multi-agent reinforcement learning via predictive observation imputation. Artif. Intell. 2025, 348, 104404. [Google Scholar] [CrossRef]
  27. Alam, S.; Khan, M.; Zhang, W. Actor–Critic Frameworks for UAV Swarm Networks: A Survey. Drones 2025, 9, 153. [Google Scholar] [CrossRef]
  28. Xu, C.; Li, P.; Sun, X. Mean-Field Multi-Agent Reinforcement Learning for UAV-Assisted V2X Communications. arXiv 2025, arXiv:2502.01234. [Google Scholar]
  29. Emami, N.; Joo, C.; Kim, S.C. Age of Information Minimization Using Multi-Agent UAVs Based on AI-Enhanced Mean Field Resource Allocation. IEEE Trans. Wirel. Commun. 2024, 73, 13368–13380. [Google Scholar] [CrossRef]
  30. Mostofi, Y.; Malmirchegini, M.; Ghaffarkhah, A. Estimation of communication signal strength in robotic networks. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 1946–1951. [Google Scholar]
  31. Malmirchegini, M.; Mostofi, Y. On the spatial predictability of communication channels. IEEE Trans. Wirel. Commun. 2012, 11, 964–978. [Google Scholar] [CrossRef]
  32. Charalambous, C.D.; Menemenlis, N. Stochastic models for long-term multipath fading channels and their statistical properties. In Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No. 99CH36304), Phoenix, AZ, USA, 7–10 December 1999; Volume 5, pp. 4947–4952. [Google Scholar]
  33. Huang, M.; Malhamé, R.P.; Caines, P.E. Stochastic power control in wireless communication systems: Analysis, approximate control algorithms and state aggregation. In Proceedings of the 42nd IEEE International Conference on Decision and Control (IEEE Cat. No. 03CH37475), Maui, HI, USA, 9–12 December 2003; Volume 4, pp. 4231–4236. [Google Scholar]
  34. Huang, M.; Caines, P.E.; Malhamé, R.P. Individual and mass behaviour in large population stochastic wireless power control problems: Centralized and Nash equilibrium solutions. In Proceedings of the 42nd IEEE International Conference on Decision and Control (IEEE Cat. No. 03CH37475)), Maui, HI, USA, 9–12 December 2003; Volume 1, pp. 98–103. [Google Scholar]
  35. Aziz, M.; Caines, P.E. Computational investigations of decentralized cellular network optimization via mean field control. In Proceedings of the 53rd IEEE Conference on Decision and Control, Los Angeles, CA, USA, 15–17 December 2014; pp. 5560–5567. [Google Scholar]
  36. Aziz, M.; Caines, P.E. A mean field game computational methodology for decentralized cellular network optimization. IEEE Trans. Control Syst. Technol. 2016, 25, 563–576. [Google Scholar] [CrossRef]
  37. Haenggi, M.; Ganti, R.K. Interference in large wireless networks. Found. Trends Netw. 2009, 3, 127–248. [Google Scholar] [CrossRef]
  38. Baccelli, F.; Błaszczyszyn, B. Stochastic geometry and wireless networks: Volume II applications. Found. Trends Netw. 2010, 4, 1–312. [Google Scholar] [CrossRef]
  39. Jiang, Y.; Fan, J.; Chai, T.; Lewis, F.L.; Li, J. Tracking control for linear discrete-time networked control systems with unknown dynamics and dropout. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 4607–4620. [Google Scholar] [CrossRef] [PubMed]
  40. Nourian, M.; Caines, P.E. ϵ-Nash mean field game theory for nonlinear stochastic dynamical systems with major and minor agents. SIAM J. Control Optim. 2013, 51, 3302–3331. [Google Scholar] [CrossRef]
  41. Alpcan, T.; Başar, T.; Srikant, R.; Altman, E. CDMA uplink power control as a noncooperative game. Wirel. Netw. 2002, 8, 659–670. [Google Scholar] [CrossRef]
  42. El Jamous, Z.; Davaslioglu, K.; Sagduyu, Y.E. Deep reinforcement learning for power control in next-generation wifi network systems. In Proceedings of the MILCOM 2022—2022 IEEE Military Communications Conference (MILCOM), Rockville, MD, USA, 28 November–2 December 2022; pp. 547–552. [Google Scholar]
  43. Choi, H.; Kim, T.; Lee, S.; Choi, H.S.; Yoo, N. Energy-Efficient Dynamic Enhanced Inter-Cell Interference Coordination Scheme Based on Deep Reinforcement Learning in H-CRAN. Sensors 2024, 24, 7980. [Google Scholar] [CrossRef]
  44. Soltani, P.; Eskandarpour, M.; Ahmadizad, A.; Soleimani, H. Energy-Efficient Routing Algorithm for Wireless Sensor Networks: A Multi-Agent Reinforcement Learning Approach. arXiv 2025, arXiv:2508.14679. [Google Scholar]
  45. Wu, Y.; Wu, J.; Huang, M.; Shi, L. Mean-field transmission power control in dense networks. IEEE Trans. Control Netw. Syst. 2020, 8, 99–110. [Google Scholar] [CrossRef]
  46. Zhang, H.; Lu, C.; Tang, H.; Wei, X.; Liang, L.; Cheng, L.; Ding, W.; Han, Z. Mean-field-aided multiagent reinforcement learning for resource allocation in vehicular networks. IEEE Internet Things J. 2022, 10, 2667–2679. [Google Scholar] [CrossRef]
  47. Zhou, Z.; Qian, L.; Xu, H. Decentralized multi-agent reinforcement learning for large-scale mobile wireless sensor network control using mean field games. In Proceedings of the 2024 33rd International Conference on Computer Communications and Networks (ICCCN), Kailua-Kona, HI, USA, 29–31 July 2024; pp. 1–6. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.