Recent Advances in Multi-Agent Reinforcement Learning for Intelligent Automation and Control of Water Environment Systems

Jia, Lei; Pei, Yan

doi:10.3390/machines13060503

Open AccessReview

Recent Advances in Multi-Agent Reinforcement Learning for Intelligent Automation and Control of Water Environment Systems

by

Lei Jia

^1,† and

Yan Pei

^2,*,†

¹

Graduate School of Computer Science and Engineering, University of Aizu, Aizuwakamatsu 965-8580, Fukushima, Japan

²

Computer Science Division, University of Aizu, Aizuwakamatsu 965-8580, Fukushima, Japan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Machines 2025, 13(6), 503; https://doi.org/10.3390/machines13060503

Submission received: 27 April 2025 / Revised: 30 May 2025 / Accepted: 6 June 2025 / Published: 9 June 2025

(This article belongs to the Special Issue Recent Developments in Machine Design, Automation and Robotics)

Download

Browse Figures

Versions Notes

Abstract

Multi-agent reinforcement learning (MARL) has demonstrated significant application potential in addressing cooperative control, policy optimization, and task allocation problems in complex systems. This paper focuses on its applications and development in water environmental systems, providing a systematic review of the theoretical foundations of multi-agent systems and reinforcement learning and summarizing three representative categories of mainstream MARL algorithms. Typical control scenarios in water systems are also examined. From the perspective of cooperative control, this paper investigates the modeling mechanisms and policy coordination strategies of MARL in key tasks such as water supply scheduling, hydro-energy co-regulation, and autonomous monitoring. It further analyzes the challenges and solutions for improving global cooperative efficiency under practical constraints such as limited resources, system heterogeneity, and unstable communication. Additionally, recent progress in cross-domain generalization, integrated communication–perception frameworks, and system-level robustness enhancement is summarized. This work aims to provide a theoretical foundation and key insights for advancing research and practical applications of MARL-based intelligent control in water infrastructure systems.

Keywords:

reinforcement learning; multi-agent systems; cooperative control; intelligent environmental systems; Air-space-ground-water integrated framework

1. Introduction

A multi-agent system (MAS) is a complex system composed of multiple agents equipped with capabilities for computation, perception, communication, learning, and execution. Through localized sensing and interactive collaboration, these agents achieve coordinated control and optimal decision-making for global tasks. Compared with single-agent systems, a MAS offers greater flexibility and robustness in addressing large-scale, complex problems characterized by dynamics, uncertainty, and high-dimensional state spaces. MASs have thus found widespread application in various domains, including cooperative robotics, intelligent transportation, unmanned aerial vehicle (UAV) formation control, and power system scheduling. In recent years, with the continuous advancement of intelligent technologies, multi-agent cooperative control has become a core topic across several interdisciplinary fields, such as artificial intelligence, automation, game theory, and control theory, and is recognized as one of the key theoretical challenges in next-generation AI [1,2].

Traditional research on multi-agent cooperative control has primarily focused on consensus-based approaches, emphasizing the interplay between agent dynamics and information exchange mechanisms. These studies aim to address distributed consensus control under complex constraints, such as limited communication, dynamic topologies, and actuator saturation. However, as agent objectives become increasingly diverse and autonomous, the conventional single-objective coordination paradigm is no longer sufficient to capture complex interactions in real-world scenarios. To address this limitation, an increasing number of studies have adopted a game-theoretic framework, modeling agents as rational decision-makers engaged in strategic interactions driven by individual objectives. Through strategy optimization under multi-objective trade-offs, system-level coordination is achieved. This integration of game theory and control theory has shown promising results in robust control, optimal control, and cooperative estimation, providing a unified modeling foundation and solution framework to describe autonomy, interaction, and cooperation in MASs [3,4].

On the other hand, reinforcement learning (RL) has emerged as a powerful tool for cooperative MAS research due to its capability to optimize adaptive policies in model-free environments. Multi-agent reinforcement learning (MARL) further combines the representational power of deep learning with the policy optimization ability of RL, enabling agents to learn effective strategies under incomplete information and unstable environments. Building upon this foundation, deep MARL methods have been widely applied in domains such as autonomous formation control, intelligent manufacturing, and traffic management, demonstrating strong generalization and scalability [5,6].

With the rapid advancement of complex systems research driven by large-scale models, high-performance computing, and big data, water environment systems have emerged as a critical application domain for multi-agent deep reinforcement learning (MADRL) methods. These systems typically exhibit the following key characteristics: (1) an inherently multi-agent structure consisting of components such as pump stations, monitoring sites, control gates, and unmanned underwater or surface platforms; (2) strong spatiotemporal heterogeneity and nonlinear coupling in system dynamics; (3) the need to coordinate multiple sub-objectives, including ensuring water supply security, reducing energy consumption, and responding to extreme events; and (4) high levels of uncertainty, data sparsity, and stringent real-time requirements [7,8].

Against this backdrop, MADRL offers novel theoretical tools and practical approaches for modeling and optimizing water environment systems. Representative application scenarios include the following:

Intelligent pump station scheduling: Modeling multiple pump stations as cooperative agents, MADRL integrates predictive management with adaptive control to achieve optimal trade-offs between energy consumption and water supply efficiency.
Urban flood control: Under extreme rainfall conditions, drainage nodes must collaboratively optimize retention and discharge strategies. MADRL enables the learning of disaster response policies in simulated environments.
Underwater monitoring and detection systems (USV/UGV/AUV): Teams of autonomous surface or underwater vehicles execute long-term monitoring and pollutant source tracking through cooperative path planning and area coverage optimization [9,10].
Joint scheduling of multiple water sources and watershed management: In large-scale basins with coupled water sources, MADRL supports multi-objective water resource allocation and coordination.
Emergency pollution control: In response to sudden water quality events, multiple sensing and control nodes collaborate to optimize the timing and spatial extent of emergency interventions.

To date, a growing body of literature has systematically reviewed MADRL approaches from various perspectives. For example, Yang et al. explored multi-agent strategy evolution mechanisms within a game-theoretic (especially meta-game) framework [11]; Shi et al. investigated experience sharing and policy generalization among agents through transfer learning [12]; Han et al. provided a comprehensive review of hierarchical multi-agent reinforcement learning algorithms [13]; and Luo et al. examined the differences between online and offline learning paradigms in the context of game-theoretic modeling [14]. Furthermore, challenges such as scalability, heterogeneity, and stability in large-scale multi-agent systems have spurred research into emerging directions such as “massively multi-agent systems” and “scalable MARL,” driving the application of graph neural networks, attention mechanisms, and mean-field theory in MARL [15,16,17].

The integration of multi-agent deep reinforcement learning (MADRL) with water environment systems is experiencing rapid advancement, yet several persistent challenges remain. These include the complexity of cooperative game modeling, limited policy stability, and the lack of interpretability and deployability in real-world scenarios. To address these issues, this paper introduces MADRL as a pivotal research paradigm for intelligent control in water environment systems, highlighting its unique strengths in enabling collaborative mechanisms and dynamic control optimization among multiple control units.

This work presents a comprehensive overview of the theoretical foundations, learning architectures, representative applications, and critical challenges of MADRL in water-related domains. Special emphasis is placed on addressing complex conditions—such as partial observability, non-stationarity, and large-scale spatial distribution—through policy coupling, task-level coordination, and inter-agent information sharing. MADRL demonstrates strong potential to enhance system-wide coordination and efficiency, particularly in clustered control tasks such as multi-pump station scheduling, upstream–downstream regulation, and multi-point compliance of water quality metrics. By building unified frameworks for learning, MADRL opens new pathways for intelligent scheduling in complex systems. Its growing maturity not only propels innovation in multi-agent control theory and algorithms but also provides a technological foundation for more resilient, efficient, and sustainable management of water resources.

From the perspective of multi-agent coordination and reinforcement learning integration, this paper systematically explores the typical applications, core techniques, and development trends of MADRL in the context of water environment systems. The remaining sections of this article are organized as follows. In Section 2, this paper introduces the coordination mechanisms and game-theoretic foundations for multi-agent systems, providing the theoretical underpinnings for distributed decision-making; Section 3 reviews the evolution of MADRL, focusing on training paradigms and modeling frameworks in dynamic and partially observable environments; Section 4 proposes a cross-domain collaborative control architecture tailored to water systems, addressing structural heterogeneity and operational complexity; Section 5 presents MADRL-based applications in four key water system scenarios, namely, urban drainage control, distributed water resource management, hydropower regulation, and autonomous monitoring with path planning; Section 6 investigates the generalization capabilities of MADRL in complex system settings and its integration with digital twin technologies; Section 7 outlines future research directions for MADRL-driven intelligent control in water systems; and Section 8 concludes this paper by summarizing the key findings and offering critical research insights.

2. Theoretical Fundamental and Modeling Mechanisms of Multi-Agent DRL

Benefiting from the powerful nonlinear approximation and high-dimensional feature extraction capabilities of deep neural networks, DRL has emerged as a mainstream algorithm in intelligent decision-making research [18]. In DRL, neural networks are used to approximate value functions and policies, thereby enabling the handling of continuous and high-dimensional state–action spaces that are otherwise intractable. This allows for end-to-end learning in complex environments [19].

As illustrated in Figure 1, the general framework of DRL involves two tightly integrated components: feature representation and policy optimization. First, deep learning techniques are employed to extract meaningful representations from high-dimensional input data, providing an accurate depiction of the current environmental state. Then, reinforcement learning algorithms use the extracted features and cumulative rewards to optimize decision-making policies, mapping the current state to appropriate actions [20].

2.1. Problem Formulation and Modeling Framework

In water environment systems engineering, many control processes exhibit the Markov property, i.e., the system’s state transitions depend solely on the current and historical states, independent of future anticipated states. This memoryless nature makes water environment control problems naturally amenable to modeling as Markov Decision Processes (MDPs). As the foundational framework of reinforcement learning, MDPs offer a theoretical basis for structuring the interaction between decision-making agents and complex, dynamic environments [21,22].

With the increasing deployment of sensor networks and the Internet of Things (IoT), environmental systems now generate high-dimensional, spatiotemporally heterogeneous data related to water quality, quantity, and pollutant dispersion. Traditional control approaches often face significant performance limitations when dealing with such complex and dynamic systems. In contrast, deep reinforcement learning (DRL) has demonstrated remarkable capabilities in handling high-dimensional state spaces and continuous control problems, positioning it as a promising tool for intelligent water environment management.

Under this modeling paradigm, an individual control unit in the water environment system—such as a treatment plant, pump station, sluice gate, or sub-watershed—can be abstracted as an intelligent agent. Its interaction with the surrounding environment can be formalized as a Markov Decision Process (MDP), defined by the tuple

MDP = (S, A, P_{s a}, π, γ)

, where

$S$ denotes the system state space (e.g., water quality parameters, flow conditions, meteorological inputs);
$A$ represents the set of available control actions (e.g., switching pumps on/off, adjusting chemical dosages, regulating flow rates);
$P_{s a}$ is the state transition probability function;
$π$ is the agent’s decision-making policy;
$γ$ is the discount factor used to balance immediate and future rewards.

When multiple control units operate concurrently, the water environment control problem evolves into a finite-horizon Multi-Agent Markov Decision Process (MAMDP), represented as

MAMDP = (S, A, N, P_{s a}, π, γ)

, where N denotes the number of interacting agents. In such settings, agents may engage in both information cooperation and resource competition, with the shared objective of optimizing joint control policies under system constraints to achieve goals such as pollution mitigation, water quality assurance, and coordinated resource allocation.

2.2. Identification of Key State Variables in Water Environment System Regulation

In the context of intelligent control for water environment systems, the system typically comprises multiple observation nodes or control units, each of which can access partial state information from its local environment, forming the overall state observation space

s \in S

. Based on the observed state, an intelligent agent executes an optimal control action to achieve specific objectives such as water quality maintenance, hydrological balance, or emergency pollution response. This section reviews the commonly used categories of state observations in current research.

(1): Pollutant concentration.

Pollutant indicators represent the most fundamental type of state information in water environment control. Typical variables include dissolved oxygen (DO), ammonium nitrogen (

{NH}_{4}^{+}

), total phosphorus (TP), pH, and temperature. These measurements are often collected from multiple spatial nodes and organized into spatial-variable matrices or time-series vectors, which reflect either localized or system-wide water quality conditions. Such representations are critical for assessing pollution levels and determining regulatory requirements.

(2): Flow velocity, discharge, and water level.

Information regarding flow velocity, discharge, and water level—obtained via hydrodynamic simulation or real-time sensor measurements—serves as a key basis for control decisions, especially in scheduling operations such as pump activation or gate regulation. These features are typically represented using grid-based hydrodynamic models in the form of multi-dimensional tensors or numerical fields, allowing the spatiotemporal dynamics of water movement within the system to be effectively captured.

(3): Operational status of control facilities.

The real-time operational status of control facilities (e.g., pumps, sluice gates, aerators, chemical dosing systems) also constitutes an important component of the system’s state space. This category of information is often encoded as vectors that represent the on/off states, operational intensity, or duration of various devices. Such representations facilitate the reduction of regulatory conflicts and enhance operational safety and efficiency.

Beyond the typical observation types listed above, the selection of state features can be tailored to specific control objectives. Depending on the task, state representations may consist of a single modality or leverage multi-modal data fusion to construct joint state spaces, thereby enhancing the system’s perception and responsiveness in complex aquatic environments [23].

2.3. Action Space Design and Classification in Intelligent Control

Agents select and execute optimal control actions

A_{m}

based on the current observed state

s \in S

. These actions aim to optimize system objectives such as water quantity regulation, hydraulic stability, and pollutant control. The design of the action space and its corresponding policy structure directly affects the decision-making flexibility of the agent and the safety and stability of the control system. At present, action spaces in water environment systems are generally categorized into two types:

(1): Fixed action sequences.

In this control strategy, the action set is denoted in Equation (1):

A_{m} = {A_{m 1}, A_{m 2}}

(1)

with clearly defined switching rules and a predetermined sequence:

When the agent executes action $A_{m 1}$ , the system maintains the current control state without adjustment (e.g., keeping the current pump speed, gate open, or aeration intensity);
When executing action $A_{m 2}$ , the system switches to the next predefined control mode or scheduling scheme (e.g., sequentially activating different water treatment units or rotating pump operations according to a fixed schedule).

To ensure operational safety and prevent system lag or environmental load accumulation, each control mode is typically assigned a maximum duration threshold

Δ t_{max}

. Once the threshold is reached, the system automatically transitions to the next state. This approach is widely applied in standardized water treatment processes and automated control systems, particularly in scenarios with well-defined task flows and stable operational logic.

(2): Variable action sets.

In this control strategy, the action set is defined in Equation (2):

A_{m} = {A_{m 1}, A_{m 2}, \dots, A_{m n}}

(2)

where the set includes all feasible combinations of control actions within the system, such as different pump speeds, gate opening levels, and on/off states of multiple aeration or chemical dosing units, as follows:

At each decision step t, the agent selects an action $a_{t} \in A_{m}$ based on a learned policy $π (a_{t} ∣ s_{t})$ , which defines a probability distribution over all candidate actions.
This approach provides greater flexibility in control and is more adaptable to non-stationary water quality disturbances or multi-objective regulatory tasks.

However, due to the high dimensionality and combinatorial complexity of the action space, this method may introduce risks such as discontinuous control behavior, abrupt system load changes, water body disturbances, or resource conflicts. Therefore, in practical implementations, it is often necessary to introduce the following:

Action constraints $C (A_{m})$ : to limit the feasible action set;
Penalty functions $R_{penalty} (a_{t})$ : to discourage unsafe or costly actions;
Safety filtering mechanisms: to override or restrict high-risk action execution.

This classification of action space design serves as a theoretical and practical foundation for applying RL-based intelligent controllers in real-world water environment systems. By carefully structuring the action space, it is possible to balance decision-making precision and system robustness, enabling safe and adaptive control in dynamic and uncertain aquatic environments [24].

2.4. Reward Mechanism Design in Intelligent Regulation of Water Environment Systems

In reinforcement learning (RL)-based intelligent control of water environment systems, the agent receives a scalar feedback signal—known as the reward—from the environment after executing a specific control action. The reward function not only quantifies the positive or negative impact of an action on the system’s control objectives but also serves as the core driver for policy optimization in the learning process. A well-designed reward function can effectively guide the agent to learn control strategies that contribute to water quality improvement, energy efficiency, and operational effectiveness, as detailed in Table 1.

To accommodate the multi-objective nature of water environment regulation, reward function design typically incorporates the following categories of indicators:

(1): Duration or cumulative degree of pollutant concentration violations.

This indicator measures the time duration or cumulative extent to which key water quality parameters (e.g., dissolved oxygen [DO], ammonium nitrogen [

{NH}_{4}^{+}

], total phosphorus [TP]) exceed predefined safety thresholds. Persistent exceedance indicates serious pollution and ecological risk and should be penalized via a negative reward. A typical formulation is denoted in Equation (3):

R_{1} = - \sum_{i = 1}^{n} max (C_{i} - C_{th}, 0) \cdot Δ t

(3)

where

C_{i}

is the pollutant concentration at observation point i,

C_{th}

is the specified threshold value, and

Δ t

is the time step duration.

(2): Energy consumption or operational cost of control actions.

Operational facilities (e.g., pump stations, aerators, dosing systems) in water environments often incur significant energy consumption and maintenance costs. To enhance economic efficiency, control cost terms are frequently introduced as penalties in the reward function denoted in Equation (4):

R_{2} = - \sum_{j = 1}^{m} P_{j} \cdot t_{j}

(4)

where

P_{j}

is the power consumption of device j and

t_{j}

is the operational duration of device j.

(3): Water quality recovery time or target achievement time.

In pollution events or under fluctuating water quality, the speed of recovery to a safe state is a key indicator of control performance. A time-based reward function can be formulated as in Equation (5):

R_{3} = - T_{restore}

(5)

where

T_{restore}

is the time required for the system to return from a threshold-violating to a compliant state; a shorter

T_{restore}

yields a higher reward.

(4): Water level and volume balance indicators.

For systems involving water resource scheduling or flood control, hydraulic variables such as water level and flow must also be considered. One approach is to penalize deviations from a target water level range, denoted in Equation (6):

R_{4} = - \sum_{t} |h_{t} - h_{target}|

(6)

where

h_{t}

is the observed water level at time t and

h_{target}

is the desired/target water level, reflecting constraints for flood mitigation or ecological flow maintenance.

(5): Multi-objective integration via weighted aggregation.

In practical applications, water environment control systems often involve multiple concurrent objectives. To balance factors such as water quality, energy consumption, and response efficiency, a composite reward function is often constructed as a weighted sum denoted in Equation (7):

R = λ_{1} R_{1} + λ_{2} R_{2} + λ_{3} R_{3} + λ_{4} R_{4}

(7)

where

λ_{i}

denotes the weighting coefficients reflecting the relative importance of each control objective.

The choice of weights directly influences the agent’s behavioral strategy. These weights can be manually tuned based on domain knowledge or adaptively learned through meta-optimization techniques.

3. Training Paradigms in Multi-Agent DRL

Multi-agent systems (MASs) consist of multiple autonomous agents and are well-suited for controlling complex, distributed, and dynamically coupled environments. In the context of water environment systems, functional components such as pump stations, sluice gates, aeration devices, and sensor nodes can be abstracted as agents. These agents perceive local environmental states, perform control actions, and—when necessary—interact or coordinate with other agents to collectively achieve system-level objectives such as water quality stabilization, energy minimization, or pollution mitigation. The training paradigms of multi-agent reinforcement learning are summarized in Table 2.

3.1. Independent Learning Paradigm

The independent learning paradigm (ILP) is one of the most fundamental strategies in MADRL. Under this paradigm, each agent independently optimizes its own policy, treating the behaviors of other agents as part of the environment dynamics—without explicitly modeling their strategies or states. This approach is particularly suitable for scenarios where inter-agent communication is limited or costly. In water environment systems, geographically distributed and structurally independent units such as pump stations, gates, and aerators naturally align with the assumptions of this paradigm.

As shown in Figure 2a, ILP allows each agent to update its policy independently, without relying on coordination or information sharing. This design offers good parallelism and scalability, making it especially applicable to small-scale, discrete state–action space scenarios such as sectional chemical dosing or localized water quality compliance strategies. However, due to the non-stationarity introduced by independently evolving policies, ILP may suffer from convergence instability and degraded learning performance.

DLCQN (Decentralized Learning with Coupled Q-Networks) introduces a policy adaptivity mechanism that allows agents to autonomously switch between independent and coordinated behaviors. This switching is guided by negative reward signals and observation dependencies. In river network systems with upstream–downstream hydrodynamic coupling, DLCQN enables local agents to dynamically adjust their strategy coupling strength based on hydraulic linkages, improving system adaptability. Palmer et al. proposed the Lenient Deep Q-Network (LDQN) to address early-stage policy oscillations that occur in synchronous multi-agent learning. By assigning tolerance to early negative feedback, this method avoids premature penalization and helps mitigate early policy collapse [25]. It is well-suited for uncertain scenarios such as pollution source tracking and emergency response dispatching. Building on this, Zheng et al. proposed the Weighted Double Q-Network (WDQN), which combines lenient reward strategies with double Q-learning updates. WDQN significantly alleviates Q-value overestimation and improves the robustness of policy updates under stochastic disturbances, making it valuable in high-risk and dynamic water regulation tasks [26].

In addition to DQN-based methods, some researchers have extended the independent learning paradigm to partially observable environments using recurrent neural network (RNN) architectures. This is particularly relevant in water environment systems where sensors may fail intermittently, monitoring stations may be sparsely distributed, or measurements are collected with low temporal resolution. In the context of water environment systems, such approaches can be extended to complex tasks such as multi-source pollution tracing and multi-section collaborative regulation, where observations may be incomplete or delayed. These RNN-based independent learners improve the robustness and adaptability of control strategies under conditions of limited or unreliable environmental information.

3.2. Centralized Learning Paradigm

The centralized learning paradigm is a fundamental architecture in multi-agent reinforcement learning (MARL), where the policy optimization processes of all agents are managed and trained by a central learner. As shown in Figure 2b, the central controller has access to the local states, actions, and rewards of all agents, as well as the global state of the system, thereby enabling comprehensive environmental awareness and a system-level perspective. In complex water environment systems—such as cross-regional water quality regulation, upstream–downstream coordination in river networks, and joint operation of reservoir clusters—control units are highly coupled and interdependent. These tasks require precise coordination and joint decision-making among agents. The centralized learning paradigm provides theoretical support for learning such collaborative strategies by aggregating global information, making it especially suitable for multi-objective optimization scenarios requiring a high degree of cooperation. However, as the number of agents increases, the joint state–action space grows exponentially, leading to the well-known curse of dimensionality. This significantly degrades learning efficiency, hinders convergence to global optima, and limits the paradigm’s scalability and applicability to large-scale intelligent water control systems.

3.3. Centralized Training with Decentralized Execution in Water Environment Systems

The centralized training and decentralized execution (CTDE) paradigm effectively combines the strengths of global coordination and local autonomy. As shown in Figure 2c, this approach allows agents to exchange state, action, and reward information during training, facilitating the centralized optimization of system-wide objectives. Once deployed, however, each agent operates independently using only its own local observations and experience, without relying on others’ internal states or policies.

This learning paradigm is particularly well-suited to intelligent water environment systems, which are characterized by delayed information flow, limited observability, and spatially distributed control nodes. In such contexts, fully centralized control is often impractical due to communication constraints, while isolated learning may fail to achieve coordinated and globally optimal behaviors.

By leveraging global insights during training and enabling autonomous action at runtime, this hybrid framework offers a scalable, practical solution for real-world deployment. It has thus emerged as a key enabler for developing intelligent, adaptive, and resilient environmental control systems.

3.3.1. Value Decomposition Approaches for Distributed Control

Methods based on value function decomposition seek to reduce the complexity of multi-agent learning by factorizing the global value function into a set of local components. Within this framework, each agent updates its policy independently, relying solely on local observations and rewards without requiring full awareness of other agents’ strategies. This design alleviates the training instability associated with non-stationary environments typical in independent learning settings while enabling effective credit assignment for each agent’s contribution to overall system performance—an essential capability in collaborative water resource management. Representative algorithms and their applicability to water environment systems are summarized below.

Value decomposition networks (VDNs) decompose the global Q-value linearly into the sum of individual agents’ local Q-values. The contribution of each agent to the global return is quantified via gradient backpropagation, as shown in Figure 3. Due to its simplicity and training efficiency, the VDN is particularly applicable to water systems in which each control unit has clearly defined local objectives and weak coupling with others.

Multi-point water quality control involves maintaining indicators such as dissolved oxygen (DO), ammonium nitrogen ( ${NH}_{4}^{+}$ ), or total phosphorus (TP) within target thresholds across multiple monitoring sections;
Zonal aeration strategy optimization uses independently operating aeration units to locally regulate oxygenation while contributing to system-wide water quality balance [27].

QMIX introduces a nonlinear mixing network that improves the model’s ability to approximate complex interactions while maintaining a monotonicity constraint between the global Q-value and individual agents’ local Q-values [28]. This structure is well-suited for coordinated control tasks in complex water systems, such as total pollutant load regulation or basin-wide discharge compliance. However, the performance of QMIX may degrade in tasks that involve non-monotonic coordination dynamics, such as pollutant dispersion control coupled with water level regulation.

QPLEX introduces a dueling network structure for modeling agent-specific advantage functions, which constrains the mixing network to be a monotonic function of individual agent Q-values to guarantee consistency with decentralized execution. QPLEX relaxes this constraint by introducing an advantage-based factorization framework. Specifically, it decomposes the joint action-value function as

Q_{tot} (s, a) = V (s) + \sum_{i = 1}^{n} A^{i} (s, a^{i})

(8)

where

V (s)

is a shared, state-dependent baseline and

A^{i} (s, a^{i})

is the advantage function for agent i. The mixing network then aggregates these agent-specific advantages using a monotonic function, allowing the model to capture more expressive joint action dependencies while still preserving the guarantees of decentralized execution.

This formulation effectively addresses the representational limitations of QMIX in non-monotonic environments, such as competitive or partially cooperative tasks.

Qatten integrates a multi-head attention mechanism into the value decomposition framework to more effectively capture the varying significance of individual agents under different global state conditions. In contrast to conventional additive or monotonic mixing architectures, Qatten employs an attention-based mixing network, where the attention weights are computed as

α_{i} = softmax (f (h_{i}, g))

(9)

Here,

h_{i}

represents the latent feature of agent i, typically derived from its local observation, and

g

denotes a global context vector extracted from the shared environment state. These attention scores

α_{i}

are then used to dynamically modulate each agent’s contribution to the joint action-value function.

By leveraging this adaptive weighting mechanism, Qatten enables the model to focus more effectively on contextually relevant agents, thereby enhancing the expressiveness of the value function and improving credit assignment in cooperative tasks, especially in scenarios involving role specialization or situational importance.

Value decomposition-based MARL methods, due to their localized information dependency, scalable architecture, and training stability, provide a solid theoretical foundation and technical pathway for distributed intelligent control in complex water environments. These methods are particularly effective under challenging conditions such as high-dimensional state spaces, non-stationary dynamics, and weakly coupled or delayed inter-agent communication. However, most existing value decomposition algorithms are designed for discrete action spaces, which limits their applicability in continuous control tasks—such as pump speed modulation, water level adjustment, or continuous chemical dosing.

3.3.2. Applications of Centralized Value Function Methods in Multi-Agent Regulation

Compared with value decomposition-based methods, centralized value function (CVF) methods leverage global information during training to construct a unified critic network, which guides the independent optimization of each agent’s actor policy network. These methods typically adopt the actor–critic architecture, enabling decoupled learning between policy and value functions. This paradigm offers a balance between policy flexibility and training stability and is particularly suitable for water environment systems where control units exhibit strong state coupling and complex inter-agent interactions [29,30].

Representative algorithms and their application scenarios in water environment regulation are as follows:

MADDPG (Multi-Agent Deep Deterministic Policy Gradient) was the first to extend deterministic policy gradient methods to multi-agent settings by combining a centralized critic with decentralized actors [31]. It enables agents to learn deterministic policies while utilizing shared global information. MADDPG is applicable to scenarios such as multi-pump flow distribution, reservoir group water level coordination, and upstream–downstream joint discharge optimization. However, its scalability is limited in large-scale systems due to the curse of dimensionality and difficulties in credit assignment.

MATD3 (Multi-Agent Twin Delayed DDPG) improves upon MADDPG by introducing twin centralized critics and delayed policy updates, effectively addressing the overestimation bias in value functions. It is particularly suited for fine-grained control tasks involving continuous variables, such as pump frequency modulation, water level gradients, or chemical dosing rates, and offers strong performance in terms of both control precision and training stability.

HATRPO and HAPPO were proposed to overcome the monotonic policy improvement limitations of traditional TRPO/PPO in cooperative multi-agent settings. By decomposing the advantage function and employing a sequential update mechanism, these algorithms enhance policy expressiveness and coordination capabilities without requiring parameter sharing or joint value function assumptions. They are particularly effective for heterogeneous control scenarios, such as functionally differentiated pump stations or asynchronous operation of distributed treatment plants [32].

Centralized value function-based methods, by leveraging global information integration, demonstrate strong adaptability in controlling water environment systems characterized by strongly coupled regulation nodes, complex collaborative tasks, and widespread continuous action spaces. In particular, algorithms based on the actor–critic architecture exhibit distinct advantages in terms of policy stability, sample efficiency, and multi-objective optimization.

4. Cross-Domain Collaborative Control and Decision Architectures in Water Environment Systems

With the rapid advancement of sensor networks, intelligent control, and environmental modeling technologies, intelligent control of water environment systems is evolving from traditional point-based regulation toward large-scale, multi-agent, and cross-regional collaborative management paradigms. By integrating the global scheduling capability of shore-based control centers, the environmental sensing capability of in situ monitoring nodes, and the local execution capability of distributed agents (e.g., autonomous monitoring buoys, intelligent pumping stations), intelligent water systems can effectively support critical tasks such as pollution source tracking, dissolved oxygen regulation, water quality early warning, and coordinated emergency response. These capabilities position such systems as a core technological framework for achieving precision water quality management, basin-wide coordinated scheduling, and ecological restoration. Current research on intelligent water environment systems primarily focuses on the following key technical domains:

Task deployment and optimization;
Information perception and transmission mechanisms;
Path planning and agent scheduling;
Multi-agent cooperative control strategies.

Meanwhile, ongoing breakthroughs in reinforcement learning, graph neural networks (GNNs), adaptive control, and multi-agent systems provide viable pathways for advancing water environment systems toward large-scale, multi-source collaboration, heterogeneous device integration, and adaptation to complex physical environments. This transformation is gradually pushing water environment systems from automation in monitoring toward ultimately self-organizing ecosystems, establishing itself as a prominent research frontier in smart water management and watershed governance.

In this context, this section presents a comprehensive review of the current state of the art and emerging trends of these key technologies in water environment systems. In addition, the interconnections among various technical modules are mapped and analyzed. The technical architecture is illustrated in Figure 4, and the representative algorithms are described in Table 3.

4.1. Strategies of Task Deployment in Collaborative Control

Building on the preceding discussion of training paradigms and policy optimization mechanisms in multi-agent systems, this section focuses on the practical layer of task deployment in intelligent water environment control. It provides a systematic overview of centralized, distributed, and hybrid task scheduling strategies, as well as their architectural designs and representative applications.

In intelligent water systems, task deployment plays a critical role in the efficient organization and dynamic coordination of heterogeneous control resources—such as pumping stations, sluice gates, unmanned monitoring vessels, and sensor nodes—with respect to diverse operational objectives including emergency pollution response, real-time water quality regulation, and multi-source supply coordination. By constructing rational task allocation schemes and control frameworks, the overall system performance in terms of responsiveness and resource efficiency can be significantly improved. This is particularly crucial under complex scenarios involving sudden pollution events, water quality fluctuations, or basin-scale coordinated regulation.

According to the underlying collaboration architecture, current task deployment strategies in intelligent water systems can be broadly categorized into centralized and distributed approaches.

(1): Centralized task deployment strategies.

In centralized deployment architectures, a shore-based control center assumes responsibility for generating global task plans, allocating system resources, and issuing execution commands to individual control nodes. This paradigm is suitable for scenarios with tight time constraints, high task complexity, and a need for high-precision coordination, such as

Multi-section pollutant source tracking;
Eutrophication emergency remediation;
Unified water quality standard enforcement across river basins.

For instance, Liu et al. proposed a task-oriented intelligent control framework for centralized coordination across multiple water resource nodes [33]. Chen et al. developed a multi-source cooperative model incorporating heuristic optimization and coalition game theory to optimize joint monitoring task assignment between surface buoys and unmanned vessels [34]. Zhang et al. introduced a variable task value model that considers both static goals and dynamic feedback to achieve centralized allocation for maximum pollution reduction benefit [35].

While centralized strategies offer strong consistency and command controllability, they often suffer from limitations in adaptability and scalability, particularly under conditions of high task heterogeneity, environmental uncertainty, incomplete information, or unstable communication. These constraints can result in inflexible resource allocation, delayed node responses, and low execution robustness.

(2): Distributed task deployment strategies.

To address the limitations of centralized methods, distributed task deployment has garnered increasing attention. In this paradigm, individual control units or edge agents autonomously perceive, decide, and collaborate, executing tasks and managing resources at the regional level without relying on global communication. This makes it particularly suitable for large-scale, dynamically changing, and weakly coupled water environments with limited data transmission capability [36,37].

Han et al. proposed a self-organizing task mapping algorithm based on belief functions, enabling dynamic task chain configuration from sensing nodes to target control zones [38]. Davis et al. used a marginal cost-based method to optimize conflict-free routing of mobile monitoring robots in water quality surveillance [39].

Although distributed strategies enhance system resilience, local responsiveness, and edge intelligence utilization, they may lead to task conflicts, resource fragmentation, and insufficient coordination if global planning mechanisms are not properly integrated.

4.2. Communication Mechanisms for Multi-Agent Collaboration in Intelligent Water Environment Systems

In intelligent water environment systems, the collaborative operation and optimization of control strategies across multi-source heterogeneous nodes fundamentally rely on efficient and reliable communication mechanisms. This is particularly critical in representative applications such as watershed scheduling, pollution tracking, lake management, and smart drainage systems, where multi-level coordination among underwater sensors, surface unmanned vehicles, and land-based control platforms is required. These components collectively constitute a complex air–land–water integrated communication network operating across diverse physical media and dynamic environments. Consequently, designing communication architectures and cooperative mechanisms adapted to the hydrophysical characteristics of water environments has become a foundational task for enabling system-wide intelligent operation [40].

Currently, the dominant communication modalities in such systems include radio frequency (RF) communication, optical wireless communication, and underwater acoustic (UWA) communication. Each medium exhibits distinct trade-offs in terms of data rate, coverage, energy consumption, and environmental robustness and must be tailored to specific spatial and operational requirements.

4.2.1. Intelligent Communication Mechanisms for Collaborative Regulation in Water Environment Systems

In complex environments, communication mechanisms are regarded as a key enabler for achieving collaboration and efficient decision-making in multi-agent systems. In the context of intelligent regulation of water environment systems, agents—such as pumps, valves, sensor nodes, and monitoring drones—must share experiences, intentions, and local observations through communication technologies. This information exchange allows each agent to develop a more comprehensive understanding of both the system state and the behavior of other agents, ultimately facilitating the coordinated completion of complex tasks such as pollution control, water resource allocation, and flood warning, as illustrated in Figure 5.

The core of multi-agent communication mechanisms involves three essential questions: when to communicate, with whom, and what information to exchange. In practical tasks within intelligent water regulation, this is reflected in situations such as follows: When should upstream sensing nodes transmit pollutant concentration data to downstream control units? Should control modules communicate based on dynamic factors in water quality models? Research on communication mechanisms has provided transferable technical frameworks and insights for water environment systems.

DIAL (Differentiable Inter-Agent Learning) offers an end-to-end communication mechanism that, while simple, is limited to discrete communication settings—thus restricting its direct applicability in complex water resource management scenarios [41]. Abdallah proposed a centralized communication architecture in which the hidden states of all agents are averaged to enhance local observation capabilities [42]. In the context of water systems, this can be analogized to aggregating data from multiple sensor nodes to assist decision-making. However, the method suffers from semantic information loss, which limits its performance when dealing with multi-source heterogeneous environmental data.

In highly complex water system coordination tasks, communication optimization has also been modeled as a game-theoretic problem. For instance, Xue evaluated the marginal contribution of communication content to effectively prune irrelevant information [43]. This idea can be applied to assess the value of communication among multiple nodes in river network scheduling, providing more valuable transmission paths under resource-constrained conditions.

4.2.2. Graph Topologies for Air–Water–Land Multi-Agent Collaboration in Water Environment Systems

In the study of intelligent control for complex water environment systems, the problem of coordination and interaction among multiple control units has emerged as one of the core challenges in multi-agent modeling and optimal control. Coordination graphs, which explicitly represent inter-agent relationships, have gained significant attention in distributed control and intelligent optimization in recent years. By decomposing the global value function over a graph structure, this framework models system-wide utility through a higher-order abstraction.

In this setting, nodes correspond to control units, while edges or hyperedges represent local reward functions defined over the joint action–observation spaces of interacting agents. Once the coordination graph is constructed, distributed constrained optimization algorithms are typically employed to solve for optimal control actions. Control units perform multiple rounds of local information exchange and policy updates along graph edges [44].

A critical challenge in water environment systems lies in learning dynamic, sparse, and task-dependent coordination graphs that can efficiently support distributed decision-making among control agents. Existing research still faces several limitations. For instance, sparse coordination Q-function methods attempt to automatically learn sparse graph structures for value function factorization, but the learned structures are often static and require substantial prior knowledge. The MACO environment and the CASEC algorithm provide testing platforms and context-aware coordination graph modeling methods, aiming to reduce value estimation error through graph representation learning, and have demonstrated promising results across multiple tasks [45]. Recent studies have also focused on structural advances in coordination graphs, such as nonlinear topology modeling and self-organizing polynomial structures, to further enhance adaptability in complex water control scenarios.

On the other hand, several approaches adopt implicit graph construction strategies by leveraging attention mechanisms to prune redundant connections and improve both the efficiency and interpretability of the learned graph structures. For example, DICG constructs an implicit coordination graph via a graph neural network that dynamically infers inter-agent coordination relationships, thereby improving cooperative performance in multi-agent systems [46]. Although such methods increase the flexibility of graph modeling, they still encounter scalability and stability issues when applied to large-scale, multi-scale, and highly coupled control scenarios in water environment systems.

Therefore, developing scalable, robust, and task-adaptive coordination graph learning and optimization mechanisms in high-dimensional and dynamic water environments remains an open and critical research direction [47].

4.2.3. Communication-Aware Collaboration and Adaptive Swarm Control Strategies

Most existing task scheduling designs for perception platforms remain centered on single-agent control, often neglecting the impact of dynamic communication channel conditions on overall system performance in water environments. To address the control challenges of multi-node systems under communication constraints, researchers have gradually begun to integrate channel state modeling with task coordination mechanisms, aiming to explore stable operational strategies for distributed intelligent systems operating in communication-limited environments.

Cao et al. proposed a communication quality estimator based on integral reinforcement learning (IRL). By considering both channel fading and uncertainties in node dynamics, a communication-aware cooperative control framework was constructed, which significantly improves the convergence efficiency of monitoring platform formation tasks, illustrated in Figure 6 [48]. The state update model is formulated in Equation (10):

X_{i} = A X_{i} + B π_{i} + B ({\hat{π}}_{i} - π_{i}) + G_{i} + N_{i}

(10)

where

π_{i}

denotes the current learned policy,

{\hat{π}}_{i}

is a reference policy, and A, B,

G_{i}

, and

N_{i}

represent system transition dynamics and disturbance terms. This model enables the neural network to estimate system evolution under conditions of node mobility and channel fluctuations.

Yan et al. developed a terminal sliding mode controller with a dual-power reaching law, effectively reducing control chattering and enhancing system stability under wind-induced disturbances. Furthermore, considering channel constraints such as path loss and multipath fading, a value iteration-based RL controller was proposed to predict the communication quality of the target node. This controller achieves an integrated design that combines communication estimation with task coordination control, offering a more robust solution for multi-platform water environment monitoring tasks [49,50].

4.2.4. Cross-Regional Integrated Sensing and Control System Architectures

In practical water environment governance, challenges such as large spatial coverage of water bodies, strong task heterogeneity, and diverse information sources are frequently encountered. To address these issues, recent studies have proposed the construction of cross-regional, multi-platform unified perception and regulation frameworks. By integrating ground stations, unmanned aerial vehicles (UAVs), autonomous monitoring buoys, and underwater sensor nodes, a multi-layer heterogeneous sensing network can be established to enable task data sharing and distributed control execution.

Zhu et al. proposed a cooperative protocol design for cross-regional unmanned platforms, aiming to address the communication heterogeneity between aerial and surface platforms and thereby improve overall task scheduling efficiency [51]. Chen et al. introduced blockchain and federated learning mechanisms to ensure data synchronization and security within the system [52]. In addition, a threshold signature scheme was employed to prevent tampering of monitoring data, thereby enhancing the verifiability and resilience of the system against adversarial attacks.

Meanwhile, related studies utilized network simulation platforms such as NS-3 to evaluate the performance of cross-platform collaborative systems, assessing key metrics including data throughput, command transmission delay, and system redundancy tolerance. These simulations provide theoretical support for real-world deployment. Through strategies such as proactive deployment, dynamic aquatic area partitioning, and task buffering at edge nodes, the system can adapt to diverse water environment task scenarios and has demonstrated strong collaborative performance in applications such as pollution response, emergency dispatching, and water–ecological joint regulation.

4.3. Path Planning and Adaptive Control for Water Infrastructure Systems in Complex Environments

In the simulation and intelligent control of water infrastructure systems, improving task reliability and operational safety requires integrated consideration of target locations, hydro-geographic terrain features, and obstacle distributions. Based on the sensed environmental state, cooperative scheduling between path planning and dynamic obstacle avoidance enables control agents to adapt to environmental uncertainties—such as sudden hydrological events, equipment malfunctions, or structural changes—by dynamically adjusting control strategies, thereby enhancing system robustness and response efficiency.

4.3.1. Multi-Platform Collaborative Mapping for Water Environment Infrastructure Systems

To ensure accurate navigation and stable task scheduling in large-scale water environment systems, robust and high-resolution map construction has become a critical prerequisite, especially in integrated control frameworks spanning aerial, surface, and underwater domains. Recent advances have explored diverse perception strategies incorporating visual, LiDAR, acoustic, and multi-sensor fusion techniques to address this fundamental challenge.

Du et al. developed a sparse, trackable mapping method based on environmental feature matching and key-frame extraction, which has been widely adopted for localization and navigation in autonomous platforms [53,54].

However, such vision-based methods often encounter limitations in complex 3D environments—such as fluctuating water surfaces, low-light scenes, and turbid underwater conditions—where sparse features and unstable registration severely degrade mapping quality. To enhance robustness under these conditions, Matsuki et al. constructed a dense mapping framework based on camera poses and selected key-frames [55], while Alireza et al. applied full-pixel sampling of image intensity gradients by minimizing photometric error, removing the reliance on conventional feature detectors [56].

In an effort to incorporate semantic understanding into the mapping process, Zhong proposed a sparse semantic mapping method that integrates semantic priors into deep odometry networks. While effective in structured environments, most of these models are still designed under static or ideal conditions and fail to address the challenges posed by dynamic natural disturbances—such as wind fields, water flows, and tidal shifts—which lead to image mismatches and position drift in multi-agent collaborative mapping [57].

To address these issues, recent work has turned attention toward robust mapping under environmental perturbations. For instance, Billings et al. incorporated visual feature extraction with robust optimization techniques to mitigate the impact of external noise and enhance the reliability of visual map matching [58].

4.3.2. Target Situation Awareness in Intelligent Water Environment Systems

Target Situation Awareness (TSA) technologies aim to support decision-making units in dynamic air–space–surface–underwater environments by fusing, analyzing, and modeling multi-source information to enable accurate judgment and adaptive strategy formulation. This technology is particularly well-suited for mission execution in unstructured water environments—such as oceans, reservoirs, and waterways—including applications such as pollutant tracking, illegal vessel detection, and emergency response. Existing research has proposed a variety of TSA approaches, aiming to construct an interconnected and intelligent perception framework to overcome challenges such as underwater noise, weak communication, and discontinuous sensing.

(1) Multi-platform collaborative perception based on edge computing.

To address issues such as limited underwater bandwidth, low acoustic communication rates, and frequent data loss, researchers have proposed heterogeneous edge computing architectures for distributed data processing. Brecko et al. introduced a federated learning framework on edge platforms that enhances convergence speed and perception accuracy while preserving data locality [59]. However, this approach suffers from high local training costs relative to aggregation, leading to suboptimal resource utilization.

To improve system-level resource scheduling, Chen et al. proposed a hybrid learning scheme that integrates centralized and federated paradigms and solves the joint optimization iteratively [60]. Lindsay et al. developed a cooperative system combining drones, surface vehicles, and underwater agents [61]. Through multi-domain sensing fusion, this system enhances real-time situational awareness and performs well in applications such as reservoir gate control, water quality monitoring, and target tracking.

(2) Multi-source perception enhanced by intelligent learning.

Traditional sensor-reliant systems face limitations in detecting non-cooperative targets and in image degradation scenarios. Braca et al. fused active and passive observations with Bayesian inference for underwater target recognition [62]. Chen et al. addressed underwater image degradation by combining blur estimation with physical image formation models to restore clarity [63].

To address the computational demands of deep models, Wang et al. proposed a frequency-interleaved progressive network to learn frequency-domain features for super-resolution enhancement [64]. Guo et al. built a 3D multi-static sonar network to reduce localization error [65]. For system-level validation, Wang et al. introduced a lightweight YOLOX-s model with transfer learning to detect vessels at multiple scales, achieving a balance between detection accuracy and runtime performance [66].

Research on Temporal–Spatial Awareness (TSA) in water environment systems is evolving from traditional physical sensing toward intelligent and integrated perception. Recent advancements include cross-platform information fusion, edge inference under weak communication conditions, semantic-aware sensing, and model compression. Future efforts should prioritize the system-level integration and validation of TSA methods, as well as their tight coupling with control strategies and mission planning, in order to develop closed-loop intelligent water environment systems.

4.3.3. Dynamic Monitoring and Target Tracking Mechanisms

In intelligent regulation tasks of water environment systems, complex spatial structures, hydrodynamic processes, and pollutant transport pathways impose increasing demands on monitoring systems. Achieving efficient and dynamic perception and tracking of target regions—such as pollution sources, water quality anomaly zones, and ecologically sensitive areas—has become a critical challenge in water environment management. In recent years, researchers have introduced multi-source sensing, machine learning, and intelligent control into aquatic monitoring networks, driving advancements in target situation recognition, path optimization, and task coordination.

(1) Dynamic perception path planning and local tracking control.

During practical monitoring operations, limitations such as communication range, water disturbances, and system latency make traditional fixed-node deployment insufficient for fine-grained sensing in spatially heterogeneous areas. In response, dynamic and adaptive path planning frameworks have been proposed to enhance the match between mobile sensing units and target regions, enabling flexible and responsive coverage. Typically, the spatiotemporal delay and localization error between sensor nodes and control points are formulated as an optimization problem, which can be solved using supervised, unsupervised, or semi-supervised learning frameworks, illustrated in Figure 7.

For example, Yan et al. proposed a target localization method based on broad learning, integrating incremental learning for rapid model parameter adjustment [67]. While this improved computational efficiency, it did not fully compensate for errors introduced by node mobility and clock drift. In water environments, asynchronous clocks and depth-dependent acoustic propagation are modeled in Equation (11):

T = \hat{t} + θ, C (z) = α z + b

(11)

where T denotes the local node clock,

\hat{t}

is the actual time, and

θ

is the clock drift.

C (z)

represents the sound speed at depth z, with

α

and b denoting the vertical gradient and surface-layer sound speed, respectively. Based on this, Han et al. developed an adaptive non-singular fast terminal sliding mode controller to track pollution plumes or critical water regions within a finite time horizon [68].

Furthermore, to handle uncertainties caused by environmental factors such as water density gradients, wind fields, and temperature layers, Naoki et al. proposed a robust control framework that integrates sliding mode control with disturbance observers [69]. Zhao et al. employed minimum rigidity graph theory and velocity observers to predict the target evolution path, balancing monitoring accuracy and operational cost [70].

(2) Multi-source platform collaborative sensing and task allocation.

Although individual sensor path planning can provide localized monitoring capabilities, large-scale and multi-regional water environments demand efficient multi-platform sensing coordination. Soriano et al. introduced a cross-platform relative localization scheme that fuses aerial visual feedback with surface/underwater sensing data. This approach maintains stability even under conditions of data loss or inconsistency [71].

To meet the needs of distributed monitoring in time-varying target regions, Zeng et al. integrated predictor structures with neural dynamic surface controllers, enabling collaborative multi-point sensing under unknown disturbances and model uncertainties. This method demonstrates strong adaptability in scenarios such as reservoir pollution tracing and algal bloom detection [72].

Despite these advances in improving system responsiveness and target recognition, real-world deployments still face challenges such as communication instability, sensor energy limitations, and weak model transferability. Future research should focus on heterogeneous data fusion mechanisms, dynamic modeling of unstructured targets, and low-power adaptive scheduling algorithms, aiming to build an intelligent water environment monitoring and management system with a closed loop of perception–decision–control.

(3) Reinforcement learning-driven intelligent regulation and policy optimization.

In complex water environment scenarios—such as reservoir scheduling, pollutant tracking, and dynamic area coverage for water quality monitoring—coordinated deployment of heterogeneous platforms (e.g., UAVs, surface patrol boats, and underwater sensor networks) is often required. To enhance system responsiveness and task execution efficiency under environmental uncertainties, researchers have increasingly introduced artificial intelligence technologies into studies of path optimization and multi-platform cooperative control.

Conventional methods typically rely on environment perception data collected via optical or LiDAR sensors, which may include visual imagery, laser scans, and infrared signals. These are commonly coupled with artificial potential fields or rule-based path planning methods for area coverage and obstacle avoidance [73,74]. However, such approaches often struggle to simultaneously meet the requirements of real-time performance, feasible path planning in dynamic environments, and constraints on energy consumption and navigation stability imposed by complex aquatic structures.

To address these challenges, deep reinforcement learning (DRL) has emerged as an adaptive strategy optimization tool, enabling systems to iteratively interact with the environment and learn optimal control policies through trial and error. In water environment tasks, DRL can be applied to solve multi-objective scheduling problems such as path generation, task area selection, and energy-aware control. When deployed on mobile sensing platforms, the learned outputs can be directly mapped to path decisions or operational commands.

Given the high cost and risk associated with real-world trial-and-error training in aquatic environments, Yan et al. proposed a DDPG (Deep Deterministic Policy Gradient)-based path optimization method. This approach utilizes disparity angles between the monitoring platform and surrounding obstacles as input and incorporates a cost function to generate energy-efficient and communication-aware obstacle avoidance paths, illustrated in Figure 8 [75].

Moreover, information coordination and task allocation among heterogeneous platforms are essential for enhancing overall system effectiveness. Hu et al. proposed a Distributed Guiding-Vector Field (DGVF) control architecture, establishing a perception network composed of aerial drones and surface vessels. The system employs a hierarchical control structure to decouple task allocation from command tracking [76].

Under more complex marine–atmospheric interference conditions, Sana et al. reviewed motion control techniques for hybrid-powered, high-mobility platforms operating in uncertain environments. Their work provides multi-dimensional dynamic modeling support for intelligent cruise monitoring in aquatic systems [77].

4.4. Towards Application-Oriented Deployment of Cross-Domain Control Architectures

Building upon the systematic review of key technologies such as task allocation mechanisms, cross-medium communication architectures, and environment-aware control strategies, it is evident that water environment systems are gradually evolving from localized autonomous control toward large-scale collaborative intelligence. Through hybrid centralized–distributed deployment, air–ground–water coordinated communication schemes, and graph-structured path planning algorithms empowered by reinforcement learning, multi-agent systems have demonstrated promising capabilities in cross-domain coordination and adaptive regulation.

However, most of these control architectures and mechanisms remain in the theoretical or experimental validation phase. Their effectiveness in complex real-world systems requires further evaluation through scenario-based systematic testing. For example, urban drainage systems demand rapid regulation in response to heavy rainfall and pollutant surges; water supply systems aim for energy efficiency and pressure stability during peak demand periods; hydropower systems must balance water availability, electricity generation, and ecological sustainability; and multi-platform water quality monitoring faces challenges such as sensor heterogeneity and unreliable communication.

Therefore, the subsequent chapters will focus on representative water system scenarios to explore the deployment pathways and application effectiveness of multi-agent reinforcement learning (MARL) in real-world water environment control. By engaging in task-specific modeling practices and comparative performance analysis, the next section aims to delineate the applicability boundaries, latent challenges, and evolutionary trends of existing methods—ultimately laying both theoretical and engineering foundations for a more robust and generalizable intelligent control framework in water environment systems.

5. Application Practices of Multi-Agent Reinforcement Learning in Water Environment Systems

5.1. Urban Drainage and Emergency Pollution Control

5.1.1. Intelligent Regulation Requirements and Reinforcement Learning

Under the dual pressures of rapid urbanization and increasing frequency of extreme rainfall events, urban drainage systems face escalating challenges, including combined sewer overflows (CSOs) and urban flooding, which pose serious threats to urban water safety and aquatic ecosystems [78]. As a key component of smart infrastructure, real-time control (RTC) has been widely adopted to dynamically adjust hydraulic loads based on sensor and actuator feedback, enabling proactive intervention in the drainage process.

In recent years, reinforcement learning (RL) has emerged as a powerful control method in urban drainage systems (UDSs), offering the ability to learn optimal control policies through environmental interaction without requiring precise physical models. Compared to traditional model predictive control (MPC) approaches, RL demonstrates lower computational cost and greater adaptability to environmental variability [79]. Studies have shown that RL algorithms can effectively reduce CSO occurrences, mitigate flooding risks, and limit pollutant discharges. However, RTC systems are often susceptible to real-world uncertainties such as actuator failures, sensor inaccuracies, and communication breakdowns, posing challenges to the robustness and safe deployment of RL-based controllers [80].

HEO proposed an explainable MARL-driven flexible SBR system, which enables coordinated control of DO and EC regulators during the SBR process. The approach achieved an average energy consumption reduction of

4.93 %

while maintaining effluent quality standards under varying influent conditions [81]. Huang employed a multi-agent deep reinforcement learning (DRL) approach to evaluate the effectiveness of real-time control (RTC) strategies and developed an integrated assessment framework comprising multiple quantitative metrics, including control objectives, decision-making latency, robustness, and adaptability. A total of 31 historical rainfall events were analyzed. The results demonstrated that compared with conventional RTC methods, the DRL-based approach reduced flood and overflow risks by an average of

15.1 %

to

43.5 %

[82].

5.1.2. Robustness Assessment and Fault Tolerance Under Communication Failures

Sensors and actuators in urban drainage networks are typically spatially distributed and rely on wireless communication to connect with centralized controllers or intelligent agents. Thus, communication failures or signal delays can significantly impair control performance. To evaluate system resilience under such conditions, researchers constructed two simulated fault scenarios: (1) sensor data loss leading to observation gaps; and (2) delayed control signals preventing timely actuation.

In centralized DQN systems, control signal failures can result in total loss of functionality. However, decentralized architectures such as IQL and VDN can incorporate default fallback rules or historical behavior references to retain partial autonomy in the absence of communication. To further enhance fault tolerance, an integrated control architecture was proposed, where a central agent governs remote control under normal conditions, while local agents autonomously take over during communication interruptions, improving emergency response stability.

Simulation experiments under varying communication failure rates revealed that decentralized MARL structures demonstrate superior robustness and performance retention, particularly in CSO volume reduction and response time, as illustrated in Figure 9. These findings underscore the practical advantages of MARL-based architectures in urban water management, supporting the development of resilient, communication-tolerant, and intelligent drainage systems.

5.2. Multi-Agent Intelligent Control for Urban Water Distribution Systems

In the context of smart urban water management, the water distribution network (WDN)—as the core infrastructure of urban supply systems—has emerged as a critical focus for enhancing resource utilization, operational efficiency, and water security. Unlike drainage systems that primarily manage gravity-based flow and stormwater routing, supply systems must address challenges such as pressure regulation, dynamic demand response, and coordinated pump–valve control [83,84].

5.2.1. Intelligent Control Requirements in Water Supply Systems

Urban water supply systems must maintain service reliability while minimizing energy consumption, ensuring pressure balance, and prolonging equipment lifespan. These systems are structurally complex, typically comprising multiple sets of pumps, tanks, valves, pipelines, and user demand nodes. Given the spatial heterogeneity and temporal variability of demand, scheduling decisions must strike a balance between real-time responsiveness, cost-efficiency, and system safety. Currently, the main control problems in WDNs are related to pump scheduling and valve operation, where pump activity directly influences energy usage and carbon emissions, while valves govern pressure distribution and flow equilibrium. These control processes are highly nonlinear and strongly interconnected, rendering conventional rule-based or optimization methods inadequate in the face of abrupt demand fluctuations, equipment malfunctions, or uncertain disturbances. Hence, there is a pressing need for intelligent control strategies capable of environmental perception, temporal memory, and multi-objective coordination.

5.2.2. Multi-Agent Modeling and Task Decomposition for Water Supply Networks

A water distribution system can be abstracted as a graph-based structure, where nodes represent reservoirs, tanks, consumers, or junctions, while edges represent physical components such as pipes, pumps, or valves. The control objective is typically to maintain acceptable tank levels while achieving network-wide pressure minimization under minimum pressure constraints. In a MAS framework, key control components—such as pump stations, main valves, and tanks—are modeled as autonomous agents. Each agent independently perceives its local state (e.g., pressure, tank level, demand flow) and selects actions (e.g., adjusting pump speed or valve position) accordingly. System-wide optimization is achieved through indirect coordination via environmental dynamics. Each agent is defined by four core functions: perception, memory, decision-making, and execution. Agents operate in a distributed manner without requiring explicit inter-agent communication, enhancing system scalability, robustness, and fault tolerance.

5.2.3. Applications of Deep Reinforcement Learning in Water Supply Control

To enhance autonomous decision-making, deep reinforcement learning (DRL) techniques have been widely applied to pump scheduling and valve operation, as illustrated in Figure 10. Representative algorithms such as Proximal Policy Optimization (PPO) and Deep Q-Networks (DQN) are trained through state–action–reward interactions in simulated environments to derive long-term optimal policies.

PPO, a stable policy gradient method, clips the objective function to prevent excessively large updates. Its loss function is defined in Equation (12):

L (θ) = E_{t} [log π_{θ} (a_{t} | s_{t}) \cdot {\hat{A}}_{t}]

(12)

where

{\hat{A}}_{t}

is the advantage function, measuring the deviation of the action from the average policy performance. PPO often uses a two-layer neural network to output parameters

α

and

β

of a Beta distribution for the action policy, defined in Equation (13):

h (x; α, β) = \frac{Γ (α + β)}{Γ (α) Γ (β)} \cdot x^{α - 1} {(1 - x)}^{β - 1}

(13)

To ensure uni-modal concavity, the outputs are transformed via Equation (14):

α = LeakyReLU (x) + 1, β = Softplus (x) + 1 = log (1 + exp (x)) + 1

(14)

In practical applications, PPO agents are trained with an hourly time step over a 24 h operational cycle. The state inputs include pressure, tank level, and time, while the actions represent valve positions or pump speed settings. EPANET is used for hydraulic simulation and reward computation.

To reflect real-world demand variability, Zaman et al. applied a DQN strategy to optimize pump operating periods, reducing energy consumption while maintaining supply reliability [85]. Candelieri et al. used Q-learning for pump on/off control, demonstrating scalability and tolerance to demand uncertainty in multi-pump systems [86]. Xu et al. integrated knowledge assistance using maximum state values from historical data to accelerate PPO convergence under unstable conditions [87]. Vercouter further enhanced DQN with offline learning and supervised pretraining (e.g., k-NN) for improved policy performance. Hung used Q-learning to allocate water quantities in non-stationary supply environments [88].

Hung employed Q-learning to allocate water resources under non-stationary supply conditions, effectively addressing dynamically changing water availability [87]. Overall, deep reinforcement learning (DRL) algorithms—particularly PPO, DQN, MADDPG, and Q-learning—have demonstrated promising applicability in urban water supply systems, especially under multi-scenario, multi-objective, and high-disturbance settings [89]. Future research may further enhance the generalization capabilities and cross-regional deployment efficiency of multi-agent policies by integrating meta-learning, transfer learning, and federated learning techniques.

Wang applied a data-driven personalized federated multi-agent attention-based deep reinforcement learning (PFL-MAADRL) algorithm to address the intake scheduling problem of three water abstraction pump stations in an urban water treatment system. The model simultaneously optimized energy consumption, reservoir water levels, and trunk line pressure. Under uncertainty, the proposed method demonstrated robust performance and achieved up to a

10.6 %

reduction in maximum energy consumption compared to benchmark approaches [90].

Jiang proposed a coupled framework that integrates a comprehensive water quality model with a deep reinforcement learning (DRL) algorithm and applied it to the eutrophic Lake Dianchi—the largest freshwater lake in China—for validation. The results showed that compared to previous operational strategies, the proposed approach significantly reduced total nitrogen and total phosphorus concentrations in the lake by

7 %

and

6 %

, respectively [7].

5.3. Multi-Agent Regulation in Hydropower Energy Systems

Hydropower systems, functioning as integrated water infrastructure that combines water supply, flood control, power generation, and ecological regulation, involve highly complex and nonlinear processes. Particularly in scenarios such as multi-reservoir joint scheduling, real-time flood discharge, and energy conversion efficiency optimization, the system is characterized by significant uncertainty and conflicting multi-objective demands, which pose challenges for traditional optimization approaches [91]. In recent years, reinforcement learning (RL) and deep reinforcement learning (DRL) have been increasingly applied to the intelligent operation of hydropower systems, demonstrating superior adaptability and generalization capabilities. In contexts where accurate environmental modeling is infeasible or traditional methods fail to yield timely solutions, multi-agent systems (MASs) offer a scalable paradigm for addressing high-dimensional and dynamic control problems collaboratively, as illustrated in in Figure 11.

5.3.1. Hydropower Scheduling Problems and DRL-Based Hydropower System Control

Hydropower operation problems can be formulated as Markov Decision Processes (MDPs), where the state space typically includes reservoir levels, inflow rates, and inflow forecasts, the action space includes turbine flow or spill discharge, and the reward function integrates electricity revenue, water utilization efficiency, and risk mitigation. Hu et al. were among the first to apply the Deep Q-Network (DQN) to real-time multi-reservoir scheduling, using two separate neural networks to estimate action values and policy functions. This approach effectively overcomes the “curse of dimensionality” faced by conventional RL in high-dimensional state spaces and outperforms both dynamic programming and decision tree-based strategies, particularly in balancing hydropower benefits and interpretability [92].

Luo et al. further enhanced DQN performance by integrating dueling networks, Double DQN (DDQN), and prioritized experience replay, which improved learning stability and sample efficiency [93]. Jiang et al. extended DQN to a multi-energy system by incorporating solar and wind power forecasts, achieving significant improvements in overall system profit through cross-energy coordination [94]. Wu et al. developed a multi-objective RL framework to optimize water supply, ecological preservation, and power generation simultaneously. By introducing weighted combinations into the reward function and evaluating multiple policy strategies, the study demonstrated superior performance in multi-objective trade-offs compared to traditional baselines [95].

Mitjana et al. addressed constraint scheduling under reservoir level uncertainty using the REINFORCE algorithm. By incorporating chance constraints and fallback mechanisms, they enhanced system safety and operational robustness [96]. Riemer-Sorensen employed the Soft Actor–Critic (SAC) algorithm to optimize weekly hydropower dispatch, leveraging historical electricity price and inflow data to reduce overflow loss and increase generation profit [97].

5.3.2. Cutting-Edge Applications of Multi-Agent DRL in Hydropower Systems

Multi-agent architectures are particularly well-suited for coordinated scheduling tasks in multi-reservoir and multi-hydropower systems. Hu et al. developed a collaborative model based on Multi-Agent Deep Deterministic Policy Gradient (MADDPG) for hydropower dispatching, achieving significant improvements in both energy consumption control and electricity generation efficiency [98].

MADDPG adopts a centralized training and decentralized execution (CTDE) paradigm, wherein each agent (e.g., reservoir unit) makes decisions independently using local observations, while policy updates are performed using global information. This enhances the generalization ability and robustness of the overall system.

Xi et al. systematically evaluated the sensitivity of scheduling performance with respect to key hyperparameters, including scheduling time step, state space representation, and policy update frequency. To alleviate the training burden caused by high-dimensional inputs, state space compression and feature construction techniques were applied to reduce input dimensionality [99].

To improve modeling under high-dimensional state inputs, researchers have incorporated factors such as hydraulic interdependencies among reservoirs, seasonal forecasts, and stochastic inflow disturbances. These enhancements enable DRL models to better adapt to abrupt hydrological changes across river basins. Moreover, the design of DRL decision networks is evolving toward deeper and more modular architectures. By introducing residual connections, attention mechanisms, and contrastive learning components, the models exhibit enhanced capabilities in extracting and leveraging temporal features for robust and adaptive scheduling.

5.4. Intelligent Task and Path Control in Multi-Platform Autonomous Monitoring Systems

In response to the growing demand for dynamic sensing, real-time response, and efficient spatial coverage in water environmental systems, multi-platform autonomous monitoring systems have demonstrated substantial value in tasks such as pollution tracking, water quality assessment, and ecological early warning. Approaches based on multi-agent systems (MASs)—covering task allocation, path planning, and water body monitoring—have emerged as key technologies driving intelligent regulation in aquatic environments.

5.4.1. Multi-Agent Task Allocation and Coordination Mechanisms

A key challenge lies in optimizing task allocation and execution coordination among multiple AUV/SV platforms. To address issues such as task priority, platform heterogeneity, and communication constraints, researchers have proposed various distributed strategies—such as auction-based mechanisms, market-driven games, and graph-theoretic modeling—for multi-objective task planning. Wu et al. proposed an improved consensus algorithm based on bundle assignment for dynamic task allocation among heterogeneous AUV platforms [100].

Furthermore, Nguyen developed a distributed coordination architecture for underwater multi-robot systems with limited communication bandwidth, which improved task coverage while maintaining manageable communication loads [101]. Lin et al. introduced a software-defined networking framework for centralized management of AUV networks. By integrating graph-based optimization strategies with reinforcement learning, they proposed a graph-structured SAC algorithm for optimal path scheduling and system input selection [102]. Their system further incorporates a three-tier network architecture—SDNA-ITS—comprising Detection AUVs at the network layer, Local Control AUVs at the intermediate layer, and a USV-based Master Controller (USV-MC) at the global coordination layer, enabling robust collaborative scheduling for complex pollution detection tasks.

5.4.2. Autonomous Path Planning and Dynamic Obstacle Avoidance Strategies

In intricate and obstacle-dense underwater environments, effective inspection of submerged structures relies on advanced path planning systems. These systems must not only address energy efficiency and obstacle avoidance but also dynamically adapt to ocean currents, physical obstructions, and sensor feedback. A hierarchical MER-SAC framework for AUV path planning is illustrated in Figure 12.

Studies have shown that traditional methods perform well in static environments, but their effectiveness is limited in complex underwater scenarios [103]. In the realm of intelligent path optimization, bio-inspired algorithms—including Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), and Cuckoo Search—have been widely applied to path generation for underwater missions [104]. However, their robustness is limited when facing abrupt environmental changes or unstructured scenarios. In recent years, deep reinforcement learning (DRL) has demonstrated significant advantages in underwater path planning, offering improved adaptability to uncertain environments and enhancing the overall intelligence of navigation systems [105]).

In addition, underwater defect detection tasks based on optical imaging impose higher requirements on path planning. Underwater image quality is often degraded by lighting variability, color attenuation, and turbidity. Therefore, path planning systems must simultaneously optimize imaging position, illumination angle, and obstacle avoidance strategy. To address the complexity and adaptivity required for underwater structure inspection, Tao proposed multi-objective DRL-based path planning frameworks [106]. These models, incorporating algorithms such as SAC or DDPG, enable dynamic responsiveness and real-time policy adaptation during the exploration of targeted regions.

5.4.3. Multi-Source Data Fusion and Real-Time Control in Water Body Monitoring Tasks

In water body monitoring tasks, multi-platform systems must ensure high spatiotemporal resolution in data acquisition over target regions and provide rapid responses to abnormal events such as sudden pollution incidents or algal bloom outbreaks. To achieve this, constructing a perception and control architecture based on multi-source heterogeneous data fusion has become a critical enabler, as illustrated in Figure 13.

Punit proposed a water quality monitoring framework that integrates data from multiple sensors, enabling real-time perception of key environmental parameters and improving accuracy through error correction techniques [107]. Joana et al. further utilized a collaborative autonomous underwater vehicle (AUV) sensing mechanism to significantly enhance response times during algal bloom events [108].

Mancy proposed the FL-MAPPO model, which promotes decentralized and privacy-preserving decision-making to minimize latency and eliminate single points of failure. The framework adopts a multi-layer architecture encompassing data acquisition, distributed learning, and real-time execution. Experimental results showed that the proposed method outperformed existing large-scale real-time flood forecasting systems, achieving a mean squared error (MSE) of 0.112, R-squared (

R^{2}

) of 0.953, and mean absolute error (MAE) of 0.207 [109].

Building upon edge intelligence principles, Ren et al. developed a water environment monitoring system that integrates computation, analytics, and feedback mechanisms into a unified edge architecture. The real-time regulation system leverages wireless communication and edge computing to upload local sensing results to the control center, enabling rapid command dissemination and adaptive task updating [110].

Future advancements in multi-platform collaborative monitoring are expected to focus on intelligent formation control, energy-efficient coordination, and fine-grained target recognition. These developments will further enhance the system’s comprehensive capabilities in emergency response, pollutant diffusion tracking, and ecological risk assessment.

6. Cross-Domain Generalization and Digital Twin Integration in Multi-Agent Systems

6.1. Policy Transfer and Robustness in Heterogeneous Scenarios

The deployment of multi-agent systems (MASs) in complex water environment scenarios faces substantial challenges, including diverse task types, heterogeneous geographic conditions, sensor inconsistency, and communication instability. To enhance the generalization and adaptability of learned policies across such non-stationary conditions, researchers have explored transfer learning and meta-reinforcement learning (meta-RL) techniques [111].

Yang et al. investigated the issue of suboptimal renewable energy utilization in integrated power systems (IPS) combining hydropower, photovoltaics, and pumped storage. They proposed a parallel multi-agent dynamic scheduling framework that integrates spatiotemporal information, enabling more efficient coordination and adaptation in dynamic, multi-modal energy environments [112].

Other studies have employed adversarial perturbation training and parameter regularization to improve the robustness of RL agents to sensor drift and input disturbances [113]. Under conditions of inconsistent or missing multi-source data, Bayesian policy optimization and distributed collaborative training were introduced to stabilize cross-regional deployments [114]. In heterogeneous systems such as urban water networks and mountainous reservoirs, knowledge distillation and agent weight-sharing strategies have been applied to transfer experiential knowledge across domains, significantly reducing the cost of retraining.

6.2. Integration Potential of Digital Twins in Intelligent Water System Control

Digital twins (DTs), as a key technology linking physical systems with virtual models, are increasingly being integrated into water system control. By fusing sensor data, simulation models, and control logic, DTs enable real-time mapping, prediction, and optimization of MAS operations. Heo et al. implemented a DT system in a wastewater treatment plant for dynamic aeration control using reinforcement learning, enabling feedback from online policy updates to physical processes and thus forming a closed-loop optimization cycle [115].

In urban water supply systems, Salvatore et al. developed a multi-scale DT framework that integrates LSTM-based prediction modules with MAS-based scheduling to achieve coordinated optimization of tank levels and pump energy consumption [116]. In multi-AUV systems, DTs are used to replicate individual agent states, supporting dynamic simulations of task planning, path optimization, and energy consumption while enabling real-time policy adjustment [117]. Looking ahead, DT-enabled intelligent water systems are expected to integrate complex system modeling, real-time simulation, and policy iteration, serving as foundational platforms for building high-reliability intelligent control systems.

Despite the strong theoretical synergy between digital twin (DT) technologies and reinforcement learning (RL), their integration in practical engineering systems still faces several critical challenges.

First, the simulation-to-reality gap (sim-to-real gap) remains a major bottleneck limiting the generalizability of trained models. In simulated environments, it is difficult to fully replicate real-world sensor precision, control dynamics, and external disturbances. As a result, policies trained within DT environments may experience performance degradation and lack robustness when deployed in real systems.

Second, model fidelity is a fundamental prerequisite for constructing effective digital twins. Any deviation or simplification in the physical modeling used for policy learning can impair the multi-agent system’s perception capabilities, thereby undermining the stability and interpretability of the training process.

Third, system latency is a key factor influencing real-time control performance. In scenarios involving multi-source sensing, data transmission, model inference, and control execution, excessive system response time relative to the dynamics of the target process may result in ineffective decisions or system instability. This issue is particularly critical in high-frequency control tasks such as pump regulation and aeration control.

Therefore, the future integration of DT and multi-agent reinforcement learning (MARL) requires in-depth research into cross-modal data alignment, enhanced environmental modeling fidelity, and low-latency architecture design in order to build intelligent water system control frameworks that are stable, trustworthy, and practically deployable.

6.3. Edge Computing and Control Latency Optimization

In multi-agent systems deployed in real-world water infrastructure, limited wireless communication capacity and the spatial distribution of agents impose significant constraints on centralized coordination. As a result, both control commands and observation data must be processed at edge nodes near the source to mitigate communication latency.

Zhang proposed a distributed control strategy based on multi-agent reinforcement learning (MARL) to enhance system-level coordination and robustness under unstable communication conditions [118]. This approach enables agents to make decentralized decisions locally, while still contributing to global system objectives.

Applied to real-time control (RTC) in urban drainage systems, the method not only improves regulation efficiency but also addresses potential risks arising from hardware malfunctions in sensors and actuators or communication failures. The framework supports resilient operations in cyber–physical water infrastructure by leveraging edge intelligence for timely and adaptive control.

In large-scale AUV clusters, Zhang et al. deployed lightweight inference models at each edge node and used information compression mechanisms to upload only key control parameters to the cloud, maintaining global system consistency while minimizing network load [119]. For high-frequency control devices such as pumps, gates, and monitoring units, Qian et al. introduced edge-side control caches and sliding window prediction mechanisms, improving fault tolerance and latency robustness to achieve microsecond-level response times [120]. By integrating mobile edge computing (MEC), 5G network slicing, and asynchronous event-driven control architectures, future edge-intelligent water systems are expected to feature modular design, high-frequency responsiveness, and adaptive fault tolerance, driving the evolution toward low-latency and highly robust intelligent control.

7. Research Challenges and Future Directions

7.1. Challenges and Technical Limitations

Despite the promising applications of multi-agent reinforcement learning (MARL), digital twins, and edge intelligence in water environment systems, several challenges remain in their real-world deployment and adaptation to complex scenarios. These challenges can be broadly categorized into four major aspects:

(1) Limited policy generalization and poor system transferability.

Current MARL frameworks face significant difficulties in cross-regional deployment and heterogeneous scenario adaptation. Most policy transfer and meta-learning approaches rely heavily on structural similarities between source and target tasks, which are often lacking in complex water systems due to variations in geographic features, task objectives, and sensor configurations. While adversarial training and parameter regularization have been proposed to improve robustness, these methods still struggle to generalize under unpredictable disturbances and multi-source uncertainties encountered in real-world settings.

(2) Incomplete system-level co-modeling and control architectures.

Water environment systems are inherently composed of multiple interconnected subsystems characterized by multi-scale, multi-objective, and multi-constraint dynamics. Most existing methods remain focused on localized optimization, lacking a unified architecture for system-wide coordinated control. In large-scale, heterogeneous MARL settings, issues such as role assignment, communication protocols, and policy synchronization among agents are yet to be fully addressed. These limitations hinder task integration, reduce policy efficiency, and make it difficult to achieve real-time optimization under complex and dynamic conditions.

(3) Complexity in twin and edge system construction with limited real-time responsiveness.

Digital twin systems require high-fidelity physical modeling and high-quality real-time data, leading to high implementation costs and poor scalability. They are currently applicable mainly to isolated or small-scale systems. While edge computing can reduce control latency, challenges remain in terms of information sharing, global consistency maintenance, and resilience to communication failures. Moreover, limitations in computing power and sensing range at edge nodes hinder their scalability and adaptability in large-scale deployments.

(4) Limited adaptability and safety robustness under extreme conditions.

Most existing intelligent control models are primarily trained under stable operating conditions, lacking sufficient capacity to model system behavior in the face of nonlinear environmental responses, data incompleteness, or equipment malfunctions encountered during extreme events. This results in poor dynamic adaptability and inadequate coordination across multiple data and control sources during emergencies. In scenarios such as sudden flooding, contamination accidents, or system failures, agents typically lack risk-awareness mechanisms and fault-tolerant strategies, making it difficult to achieve resilient, safe, and continuous control—key objectives in the management of critical water infrastructure systems.

7.2. Future Research Directions

7.2.1. Building a Data Ecosystem and AI Integration for Intelligent Water Systems

Given the complexity of pollutant compositions, limitations in detection techniques, and severe fragmentation of data across water systems, constructing a comprehensive data ecosystem integrated with AI and machine learning has become a crucial direction for enhancing intelligent water system regulation. In recent years, researchers have proposed using advanced spectroscopic detection, two-dimensional chromatography, and multi-channel online sensors to collect high-dimensional and structured environmental data. These data are then processed using AI models to identify key pollutants, enabling precise water quality risk warnings and real-time optimization.

Ding et al. developed an AI-assisted framework that integrates detection, analysis, and control optimization for identifying disinfection by-product (DBP) precursors. This framework separates dissolved organic matter (DOM) using two-dimensional chromatography, converts it into a digital matrix via UV/fluorescence detectors, and applies AI/ML methods to model the relationship between DOM composition and DBP formation potential [121]. The resulting predictive model informs the adjustment of disinfection parameters, thereby improving both the timeliness and accuracy of pollution detection and supporting dynamic optimization of water treatment processes.

Furthermore, integrating such approaches with digital twin systems enables real-time virtual–physical synchronization, dynamic water quality mapping, and strategy co-optimization. Future research should focus on multi-source data fusion modeling, the development of standardized and shareable data platforms, and the evolution of AI algorithms from black-box models to explainable frameworks, all aiming to enhance the reliability and autonomy of intelligent water environment management.

7.2.2. Perception-Driven Communication Mechanisms

In intelligent aquatic environments, clusters of distributed agents—such as water quality sensors, storage control units, and mobile sampling devices—must collaboratively perform complex tasks under conditions of partial observability. Given the constraints imposed by limited communication bandwidth, restricted sensing range, and dynamic task environments, the design of efficient systems has emerged as a critical enabler for intelligent system coordination.

Regarding communication timing, non-continuous (event-triggered) strategies have received considerable attention. In this paradigm, agents initiate communication only when significant state changes occur or when interaction is deemed essential. This approach effectively reduces communication overhead and is particularly suitable for scenarios that demand both responsiveness and resource efficiency, such as dynamic hydrological response, emergency pollution mitigation, or stormwater regulation.

In terms of interaction targets, group-based communication mechanisms offer a hybrid approach by integrating global and local communication strategies. Agents dynamically select communication partners based on current task demands, enabling broader coordination coverage while minimizing redundant exchanges. This approach is highly effective for cross-regional water resource allocation and multi-site water quality monitoring, facilitating task decomposition and goal partitioning among agents.

The encoding and compression of communication content are also critical for improving system scalability. Current research emphasizes techniques such as localized state sharing and neighbor-based compressed representations to reduce computational and memory burdens, making them well-suited for large-scale distributed deployments. These methods are applicable to scenarios including large-scale water network simulation and regional decision integration in urban drainage systems.

From the perspective of information processing, the adoption of mutual information optimization and attention-based mechanisms enhances the ability to extract salient signals from noisy observations. Attention-driven communication protocols dynamically assess the relevance of peer-provided information, allowing agents to focus on task-critical exchanges. Such mechanisms exhibit strong adaptability in applications such as pollution plume tracking and strategic protection of critical water sources [122].

In summary, a perception-driven communication framework structured around four core dimensions—communication timing, interaction targeting, message encoding, and information processing—lays a theoretical and technical foundation for effective coordination in distributed intelligent systems. Future research directions include

Developing task-sensitive communication triggers tailored to non-stationary environments;
Designing multi-scale communication structures and hierarchical architectures to accommodate system heterogeneity;
Incorporating graph-based attention and multi-modal fusion for efficient heterogeneous information integration;
Exploring co-optimization frameworks that jointly address communication and control strategies.

Together, these strategies can drive the evolution of intelligent water systems from “multi-point coordination” toward “ubiquitous intelligence,” enabling advanced management and decision optimization across domains such as smart water engineering, real-time operations, and ecological protection.

7.2.3. Self-Organizing Multi-Agent Collaboration

Self-organizing multi-agent collaboration aims to develop autonomous agents capable of general collaboration, even in the absence of predefined protocols or prior knowledge. This approach is highly relevant to heterogeneous and decentralized water systems, especially in scenarios such as cross-basin water allocation, multi-agency pollution prevention, and emergency flood response, where agents must coordinate with unfamiliar or dynamically changing partners.

Early studies assumed static and observable agent behaviors, analogous to predefined operations in hydraulic facilities. However, real-world actors often exhibit behavioral uncertainty. Subsequent work has relaxed these assumptions by enabling agents to dynamically model their teammates through interaction learning. For example, agents can observe the behaviors of others to infer strategies and adapt accordingly—critical in systems where autonomous water quality sampling robots adjust their paths based on peer activity to enhance monitoring efficiency.

Other studies enhanced inter-agent communication through robust, low-bandwidth protocols to support temporary cooperation. These approaches are particularly applicable to distributed hydrological sensor networks and real-time coordination in urban drainage systems under low-latency constraints.

Despite progress, many approaches remain confined to closed-world settings with fixed agent types. In contrast, open-world water systems—comprising mobile sensors, heterogeneous infrastructure, and human interventions—require agents to adapt as their population evolves. Emerging methods use graph neural networks (GNNs), such as the GPL framework, to model evolving interaction structures, enabling effective coordination in scenarios like cross-watershed dispatching.

Further research has also considered adversarial robustness—such as handling malicious pollution sources or unreliable nodes—and few-shot collaborative learning for rapid adaptation in new or emergency tasks. In flood events, for example, agents must quickly learn effective strategies from minimal experience, while strategy coverage methods can help build more generalized agent libraries to support scalable simulation and training.

7.2.4. Collaborative Control Architecture for Complex Water Systems

Water environment systems exhibit a high degree of complexity and multi-scale coupling, encompassing a wide range of interconnected subsystems such as urban drainage, water resource scheduling, water quality management, and ecological conservation. These subsystems are characterized by strong dynamic dependencies and spatiotemporal interactions, necessitating coordinated control at the system level to ensure operational safety, efficient resource allocation, and long-term ecological sustainability.

However, most reinforcement learning (RL)-based approaches to date are constrained to isolated subsystems or static scenarios, making them insufficient for capturing the inter-module dependencies and asynchronous dynamics inherent in such complex systems. Particularly under conditions involving multi-objective trade-offs and complex operational constraints, conventional RL methods often suffer from limited learning efficiency, poor scalability, and lack of coordination. These limitations highlight the urgent need for a novel control architecture that supports modularity, hierarchy, and collaboration.

Hierarchical control architectures (HCAs) provide a systematic framework for decomposing complex control problems into multiple layers, each responsible for perception, planning, and execution at different levels of granularity. This structure enhances both the manageability and learning efficiency of the system. Representative modeling methods include the following:

Task decomposition and role assignment: The global control objective is decomposed into multiple sub-goals, each corresponding to a control layer or agent role. For instance, in water resource management, the upper layer may handle high-level scheduling tasks such as water allocation and inter-regional distribution, while the lower layer focuses on local actuation tasks like pump control and feedback-based flow adjustments.
Hierarchical policy learning: The high-level controller (meta-controller) learns to assign sub-tasks or intermediate goals to lower-level controllers, while the low-level controllers optimize their actions based on local observations to fulfill the high-level directives. This structure can be implemented using the options framework or hierarchical reinforcement learning (HRL), which improves policy interpretability and enhances transferability across tasks.
State abstraction and coordination mechanisms: Each layer can define its own state–action space abstraction to reduce learning complexity. Communication between layers or across subsystems can be facilitated through intermediate variables, shared representations, or attention-based mechanisms, enabling inter-layer coordination and policy alignment.
Multi-agent cooperation within hierarchical structures: Each sub-task can be further handled by one or more agents operating under a hierarchical framework. Mechanisms such as centralized training with decentralized execution (CTDE), heterogeneous reward design, and graph neural networks (GNNs) are employed to promote agent collaboration and information sharing. For example, in wastewater treatment, individual agents can control flow regulation, chemical dosing, and energy optimization, while in watershed-scale management, agents can coordinate water storage operations, pollutant dispersion control, and ecological flow management.

Such a hierarchical multi-agent architecture inherently supports scalability, mitigates the curse of dimensionality in large state spaces, and addresses asynchronous dynamics and local conflicts in distributed environments. Moreover, state abstraction and inter-layer communication are key enablers for architectural efficiency. In practice, high-level policies operate on low-dimensional abstract states representing system trends or progress, while low-level agents respond to raw sensor inputs in real time. To ensure coherence between local actions and global objectives, communication protocols based on GNNs, attention mechanisms, or policy embeddings can be employed to construct shared state representations.

Future research should further explore the integration of physics-informed modeling, multi-source heterogeneous data fusion, and cross-scale process coupling into hierarchical multi-agent frameworks. Embedding such architectures within digital twin platforms will enable closed-loop optimization from sensing and perception to cognitive decision-making. This integrated approach is expected to significantly enhance the intelligence, adaptability, and robustness of water environment systems under uncertainty, paving the way for a new generation of resilient and sustainable water governance systems.

7.2.5. Modeling Large-Scale Heterogeneous Multi-Agent Systems

Most reinforcement learning (RL) applications in intelligent water management primarily focus on small- to medium-scale homogeneous environments, where agents typically share similar architectures, objectives, and operational tasks. However, real-world water systems exhibit significantly greater complexity, encompassing functionally heterogeneous subsystems—such as urban drainage management, watershed dispatching, water treatment operations, and ecological preservation—along with multi-objective missions and diverse, multi-modal data sources. This paradigm shift calls for control models that can effectively manage large-scale, heterogeneous perception processes.

A comprehensive water management task may involve diverse agent types: sluice gate controllers responsible for hydraulic regulation in river networks, industrial supervisory agents focused on pollution mitigation, process-level agents optimizing wastewater treatment operations, and ecological agents tasked with maintaining environmental flows. These agents differ in their structural designs, functional objectives, underlying control mechanisms, and operational spatiotemporal scales. When operating within shared and dynamic environments, they must achieve strategic coordination across multiple layers to ensure system stability and task diversity fulfillment.

Currently, there is a notable absence of scalable RL frameworks specifically designed for heterogeneous multi-agent systems in complex water environments. This limits their deployment in real-world settings where variability, conflict, and cooperation are prevalent. Future research efforts should prioritize the following directions:

Modeling heterogeneous tasks and role-based structures: Develop agent modeling frameworks that accommodate varying structures, roles, capabilities, and objectives, including clear task decomposition and role assignments;
Multi-objective game theory and cooperative strategy learning: Employ game-theoretic models and multi-objective optimization techniques to support robust strategy formation under both cooperative and adversarial conditions;
Cross-scale coordination and communication protocol design: Design efficient and scalable communication protocols to resolve asynchrony and coordination issues arising from distributed, multi-scale decision-making processes;
Generalization and scalability: Construct RL training architectures that offer high generalizability and scalability, thereby improving performance, adaptability, and robustness in complex, uncertain real-world environments.

Addressing these challenges will be essential for transitioning from isolated RL deployments to fully integrated, large-scale intelligent water management systems capable of supporting long-term sustainability, operational resilience, and adaptive governance.

7.2.6. Adaptability and Robustness of Intelligent Models Under Extreme Conditions

The ability to optimize control strategies under extreme conditions is essential for developing resilient and secure control architectures in water environment systems. In practice, such systems are often subjected to abrupt disturbances, including floods, droughts, pollution incidents, equipment malfunctions, and extreme weather events. These anomalies may induce partial system failures, abrupt changes in resource availability, or severe sensor outages, all of which pose considerable challenges to control models trained under nominal operational conditions.

Most existing reinforcement learning (RL) approaches were developed with a focus on optimal control in stable and routine scenarios, frequently overlooking the inherent uncertainties and nonlinear dynamics associated with extreme events. However, in such critical scenarios, both the system’s safety boundaries and its operational objectives may undergo significant shifts.

Safety-constrained reinforcement learning algorithms have introduced distributed risk-aware critic mechanisms to enhance the adaptability of agents in high-risk scenarios. By incorporating distributed risk evaluation modules and context-aware policy optimization strategies, these frameworks enable the construction of intelligent control architectures capable of dynamically adjusting objective weights and strategy priorities under varying extreme conditions. Future research should prioritize the following directions:

Extreme scenario modeling and synthetic environment construction: Develop high-fidelity simulation environments that emulate events such as flooding, pollution surges, and water shortages to support adversarial training and generalization of agent behavior;
Safety-constrained reinforcement learning: Introduce risk-sensitivity metrics, dynamic safety threshold functions, and multi-objective constraint modeling to improve system boundary awareness and ensure secure policy execution;
Uncertainty modeling and robust optimization: Apply Bayesian reinforcement learning and distributed decision-making frameworks to explicitly represent environmental and prediction uncertainties, thereby improving model robustness under disruptive conditions;
Integration of fault tolerance and emergency response: Incorporate fault diagnosis, contingency planning, and multi-source coordination mechanisms to combine RL-based control with traditional emergency management systems for enhanced responsiveness during critical events.

By embedding these capabilities, future intelligent water systems will be more capable of handling rare yet high-impact disruptions, thereby contributing to the realization of robust, adaptive, and secure environmental management infrastructures.

8. Conclusions

Multi-agent reinforcement learning (MARL) is emerging as a foundational methodology for intelligent control in water environment systems, enabling a paradigm shift from centralized, rule-based operations toward distributed, cooperative, and adaptive decision-making frameworks. This review has systematically traced the evolution of MARL applications in the domain, highlighting its capabilities in improving water resource allocation, enhancing pollution response strategies, reducing operational energy consumption, and reinforcing system resilience under complex and dynamic conditions.

Despite recent advances, deploying MARL in real-world water systems remains a substantial challenge. Key limitations stem from agent heterogeneity, cross-domain task coupling, communication constraints, and the inherent non-stationarity of environmental dynamics. These challenges hinder the scalability, reliability, and interpretability of existing MARL-based control frameworks in large-scale operational settings.

To address these gaps, future research must focus on several critical directions: (i) developing hierarchical and cross-layer modeling frameworks that support system-level abstraction and inter-agent coordination; (ii) enhancing generalization and transfer capabilities across spatiotemporally variable tasks with limited training samples; (iii) designing interpretable and adaptive control architectures that remain robust under uncertain, multi-domain operating conditions; and (iv) establishing unified benchmarks and semi-realistic testbeds that integrate hydrodynamic variability, sensing faults, communication latency, and external disturbances for the systematic validation of MARL algorithms.

In summary, MARL holds strong potential to become a core enabling technology for the next generation of intelligent, self-organizing, and sustainable water environment systems. Advancing this vision will require closer integration across control theory, machine learning, and environmental systems engineering, as well as a shift toward system-level co-design, multi-scale optimization, and field-level deployment to fully bridge the gap between algorithmic innovation and practical impact.

Author Contributions

Supervision, Y.P. and L.J.; investigation, L.J.; resources, L.J. and Y.P.; visualization, L.J.; writing—original draft, L.J.; writing—review & editing, Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, F.; Sewlia, M.; Dimarogonas, D.V. Cooperative control of heterogeneous multi-agent systems under spatiotemporal constraints. Annu. Rev. Control 2024, 57, 100946. [Google Scholar] [CrossRef]
Li, W.; Shi, F.; Li, W.; Yang, S.; Wang, Z.; Li, J. Review on Cooperative Control of Multi-Agent Systems. Int. J. Appl. Math. Control Eng. 2024, 7, 10–17. [Google Scholar]
Jain, G.; Kumar, A.; Bhat, S.A. Recent developments of game theory and reinforcement learning approaches: A systematic review. IEEE Access 2024, 12, 9999–10011. [Google Scholar] [CrossRef]
Zhu, R.; Liu, L.; Li, P.; Chen, N.; Feng, L.; Yang, Q. DC-MAC: A delay-aware and collision-free MAC protocol based on game theory for underwater wireless sensor networks. IEEE Sens. J. 2024, 24, 6930–6941. [Google Scholar] [CrossRef]
He, X.; Hu, Z.; Yang, H.; Lv, C. Personalized robotic control via constrained multi-objective reinforcement learning. Neurocomputing 2024, 565, 126986. [Google Scholar] [CrossRef]
Cheng, J.; Cheng, M.; Liu, Y.; Wu, J.; Li, W.; Frangopol, D.M. Knowledge transfer for adaptive maintenance policy optimization in engineering fleets based on meta-reinforcement learning. Reliab. Eng. Syst. Saf. 2024, 247, 110127. [Google Scholar] [CrossRef]
Jiang, Q.; Li, J.; Sun, Y.; Huang, J.; Zou, R.; Ma, W.; Guo, H.; Wang, Z.; Liu, Y. Deep-reinforcement-learning-based water diversion strategy. Environ. Sci. Ecotechnol. 2024, 17, 100298. [Google Scholar] [CrossRef]
Zuccotto, M.; Castellini, A.; Torre, D.L.; Mola, L.; Farinelli, A. Reinforcement learning applications in environmental sustainability: A review. Artif. Intell. Rev. 2024, 57, 88. [Google Scholar] [CrossRef]
Jiao, P.; Ye, X.; Zhang, C.; Li, W.; Wang, H. Vision-based real-time marine and offshore structural health monitoring system using underwater robots. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 281–299. [Google Scholar] [CrossRef]
Kartal, S.K.; Cantekin, R.F. Autonomous Underwater Pipe Damage Detection Positioning and Pipe Line Tracking Experiment with Unmanned Underwater Vehicle. J. Mar. Sci. Eng. 2024, 12, 2002. [Google Scholar] [CrossRef]
Ravier, R.; Garagić, D.; Galoppo, T.; Rhodes, B.J.; Zulch, P. Multiagent Reinforcement Learning and Game-Theoretic Optimization for Autonomous Sensor Control. In Proceedings of the 2024 IEEE Aerospace Conference, Big Sky, MT, USA, 2–9 March 2024; pp. 1–12. [Google Scholar] [CrossRef]
Shi, H.; Li, J.; Mao, J.; Hwang, K.S. Lateral Transfer Learning for Multiagent Reinforcement Learning. IEEE Trans. Cybern. 2023, 53, 1699–1711. [Google Scholar] [CrossRef]
Han, S.; Dastani, M.; Wang, S. Sparse communication in multi-agent deep reinforcement learning. Neurocomputing 2025, 625, 129344. [Google Scholar] [CrossRef]
Luo, J.; Zhang, W.; Jiongming, S.; Yuan, W.; Chen, J. Research Progress of Multi-Agent Game Theoretic Learning. J. Syst. Eng. Electron. 2022. [Google Scholar] [CrossRef]
Rutherford, A.; Ellis, B.; Gallici, M.; Cook, J.; Lupu, A.; Ingvarsson Juto, G.; Willi, T.; Hammond, R.; Khan, A.; Schroeder de Witt, C.; et al. Jaxmarl: Multi-agent rl environments and algorithms in jax. Adv. Neural Inf. Process. Syst. 2024, 37, 50925–50951. [Google Scholar]
Nagoev, Z.; Bzhikhatlov, K.; Pshenokova, I.; Unagasov, A. Algorithms and Software for Simulation of Intelligent Systems of Autonomous Robots Based on Multi-Agent Neurocognitive Architectures; Springer: Berlin/Heidelberg, Germany, 2024; pp. 381–391. [Google Scholar]
Wu, P.; Guan, Y. Multi-agent deep reinforcement learning for computation offloading in cooperative edge network. J. Intell. Inf. Syst. 2025, 63, 567–591. [Google Scholar] [CrossRef]
Milani, S.; Topin, N.; Veloso, M.; Fang, F. Explainable reinforcement learning: A survey and comparative review. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Ernst, D.; Louette, A. Introduction to Reinforcement Learning; Feuerriegel, S., Hartmann, J., Janiesch, C., Zschech, P., Eds.; Springer: Singapore, 2024; pp. 111–126. [Google Scholar]
Tam, P.; Ros, S.; Song, I.; Kang, S.; Kim, S. A survey of intelligent end-to-end networking solutions: Integrating graph neural networks and deep reinforcement learning approaches. Electronics 2024, 13, 994. [Google Scholar] [CrossRef]
Boyajian, W.L.; Clausen, J.; Trenkwalder, L.M.; Dunjko, V.; Briegel, H.J. On the convergence of projective-simulation–based reinforcement learning in Markov decision processes. Quantum Mach. Intell. 2020, 2, 13. [Google Scholar] [CrossRef] [PubMed]
Albrecht, S.V.; Christianos, F.; Schäfer, L. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches; MIT Press: Cambridge, MA, USA, 2024. [Google Scholar]
Li, Y.; Tsang, Y.P.; Wu, C.H.; Lee, C.K.M. A multi-agent digital twin–enabled decision support system for sustainable and resilient supplier management. Comput. Ind. Eng. 2024, 187, 109838. [Google Scholar] [CrossRef]
Tong, Y.; Fei, S. Research on Multi object Occlusion Tracking and Trajectory Prediction Method Based on Deep Reinforcement Learning. In Proceedings of the 2024 6th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI), Nanjing, China, 6–8 December 2024; pp. 656–659. [Google Scholar]
Palmer, G.; Tuyls, K.; Bloembergen, D.; Savani, R. Lenient multi-agent deep reinforcement learning. arXiv 2017, arXiv:1707.04402. [Google Scholar]
Zheng, Y.; Meng, Z.; Hao, J.; Zhang, Z. Weighted double deep multiagent reinforcement learning in stochastic cooperative environments. In Pacific Rim International Conference on Artificial Intelligence; Springer: Cham, Switzerland, 2018; pp. 421–429. [Google Scholar]
Micieli, M.; Botter, G.; Mendicino, G.; Senatore, A. UAV thermal images for water presence detection in a Mediterranean headwater catchment. Remote Sens. 2021, 14, 108. [Google Scholar] [CrossRef]
Zhang, C.; Wu, Z.; Li, Z.; Xu, H.; Xue, Z.; Qian, R. Multi-agent Reinforcement Learning-Based UAV Swarm Confrontation: Integrating QMIX Algorithm with Artificial Potential Field Method. In Proceedings of the 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Kuching, Malaysia, 6–10 October 2024; pp. 161–166. [Google Scholar]
Romero, A.; Song, Y.; Scaramuzza, D. Actor-critic model predictive control. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 14777–14784. [Google Scholar]
Zhang, J.; Han, S.; Xiong, X.; Zhu, S.; Lü, S. Explorer-actor-critic: Better actors for deep reinforcement learning. Inf. Sci. 2024, 662, 120255. [Google Scholar] [CrossRef]
Zhao, E.; Zhou, N.; Liu, C.; Su, H.; Liu, Y.; Cong, J. Time-aware MADDPG with LSTM for multi-agent obstacle avoidance: A comparative study. Complex Intell. Syst. 2024, 10, 4141–4155. [Google Scholar] [CrossRef]
Kuba, J.G.; Chen, R.; Wen, M.; Wen, Y.; Sun, F.; Wang, J.; Yang, Y. Trust region policy optimisation in multi-agent reinforcement learning. arXiv 2021, arXiv:2109.11251. [Google Scholar]
Harris, A.; Liu, S. Maidrl: Semi-centralized multi-agent reinforcement learning using agent influence. In Proceedings of the 2021 IEEE Conference on Games (CoG), Copenhagen, Denmark, 17–20 August 2021; pp. 01–08. [Google Scholar]
Ke, C.; Chen, H. Cooperative path planning for air–sea heterogeneous unmanned vehicles using search-and-tracking mission. Ocean Eng. 2022, 262, 112020. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, X.; Mao, Y.; Shuai, C.; Jiao, L.; Wu, Y. Analysis of resource allocation and PM_2.5 pollution control efficiency: Evidence from 112 Chinese cities. Ecol. Indic. 2021, 127, 107705. [Google Scholar] [CrossRef]
Zhang, H.; Zhao, H.; Liu, R.; Kaushik, A.; Gao, X.; Xu, S. Collaborative task offloading optimization for satellite mobile edge computing using multi-agent deep reinforcement learning. IEEE Trans. Veh. Technol. 2024, 73, 15483–15498. [Google Scholar] [CrossRef]
Jouini, O.; Sethom, K.; Namoun, A.; Aljohani, N.; Alanazi, M.H.; Alanazi, M.N. A survey of machine learning in edge computing: Techniques, frameworks, applications, issues, and research directions. Technologies 2024, 12, 81. [Google Scholar] [CrossRef]
Han, S.; Zhang, T.; Li, X.; Yu, J.; Zhang, T.; Liu, Z. The Unified Task Assignment for Underwater Data Collection with Multi-AUV System: A Reinforced Self-Organizing Mapping Approach. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 1833–1846. [Google Scholar] [CrossRef]
Davis, A.; Wills, P.S.; Garvey, J.E.; Fairman, W.; Karim, M.A.; Ouyang, B. Developing and field testing path planning for robotic aquaculture water quality monitoring. Appl. Sci. 2023, 13, 2805. [Google Scholar] [CrossRef]
Ullah, R.; Abbas, A.W.; Ullah, M.; Khan, R.U.; Khan, I.U.; Aslam, N.; Aljameel, S.S. EEWMP: An IoT-Based Energy-Efficient Water Management Platform for Smart Irrigation. Sci. Program. 2021, 2021, 5536884. [Google Scholar] [CrossRef]
Lenczner, G.; Chan-Hon-Tong, A.; Le Saux, B.; Luminari, N.; Le Besnerais, G. DIAL: Deep interactive and active learning for semantic segmentation in remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3376–3389. [Google Scholar] [CrossRef]
Moubayed, A.; Sharif, M.; Luccini, M.; Primak, S.; Shami, A. Water leak detection survey: Challenges & research opportunities using data fusion & federated learning. IEEE Access 2021, 9, 40595–40611. [Google Scholar]
Xue, D.; Yuan, L.; Zhang, Z.; Yu, Y. Efficient Multi-Agent Communication via Shapley Message Value. In IJCAI; Cambridge University Press: Cambridge, UK, 2022; pp. 578–584. [Google Scholar] [CrossRef]
Abadal, S.; Jain, A.; Guirado, R.; López-Alonso, J.; Alarcón, E. Computing graph neural networks: A survey from algorithms to accelerators. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
Balasubramanian, E.; Elangovan, E.; Tamilarasan, P.; Kanagachidambaresan, G.; Chutia, D. Optimal energy efficient path planning of UAV using hybrid MACO-MEA* algorithm: Theoretical and experimental approach. J. Ambient Intell. Humaniz. Comput. 2023, 14, 13847–13867. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Gupta, J.K.; Morales, P.; Allen, R.; Kochenderfer, M.J. Deep implicit coordination graphs for multi-agent reinforcement learning. arXiv 2020, arXiv:2006.11438. [Google Scholar]
Ou, M.; Xu, S.; Luo, B.; Zhou, H.; Zhang, M.; Xu, P.; Zhu, H. 3D Ocean Temperature Prediction via Graph Neural Network with Optimized Attention Mechanisms. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Cao, W.; Yan, J.; Yang, X.; Luo, X.; Guan, X. Communication-Aware Formation Control of AUVs with Model Uncertainty and Fading Channel via Integral Reinforcement Learning. IEEE/CAA J. Autom. Sin. 2023, 10, 159–176. [Google Scholar] [CrossRef]
Yan, M.; Wang, Z. Water Quality Prediction Method Based on Reinforcement Learning Graph Neural Network. IEEE Access 2024, 12, 184421–184430. [Google Scholar] [CrossRef]
Fan, X.; Zhang, X.; Yu, X. A graph convolution network-deep reinforcement learning model for resilient water distribution network repair decisions. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 1547–1565. [Google Scholar] [CrossRef]
Zhu, M.; Wen, Y.Q. Design and Analysis of Collaborative Unmanned Surface-Aerial Vehicle Cruise Systems. J. Adv. Transp. 2019, 2019, 1323105. [Google Scholar] [CrossRef]
Chen, J.; Wang, Z.; Srivastava, G.; Alghamdi, T.A.; Khan, F.; Kumari, S.; Xiong, H. Industrial blockchain threshold signatures in federated learning for unified space-air-ground-sea model training. J. Ind. Inf. Integr. 2024, 39, 100593. [Google Scholar] [CrossRef]
Du, Y.; Fu, H.; Wang, S.; Sun, Z. A Real-Time Collaborative Mapping Framework using UGVs and UAVs. In Proceedings of the 2024 IEEE International Conference on Unmanned Systems (ICUS), Nanjing, China, 18–20 October 2024; pp. 584–590. [Google Scholar]
Samadzadegan, F.; Toosi, A.; Dadrass Javan, F. A critical review on multi-sensor and multi-platform remote sensing data fusion approaches: Current status and prospects. Int. J. Remote Sens. 2025, 46, 1327–1402. [Google Scholar] [CrossRef]
Matsuki, H.; Scona, R.; Czarnowski, J.; Davison, A.J. CodeMapping: Real-Time Dense Mapping for Sparse SLAM using Compact Scene Representations. arXiv 2021, arXiv:2107.08994. [Google Scholar] [CrossRef]
Ghasemieh, A.; Kashef, R. Towards explainable artificial intelligence in deep vision-based odometry. Comput. Electr. Eng. 2024, 115, 109127. [Google Scholar] [CrossRef]
Zhong, X.; Pan, Y.; Behley, J.; Stachniss, C. Shine-mapping: Large-scale 3d mapping using sparse hierarchical implicit neural representations. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 8371–8377. [Google Scholar]
Billings, G.; Camilli, R.; Johnson-Roberson, M. Hybrid visual SLAM for underwater vehicle manipulator systems. IEEE Robot. Autom. Lett. 2022, 7, 6798–6805. [Google Scholar] [CrossRef]
Brecko, A.; Kajati, E.; Koziorek, J.; Zolotova, I. Federated Learning for Edge Computing: A Survey. Appl. Sci. 2022, 12, 9124. [Google Scholar] [CrossRef]
Chen, D.; Deng, T.; Jia, J.; Feng, S.; Yuan, D. Mobility-aware decentralized federated learning with joint optimization of local iteration and leader selection for vehicular networks. Comput. Netw. 2025, 263, 111232. [Google Scholar] [CrossRef]
Lindsay, J.; Ross, J.; Seto, M.L.; Gregson, E.; Moore, A.; Patel, J.; Bauer, R. Collaboration of Heterogeneous Marine Robots Toward Multidomain Sensing and Situational Awareness on Partially Submerged Targets. IEEE J. Ocean. Eng. 2022, 47, 880–894. [Google Scholar] [CrossRef]
Braca, P.; Willett, P.; LePage, K.; Marano, S.; Matta, V. Bayesian Tracking in Underwater Wireless Sensor Networks with Port-Starboard Ambiguity. IEEE Trans. Signal Process. 2014, 62, 1864–1878. [Google Scholar] [CrossRef]
Chen, J.; Wu, H.T.; Lu, L.; Luo, X.; Hu, J. Single underwater image haze removal with a learning-based approach to blurriness estimation. J. Vis. Commun. Image Represent. 2022, 89, 103656. [Google Scholar] [CrossRef]
Wang, L.; Xu, L.; Tian, W.; Zhang, Y.; Feng, H.; Chen, Z. Underwater image super-resolution and enhancement via progressive frequency-interleaved network. J. Vis. Commun. Image Represent. 2022, 86, 103545. [Google Scholar] [CrossRef]
Guo, Y.; Li, Y.; Tan, H.; Zhang, Z.; Ye, J.; Ren, C. Research on Target Tracking Simulation System Framework for Multi-Static Sonar Buoys. J. Phys. Conf. Ser. 2023, 2486, 012097. [Google Scholar] [CrossRef]
Xuan, W.; Jian-She, G.; Bo-Jie, H.; Zong-Shan, W.; Hong-Wei, D.; Jie, W. A lightweight modified YOLOX network using coordinate attention mechanism for PCB surface defect detection. IEEE Sens. J. 2022, 22, 20910–20920. [Google Scholar] [CrossRef]
Yan, J.; Yi, M.; Yang, X.; Luo, X.; Guan, X. Broad-Learning-Based Localization for Underwater Sensor Networks with Stratification Compensation. IEEE Internet Things J. 2023, 10, 13123–13137. [Google Scholar] [CrossRef]
Han, L.; Tang, G.; Cheng, M.; Huang, H.; Xie, D. Adaptive Nonsingular Fast Terminal Sliding Mode Tracking Control for an Underwater Vehicle-Manipulator System with Extended State Observer. J. Mar. Sci. Eng. 2021, 9, 501. [Google Scholar] [CrossRef]
Motoi, N.; Hirayama, D.; Yoshimura, F.; Sabra, A.; Fung, W.k. Sliding Mode Control with Disturbance Estimation for Underwater Robot. In Proceedings of the 2022 IEEE 17th International Conference on Advanced Motion Control (AMC), Padova, Italy, 18–20 February 2022; pp. 317–322. [Google Scholar] [CrossRef]
Chen, B.; Hu, J.; Zhao, Y.; Ghosh, B.K. Finite-time observer based tracking control of uncertain heterogeneous underwater vehicles using adaptive sliding mode approach. Neurocomputing 2022, 481, 322–332. [Google Scholar] [CrossRef]
Soriano, T.; Pham, H.A.; Gies, V. Experimental investigation of relative localization estimation in a coordinated formation control of low-cost underwater drones. Sensors 2023, 23, 3028. [Google Scholar] [CrossRef]
Zeng, J.; Wan, L.; Li, Y.; Zhang, Z.; Xu, Y.; Li, G. Robust composite neural dynamic surface control for the path following of unmanned marine surface vessels with unknown disturbances. Int. J. Adv. Robot. Syst. 2018, 15, 1729881418786646. [Google Scholar] [CrossRef]
Luo, G.; Shen, Y. A study on path optimization of construction machinery by fusing ant colony optimization and artificial potential field. Adv. Control Appl. Eng. Ind. Syst. 2024, 6, e125. [Google Scholar] [CrossRef]
Tang, J.; Pan, Q.; Chen, Z.; Liu, G.; Yang, G.; Zhu, F.; Lao, S. An improved artificial electric field algorithm for robot path planning. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 2292–2304. [Google Scholar] [CrossRef]
Yan, J.; Zhang, L.; Yang, X.; Chen, C.; Guan, X. Communication-Aware Motion Planning of AUV in Obstacle-Dense Environment: A Binocular Vision-Based Deep Learning Method. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14927–14943. [Google Scholar] [CrossRef]
Hu, B.B.; Zhang, H.T.; Liu, B.; Ding, J.; Xu, Y.; Luo, C.; Cao, H. Coordinated Navigation Control of Cross-Domain Unmanned Systems via Guiding Vector Fields. IEEE Trans. Control Syst. Technol. 2024, 32, 550–563. [Google Scholar] [CrossRef]
Sharif, S.; Zeadally, S.; Ejaz, W. Space-aerial-ground-sea integrated networks: Resource optimization and challenges in 6G. J. Netw. Comput. Appl. 2023, 215, 103647. [Google Scholar] [CrossRef]
Garofalo, G.; Giordano, A.; Piro, P.; Spezzano, G.; Vinci, A. A distributed real-time approach for mitigating CSO and flooding in urban drainage systems. J. Netw. Comput. Appl. 2017, 78, 30–42. [Google Scholar] [CrossRef]
Tian, W.; Fu, G.; Xin, K.; Zhang, Z.; Liao, Z. Improving the interpretability of deep reinforcement learning in urban drainage system operation. Water Res. 2024, 249, 120912. [Google Scholar] [CrossRef]
Zhang, M.; Xu, Z.; Wang, Y.; Zeng, S.; Dong, X. Evaluation of uncertain signals’ impact on deep reinforcement learning-based real-time control strategy of urban drainage systems. J. Environ. Manag. 2022, 324, 116448. [Google Scholar] [CrossRef]
Heo, S.; Nam, K.; Kim, S.; Yoo, C. XRL-FlexSBR: Multi-agent reinforcement learning-driven flexible SBR control with explainable performance guarantee under diverse influent conditions. J. Water Process Eng. 2024, 66, 105991. [Google Scholar] [CrossRef]
Huang, Z.; Wang, Y.; Dong, X. Dimensions of superiority: How deep reinforcement learning excels in urban drainage system real-time control. Water Res. X 2025, 28, 100313. [Google Scholar] [CrossRef]
Mendoza, E.; Andramuño, J.; Núñez, J.; Córdova, L. Intelligent multi-agent architecture for a supervisor of a water treatment plant. J. Phys. Conf. Ser. 2021, 2090, 012124. [Google Scholar] [CrossRef]
Jiménez, A.F.; Cárdenas, P.F.; Jiménez, F. Intelligent IoT-multiagent precision irrigation approach for improving water use efficiency in irrigation systems at farm and district scales. Comput. Electron. Agric. 2022, 192, 106635. [Google Scholar] [CrossRef]
Zaman, M.; Tantawy, A.; Abdelwahed, S. Optimizing Smart City Water Distribution Systems Using Deep Reinforcement Learning. In Proceedings of the 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET), Boca Raton, FL, USA, 4–6 December 2023; pp. 228–233. [Google Scholar]
Candelieri, A.; Perego, R.; Archetti, F. Bayesian optimization of pump operations in water distribution systems. J. Glob. Optim. 2018, 71, 213–235. [Google Scholar] [CrossRef]
Xu, J.; Wang, H.; Rao, J.; Wang, J. Zone scheduling optimization of pumps in water distribution networks with deep reinforcement learning and knowledge-assisted learning. Soft Comput. 2021, 25, 14757–14767. [Google Scholar] [CrossRef]
Donâncio, H.; Vercouter, L.; Roclawski, H. The Pump Scheduling Problem: A Real-World Scenario for Reinforcement Learning. arXiv 2022, arXiv:2210.11111. [Google Scholar]
Hung, F.; Yang, Y.E. Assessing adaptive irrigation impacts on water scarcity in nonstationary environments—A multi-agent reinforcement learning approach. Water Resour. Res. 2021, 57, e2020WR029262. [Google Scholar] [CrossRef]
Wang, D.; Li, A.; Yuan, Y.; Zhang, T.; Yu, L.; Tan, C. Energy-saving scheduling for multiple water intake pumping stations in water treatment plants based on personalized federated deep reinforcement learning. Environ. Sci. Water Res. Technol. 2025, 11, 1260–1270. [Google Scholar] [CrossRef]
Quan, Y.; Xi, L. Smart generation system: A decentralized multi-agent control architecture based on improved consensus algorithm for generation command dispatch of sustainable energy systems. Appl. Energy 2024, 365, 123209. [Google Scholar] [CrossRef]
Hu, S.; Gao, J.; Zhong, D.; Wu, R.; Liu, L. Real-Time Scheduling of Pumps in Water Distribution Systems Based on Exploration-Enhanced Deep Reinforcement Learning. Systems 2023, 11, 56. [Google Scholar] [CrossRef]
Luo, W.; Wang, C.; Zhang, Y.; Zhao, J.; Huang, Z.; Wang, J.; Zhang, C. A deep reinforcement learning approach for joint scheduling of cascade reservoir system. J. Hydrol. 2025, 651, 132515. [Google Scholar] [CrossRef]
Jiang, W.; Liu, Y.; Fang, G.; Ding, Z. Research on short-term optimal scheduling of hydro-wind-solar multi-energy power system based on deep reinforcement learning. J. Clean. Prod. 2022, 385, 135704. [Google Scholar] [CrossRef]
Wu, R.; Wang, R.; Hao, J.; Wu, Q.; Wang, P. Multiobjective multihydropower reservoir operation optimization with transformer-based deep reinforcement learning. J. Hydrol. 2024, 632, 130904. [Google Scholar] [CrossRef]
Mitjana, F.; Denault, M.; Demeester, K. Managing chance-constrained hydropower with reinforcement learning and backoffs. Adv. Water Resour. 2022, 169, 104308. [Google Scholar] [CrossRef]
Riemer-Sørensen, S.; Rosenlund, G.H. Deep Reinforcement Learning for Long Term Hydropower Production Scheduling. In Proceedings of the 2020 International Conference on Smart Energy Systems and Technologies (SEST), Istanbul, Turkey, 7–9 September 2020; pp. 1–6. [Google Scholar] [CrossRef]
Hu, S.; Gao, J.; Zhong, D. Multi-agent reinforcement learning framework for real-time scheduling of pump and valve in water distribution networks. Water Supply 2023, 23, 2833–2846. [Google Scholar] [CrossRef]
Zhang, X.; Wang, Q.; Yu, J.; Sun, Q.; Hu, H.; Liu, X. A multi-agent deep-reinforcement-learning-based strategy for safe distributed energy resource scheduling in energy hubs. Electronics 2023, 12, 4763. [Google Scholar] [CrossRef]
Wu, X.; Gao, Z.; Yuan, S.; Hu, Q.; Dang, Z. A dynamic task allocation algorithm for heterogeneous UUV swarms. Sensors 2022, 22, 2122. [Google Scholar] [CrossRef]
Nguyen, T.; La, H.M.; Le, T.D.; Jafari, M. Formation Control and Obstacle Avoidance of Multiple Rectangular Agents with Limited Communication Ranges. IEEE Trans. Control Netw. Syst. 2017, 4, 680–691. [Google Scholar] [CrossRef]
Lin, C.; Han, G.; Zhang, T.; Shah, S.B.H.; Peng, Y. Smart Underwater Pollution Detection Based on Graph-Based Multi-Agent Reinforcement Learning Towards AUV-Based Network ITS. IEEE Trans. Intell. Transp. Syst. 2023, 24, 7494–7505. [Google Scholar] [CrossRef]
Zhang, C.; Cheng, P.; Du, B.; Dong, B.; Zhang, W. AUV path tracking with real-time obstacle avoidance via reinforcement learning under adaptive constraints. Ocean Eng. 2022, 256, 111453. [Google Scholar] [CrossRef]
Guo, Y.; Liu, H.; Fan, X.; Lyu, W. Research progress of path planning methods for autonomous underwater vehicle. Math. Probl. Eng. 2021, 2021, 8847863. [Google Scholar] [CrossRef]
Chu, Z.; Wang, F.; Lei, T.; Luo, C. Path planning based on deep reinforcement learning for autonomous underwater vehicles under ocean current disturbance. IEEE Trans. Intell. Veh. 2022, 8, 108–120. [Google Scholar] [CrossRef]
Tao, M.; Li, Q.; Yu, J. Multi-Objective Dynamic Path Planning with Multi-Agent Deep Reinforcement Learning. J. Mar. Sci. Eng. 2024, 13, 20. [Google Scholar] [CrossRef]
Khatri, P.; Gupta, K.K.; Gupta, R.K. Raspberry Pi-based smart sensing platform for drinking-water quality monitoring system: A Python framework approach. Drink. Water Eng. Sci. 2019, 12, 31–37. [Google Scholar] [CrossRef]
Fonseca, J.; Bhat, S.; Lock, M.; Stenius, I.; Johansson, K.H. Adaptive sampling of algal blooms using autonomous underwater vehicle and satellite imagery: Experimental validation in the baltic sea. arXiv 2023, arXiv:2305.00774. [Google Scholar]
Mancy, H.; Ghannam, N.E.; Abozeid, A.; Taloba, A.I. Decentralized multi-agent federated and reinforcement learning for smart water management and disaster response. Alex. Eng. J. 2025, 126, 8–29. [Google Scholar] [CrossRef]
Ren, J.; Zhu, Q.; Wang, C. Edge Computing for Water Quality Monitoring Systems. Mob. Inf. Syst. 2022, 2022, 5056606. [Google Scholar] [CrossRef]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. Int. Conf. Mach. Learn. 2017, 70, 1126–1135. [Google Scholar]
Yang, J.; Liu, J.; Qiu, G.; Liu, J.; Jawad, S.; Zhang, S. A spatio-temporality-enabled parallel multi-agent-based real-time dynamic dispatch for hydro-PV-PHS integrated power system. Energy 2023, 278, 127915. [Google Scholar] [CrossRef]
Li, Y.; Wu, B.; Feng, Y.; Fan, Y.; Jiang, Y.; Li, Z.; Xia, S.T. Semi-supervised robust training with generalized perturbed neighborhood. Pattern Recognit. 2022, 124, 108472. [Google Scholar] [CrossRef]
Yang, Z.; Jin, H.; Tang, Y.; Fan, G. Risk-Aware Constrained Reinforcement Learning with Non-Stationary Policies. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, Auckland, New Zealand, 6–10 May 2024; pp. 2029–2037. [Google Scholar]
Heo, S.; Oh, T.; Woo, T.; Kim, S.; Choi, Y.; Park, M.; Kim, J.; Yoo, C. Real-scale demonstration of digital twins-based aeration control policy optimization in partial nitritation/Anammox process: Policy iterative dynamic programming approach. Desalination 2025, 593, 118235. [Google Scholar] [CrossRef]
Cavalieri, S.; Gambadoro, S. Digital twin of a water supply system using the asset administration shell. Sensors 2024, 24, 1360. [Google Scholar] [CrossRef]
Yang, J.; Xi, M.; Wen, J.; Li, Y.; Song, H.H. A digital twins enabled underwater intelligent internet vehicle path planning system via reinforcement learning and edge computing. Digit. Commun. Networks 2024, 10, 282–291. [Google Scholar] [CrossRef]
Zhang, Z.; Tian, W.; Liao, Z. Towards coordinated and robust real-time control: A decentralized approach for combined sewer overflow and urban flooding reduction based on multi-agent reinforcement learning. Water Res. 2023, 229, 119498. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Zhang, X.; Guo, J.; Li, F. Cloud–edge collaborative inference with network pruning. Electronics 2023, 12, 3598. [Google Scholar] [CrossRef]
Qian, P.; Liu, G. Application of High-Frequency Intelligent Sensing Network in Monitoring and Early Warning of Water Quality Dynamic Change. Int. J. Comput. Intell. Syst. 2024, 17, 195. [Google Scholar] [CrossRef]
Ding, Y.; Sun, Q.; Lin, Y.; Ping, Q.; Peng, N.; Wang, L.; Li, Y. Application of artificial intelligence in (waste)water disinfection: Emphasizing the regulation of disinfection by-products formation and residues prediction. Water Res. 2024, 253, 121267. [Google Scholar] [CrossRef]
Jiang, J.; Lu, Z. Learning attentional communication for multi-agent cooperation. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018); MIT Press: Cambridge, MA, USA, 2018; pp. 7265–7275. [Google Scholar]

Figure 1. Structure of multi-agent reinforcement learning.

Figure 2. Schematic illustration of three representative learning paradigms in multi-agent deep reinforcement learning. (a) distributed training and distributed execution; (b) centralized training and centralized execution; (c) centralized training and distributed execution.

Figure 3. Schematic illustration of the training process in the value decomposition network (VDN) algorithm.

Figure 4. Logical structure and technical pathways of cross-domain collaborative control architecture.

Figure 5. Schematic diagram of a multi-agent communication system.

Figure 6. Formation control under joint communication–perception feedback in water environment systems.

Figure 7. Scenario illustration of target localization in complex water systems. (a) supervised learning; (b) unsupervised learning; (c) semi-supervised learning.

Figure 8. Illustration of a DDPG-based obstacle avoidance strategy using parallax and channel state estimation.

Figure 9. Integrated control architecture combining centralized, decentralized, and default fallback strategies.

Figure 10. Real-time pump scheduling framework in a water distribution network.

Figure 11. Deep reinforcement learning framework for hydropower system control.

Figure 12. Schematic of the hierarchical MER-SAC architecture for AUV path planning.

Figure 13. Architecture of a deep reinforcement learning-based framework for multi-sensor data fusion.

Table 1. Multi-agent reinforcement learning-based modeling for water environment regulation.

State Variable S	Action Variable A	Reward Function R	Control Objective
Water quality status	Chemical dosing rate	Penalty for threshold exceedance of pollutants	Control nutrient levels to prevent eutrophication
Time correlation parameters	Associated control operations	Penalty based on recovery time	Evaluate the system’s recovery efficiency after pollution events
Water level/flow indicators	Gate or valve scheduling control	Penalty for deviation from target water level	Maintain water levels within the target range
Operational status of devices	Regulatory actions of the agent	Penalty for energy consumption and operational cost	Optimize resource utilization to enhance economic efficiency

Table 2. Comparative analysis of multi-agent learning paradigms.

Learning Paradigm	Core Concept	Evaluation	Representative Algorithms
Independent learning	This approach ignores interactions among agents and	Offers high scalability but is prone to non-stationarity	DRUQN, DLCQN, DDRQN
Independent learning	learns individual policies independently.	due to the lack of inter-agent coordination.	WDDQN
Centralized learning	Treats all agents as a unified entity, enabling joint policy optimization.	Immune to environmental non-stationarity but prone to curse of dimensionality.	—
Value function decomposition	Decomposes global value function to optimize the	Solves credit assignment, struggles in non-stationary	QMIX, WQMIX,
Value function decomposition	the overall objective.	or complex environments.	QPLEX, Qatten
Centralized value function	Trains a value network with global information	Reduces problems of high dimensionality and non-	MADDPG, MATRPO,
Centralized value function	to guide independent policy learning.	stationarity problem.	MAPPO, MATD3

Table 3. Representative algorithms for cross-domain intelligent cooperative control in water environment systems.

Technical Category	Algorithm Type	Technical Characteristics	Model Limitations
Task deployment	Heuristic algorithm and game-theoretic strategy	Construct goal allocation	Ignore intrinsic correlation between complex tasks
Task deployment	Self-organizing mapping algorithm	Adaptive capability of edge intelligence	Lack of coordination in task execution
Information exchange	Local topology control by autonomous learning	Flexible topology configuration	Difficult to ensure network connectivity
Information exchange	Graph-theoretic or Q-learning-based	Capability-aware routing	Difficult to balance transmission rate
Path planning	Robust optimization or state estimation	Multi-sensor fusion	Data heterogeneity across cross-domain platforms
Path planning	Federated learning or centralized learning	Edge computing algorithms	High computational cost and parameter space
Cooperative control	Centralized or distributed architecture	Architecture-based coordination	Communication overhead and latency
Cooperative control	Federated learning algorithm	Centralized/unified cross-domain training	Data heterogeneity and high control cost

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jia, L.; Pei, Y. Recent Advances in Multi-Agent Reinforcement Learning for Intelligent Automation and Control of Water Environment Systems. Machines 2025, 13, 503. https://doi.org/10.3390/machines13060503

AMA Style

Jia L, Pei Y. Recent Advances in Multi-Agent Reinforcement Learning for Intelligent Automation and Control of Water Environment Systems. Machines. 2025; 13(6):503. https://doi.org/10.3390/machines13060503

Chicago/Turabian Style

Jia, Lei, and Yan Pei. 2025. "Recent Advances in Multi-Agent Reinforcement Learning for Intelligent Automation and Control of Water Environment Systems" Machines 13, no. 6: 503. https://doi.org/10.3390/machines13060503

APA Style

Jia, L., & Pei, Y. (2025). Recent Advances in Multi-Agent Reinforcement Learning for Intelligent Automation and Control of Water Environment Systems. Machines, 13(6), 503. https://doi.org/10.3390/machines13060503

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Recent Advances in Multi-Agent Reinforcement Learning for Intelligent Automation and Control of Water Environment Systems

Abstract

1. Introduction

2. Theoretical Fundamental and Modeling Mechanisms of Multi-Agent DRL

2.1. Problem Formulation and Modeling Framework

2.2. Identification of Key State Variables in Water Environment System Regulation

2.3. Action Space Design and Classification in Intelligent Control

2.4. Reward Mechanism Design in Intelligent Regulation of Water Environment Systems

3. Training Paradigms in Multi-Agent DRL

3.1. Independent Learning Paradigm

3.2. Centralized Learning Paradigm

3.3. Centralized Training with Decentralized Execution in Water Environment Systems

3.3.1. Value Decomposition Approaches for Distributed Control

3.3.2. Applications of Centralized Value Function Methods in Multi-Agent Regulation

4. Cross-Domain Collaborative Control and Decision Architectures in Water Environment Systems

4.1. Strategies of Task Deployment in Collaborative Control

4.2. Communication Mechanisms for Multi-Agent Collaboration in Intelligent Water Environment Systems

4.2.1. Intelligent Communication Mechanisms for Collaborative Regulation in Water Environment Systems

4.2.2. Graph Topologies for Air–Water–Land Multi-Agent Collaboration in Water Environment Systems

4.2.3. Communication-Aware Collaboration and Adaptive Swarm Control Strategies

4.2.4. Cross-Regional Integrated Sensing and Control System Architectures

4.3. Path Planning and Adaptive Control for Water Infrastructure Systems in Complex Environments

4.3.1. Multi-Platform Collaborative Mapping for Water Environment Infrastructure Systems

4.3.2. Target Situation Awareness in Intelligent Water Environment Systems

4.3.3. Dynamic Monitoring and Target Tracking Mechanisms

4.4. Towards Application-Oriented Deployment of Cross-Domain Control Architectures

5. Application Practices of Multi-Agent Reinforcement Learning in Water Environment Systems

5.1. Urban Drainage and Emergency Pollution Control

5.1.1. Intelligent Regulation Requirements and Reinforcement Learning

5.1.2. Robustness Assessment and Fault Tolerance Under Communication Failures

5.2. Multi-Agent Intelligent Control for Urban Water Distribution Systems

5.2.1. Intelligent Control Requirements in Water Supply Systems

5.2.2. Multi-Agent Modeling and Task Decomposition for Water Supply Networks

5.2.3. Applications of Deep Reinforcement Learning in Water Supply Control

5.3. Multi-Agent Regulation in Hydropower Energy Systems

5.3.1. Hydropower Scheduling Problems and DRL-Based Hydropower System Control

5.3.2. Cutting-Edge Applications of Multi-Agent DRL in Hydropower Systems

5.4. Intelligent Task and Path Control in Multi-Platform Autonomous Monitoring Systems

5.4.1. Multi-Agent Task Allocation and Coordination Mechanisms

5.4.2. Autonomous Path Planning and Dynamic Obstacle Avoidance Strategies

5.4.3. Multi-Source Data Fusion and Real-Time Control in Water Body Monitoring Tasks

6. Cross-Domain Generalization and Digital Twin Integration in Multi-Agent Systems

6.1. Policy Transfer and Robustness in Heterogeneous Scenarios

6.2. Integration Potential of Digital Twins in Intelligent Water System Control

6.3. Edge Computing and Control Latency Optimization

7. Research Challenges and Future Directions

7.1. Challenges and Technical Limitations

7.2. Future Research Directions

7.2.1. Building a Data Ecosystem and AI Integration for Intelligent Water Systems

7.2.2. Perception-Driven Communication Mechanisms

7.2.3. Self-Organizing Multi-Agent Collaboration

7.2.4. Collaborative Control Architecture for Complex Water Systems

7.2.5. Modeling Large-Scale Heterogeneous Multi-Agent Systems

7.2.6. Adaptability and Robustness of Intelligent Models Under Extreme Conditions

8. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI