Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms

Zhang, Yuchen; Meng, Deyu; Ma, Weichao

doi:10.3390/pr14111800

Open AccessArticle

Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms

by

Yuchen Zhang

¹

,

Deyu Meng

²

and

Weichao Ma

^3,4,*

¹

School of Chemistry, Chemical Engineering and Materials, Heilongjiang University, Harbin 150080, China

²

School of Information and Communication Engineering, Beijing Information Science and Technology University, Changping, Beijing 102206, China

³

College of Civil Engineering, Heilongjiang University, Harbin 150080, China

⁴

Engineering Research Center of Rural Water Safety of Heilongjiang, Heilongjiang University, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(11), 1800; https://doi.org/10.3390/pr14111800

Submission received: 12 April 2026 / Revised: 28 May 2026 / Accepted: 28 May 2026 / Published: 31 May 2026

(This article belongs to the Special Issue Advanced Control, Removal, and Resource Recovery Processes in Wastewater Systems)

Download

Browse Figures

Versions Notes

Abstract

Chemical dosing control in wastewater treatment plants (WWTPs) necessitates a dynamic equilibrium between effluent compliance and operational costs, exemplifying a typical multi-objective sequential decision problem. Given the significant operational and compliance risks associated with online trial-and-error methods in full-scale plants, this study introduces a model-based reinforcement learning (MBRL) framework aimed at optimizing dosing. Utilizing 227 historical operation records, we construct a multilayer perceptron (MLP) virtual WWTP that maps 15 process states and 2 dosing actions to 6 effluent indicators, thereby providing a secure training environment for policy development. A composite reward function is formulated to incorporate penalties for effluent quality, constraints for abnormal conditions, and terms related to chemical costs. Three actor–critic algorithms—Soft Actor–Critic (SAC), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Proximal Policy Optimization (PPO)—are trained for 50,000 steps and evaluated over 50 test episodes against both a random baseline and a practically deployed proportional feedforward control baseline. All three reinforcement learning methods yield significantly higher rewards: SAC achieves a mean score of 70.23 (95% CI: [67.94, 72.52]), TD3 scores 70.33 (95% CI: [68.02, 72.64]), and PPO scores 68.78 (95% CI: [66.47, 71.09]), compared to 61.01 (95% CI: [58.23, 63.80]) for proportional control and 45.87 (95% CI: [41.29, 50.45]) for random dosing. Notably, both off-policy agents (SAC and TD3) demonstrate statistically equivalent, state-of-the-art control performance (paired Wilcoxon, p = 0.0631) and converge towards highly economical, low-dosage strategies. This study validates the feasibility of data-driven virtual commissioning for WWTP dosing optimization and supports the advancement of hybrid intelligent control systems that incorporate mechanistic constraints. In addition to algorithm comparisons, this framework can be further refined to serve as a practical, data-driven decision support tool for wastewater treatment plant (WWTP) operators. It enables them to formulate daily chemical dosing plans while adhering to compliance and cost constraints.

Keywords:

wastewater treatment; chemical dosing; model-based reinforcement learning; soft actor–critic; process optimization; digital twin

Graphical Abstract

1. Introduction

The operation of wastewater treatment plants (WWTPs) exhibits strong nonlinearity, tight coupling, and time-varying disturbances. Fluctuations in influent load, changes in sludge activity, and dosing actions interact with one another, making coagulant/flocculant control a typical multi-objective decision problem. In engineering practice, dosing still heavily relies on operator experience, which can lead to two common deviations: conservative over-dosing, which increases chemical consumption and sludge production, and under-dosing, which elevates effluent risk exposure. Therefore, intelligent control strategies are essential to jointly ensure compliance stability and economic operation.

Recent studies indicate that reinforcement learning (RL) in wastewater control is progressing from simulation validation toward engineering deployment. Energy-saving and control potential have been demonstrated on BSM1/ASM benchmarks [1]; model-free deep RL and transfer learning have been applied to activated-sludge process control [2]; and data-driven A²O simulations have enabled near real-time control [3]. Related studies in the Journal of Water Process Engineering (JWPE) further show that multi-agent RL has been utilized for set-point optimization in full-scale plants, automatic model calibration, and offline adaptive controller tuning [4,5,6]. BO-enhanced RL has also been reported for self-adaptive and multi-objective control in wastewater treatment. Meanwhile, research on advanced process control and digital twins continues to emphasize the importance of a closed data-model-control loop [7,8,9,10,11,12,13]. In addition, recent studies published in the Journal of Environmental Sciences (JES) have reported full-scale cost-performance optimization, global microbial risk profiling of WWTP effluents, temporal microbial community assembly under different carbon-source conditions, inhibition pathways of filamentous sludge bulking, and wastewater spectral-recognition methods for source identification [14,15,16,17,18].

However, three key bottlenecks remain for continuous dosing control in real plants. First, many studies still focus on benchmark simulations or simplified scenarios, with insufficient work on real historical disturbances and directly deployable action spaces (continuous chemical dosing) [19]. Second, although the reward function is central to policy performance, a unified and transferable calibration framework for balancing compliance risk, chemical cost, and process safety is still lacking [2,19]. Third, despite advancements in safe reinforcement learning, the integration of chemical-dosing mechanism constraints, process operating boundaries, and policy learning into a deployable workflow remains a significant challenge [20,21,22].

While RL is well-suited for sequential decision-making under uncertainty, direct online exploration in full-scale WWTPs is unacceptable due to the potential for triggering effluent violations and process instability. To mitigate the risks associated with trial-and-error, this study employs a model-based reinforcement learning (MBRL) approach. Initially, historical data are utilized to train a virtual WWTP that simulates state-action-effluent responses, followed by the training and comparison of policies within this virtual environment [23,24,25,26].

In this context, the paper emphasizes continuous dosing control in actual WWTPs, a scenario that is more aligned with practical deployment. A virtual WWTP is constructed using real operational data, and various RL paradigms are compared under a unified framework. An evidence chain is established across three dimensions—performance, convergence behavior, and engineering interpretability—to address three critical questions: Is it usable? Why does it work? How can it be deployed? Consequently, the primary objective is to optimize dosing in engineering-oriented WWTPs, with algorithm comparison serving as a means to select robust and deployable control policies rather than as an end in itself.

From an innovation perspective, most prior RL studies for wastewater control focus on benchmark simulations or online trial-and-error, with limited work on safety-oriented continuous dosing optimization using full-scale operational data [2,3,4,19,27]. In contrast, this study establishes a real-data MBRL route in which an MLP virtual WWTP computes state-action-effluent mappings (15 states + 2 dosing actions → 6 effluent indicators), enabling unified offline training and comparison of SAC, TD3, and PPO under realistic disturbances [23,24,26]. The framework further combines compliance-aware reward design with interpretable diagnostics (critic loss, explained variance, entropy trend, and FPS), and links training outcomes to a deployable low-dosage behavior. This integrated workflow—real-plant surrogate modeling + offline policy optimization + engineering interpretability—differs from purely simulation-based or direct online RL paradigms and provides a practical pre-deployment pathway for WWTP control [7,28,29]. The general work content can be seen in Figure 1: Overall Research Overview.

This study establishes a comprehensive drug dosing optimization workflow, highlighting several key contributions: First, by utilizing historical operational data from a municipal wastewater treatment plant (WWTP) in a northern city of China, we address the modeling contribution for practical scenarios. We developed a multi-layer perceptron (MLP) virtual wastewater treatment plant to map the relationship between state-action and effluent water quality, thereby providing a virtual debugging environment for the selection of dosing strategies in wastewater treatment plants. Second, under engineering constraints, control contributions are achieved by designing a reward function that integrates compliance objectives, penalties for abnormal conditions, and penalties for chemical costs. Additionally, a coefficient calibration logic is supplemented to ensure alignment between policy objectives and process objectives, which provides a scientifically sound and engineering-relevant reward mechanism for related reinforcement learning. Third, the evaluation contribution focuses on method comparison and interpretation: we systematically compared the SAC, TD3, PPO, Proportional, and random baseline algorithms in the same environment. By leveraging training diagnostics and dosing logs to verify policy convergence, we demonstrate interpretable low-dose dosing behavior under compliance constraints, thereby expanding new ideas for the selection of drug dosing strategies in wastewater treatment plants. Fourth, from an engineering problem-solving perspective, this work provides a methodological blueprint that, after further refinement—such as integration with uncertainty quantification and pilot-scale validation—can be operationalized as a concrete decision-support system for assisting in the formulation of dosing plans. Unlike the general advantages of Model-Based Reinforcement Learning (MBRL), which include a reduced risk of online trial-and-error, this contribution directly addresses the practical needs of plant operators to generate compliant and cost-effective dosing schedules in the face of real-world disturbances. To better articulate the novelty of this work, Table 1 systematically compares the proposed framework with representative reinforcement learning (RL)-based studies in wastewater control across four key dimensions: problem formulation, safety handling, evaluation paradigm, and statistical rigor.

This comparison highlights that prior studies frequently lack statistical credibility and safety-conscious reward design, whereas our work effectively bridges the gap between reinforcement learning research and engineering deployability.

2. Materials and Methods

2.1. Overall System Architecture

Conventional online reinforcement learning (RL) necessitates continuous trial-and-error processes on actual wastewater systems, which introduces significant compliance and operational risks. To mitigate these challenges, we adopt a closed-loop model-based reinforcement learning (MBRL) framework characterized by ‘data-driven surrogate modeling and RL policy optimization’. Initially, historical data are utilized to train a virtual wastewater treatment plant (WWTP). Subsequently, high-frequency exploration and policy iteration are conducted within this virtual environment. Finally, candidate control policies are evaluated and screened through offline assessments on independent test scenarios [23,24,25,26].

Formally, the control problem is structured as a Markov decision process (MDP), denoted as M = ⟨S, A, P, r, γ⟩. In this framework, the state space S comprises 15 process variables, while the action space A represents a two-dimensional continuous dosing decision involving polyacrylamide (PAM) and polyferric sulfate. The transition function P is approximated using the multi-layer perceptron (MLP) virtual WWTP, and the reward r is defined jointly by effluent-quality constraints, operating-boundary constraints, and dosing-intensity constraints. A key advantage of this architecture is that it allows for the completion of tens of thousands of policy-learning steps without exposing the actual plant to effluent risks, thereby enhancing the efficiency and deployability of controller development.

2.2. Data Source and Preprocessing

This study focuses on a wastewater treatment plant (WWTP) located in northern China. Data were collected over a period from 1 January 2025 to 14 August 2025. The raw dataset comprises 58 operational indicators, which encompass influent and effluent quality, flow rates, chemical dosing, sludge activity, and various process variables. From this comprehensive dataset, 27 representative indicators were selected for statistical analysis. Following data consolidation, the modeling dataset was established, containing 227 historical operational records.

The state vector consists of 15 variables: influent flow, effluent flow, influent chemical oxygen demand (COD), influent ammonia nitrogen (NH3-N), influent total nitrogen (TN), influent total phosphorus (TP), influent pH, influent suspended solids (SS), mixed liquor suspended solids (MLSS), specific volume (SV30), sludge production, specific energy consumption (kWh/m³), influent COD/TN ratio, sludge volume index (SVI), and dissolved oxygen (DO). The action vector includes two controllable dosing variables: polyacrylamide (PAM) and polyferric sulfate. The output vector comprises six effluent indicators: COD, biochemical oxygen demand (BOD), NH3-N, TN, TP, and SS.

Due to significant differences in scale and distribution among the variables (e.g., flow and TP vary by multiple orders of magnitude), separate standardizers were developed for states, actions, and outputs during the preprocessing phase. For any variable x, Z-score normalization was applied.

z = \frac{x - μ x}{σ x}

(1)

Variables are mapped to zero mean and unit variance before being fed into the model. This process reduces feature bias during gradient updates and enhances training stability and convergence efficiency.

The reduction from 58 raw indicators to 27 representative indicators adheres to three principles: regulatory priority, mechanism-chain completeness, and controllability of decision variables.

First, core compliance indicators in municipal wastewater treatment plant (WWTP) regulation, such as effluent COD, BOD, NH3-N, TN, TP, and SS, directly represent compliance risk and are therefore retained as key outputs [30].

Second, the influent quantity/quality load, sludge activity states (MLSS, SV30, SVI, DO), and operating variables (PAM, polyferric sulfate) collectively form a disturbance–state–control–response loop, a commonly adopted variable organization pattern in activated sludge modeling and control research [2,3,19,31].

Feature screening also balances statistical robustness and engineering interpretability. Recent data-driven membrane treatment studies indicate that feedwater descriptors (e.g., COD, TN, TP, and pH) are crucial for predicting separation performance and optimization outcomes [32]. Additionally, ANN-based optimization of electrocoagulation systems demonstrates that controllable operating conditions can significantly influence treatment efficiency, thereby supporting the explicit inclusion of actionable variables in model inputs [33].

Related wastewater process studies similarly show a strong sensitivity of responses to combinations of operating conditions, reinforcing the necessity to retain mechanism-related state variables. For phosphorus-oriented dosing decisions in WWTPs, iron-salt coagulation remains closely associated with phosphorus removal and solid–liquid separation [33,34].

Given the sample size of 227 records, directly utilizing all 58 input dimensions would increase redundancy and the risk of overfitting. Therefore, mechanism-constrained representative dimensionality reduction (from 58 to 27) is adopted to enhance model generalization and interpretability [35].

From a process-mechanism perspective, the 15-dimensional state variables encompass the critical chain of load disturbance, biochemical state, settling performance, and dosing prerequisites. Specifically, MLSS, SV30, and SVI represent sludge concentration and settling characteristics; DO directly influences nitrification activity and ammonia conversion; influent COD/TN reflects the sufficiency of carbon sources for denitrification; and pH and SS affect the efficiency of chemical coagulation and flocculation. This organization of variables facilitates the explicit incorporation of environmental engineering mechanism constraints into the data-driven control model. Furthermore, comprehensive data integrity and temporal granularity analyses have been conducted on operational records. After meticulously aligning 35 pages of records across seven field groups, 227 valid observations were identified. The statistical period spans from 1 January 2025 to 14 August 2025 (226 days), with predominantly daily sampling (approximately once per day) and a record coverage rate of 98.2%.

This indicates strong overall data continuity, with only a limited number of missing or filtered-out samples. Regarding operating scale, the complete records indicate that influent flow ranges from 0 to 110,282 m³/d (mean: 77,706.68 m³/d), while effluent flow ranges from 46,443 to 112,359 m³/d (mean: 76,224.02 m³/d). Based on the average treatment volume, the target facility can be classified as a medium-scale municipal wastewater treatment plant (WWTP) with a capacity of approximately 70,000–80,000 m³/d. To characterize sample distributions, Table 2 presents descriptive statistics for the 27 indicators. We selected the most representative inflow and outflow indicators, as well as sludge indicators, which are presented in Figure 2.

Separate standardizers are applied to states, actions, and outputs during preprocessing; the virtual WWTP uses an 80/20 train–test split.

Feature Selection Validation for Activated Sludge Process

To ensure the generalizability and physical interpretability of the virtual WWTP, the input features were validated through dual criteria: process mechanism priors and data-driven importance quantification.

From the perspective of process mechanisms, the selected 15 state variables strictly adhere to the kinetic constraints of the activated sludge system. MLSS represents the total microbial biomass, directly determining the degradation rate of organic substrates. SV30 and SVI collectively characterize sludge settleability and flocculation status, serving as key indicators of solid–liquid separation efficiency. DO concentration governs the nitrification rate, directly affecting ammonia nitrogen conversion. Influent COD, NH₃-N, TN, and TP represent pollution loads, determining the required biochemical reaction capacity and chemical dosing benchmarks. The influent COD/TN ratio reflects carbon source adequacy for denitrification, constituting a stoichiometric constraint for total nitrogen removal. pH and SS significantly influence chemical coagulation efficiency and microbial activity. Collectively, these features form a closed-loop structure of “disturbance–state–control–response,” consistent with conventional variable organization practices in activated sludge modeling.

For data-driven validation, permutation importance was employed to quantify each feature’s contribution to model output, calculated as the increase in prediction mean squared error after randomly shuffling the feature’s values. Testing on all 17 input features (15 state variables plus 2 action variables) revealed that all features yielded strictly positive importance scores, ruling out the presence of redundant variables. The top five ranked features were SVI (2.17 × 10⁻¹), MLSS (2.04 × 10⁻¹), influent SS (1.81 × 10⁻¹), influent TP (1.59 × 10⁻¹), and polyferric sulfate dosage (1.50 × 10⁻¹), encompassing sludge settleability, biomass, influent loading, and core control actions, which aligns closely with the operational logic of the activated sludge process. This quantitative result corroborates the process prior knowledge, confirming the statistical robustness and engineering interpretability of the feature selection.

2.3. Wastewater Process Background (Activated Sludge)

The plant adopts an integrated process route comprising activated-sludge biochemical treatment, secondary sedimentation with sludge-water separation and return, and chemically enhanced phosphorus removal through coagulation aids. This route is centered on the microbial floc degradation of organic pollutants and relies on nitrification and denitrification for nitrogen removal. Additionally, it enhances phosphorus removal and solid–liquid separation stability through the dosing of polyferric sulfate and PAM [19,31,34]. From a control perspective, the influent quality (e.g., COD, TN, TP, pH), activated sludge states (e.g., MLSS, SV30, SVI, DO), and chemical dosing (PAM, polyferric sulfate) collectively determine effluent fluctuations.

This aligns with the RL state/action/reward design in this study: state variables characterize influent and biochemical conditions, action variables correspond to dosing decisions, and the reward function constrains policy learning through both effluent quality and dosing cost [2,4,19]. The activated-sludge process configuration and unit parameters can be summarized as a continuous chain of pretreatment, biochemical reaction, secondary clarifier, sludge return, and chemically enhanced phosphorus removal through coagulation aids. The biochemical section encompasses aerobic nitrification and anoxic denitrification functions, as evidenced by the observed total nitrogen removal behavior. Key parameters supported by operational data include an influent flow of 51,811–110,282 m³/d (mean 77,706.68 m³/d), MLSS of 1908–4359 mg/L (mean 2947.47 mg/L), SV30 of 28–49%, SVI of 92–265 mL/g, DO of 0.58–21.79 mg/L (mean 5.32 mg/L), PAM dosage of 0–0.3 kg/d, and polyferric sulfate dosage of 0–7.5 t/d. Effluent NH3-N remains at 0.13–4.52 mg/L (mean 0.83 mg/L), indicating that nitrification is generally maintained.

From the perspective of the chemical-dosing mechanism, the efficiency of iron/aluminum salt-assisted phosphorus removal is significantly influenced by pH, the metal/phosphorus molar ratio, the degree of sludge aging, and the return-flow pathway. Elevated Fe/P ratios or advanced sludge aging can hinder phosphorus release and recoverability. Moreover, prolonged simultaneous dosing of iron salts may suppress the activity of phosphorus-accumulating organisms (PAOs), thereby diminishing the biological contribution to phosphorus removal [20,22,34,36]. Consequently, engineering controls should regard the ‘chemical dosage upper bound–pH window–iron load in return sludge’ as stringent constraints to prevent imbalances between short-term compliance and long-term biochemical performance.

2.4. Virtual WWTP Model

The virtual wastewater treatment plant (WWTP) employs a multilayer perceptron (MLP) architecture comprising two hidden layers, each containing 256 neurons with ReLU activation functions, followed by a linear output layer. The input dimension consists of 17 features (15 states and 2 actions), while the output dimension comprises 6 values. Training is conducted using the Adam optimizer with a learning rate of 0.001, a batch size of 64, and a total of 50 epochs.The dosing of wastewater treatment and related steps are illustrated in Figure 3. Wastewater process flow and control-point schematic. The basic architecture of the MLP is shown in Figure 4. Virtual WWTP architecture.

The mechanistic interpretation of the MLP illustrates that it approximates complex processes through hierarchical combinations of linear mappings and nonlinear activations. The lth layer can be represented mathematically as

h^{(l)} = \emptyset (W^{(l)} h^{(l - 1)} + b^{(l)})

(2)

where h(0) = x is the input vector (process states and dosing actions), and φ(·) is the ReLU activation. After multi-layer feature transformation, the output layer jointly predicts six effluent indicators:

\hat{y} = W^{(o)} h^{(L)} + b^{(o)}

(3)

In wastewater dosing scenarios, the first hidden layer is capable of learning local couplings among influent load, biochemical states, and dosing actions. The second hidden layer further integrates higher-order interaction features to capture nonlinear mechanisms, such as the joint influence of the COD/TN ratio and DO on denitrification efficiency, as well as the synergistic effect of coagulants and PAM on TP/SS removal. Model parameters are updated end-to-end by minimizing the mean squared error:

L_{M S E} = \frac{1}{N} \sum_{i = 1}^{N} {‖y_{i} - {\hat{y}}_{i}‖}_{2}^{2}

(4)

In this study, the MLP virtual WWTP functions as a differentiable approximator for the relationship between state-action and effluent response. This allows the agent to quickly evaluate the potential effects of various continuous dosing strategies within a virtual environment, thereby minimizing the risk associated with online trial-and-error in the actual plant. During the reinforcement learning (RL) interaction, at time t, the agent outputs action u_t based on the current state s_t, and the virtual WWTP generates the predicted effluent.

{\hat{y}}_{t} = g_{θ} (S_{t}, U_{t})

(5)

The immediate reward is then calculated as Rt = R(st, ut,

\hat{y}

t). This “predict–score–update” loop enables end-to-end policy optimization around engineering objectives. The MLP architecture was selected over alternative models (e.g., XGBoost, Random Forest, LSTM) for three primary reasons. First, the MLP offers differentiable function approximation, which enables end-to-end policy optimization through backpropagation within the virtual WWTP—an essential requirement for model-based reinforcement learning that tree-based models cannot fulfill [23,25,26]. Second, the MLP provides smoother predictions in continuous action spaces, thereby mitigating the extrapolation discontinuities that tree-based models exhibit in previously unseen action regions, which could otherwise result in brittle policy behavior. Third, given the limited dataset of only 227 samples, the relatively low parameter count of the proposed MLP (comprising two hidden layers of 256 neurons each) effectively balances representational capacity with the risk of overfitting; more complex architectures (e.g., LSTM) would necessitate significantly larger datasets to generalize reliably. Recent studies in water treatment optimization have similarly utilized MLP-based surrogate models due to their advantageous properties in continuous control tasks [7,9]. A systematic comparison with alternative surrogate models is planned for future work as data availability improves.

The performance of the virtual wastewater treatment plant (WWTP) is quantified on the test set using metrics such as R², mean absolute error (MAE), and root mean square error (RMSE). During offline policy screening, conservative value-function and pessimistic Markov Decision Process (MDP) concepts can be further integrated to mitigate the overestimation risk associated with out-of-distribution actions [28,29].

To empirically justify the selection of the MLP as the surrogate environment, a 5-fold cross-validation comparison was conducted on the same dataset (N = 227) against three representative regression models: Random Forest (RF), Gradient Boosting Decision Tree (GBDT), and Linear Regression (LR). The MLP achieved an RMSE of 1.2534 and an MAE of 0.7089, compared to 1.1725 and 0.6615 for RF, 1.3143 and 0.7238 for GBDT, and 1.3109 and 0.7447 for LR. Although RF exhibited marginally lower prediction errors, the MLP is ultimately preferred for three reasons specific to model-based reinforcement learning. First, differentiability: the MLP supports gradient backpropagation through the dynamics model for direct policy optimization, whereas tree-based models are non-differentiable step functions. Second, extrapolation capacity: tree-based models cannot predict beyond the bounds of training targets, making them unsuitable for extreme disturbances outside historical operating ranges. Third, multi- output joint representation: the MLP’s hidden layers automatically learn coupled representations among COD, TN, TP, and SS, preserving physicochemical interdependencies. Furthermore, in the proposed MBRL framework, the virtual WWTP is not merely a prediction module but a differentiable simulator that enables policy gradients to propagate through the environment dynamics—a capability that tree-based models fundamentally lack regardless of their predictive accuracy. Accordingly, the MLP is adopted as the virtual WWTP.

Prediction Model Stability and Multi-Algorithm Benchmark Assessment

To validate the predictive robustness of the virtual WWTP, this study employs 5-fold cross-validation (5-fold CV) complemented by bootstrapping for uncertainty quantification. The MLP predictor achieves a mean R² of 0.3422 ± 0.2534 and an RMSE of 1.2534. Using 100 bootstrap resampling iterations, the 95% confidence interval (CI) for R² is [0.2217, 0.5390]. The lower bound of this CI is strictly greater than zero, statistically rejecting the null hypothesis that the model is equivalent to random guessing (R² = 0).

A noteworthy observation is the apparent divergence between R² and RMSE in specific folds. For Fold 3, R² registers a negative value (−0.0437), yet the absolute prediction error (RMSE = 1.2304) remains highly consistent with the other folds. According to the definition R² = 1 − SS_res/SS_tot, this phenomenon arises mechanistically because the test samples in this fold exhibit extremely stable distributions, leading to a very small total variance SS_tot (the denominator). Even though the absolute prediction error SS_res is correspondingly low, the computed R² becomes suppressed. This does not indicate model failure; rather, it confirms that the model maintains stable absolute prediction accuracy under smooth operating conditions.

Building on the above benchmark assessment, we further justify the choice of MLP as the world model from a control-theoretic perspective. First, the full differentiability of MLP supports gradient-based backpropagation, enabling direct policy network optimization, whereas tree-based models (e.g., Random Forest, GBDT) are non-differentiable step functions incompatible with gradient-based reinforcement learning algorithms. Second, MLP exhibits favorable extrapolation and continuous interpolation capabilities, as its predictions are not constrained by the extreme values of the training set, making it more suitable for handling out-of-distribution extreme conditions beyond historical boundaries. Third, the multi-output shared representation mechanism of MLP automatically captures the inherent coupling relationships among effluent indicators (COD, TN, TP, SS), whereas tree-based models require independent modeling for each target, thereby decoupling the physical interdependencies among variables. In summary, by balancing predictive accuracy with control-theoretic requirements, this study adopts MLP as the surrogate model for the virtual environment.

2.5. Reward Function Design

The reward function jointly considers effluent quality, process stability, and chemical dosing cost, and is formulated as:

R = R_{b a s e} + B - P

(6)

R_{b a s e} = 100 - 1.5 {C O D}_{o u t} - 2.0 {T N}_{o u t} - 3.0 {T P}_{o u t} - 1.0 {S S}_{o u t} - 10 u_{P A M} - 5 u_{P F S}

(7)

In this context, u_PAM and u_PFS represent dosing actions, while B denotes the conditional bonus term and P signifies the conditional penalty term applicable under abnormal operating conditions. According to the guidelines established in the project documentation, boundary penalties and effectiveness bonuses are clearly articulated through the use of indicator functions:

P = P_{S V 30} + P_{D O} + P_{C / T N}

P_{S V 30} = 10 I (S V 30 > 45) + 15 I (S V 30 > 45 a n d ({S S}_{o u t} > 10 o r {C O D}_{o u t} > 30))

P_{D O} = 20 I (D O < 0.2) + 25 I (D O < 0.2 a n d ({C O D}_{o u t} > 30 o r {N H}_{3} {- N}_{o u t} > 1.0))

P_{C / T N} = 10 I (C O D / T N < 4) + 15 I (C O D / T N < 4 a n d {T N}_{o u t} > 12)

(8)

B = B_{S V 30} + B_{P F S} + B_{P A M}

B_{S V 30} = 5 I (25 \leq S V 30 \leq 45)

B_{P F S} = 10 I (u_{P F S} > 0 a n d {T P}_{o u t} < 0.25)

B_{P A M} = 10 I (u_{P A M} > 0 a n d {S S}_{o u t} < 5)

(9)

The indicator function I(·) equals 1 when the specified condition is satisfied and equals 0 otherwise.

These rules encapsulate a control philosophy that prioritizes stability, suppresses boundary risks, and advocates for minimum dosing. Stepwise penalties are imposed when the system approaches unsafe operating boundaries, thereby preventing the policy from pursuing short-term rewards at the expense of process safety. Positive feedback is granted only when dosing results in verifiable removal of phosphorus or suspended solids (SS), which helps mitigate the risk of a degenerate ‘near-zero dosing’ policy. This design aligns with findings in simultaneous chemical phosphorus removal, wherein the optimization of iron-salt dosing must balance compliance benefits against the potential inhibition of biological phosphorus removal [20,22,36].

Reward coefficients are determined through a three-step procedure: regulation priority, environmental impact weighting, and economic cost calibration. First, according to the Grade-1A daily limits specified in the Chinese municipal wastewater discharge standard (GB 18918-2002) [30], the limits for COD, TN, TP, and SS are set at 50, 15, 0.5, and 10 mg/L, respectively; stricter indicators are assigned higher penalty weights (GB, 2002). Second, based on the BSM1 pollution-equivalent weights for effluent assessment (B_COD = 1, B_SS = 2, B_NO = 10, B_NKj = 30), nitrogen-related pollutants are weighted more heavily than organic and suspended solids [31]. Considering the phosphorus removal objective of this study, the TP term is designated the highest water-quality penalty coefficient. Third, dosing terms are penalized linearly in relation to unit chemical costs and the risks associated with overdosing; virtual environment calibration is then employed to ensure that water-quality and chemical-cost penalties are on the same order of magnitude. The final coefficients are k_COD = 1.5, k_TN = 2.0, k_TP = 3.0, k_SS = 1.0, k_PAM = 10, and k_PFS = 5. Additionally, the coefficient ranking is quantitatively supported by a statistical compliance margin. The relative margin of the 1st effluent indicator is defined as follows:

M_{i} = \frac{C_{i}^{l i m} - μ_{i}}{σ_{i}}

(10)

where C_i^lim represents the Grade-1A limit, and µ_i and σ_i denote the sample mean and standard deviation, respectively. Based on the data in Table 2, the estimated Mi values for COD, TN, TP, and SS are approximately 8.37, 4.21, 4.17, and 7.97. The narrower margins for TN and TP suggest that these parameters are more likely to approach regulatory thresholds under routine disturbances; consequently, a risk-priority ordering is established: k_TP > k_TN > k_COD > k_SS.

From an optimization standpoint, linear penalties and conditional terms (B, P) create a coordinated structure of ‘soft-constraint + hard-constraint’: linear terms offer smooth gradients for policy updates, while conditional terms introduce nonlinear amplified penalties in regions close to violations, thereby discouraging opportunistic strategies that trade minor non-compliance for reduced chemical usage. The engineering rationale posits that very low dissolved oxygen (DO) levels (e.g., <0.2 mg/L) inhibit nitrification activity and can impair ammonia removal. Conversely, low influent COD/TN levels indicate carbon limitation for denitrification, increasing the risk of TN non-compliance. Therefore, boundary penalties can be interpreted as explicit biochemical constraint injections into the reinforcement learning (RL) objective. The local sensitivity concerning action uj can be expressed as

\frac{\partial R}{\partial u_{j}} = - \sum_{i} k_{i} \frac{\partial {\hat{y}}_{i}}{\partial u_{j}} - k_{u j}, {\hat{y}}_{i} = f_{i} (s, u)

(11)

Coefficient selection adheres to a marginal-balance criterion: the reward gained from improvements in water quality should be comparable to the penalty incurred from increased chemical usage under typical operating conditions. This approach prevents the policy from devolving into either extreme of under-dosing or over-dosing.

This reward function is designed based on the fundamental principles of water pollution control engineering, with all penalty and reward items corresponding to the mechanistic constraints established in the mature theories of activated sludge systems.

The process stability penalty term (P) incorporates biochemical mechanistic constraints. Three penalty items specifically address the rate-limiting steps in the biological nitrogen removal and phosphorus removal processes. The dissolved oxygen (DO) penalty term is based on the nitrification process, which follows Monod kinetics, where the ammonia oxidation rate is determined by the concentration of dissolved oxygen. The half-saturation constant for ammonia-oxidizing bacteria (AOB) is (K_O = 0.15 to 0.3 mg/L). When the dissolved oxygen concentration falls below 0.2 mg/L, oxygen supply becomes the rate-limiting factor, leading to the obstruction of the nitrification process and subsequently resulting in excessive ammonia nitrogen in the effluent. Therefore, the penalty threshold (DO < 0.2 mg/L) is mechanistically set as the critical oxygen concentration; once it falls below this value, the activity of ammonia-oxidizing bacteria will decline exponentially. This principle has been incorporated into the Activated Sludge Model (ASM) and is widely applied in aeration control engineering practices [37,38]. In the sludge volume index (SV30) penalty term, when SV30 exceeds 45%, it indicates the onset of filamentous bacterial sludge bulking. The excessive proliferation of filamentous microorganisms, such as Microthrix and type 0041 filamentous bacteria, can disrupt the structure of the sludge flocs and hinder gravitational sedimentation. From the perspective of solid–liquid separation mechanics, the interference sedimentation rate exhibits an exponential decay with increasing SV30. Additionally, the condition of sludge expansion (SVI > 150 mL/g, SV30 > 45%) can directly trigger the uplift of the sludge layer, posing a risk of sludge loss. This penalty term thus incorporates the sedimentation constraint necessary to maintain the stability of the sludge-water interface [39]. The penalty term based on the carbon-to-nitrogen ratio (COD/TN) is grounded in the understanding that denitrification is a heterotrophic biochemical process driven by carbon sources, requiring biodegradable COD as an electron donor. The theoretical stoichiometric requirement is that 2.86 g of COD is needed to reduce 1 g of nitrate nitrogen. When the influent COD/TN < 4, the carbon source becomes the limiting factor for the complete reduction of nitrate nitrogen, especially pronounced under high influent nitrogen loads. The threshold of COD/TN < 4 is recognized as the critical value for stable denitrification operation in wastewater treatment engineering. This penalty term reflects the material balance constraints of electron donors in the anoxic zone, preventing reinforcement learning control strategies from neglecting the stoichiometric relationship of carbon sources, which could lead to excessive total nitrogen in the effluent [38,40].

The integration of these three major constraints—oxygen affinity characteristics (microbial kinetics), gravitational sedimentation characteristics (separation mechanics), and electron donor stoichiometric balance (biochemical material balance)—constitutes the core mechanistic boundaries of the activated sludge system. By incorporating this in the form of a conditional penalty term in Equation (8), it ensures that any reinforcement learning strategy remains within the biochemically feasible operational range, rather than relying solely on effluent quality rewards for blind optimization.

The design of the coagulation performance reward item (B) is based on the principle of coagulation and flocculation. The conditional reward items for polyferric sulfate (PFS) and polyacrylamide (PAM) are derived from the mechanism of coagulation and flocculation.

The PFS reward item is based on the phosphorus removal mechanism of iron salts, which achieves this by generating trivalent iron-phosphate complex precipitates. The precipitation efficiency is determined by the solubility product constant (approximately 10⁻²⁶ at pH 7.5–8.0). Under the conventional pH conditions of urban wastewater treatment plants (7.0–8.0), the theoretically achievable minimum soluble total phosphorus with iron salts is about 0.1–0.2 mg/L. Therefore, setting a reward condition for total phosphorus in the effluent to be <0.25 mg/L represents an engineering controllable target that approaches the thermodynamic precipitation limit. Additionally, it is essential to avoid excessive addition of iron salts. Trivalent iron ions can bind with extracellular polymeric substances (EPS), inhibiting the polyphosphate kinase activity of polyphosphate-accumulating organisms (s), thereby forming a well-recognized balance between chemical and biological phosphorus removal in the academic community. This reward mechanism provides positive incentives only when the total phosphorus in the effluent approaches the precipitation limit, adhering to the stoichiometric constraints of the optimal iron-to-phosphorus molar ratio range (1.5–2.5) while avoiding the inhibition of PAO activity [41].

Polyacrylamide (PAM) is a high molecular weight flocculant that aggregates suspended particles through adsorption bridging and charge neutralization. The established threshold for effluent suspended solids (SS) is set at <5 mg/L, which is a stringent internal control target that exceeds the Class A discharge standard of 10 mg/L. Rewards are granted only when the addition of PAM can effectively enhance the solid–liquid separation performance, thereby eliminating indiscriminate and arbitrary dosing of chemicals. Blind dosing will only increase sludge production and chemical consumption without yielding actual water quality benefits. This design aligns with colloidal stability theory: the optimal dosage of flocculant can neutralize the system’s Zeta potential to near-zero potential range (−5 ~ +5 mV) [42]. Figure 5 further illustrates the various factors we considered in designing the reward function.

To quantify the impact of reward coefficient variations on policy performance, a sensitivity analysis was conducted on the four core effluent penalty coefficients (COD, TN, TP, SS).The SAC agent was evaluated under ±20% perturbations of each coefficient across 20 independent operating scenarios, with the Impact Score defined as the mean reward difference between the +20% and −20% conditions, as shown in Table 3.

2.6. Reinforcement Learning Algorithms and Training Setup

This study compares three continuous-control algorithms: Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor–Critic (SAC), which incorporates maximum-entropy regularization. All algorithms are trained for 50,000 steps in the same virtual environment, using a random-action policy as the baseline.

The rationale for selecting these three algorithms is as follows. First, this study’s action space is two-dimensional and continuous, encompassing PAM and polyferric sulfate dosage. This necessitates the use of deep Actor–Critic methods capable of directly outputting continuous actions while effectively handling nonlinear value estimation. Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor–Critic (SAC) are widely validated in continuous-control tasks and represent mainstream on-policy and off-policy technical routes [43,44,45,46].

Second, the three methods exhibit complementary mechanisms, making them suitable for an interpretable comparison set. PPO enhances policy-update stability through a clipped objective [40]; TD3 mitigates overestimation bias and improves sample efficiency via twin Q-networks and delayed policy updates; and SAC incorporates maximum-entropy regularization within an off-policy framework to balance return optimization and exploration robustness [45]. Additionally, in wastewater-control research, continuous-control policy-gradient and Actor–Critic families are commonly assessed to compare trade-offs among energy use, compliance, and risk [1,2,19]. Therefore, the selection of PPO, TD3, and SAC facilitates a simultaneous comparison of three core properties within a unified environment—training stability, sample efficiency, and exploration capability—thereby enhancing reproducibility and engineering interpretability.

The evaluation stage employs fifty independent test episodes to guarantee statistical significance and suppress evaluation variance, with the average episode reward under matched initial operating conditions serving as the primary metric.

To ensure a fair comparison, all three algorithms utilize the same state/action normalization pipeline, identical training steps (50,000), and the same set of initial test conditions. During evaluation, exploration noise is disabled, and deterministic actions are employed; average cumulative reward and its dispersion are reported as the main comparison criteria.

3. Results and Discussion

3.1. Benchmark Comparison of Policy Performance

To objectively evaluate control performance across various policies, a deterministic evaluation was conducted for the trained SAC, TD3, PPO, Proportional and the random baseline on a test set comprising 50 independent random initial scenarios. The metric employed is the Average Cumulative Reward. The benchmark results are summarized in Table 4. All three reinforcement learning (RL) algorithms significantly outperform the random dosing strategy.

As summarized in Table 4, all three RL algorithms significantly outperform both the random baseline and proportional control. SAC achieves a mean reward of 70.23, TD3 70.33, and PPO 68.78, corresponding to improvements of 53.1%, 53.3%, and 49.9% over the random baseline, respectively. Under uniform conditions involving 50 independent random initial scenarios and deterministic-action evaluation, all three reinforcement learning (RL) methods consistently outperform the random baseline. This observation suggests that the performance gains are not attributable to chance exploration; rather, they stem from effective learning of the state–action–effluent relationship. The findings demonstrate that in a highly nonlinear virtual wastewater treatment plant (WWTP), RL can yield reproducible high-quality control policies under multi-objective constraints.

The performance of SAC and TD3 is statistically equivalent, with a paired Wilcoxon signed-rank test yielding p = 0.0631, indicating no significant difference at the α = 0.05 level. However, both off-policy algorithms (SAC and TD3) achieve higher mean rewards than the on-policy PPO, reflecting the sample efficiency advantage of off-policy learning within the 50,000-step training budget. SAC further benefits from maximum-entropy regularization, which adaptively balances exploration and exploitation, as evidenced by the entropy loss trend in Figure 6c. TD3, by contrast, employs a deterministic policy with clipped double-Q learning to mitigate overestimation bias. Despite these architectural differences, both off-policy agents converge to statistically equivalent control performance under the current virtual environment and training budget.

Robustness and Safety Validation Under Influent Concentration Shocks

To validate the robustness and inherent safety of the learned reinforcement learning policy under realistic industrial disturbances, this subsection presents shock-load experiments in which influent pollutant concentrations (COD, TN, TP, SS, NH₃-N) were simultaneously increased by +30% and +50% to simulate sudden industrial discharge events or stormwater-induced hydraulic overloads. Evaluation metrics include the full effluent compliance rate, chemical dosing cost, and average episodic reward. The SAC policy is compared against the proportion-based feedforward control widely deployed in engineering practice. The test results are shown in Table 5.

Under normal conditions, SAC achieves 100% full compliance with a dosing cost of only 1.66, representing an 80.8% reduction compared to proportional control (8.68). Under the +30% shock, proportional control blindly increases dosing to 11.27 based on its linear gain assumption, yet achieves only 80.0% compliance. In contrast, SAC adjusts its cost marginally from 1.66 to 1.60 while maintaining 88.0% compliance, achieving an 85.8% cost saving. Under the severe +50% shock, proportional control escalates dosing to 12.92 (a 48.8% increase from normal) while its compliance rate drops to 78.0%. SAC, by contrast, maintains 84.0% compliance with a remarkably low cost of 1.38, outperforming proportional control by 6 percentage points in compliance while achieving an 89.3% cost saving.

The failure of proportional feedforward control stems from its linear extrapolation assumption—that dosing should scale proportionally with influent load—which ignores the nonlinear mechanisms of floc overload, sedimentation inhibition, and chemical–biological antagonism (e.g., ferric salt interference with polyphosphate-accumulating organisms) at high concentrations. Excessive dosing not only fails to linearly improve removal efficiency but instead exacerbates chemical sludge production, coagulant hydrolysis competition, and PAO activity suppression. Through extensive trial-and-error within the virtual environment, the SAC agent learns an optimal control strategy that avoids this antagonistic regime, achieving a “low-dose, high-compliance” dual objective under shock loads. These results demonstrate the superior inherent safety margin and economic rationality of the RL-based policy under extreme disturbances, providing critical safety evidence for its engineering deployment.

3.2. Comparison with Existing Technical Routes

To address the incremental value of this method in comparison to existing routes, we conduct a horizontal comparison from a technical attributes perspective. Published studies exhibit significant variations in data sources, process configurations, evaluation metrics, and reward definitions (e.g., BSM1 simulation, data-driven A²O simulation, full-scale plant set-point optimization), rendering absolute reward values non-comparable across different papers [2,3,4,5,6,19]. Consequently, we employ a four-dimensional framework for comparison, which includes nonlinear adaptability, online trial-and-error risk, multi-objective coordination capability, and pre-deployment validation capability, as shown in Table 6.

Table 4 demonstrates that heuristic methods are straightforward to implement; however, they often struggle to maintain stable optimality when faced with significant disturbances and multiple constraints. Classical APC is effective under conditions of local linearity or known models but exhibits limited adaptability to complex nonlinear couplings and operational drifts. Direct online reinforcement learning (RL) offers robust policy-search capabilities, yet its practical application is hindered by compliance risks and the costs associated with trial-and-error approaches. The proposed methodology—comprising prior learning of a virtual wastewater treatment plant (WWTP), policy training in a virtual environment, and offline evaluation and screening—maintains strong optimization capabilities while significantly mitigating online exploration risks, thus aligning more closely with the engineering requirements of WWTPs, which prioritize safety and progressive deployment. The “High/Medium/Low” levels in the table are derived from the reported capabilities, deployment forms, and risk characteristics of the methods discussed in the representative references.

3.3. Statistical Significance of Policy Comparisons

To evaluate whether the performance differences among policies are statistically significant, we conducted a Kruskal–Wallis H test (a non-parametric one-way ANOVA) on the evaluation rewards from SAC, TD3, PPO, proportional control, and a random baseline across 50 independent test episodes. The test yielded p = 8.10 × 10⁻¹⁵ (p < 0.001), leading to the rejection of the null hypothesis that all policies perform equally.

Post hoc pairwise comparisons were performed using Mann–Whitney U tests with Bonferroni correction (m = 10) to control for the family-wise error rate. As illustrated in the study, all three reinforcement learning (RL) algorithms significantly outperform both the random baseline (adjusted p < 6.5 × 10⁻¹⁰) and proportional control (adjusted p < 4.8 × 10⁻⁶). The difference between SAC and TD3 was found to be statistically insignificant (adjusted p = 1.00), indicating that their performance is comparable given the current sample size and number of evaluation episodes. However, both off-policy algorithms (SAC and TD3) consistently achieve higher mean rewards than the on-policy PPO (see Table 3), reflecting the sample efficiency advantage of off-policy learning within the 50,000-step training budget. These statistical results confirm that the performance gains reported in Table 2 are not attributable to random chance, but rather reflect a genuine learning of the state-action-effluent relationship. Detailed results of the pairwise comparisons, including raw and adjusted p-values, are provided in Table 7.

3.4. TensorBoard Training Dynamics Analysis

During the experimental process, we obtained training diagnostics of PPO, TD3, and SAC in the virtual wastewater treatment plant environment, with Figure 6a–d most intuitively demonstrating the learning process of the intelligent agents.

According to Figure 6, the training process can be analyzed as follows:

Regarding the stability of the critic network in Figure 6a, the deep blue SAC curve demonstrates an initial rapid decline, followed by a swift stabilization around 1.2, while the pink TD3 curve shows a slower decline with significant fluctuations during the descent, ultimately stabilizing around 6.8. This observation indicates that although SAC and TD3 converge to similar final behavioral policies without significant differences, SAC provides a more stable estimate of future rewards under complex water quality disturbances. Figure 6b: The explained variance of the PPO algorithm steadily increases from negative or near-zero values to above 0.8, finally reaching 0.8432. This demonstrates strong statistical consistency between the constructed virtual wastewater treatment plant and the reward function, while the policy network effectively explains the action-rating relationship. Figure 6c. From exploration to convergence phase: The entropy loss gradually decreases, reflecting a smooth transition from high exploration to stable exploitation. Additionally, the action standard deviation drops to approximately 0.26, indicating the agent has converged to a relatively stable medication decision-making pattern. As shown in Figure 6d, the trade-off between computational efficiency and performance is as follows: Frame rate (FPS) metrics are PPO(93) > TD3(34) > SAC(28). Although SAC achieves a final reward comparable to that of TD3 (with no statistically significant difference, see Table 4), its update mechanism is more complex and computationally intensive, highlighting the engineering trade-off between computational efficiency and policy performance. Alternatively, one may note that SAC provides more stable critic loss convergence (Figure 6a) at the cost of lower FPS, indicating a slight advantage in terms of convergence stability and exploratory behavior.

Furthermore, the efficiency differences illustrated in Figure 6d have direct implications for deployment: under the current MLP virtual WWTP, the inference overhead is low, making the performance-optimal SAC more valuable. However, if more complex mechanism simulators are introduced in the future (e.g., CFD or ASM-coupled models), PPO’s higher FPS may become advantageous in scenarios with strict real-time requirements.

3.5. Interpretable Dosing-Behavior Analysis

Both SAC and TD3 exhibit a “minimum dose” strategy, with SAC showing a slight advantage in stability and consistency; however, the overall behavior of both algorithms is comparable. The average dosage of Polyacrylamide (PAM) remains highly stabilized at 0.0964 kg/d, while polyferric sulfate (Fe) dosage centers precisely at 0.1400 t/d. This finding aligns with the reward design, wherein increased dosage results in direct score deductions. Consequently, the agent is inclined to identify the lowest-cost action that still adheres to water quality constraints.

From an engineering perspective, this behavior contributes to reduced chemical consumption and supports the principles of “precision dosing and low-carbon operation.” In contrast to the prevalent heuristic strategies employed in traditional plants, which often rely on over-dosing to maintain safety redundancy, this policy prioritizes a more nuanced approach that emphasizes the balance between marginal benefits and marginal costs while ensuring compliance with regulatory constraints. Nonetheless, it is crucial to acknowledge the potential model risk associated with this strategy; if historical data do not encompass extreme scenarios where low dosing may induce systemic instability, the policy could be overly optimistic regarding the safety of low-dosage applications.

3.6. Engineering Implications and Limitations

The engineering implications of this study are primarily reflected in three aspects: it provides a low-risk development paradigm of “historical operation data → virtual WWTP → control-policy evaluation,” which can significantly reduce online trial-and-error costs in real plants; it achieves dosing optimization under compliance constraints and offers quantifiable decision support for chemical reduction and low-carbon operation; and it establishes a verifiable evidence chain for policy trustworthiness through training diagnostics (critic loss, explained variance, entropy indicator, and FPS), thereby improving interpretability for operations-support applications. Importantly, the implications outlined above should be distinguished from a more concrete engineering deliverable. The current framework, after further refinements such as uncertainty quantification, cross-plant validation, and human-in-the-loop deployment mechanisms, has the potential to evolve into a practical decision-support system. This system would assist wastewater treatment plant operators in formulating chemical dosing plans. Unlike the general advantages of model-based reinforcement learning, which include reducing online trial-and-error risks, this specific methodological contribution directly addresses the real-world challenge of generating safe, compliant, and economically optimal dosing schedules under variable influent conditions. The present study lays the groundwork for the necessary modeling, reward design, and policy benchmarking components required for such a system. Future work will concentrate on bridging the gap between offline policy screening and online operational assistance. However, consistent with common issues in existing studies [13,19], this work has several limitations: the sample size is limited to 227 and originates from a single plant and period; cross-season and cross-process generalization remain to be validated; the purely data-driven virtual WWTP may deviate from mechanistic laws under extrapolation conditions, and uncertainty quantification is still insufficient; although reward coefficients are engineering-calibrated, multi-scenario sensitivity and risk-boundary analyses are not yet systematic; the current reward emphasizes stepwise feedback and has limited representation of long-horizon dynamics in sludge settling and biochemical processes. Future work could integrate mechanistic priors (e.g., ASM-related constraints), uncertainty-aware virtual WWTPs, and staged “human-in-the-loop” deployment mechanisms, alongside safe reinforcement learning and offline policy-evaluation frameworks, to further enhance engineering deployment reliability [21,28,29,47,48,49,50].

3.7. Expanded Discussion on Dataset Limitations and Generalizability

While the 227 historical records constitute a genuine operational dataset from a full-scale wastewater treatment plant (WWTP), two inherent limitations warrant explicit discussion regarding their generalization and potential bias.

Using only 227 samples, the virtual WWTP achieves a mean R² of 0.3422 ± 0.2534 from 5-fold cross-validation, with a bootstrapped 95% CI of [0.2217, 0.5390] (100 resampling iterations). The lower bound of this CI is strictly greater than zero, statistically rejecting the null hypothesis of random prediction (R² = 0). Although this result is statistically significant, it also indicates the presence of some unexplained variance. Additionally, the low R² value observed in the third fold of the 5-fold cross-validation suggests that predictive performance may decline in certain data partitions. To mitigate overfitting, we employed 5-fold cross-validation, a conservative reward design with boundary penalties, and non-parametric statistical tests. Nevertheless, the risk of overfitting still exists, and it is crucial to keep this limitation in mind when interpreting the results. All data in this study were collected from a wastewater treatment plant in a town in northern China, spanning from January to August 2025, for a total of 8 months. Therefore, the control strategies learned by the model may implicitly reflect site-specific characteristics (such as sludge age, microbial community, and equipment configuration) as well as seasonal variations (such as temperature-influenced nitrification dynamics). Given that different wastewater treatment plants exhibit certain differences in process flow, activated sludge microbial composition, and environmental conditions, the generalizability of this model-based reinforcement learning (MBRL) framework under different wastewater treatment plants, climate zones, or seasonal conditions (such as low-temperature operation in winter) has yet to be validated. Hence, the next step is to urgently conduct cross-plant and cross-season data collection and validation studies to further verify the applicability of this method in different wastewater treatment plants.

Based on the aforementioned limitations, this paper identifies five key directions for future research. Future studies could integrate mechanistic prior knowledge, such as the constraints related to the Activated Sludge Model (ASM), into virtual models of wastewater treatment. This integration would enhance the reliability of model extrapolation beyond the distribution and facilitate data collection across multiple plants and different seasons, thereby strengthening the generalization capability of the models. Additionally, before the full-scale engineering operation, pilot-scale validation should be conducted, employing a phased human-in-the-loop deployment mechanism.

4. Conclusions

Focusing on the core objective of compliance-cost co-optimization in wastewater treatment plants (WWTPs), this study develops and validates a Model-Based Reinforcement Learning (MBRL) framework for dosing optimization. Experimental results demonstrate that Soft Actor–Critic (SAC), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Proximal Policy Optimization (PPO) consistently outperform the random baseline and Proportional. SAC and TD3 exhibit statistically equivalent average rewards, with no significant difference in final performance (paired Wilcoxon p = 0.0631). However, SAC shows slight advantages in training stability, including a faster convergence of critic loss and a smoother entropy decay.

More importantly, this study transcends the mere reporting of comparative scores by establishing a verifiable evidence chain for policy credibility. At the training level, SAC exhibits faster and less volatile convergence of critic loss, while PPO maintains an explained variance of approximately 0.84, with entropy loss and action dispersion increasing smoothly. At the behavioral level, SAC spontaneously develops a minimally sufficient dosing pattern characterized by low Polyacrylamide (PAM) (average 0.0964 kg/d) and low polyferric sulfate (average 0.1400 t/d), reflecting a consistent control logic aimed at minimizing consumption under compliance constraints. Beyond these algorithmic comparisons, the primary scientific contribution of this work lies in providing a replicable methodology for constructing a low-risk virtual commissioning environment and benchmarking candidate control policies. From an engineering perspective, this methodology—after further refinement, including uncertainty quantification, cross-season validation, and staged deployment protocols—can be transformed into a practical decision-support tool to assist in the formulation of chemical dosing plans for municipal wastewater treatment plants (WWTPs). This concrete engineering deliverable is distinct from the general advantages of model-based reinforcement learning, such as reduced online exploration risk, and directly addresses the pressing need for intelligent, safe, and interpretable dosing strategies in real-world plants.

The engineering significance of these findings lies in the framework’s ability to provide quantitative support for dosing optimization, low-carbon operation, and operations and maintenance (O&M) decision-making without increasing the risk of trial-and-error in real plants. Furthermore, this framework serves as a proof-of-concept for virtual commissioning; however, actual deployment necessitates plant-specific fine-tuning, the incorporation of uncertainty-aware safeguards, and pilot-scale validation before field implementation.

However, this study has several limitations that affect its generalizability. The dataset consists of only 227 samples collected from a single plant over an 8-month period, which restricts statistical power and may introduce seasonal and plant-specific biases. Although 5-fold cross-validation (mean R² = 0.3422 ± 0.2534, 95% CI [0.2217,0.5390]) and non-parametric statistical tests (Bonferroni-corrected p < 0.001 for RL vs. baseline) address some of these concerns, the virtual wastewater treatment plant (WWTP) still leaves a significant amount of variance unexplained, particularly regarding total nitrogen (TN) dynamics (Durbin-Watson = 1.33). Future research should incorporate data from multiple plants across different seasons, utilize recurrent architectures to account for temporal dependencies, and implement uncertainty-aware offline reinforcement learning frameworks (e.g., conservative Q-learning) to improve generalization and ensure deployment safety [28,29].

Author Contributions

Y.Z.: Methodology, Validation, Visualization, Data curation, Writing—original draft, Writing—review & editing. D.M.: Software, Methodology, Formal analysis. W.M.: Project administration, Investigation, Funding acquisition, Formal analysis, Data curation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Heilongjiang Province Ecological Environment Protection Research Project (No. HST2023S014); the Basic Research Project of Scientific Research Funds for Universities in Heilongjiang Province (No. 2025-KYYWF-ZR0404); the Heilongjiang Province Double First-Class Disciplines Collaborative Innovation Achievement Project (No.LJGXCG2025-F07); and the Heilongjiang Province Scientific and Technological Innovation Base Award Program (No. JD25B011).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

During the preparation of this work, the authors utilized OpenAI’s ChatGPT GPT-5 mini to enhance language clarity and readability, as well as to assist with ensuring citation format consistency. Following the use of this tool, the authors carefully reviewed and edited the content as necessary, taking full responsibility for the publication’s content.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

WWTP	wastewater treatment plant
SRF	specific resistance to filtration
RL	reinforcement learning
MDP	Markov decision process
MLP	multi-layer perceptron
A²O	Anaerobic–anoxic–oxic
APC	advanced process control
MPC	model predictive control
PID	proportional–integral–derivative
BSM1	benchmark simulation model no.1
ASM	activated sludge model
MSE	mean squared error
MAE	mean absolute error

References

Croll, H.C.; Ikuma, K.; Ong, S.K.; Sarkar, S. Systematic performance evaluation of reinforcement learning algorithms applied to wastewater treatment control optimization. Environ. Sci. Technol. 2023, 57, 18382–18390. [Google Scholar] [CrossRef]
Aponte-Rengifo, O.; Francisco, M.; Vilanova, R.; Vega, P.; Revollar, S. Intelligent control of wastewater treatment plants based on model-free deep reinforcement learning. Processes 2023, 11, 2269. [Google Scholar] [CrossRef]
Hu, F.; Zhang, X.; Lu, B.; Lin, Y. Real-time control of A2O process in wastewater treatment through fast deep reinforcement learning based on data-driven simulation model. Water 2024, 16, 3710. [Google Scholar] [CrossRef]
Nam, K.J.; Heo, S.K.; Kim, S.Y.; Yoo, C.K. A multi-agent AI reinforcement-based digital multi-solution for optimal operation of a full-scale wastewater treatment plant under various influent conditions. J. Water Process Eng. 2023, 52, 103533. [Google Scholar] [CrossRef]
Nam, K.J.; Heo, S.K.; Tariq, S.; Woo, T.Y.; Yoo, C.K. Multi-agent reinforcement learning-enhanced autonomous calibration method for wastewater treatment modeling: Long-term validation of a full-scale plant. J. Water Process Eng. 2024, 59, 104908. [Google Scholar] [CrossRef]
Nam, K.J.; Heo, S.K.; Yoo, C.K. Multi-agent reinforcement learning-driven adaptive controller tuning system for autonomous control of wastewater treatment plants: An offline learning approach. J. Water Process Eng. 2025, 70, 107059. [Google Scholar] [CrossRef]
Zhu, Z.; Dong, S.; Zhang, H.; Parker, W.; Yin, R.; Bai, X.; Yu, Z.; Wang, J.; Gao, Y.; Ren, H. Bayesian optimization-enhanced reinforcement learning for self-adaptive and multi-objective control of wastewater treatment. Bioresour. Technol. 2025, 421, 132210. [Google Scholar] [CrossRef]
Cairone, S.; Hasan, S.W.; Choo, K.H.; Lekkas, D.F.; Fortunato, L.; Zorpas, A.A.; Korshin, G.; Zarra, T.; Belgiorno, V.; Naddeo, V. Revolutionizing wastewater treatment toward circular economy and carbon neutrality goals: Pioneering sustainable and efficient solutions for automation and advanced process control with smart and cutting-edge technologies. J. Water Process Eng. 2024, 63, 105486. [Google Scholar] [CrossRef]
Haimi, H.; Awaitey, A.; Kiran, A.; Larsson, T.; Blomberg, K.; Elvander, F.; Petäjä, E.; Mulas, M.; Sahlstedt, K.; Mikola, A. Integrating data-driven and process expertise in soft-sensor design for a wastewater treatment digital twin application. Water Sci. Technol. 2025, 92, 1308–1327. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; He, S.; Mou, J.; Xue, T.; Chen, H.; Xiong, W. Digital twins-based process monitoring for wastewater treatment processes. Reliab. Eng. Syst. Saf. 2023, 238, 109416. [Google Scholar] [CrossRef]
Ma, Z.; Zhu, Y.; Chen, C.; Li, T.; Li, Y.; Li, X.; Wang, Y.; Waite, T.D.; Guan, J. Towards the digitalization of water treatment facilities: A case study on machine learning-enabled digital twins. J. Water Process Eng. 2025, 77, 108316. [Google Scholar] [CrossRef]
Rodríguez-Alonso, C.; Peña-Regueiro, I.; García, Ó. Digital twin platform for the real-time monitoring and prediction of water and wastewater treatment plant systems. Sensors 2024, 24, 1568. [Google Scholar] [CrossRef]
Wang, A.J.; Li, H.; He, Z.; Tao, Y.; Wang, H.; Yang, M.; Savic, D.; Daigger, G.T.; Ren, N. Digital twins for wastewater treatment: A technical review. Engineering 2024, 36, 21–35. [Google Scholar] [CrossRef]
Chen, K.; Liang, J.; Wang, Y.; Tao, Y.; Lu, Y.; Wang, A. A global perspective on microbial risk factors in effluents of wastewater treatment plants. J. Environ. Sci. 2024, 138, 227–235. [Google Scholar] [CrossRef]
Gao, C.; Yang, F.; Tian, Z.; Sun, D.; Liu, W.; Peng, Y. Pathways of inhibition of filamentous sludge bulking by slowly biodegradable organic compounds. J. Environ. Sci. 2025, 150, 104–115. [Google Scholar] [CrossRef]
Kuang, L.; Liu, R.; Jin, M.; Lan, Y.; Su, Y.; Zhao, Y.; Chen, L. Characterization and recognition of three-dimensional excitation-emission matrix spectra of wastewater from six typical categories. J. Environ. Sci. 2025, 157, 206–219. [Google Scholar] [CrossRef]
Li, Z.; Qi, R.; Wang, B.; Zou, Z.; Wei, G.; Yang, M. Cost-performance analysis of nutrient removal in a full-scale oxidation ditch process based on kinetic modeling. J. Environ. Sci. 2013, 25, 26–32. [Google Scholar] [CrossRef]
Li, W.; Xia, Y.; Li, N.; Chang, J.; Liu, J.; Wang, P.; He, X. Temporal assembly patterns of microbial communities in three parallel bioreactors treating low-concentration coking wastewater with differing carbon source concentrations. J. Environ. Sci. 2024, 137, 455–468. [Google Scholar] [CrossRef] [PubMed]
Croll, H.C.; Ikuma, K.; Ong, S.K.; Sarkar, S. Reinforcement learning applied to wastewater treatment process control optimization: Approaches, challenges, and path forward. Crit. Rev. Environ. Sci. Technol. 2023, 53, 1775–1794. [Google Scholar] [CrossRef]
Alnimer, A.A.; Smith, D.S.; Parker, W.J. Insight into direct phosphorus release from simulated wastewater ferric sludge: Influence of physiochemical factors. J. Environ. Chem. Eng. 2023, 11, 110259. [Google Scholar] [CrossRef]
Huang, J.; Zhang, L. Safe reinforcement learning for wastewater treatment with an input convex safety critic. Desalin. Water Treat. 2025, 324, 101451. [Google Scholar] [CrossRef]
Xiao, A.; Yu, J.; Lin, Z.; Cao, M.; Jian, S.; Lin, S.; Zhou, J. Inhibition of ferric salts on phosphorus-accumulating organisms in simultaneous chemical precipitation for phosphorus removal. Front. Microbiol. 2025, 16, 1681450. [Google Scholar] [CrossRef] [PubMed]
Chua, K.; Calandra, R.; McAllister, R.; Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Proceedings of the Advances in Neural Information Processing Systems, Online, 3–8 December 2018. [Google Scholar]
Deisenroth, M.P.; Rasmussen, C.E. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, Atlanta, GA, USA, 28 June–2 July 2011; pp. 465–472. [Google Scholar]
Deisenroth, M.P.; Fox, D.; Rasmussen, C.E. Gaussian processes for data-efficient learning in robotics and control. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 408–423. [Google Scholar] [CrossRef]
Janner, M.; Fu, J.; Zhang, M.; Levine, S. When to trust your model: Model-based policy optimization. In Proceedings of the Advances in Neural Information Processing Systems, Online, 8–14 December 2019. [Google Scholar]
Khurshid, A.; Pani, A.K. A review on machine learning in wastewater treatment applications: Focus on model evaluation and analysis of BSM1 benchmark simulation dataset. Environ. Monit. Assess. 2023, 195, 916. [Google Scholar] [CrossRef] [PubMed]
Kidambi, R.; Rajeswaran, A.; Netrapalli, P.; Joachims, T. MOReL: Model-based offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
Kumar, A.; Zhou, A.; Tucker, G.; Levine, S. Conservative Q-learning for offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
GB 18918-2002; Discharge Standard of Pollutants for Municipal Wastewater Treatment Plant (GB 18918-2002, with 2006 and 2025 Amendment Sheets). Ministry of Ecology and Environment of the People’s Republic of China; State Administration for Market Regulation; Standards Press of China: Beijing, China, 2002.
Alex, J.; Benedetti, L.; Copp, J.; Gernaey, K.V.; Steyer, J.P. Benchmark Simulation Model No. 1 (BSM1); Lund University: Lund, Sweden, 2008. [Google Scholar]
Wu, Z.; Zheng, K.; Zhang, G.; Huang, L.; Zhou, S. Preparation of polysulfone-based nanofiber Janus membrane for membrane distillation containing organic pollutants. npj Clean Water 2024, 7, 51. [Google Scholar] [CrossRef]
Dimoglo, A.; Sevim-Elibol, P.; Dinç, Ö.; Gökmen, K.; Erdoğan, H. Electrocoagulation/electroflotation as a combined process for the laundry wastewater purification and reuse. J. Water Process Eng. 2019, 31, 100877. [Google Scholar] [CrossRef]
Xu, H.; Wei, S.; Li, G.; Guo, B. Advanced removal of phosphorus from urban sewage using chemical precipitation by Fe-Al composite coagulants. Sci. Rep. 2024, 14, 4918. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Abdoli, S.; Asgari Lajayer, B.; Dehghanian, Z.; Bagheri, N.; Vafaei, A.H.; Chamani, M.; Rani, S.; Lin, Z.; Shu, W.; Price, G.W. A review of the efficiency of phosphorus removal and recovery from wastewater by physicochemical and biological processes: Challenges and opportunities. Water 2024, 16, 2507. [Google Scholar] [CrossRef]
Picioreanu, C.; Pérez, J.; van Loosdrecht, M.C.M. Impact of cell cluster size on apparent half-saturation coefficients for oxygen in nitrifying sludge and biofilms. Water Res. 2016, 106, 371–382. [Google Scholar] [CrossRef] [PubMed]
Zaman, M.; Kim, M.; Nakhla, G. imultaneous nitrification-denitrifying phosphorus removal (SNDPR) at low DO for treating carbon-limited municipal wastewater. Sci. Total Environ. 2021, 760, 143387. [Google Scholar] [CrossRef]
Séka, M.A.; Van de Wiele, T.; Verstraete, W. A test for predicting propensity of activated sludge to acute filamentous bulking. Water Environ. Res. 2001, 73, 237–242. [Google Scholar] [CrossRef]
Badia, A.; Kim, M.; Nakhla, G.; Ray, M.B. Effect of COD/N ratio on denitrification from nitrite. Water Environ. Res. 2019, 91, 119–131. [Google Scholar] [CrossRef] [PubMed]
Pi, K.W.; Chen, W.W.; Shi, Y.F.; Liu, D.F. Solidification and dewatering of phosphorus-rich river sediment using calcium-based polyferric sulfate. Gongye Anquan Yu Huanbao 2017, 43, 42–45. [Google Scholar] [CrossRef]
Zeng, Y.; Shen, Y.; Lin, H.; Tan, Q.; Sun, J.; Shen, L.; Li, R.; Xu, Y.; Teng, J. A synergistic approach integrating potassium ferrate oxidation with polyacrylamide flocculation to enhance sludge dewatering and its mechanisms. J. Environ. Manag. 2025, 382, 125323. [Google Scholar] [CrossRef]
Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1329–1338. [Google Scholar]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 22–31. [Google Scholar]
Berkenkamp, F.; Turchetta, M.; Schoellig, A.P.; Krause, A. Safe model-based reinforcement learning with stability guarantees. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Brunke, L.; Greeff, M.; Hall, A.W.; Yuan, Z.; Zhou, S.; Panerati, J.; Schoellig, A.P. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annu. Rev. Control Robot. Auton. Syst. 2022, 5, 411–444. [Google Scholar] [CrossRef]
Garcia, J.; Fernandez, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16, 1437–1480. [Google Scholar]

Figure 1. Overall Research Overview.

Figure 2. Data overview (key variable distributions).

Figure 3. Wastewater process flow and control-point schematic.

Figure 4. Virtual WWTP architecture.

Figure 5. Reward function components and penalty/bonus trigger logic.

Figure 6. Training diagnostics of PPO, TD3, and SAC in the virtual WWTP environment. (a) train/critic_loss; (b) train/explained_variance; (c) train/entropy_loss; (d) time/fps.

Table 1. Comparison of this study with existing RL-based wastewater control studies.

Aspect	Prior RL Studies (e.g., [2,4,19,27])	This Study
Control problem	Mostly BSM1 benchmark or A²O simulation	Real-plant historical data + MLP virtual WWTP
Action space	Discrete or simplified continuous	Two continuous dosing actions (PAM + polyferric sulfate)
Safety handling	Implicit or ignored	Explicit reward penalties for DO, SV30, COD/TN boundaries
Evaluation paradigm	Single-run reward comparison	5-fold CV + bootstrapping (95% CI) + pairwise statistical tests
Statistical rigor	Lacking or minimal	Bonferroni-corrected Mann–Whitney U test (p < 0.001)
Interpretability	Low (black-box policy)	Training diagnostics (critic loss, entropy, FPS) + dosing behavior analysis

Table 2. Descriptive statistics of operational data (maximum, minimum, mean, and standard deviation).

Category	Indicator	Unit	Max	Min	Mean	Std
Flow-related	Influent flow	m³/d	110,282	51,811	76,696.17	10,507.94
Flow-related	Effluent flow	m³/d	112,359	46,443	73,984.85	11,539.91
Influent quality	Influent COD	mg/L	480	90	296.98	89.13
Influent quality	Influent NH₃-N	mg/L	74	4.8	32.99	10.04
Influent quality	Influent TN	mg/L	75.2	11.8	35.55	8.69
Influent quality	Influent TP	mg/L	9.6	1.02	3.90	1.24
Influent quality	Influent pH		8.5	7	7.52	0.25
Influent quality	Influent SS	mg/L	95	23	58.64	15.20
Influent quality	Influent COD/TN	–	25.42	2.25	9.39	4.00
Sludge indicators	MLSS	mg/L	4359	1908	2947.47	499.07
Sludge indicators	SV30	%	49	28	40.81	3.98
Sludge indicators	SVI	ml/g	265	92	147.60	22.58
Process variables	Sludge production	t/d	162.22	0	74.86	37.44
Process variables	Specific energy consumption	kWh/m³	0.528	0.289	0.40	0.04
Process variables	DO	mg/L	21.79	0.58	5.32	4.75
Chemical dosing	PAM dosage	kg/d	0.3	0	0.17	0.06
Chemical dosing	Polyferric sulfate dosage	t/d	7.5	0	1.22	1.52
Effluent quality	Effluent COD	mg/L	35	14	19.30	3.67
Effluent quality	Effluent BOD	mg/L	3.3	1.4	2.30	0.44
Effluent quality	Effluent NH₃-N	mg/L	4.52	0.13	0.83	0.78
Effluent quality	Effluent TN	mg/L	12.6	3.21	8.39	1.57
Effluent quality	Effluent TP	mg/L	0.38	0.11	0.25	0.06
Effluent quality	Effluent SS	mg/L	5	2	3.78	0.78
Removal performance	COD removed	mg/L	460	63	296.11	90.07
Removal performance	TN removed	mg/L	69.1	3.2	27.81	10.74
Removal performance	TP removed	mg/L	9.22	0.65	3.72	1.33
Removal performance	SS removed	mg/L	44	23	36.19	3.87

Table 3. Sensitivity analysis of reward function coefficients.

Coefficient	Reward Range (−20% to +20% Change)	Impact Score
COD Coefficient	77.42 $\to$ 65.84	11.59
TN Coefficient	74.71 $\to$ 68.56	6.15
SS Coefficient	72.41 $\to$ 70.86	1.55
TP Coefficient	71.76 $\to$ 1.50	0.26
Analysis	The model is most sensitive to the COD coefficient, which aligns with the objective of reducing organic pollutant load.

Table 4. Policy performance comparison over 50 test episodes (Mean ± SD, 95% CI).

Method	Average Reward	Standard Deviation	95% CI	Improvement vs. Random
SAC	70.23	7.97	[67.94, 72.52]	+53.1%
TD3	70.33	8.04	[68.02, 72.64]	+53.3%
PPO	68.78	8.04	[66.47, 71.09]	+49.9%
Proportional	61.01	9.70	[58.23, 63.80]	+33.0%
Random	c	15.97	[47.29, 50.45]	+0.0%

Note: Standard deviations are computed across 50 independent test episodes under matched initial conditions. Proportional control refers to a dosage policy where PAM and polyferric sulfate are set proportionally to influent SS and TP, respectively, mimicking common industrial practice [33,34].

Table 5. Robustness validation under influent concentration shocks: Compliance rate, dosing cost, and average reward of SAC versus proportional control under normal, +30% shock, and +50% shock conditions.

Scenario	Control Strategy	Compliance Rate	Dosing Cost	Average Reward	Cost Saving vs. Proportional
Normal (1.0×)	SAC	100.0%	1.66	70.23	80.8%
	Proportional	84.0%	8.68	61.01	-
Spike +30%	SAC	88.0%	1.60	64.93	85.8%
	Proportional	80.0%	11.27	52.84	-
Spike +50%	SAC	84.0%	1.38	62.43	89.3%
	Proportional	78.0%	12.92	47.40	-

Table 6. Attribute-level comparison between the proposed method and existing technical routes.

Technical Route	Nonlinear Adaptation	Online Risk	Multi-Objective Coordination	Pre-Deployment Validation	Representative References
Heuristic method (fixed dosing)	Low	Low	Low	Low	[33,34]
APC (PID/MPC)	Medium	Medium	Medium	Medium	[8,13]
Online RL (model-free)	High	High	High	Low	[2,19]
This MBRL route (offline screening)	High	Low	High	High	[23,24,26]

Table 7. Post hoc Mann–Whitney with Bonferroni.

Comparison Pair	Raw p-Value	Adjusted p-Value	Significance
SAC vs. Proportional	4.75 × 10⁻⁷	2.85 × 10⁻⁶	Highly significant
TD3 vs. Proportional	5.68 × 10⁻⁷	3.41 × 10⁻⁶	Highly significant
SAC vs. TD3	0.9204	1.0000	Not significant

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Meng, D.; Ma, W. Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms. Processes 2026, 14, 1800. https://doi.org/10.3390/pr14111800

AMA Style

Zhang Y, Meng D, Ma W. Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms. Processes. 2026; 14(11):1800. https://doi.org/10.3390/pr14111800

Chicago/Turabian Style

Zhang, Yuchen, Deyu Meng, and Weichao Ma. 2026. "Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms" Processes 14, no. 11: 1800. https://doi.org/10.3390/pr14111800

APA Style

Zhang, Y., Meng, D., & Ma, W. (2026). Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms. Processes, 14(11), 1800. https://doi.org/10.3390/pr14111800

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall System Architecture

2.2. Data Source and Preprocessing

Feature Selection Validation for Activated Sludge Process

2.3. Wastewater Process Background (Activated Sludge)

2.4. Virtual WWTP Model

Prediction Model Stability and Multi-Algorithm Benchmark Assessment

2.5. Reward Function Design

2.6. Reinforcement Learning Algorithms and Training Setup

3. Results and Discussion

3.1. Benchmark Comparison of Policy Performance

Robustness and Safety Validation Under Influent Concentration Shocks

3.2. Comparison with Existing Technical Routes

3.3. Statistical Significance of Policy Comparisons

3.4. TensorBoard Training Dynamics Analysis

3.5. Interpretable Dosing-Behavior Analysis

3.6. Engineering Implications and Limitations

3.7. Expanded Discussion on Dataset Limitations and Generalizability

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI