Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms
Abstract
1. Introduction
2. Materials and Methods
2.1. Overall System Architecture
2.2. Data Source and Preprocessing
Feature Selection Validation for Activated Sludge Process
2.3. Wastewater Process Background (Activated Sludge)
2.4. Virtual WWTP Model
Prediction Model Stability and Multi-Algorithm Benchmark Assessment
2.5. Reward Function Design
2.6. Reinforcement Learning Algorithms and Training Setup
3. Results and Discussion
3.1. Benchmark Comparison of Policy Performance
Robustness and Safety Validation Under Influent Concentration Shocks
3.2. Comparison with Existing Technical Routes
3.3. Statistical Significance of Policy Comparisons
3.4. TensorBoard Training Dynamics Analysis
3.5. Interpretable Dosing-Behavior Analysis
3.6. Engineering Implications and Limitations
3.7. Expanded Discussion on Dataset Limitations and Generalizability
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| WWTP | wastewater treatment plant |
| SRF | specific resistance to filtration |
| RL | reinforcement learning |
| MDP | Markov decision process |
| MLP | multi-layer perceptron |
| A2O | Anaerobic–anoxic–oxic |
| APC | advanced process control |
| MPC | model predictive control |
| PID | proportional–integral–derivative |
| BSM1 | benchmark simulation model no.1 |
| ASM | activated sludge model |
| MSE | mean squared error |
| MAE | mean absolute error |
References
- Croll, H.C.; Ikuma, K.; Ong, S.K.; Sarkar, S. Systematic performance evaluation of reinforcement learning algorithms applied to wastewater treatment control optimization. Environ. Sci. Technol. 2023, 57, 18382–18390. [Google Scholar] [CrossRef]
- Aponte-Rengifo, O.; Francisco, M.; Vilanova, R.; Vega, P.; Revollar, S. Intelligent control of wastewater treatment plants based on model-free deep reinforcement learning. Processes 2023, 11, 2269. [Google Scholar] [CrossRef]
- Hu, F.; Zhang, X.; Lu, B.; Lin, Y. Real-time control of A2O process in wastewater treatment through fast deep reinforcement learning based on data-driven simulation model. Water 2024, 16, 3710. [Google Scholar] [CrossRef]
- Nam, K.J.; Heo, S.K.; Kim, S.Y.; Yoo, C.K. A multi-agent AI reinforcement-based digital multi-solution for optimal operation of a full-scale wastewater treatment plant under various influent conditions. J. Water Process Eng. 2023, 52, 103533. [Google Scholar] [CrossRef]
- Nam, K.J.; Heo, S.K.; Tariq, S.; Woo, T.Y.; Yoo, C.K. Multi-agent reinforcement learning-enhanced autonomous calibration method for wastewater treatment modeling: Long-term validation of a full-scale plant. J. Water Process Eng. 2024, 59, 104908. [Google Scholar] [CrossRef]
- Nam, K.J.; Heo, S.K.; Yoo, C.K. Multi-agent reinforcement learning-driven adaptive controller tuning system for autonomous control of wastewater treatment plants: An offline learning approach. J. Water Process Eng. 2025, 70, 107059. [Google Scholar] [CrossRef]
- Zhu, Z.; Dong, S.; Zhang, H.; Parker, W.; Yin, R.; Bai, X.; Yu, Z.; Wang, J.; Gao, Y.; Ren, H. Bayesian optimization-enhanced reinforcement learning for self-adaptive and multi-objective control of wastewater treatment. Bioresour. Technol. 2025, 421, 132210. [Google Scholar] [CrossRef]
- Cairone, S.; Hasan, S.W.; Choo, K.H.; Lekkas, D.F.; Fortunato, L.; Zorpas, A.A.; Korshin, G.; Zarra, T.; Belgiorno, V.; Naddeo, V. Revolutionizing wastewater treatment toward circular economy and carbon neutrality goals: Pioneering sustainable and efficient solutions for automation and advanced process control with smart and cutting-edge technologies. J. Water Process Eng. 2024, 63, 105486. [Google Scholar] [CrossRef]
- Haimi, H.; Awaitey, A.; Kiran, A.; Larsson, T.; Blomberg, K.; Elvander, F.; Petäjä, E.; Mulas, M.; Sahlstedt, K.; Mikola, A. Integrating data-driven and process expertise in soft-sensor design for a wastewater treatment digital twin application. Water Sci. Technol. 2025, 92, 1308–1327. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; He, S.; Mou, J.; Xue, T.; Chen, H.; Xiong, W. Digital twins-based process monitoring for wastewater treatment processes. Reliab. Eng. Syst. Saf. 2023, 238, 109416. [Google Scholar] [CrossRef]
- Ma, Z.; Zhu, Y.; Chen, C.; Li, T.; Li, Y.; Li, X.; Wang, Y.; Waite, T.D.; Guan, J. Towards the digitalization of water treatment facilities: A case study on machine learning-enabled digital twins. J. Water Process Eng. 2025, 77, 108316. [Google Scholar] [CrossRef]
- Rodríguez-Alonso, C.; Peña-Regueiro, I.; García, Ó. Digital twin platform for the real-time monitoring and prediction of water and wastewater treatment plant systems. Sensors 2024, 24, 1568. [Google Scholar] [CrossRef]
- Wang, A.J.; Li, H.; He, Z.; Tao, Y.; Wang, H.; Yang, M.; Savic, D.; Daigger, G.T.; Ren, N. Digital twins for wastewater treatment: A technical review. Engineering 2024, 36, 21–35. [Google Scholar] [CrossRef]
- Chen, K.; Liang, J.; Wang, Y.; Tao, Y.; Lu, Y.; Wang, A. A global perspective on microbial risk factors in effluents of wastewater treatment plants. J. Environ. Sci. 2024, 138, 227–235. [Google Scholar] [CrossRef]
- Gao, C.; Yang, F.; Tian, Z.; Sun, D.; Liu, W.; Peng, Y. Pathways of inhibition of filamentous sludge bulking by slowly biodegradable organic compounds. J. Environ. Sci. 2025, 150, 104–115. [Google Scholar] [CrossRef]
- Kuang, L.; Liu, R.; Jin, M.; Lan, Y.; Su, Y.; Zhao, Y.; Chen, L. Characterization and recognition of three-dimensional excitation-emission matrix spectra of wastewater from six typical categories. J. Environ. Sci. 2025, 157, 206–219. [Google Scholar] [CrossRef]
- Li, Z.; Qi, R.; Wang, B.; Zou, Z.; Wei, G.; Yang, M. Cost-performance analysis of nutrient removal in a full-scale oxidation ditch process based on kinetic modeling. J. Environ. Sci. 2013, 25, 26–32. [Google Scholar] [CrossRef]
- Li, W.; Xia, Y.; Li, N.; Chang, J.; Liu, J.; Wang, P.; He, X. Temporal assembly patterns of microbial communities in three parallel bioreactors treating low-concentration coking wastewater with differing carbon source concentrations. J. Environ. Sci. 2024, 137, 455–468. [Google Scholar] [CrossRef] [PubMed]
- Croll, H.C.; Ikuma, K.; Ong, S.K.; Sarkar, S. Reinforcement learning applied to wastewater treatment process control optimization: Approaches, challenges, and path forward. Crit. Rev. Environ. Sci. Technol. 2023, 53, 1775–1794. [Google Scholar] [CrossRef]
- Alnimer, A.A.; Smith, D.S.; Parker, W.J. Insight into direct phosphorus release from simulated wastewater ferric sludge: Influence of physiochemical factors. J. Environ. Chem. Eng. 2023, 11, 110259. [Google Scholar] [CrossRef]
- Huang, J.; Zhang, L. Safe reinforcement learning for wastewater treatment with an input convex safety critic. Desalin. Water Treat. 2025, 324, 101451. [Google Scholar] [CrossRef]
- Xiao, A.; Yu, J.; Lin, Z.; Cao, M.; Jian, S.; Lin, S.; Zhou, J. Inhibition of ferric salts on phosphorus-accumulating organisms in simultaneous chemical precipitation for phosphorus removal. Front. Microbiol. 2025, 16, 1681450. [Google Scholar] [CrossRef] [PubMed]
- Chua, K.; Calandra, R.; McAllister, R.; Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Proceedings of the Advances in Neural Information Processing Systems, Online, 3–8 December 2018. [Google Scholar]
- Deisenroth, M.P.; Rasmussen, C.E. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, Atlanta, GA, USA, 28 June–2 July 2011; pp. 465–472. [Google Scholar]
- Deisenroth, M.P.; Fox, D.; Rasmussen, C.E. Gaussian processes for data-efficient learning in robotics and control. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 408–423. [Google Scholar] [CrossRef]
- Janner, M.; Fu, J.; Zhang, M.; Levine, S. When to trust your model: Model-based policy optimization. In Proceedings of the Advances in Neural Information Processing Systems, Online, 8–14 December 2019. [Google Scholar]
- Khurshid, A.; Pani, A.K. A review on machine learning in wastewater treatment applications: Focus on model evaluation and analysis of BSM1 benchmark simulation dataset. Environ. Monit. Assess. 2023, 195, 916. [Google Scholar] [CrossRef] [PubMed]
- Kidambi, R.; Rajeswaran, A.; Netrapalli, P.; Joachims, T. MOReL: Model-based offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
- Kumar, A.; Zhou, A.; Tucker, G.; Levine, S. Conservative Q-learning for offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
- GB 18918-2002; Discharge Standard of Pollutants for Municipal Wastewater Treatment Plant (GB 18918-2002, with 2006 and 2025 Amendment Sheets). Ministry of Ecology and Environment of the People’s Republic of China; State Administration for Market Regulation; Standards Press of China: Beijing, China, 2002.
- Alex, J.; Benedetti, L.; Copp, J.; Gernaey, K.V.; Steyer, J.P. Benchmark Simulation Model No. 1 (BSM1); Lund University: Lund, Sweden, 2008. [Google Scholar]
- Wu, Z.; Zheng, K.; Zhang, G.; Huang, L.; Zhou, S. Preparation of polysulfone-based nanofiber Janus membrane for membrane distillation containing organic pollutants. npj Clean Water 2024, 7, 51. [Google Scholar] [CrossRef]
- Dimoglo, A.; Sevim-Elibol, P.; Dinç, Ö.; Gökmen, K.; Erdoğan, H. Electrocoagulation/electroflotation as a combined process for the laundry wastewater purification and reuse. J. Water Process Eng. 2019, 31, 100877. [Google Scholar] [CrossRef]
- Xu, H.; Wei, S.; Li, G.; Guo, B. Advanced removal of phosphorus from urban sewage using chemical precipitation by Fe-Al composite coagulants. Sci. Rep. 2024, 14, 4918. [Google Scholar] [CrossRef]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Abdoli, S.; Asgari Lajayer, B.; Dehghanian, Z.; Bagheri, N.; Vafaei, A.H.; Chamani, M.; Rani, S.; Lin, Z.; Shu, W.; Price, G.W. A review of the efficiency of phosphorus removal and recovery from wastewater by physicochemical and biological processes: Challenges and opportunities. Water 2024, 16, 2507. [Google Scholar] [CrossRef]
- Picioreanu, C.; Pérez, J.; van Loosdrecht, M.C.M. Impact of cell cluster size on apparent half-saturation coefficients for oxygen in nitrifying sludge and biofilms. Water Res. 2016, 106, 371–382. [Google Scholar] [CrossRef] [PubMed]
- Zaman, M.; Kim, M.; Nakhla, G. imultaneous nitrification-denitrifying phosphorus removal (SNDPR) at low DO for treating carbon-limited municipal wastewater. Sci. Total Environ. 2021, 760, 143387. [Google Scholar] [CrossRef]
- Séka, M.A.; Van de Wiele, T.; Verstraete, W. A test for predicting propensity of activated sludge to acute filamentous bulking. Water Environ. Res. 2001, 73, 237–242. [Google Scholar] [CrossRef]
- Badia, A.; Kim, M.; Nakhla, G.; Ray, M.B. Effect of COD/N ratio on denitrification from nitrite. Water Environ. Res. 2019, 91, 119–131. [Google Scholar] [CrossRef] [PubMed]
- Pi, K.W.; Chen, W.W.; Shi, Y.F.; Liu, D.F. Solidification and dewatering of phosphorus-rich river sediment using calcium-based polyferric sulfate. Gongye Anquan Yu Huanbao 2017, 43, 42–45. [Google Scholar] [CrossRef]
- Zeng, Y.; Shen, Y.; Lin, H.; Tan, Q.; Sun, J.; Shen, L.; Li, R.; Xu, Y.; Teng, J. A synergistic approach integrating potassium ferrate oxidation with polyacrylamide flocculation to enhance sludge dewatering and its mechanisms. J. Environ. Manag. 2025, 382, 125323. [Google Scholar] [CrossRef]
- Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1329–1338. [Google Scholar]
- Fujimoto, S.; van Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 22–31. [Google Scholar]
- Berkenkamp, F.; Turchetta, M.; Schoellig, A.P.; Krause, A. Safe model-based reinforcement learning with stability guarantees. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Brunke, L.; Greeff, M.; Hall, A.W.; Yuan, Z.; Zhou, S.; Panerati, J.; Schoellig, A.P. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annu. Rev. Control Robot. Auton. Syst. 2022, 5, 411–444. [Google Scholar] [CrossRef]
- Garcia, J.; Fernandez, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16, 1437–1480. [Google Scholar]






| Aspect | Prior RL Studies (e.g., [2,4,19,27]) | This Study |
|---|---|---|
| Control problem | Mostly BSM1 benchmark or A2O simulation | Real-plant historical data + MLP virtual WWTP |
| Action space | Discrete or simplified continuous | Two continuous dosing actions (PAM + polyferric sulfate) |
| Safety handling | Implicit or ignored | Explicit reward penalties for DO, SV30, COD/TN boundaries |
| Evaluation paradigm | Single-run reward comparison | 5-fold CV + bootstrapping (95% CI) + pairwise statistical tests |
| Statistical rigor | Lacking or minimal | Bonferroni-corrected Mann–Whitney U test (p < 0.001) |
| Interpretability | Low (black-box policy) | Training diagnostics (critic loss, entropy, FPS) + dosing behavior analysis |
| Category | Indicator | Unit | Max | Min | Mean | Std |
|---|---|---|---|---|---|---|
| Flow-related | Influent flow | m3/d | 110,282 | 51,811 | 76,696.17 | 10,507.94 |
| Flow-related | Effluent flow | m3/d | 112,359 | 46,443 | 73,984.85 | 11,539.91 |
| Influent quality | Influent COD | mg/L | 480 | 90 | 296.98 | 89.13 |
| Influent quality | Influent NH3-N | mg/L | 74 | 4.8 | 32.99 | 10.04 |
| Influent quality | Influent TN | mg/L | 75.2 | 11.8 | 35.55 | 8.69 |
| Influent quality | Influent TP | mg/L | 9.6 | 1.02 | 3.90 | 1.24 |
| Influent quality | Influent pH | 8.5 | 7 | 7.52 | 0.25 | |
| Influent quality | Influent SS | mg/L | 95 | 23 | 58.64 | 15.20 |
| Influent quality | Influent COD/TN | – | 25.42 | 2.25 | 9.39 | 4.00 |
| Sludge indicators | MLSS | mg/L | 4359 | 1908 | 2947.47 | 499.07 |
| Sludge indicators | SV30 | % | 49 | 28 | 40.81 | 3.98 |
| Sludge indicators | SVI | ml/g | 265 | 92 | 147.60 | 22.58 |
| Process variables | Sludge production | t/d | 162.22 | 0 | 74.86 | 37.44 |
| Process variables | Specific energy consumption | kWh/m3 | 0.528 | 0.289 | 0.40 | 0.04 |
| Process variables | DO | mg/L | 21.79 | 0.58 | 5.32 | 4.75 |
| Chemical dosing | PAM dosage | kg/d | 0.3 | 0 | 0.17 | 0.06 |
| Chemical dosing | Polyferric sulfate dosage | t/d | 7.5 | 0 | 1.22 | 1.52 |
| Effluent quality | Effluent COD | mg/L | 35 | 14 | 19.30 | 3.67 |
| Effluent quality | Effluent BOD | mg/L | 3.3 | 1.4 | 2.30 | 0.44 |
| Effluent quality | Effluent NH3-N | mg/L | 4.52 | 0.13 | 0.83 | 0.78 |
| Effluent quality | Effluent TN | mg/L | 12.6 | 3.21 | 8.39 | 1.57 |
| Effluent quality | Effluent TP | mg/L | 0.38 | 0.11 | 0.25 | 0.06 |
| Effluent quality | Effluent SS | mg/L | 5 | 2 | 3.78 | 0.78 |
| Removal performance | COD removed | mg/L | 460 | 63 | 296.11 | 90.07 |
| Removal performance | TN removed | mg/L | 69.1 | 3.2 | 27.81 | 10.74 |
| Removal performance | TP removed | mg/L | 9.22 | 0.65 | 3.72 | 1.33 |
| Removal performance | SS removed | mg/L | 44 | 23 | 36.19 | 3.87 |
| Coefficient | Reward Range (−20% to +20% Change) | Impact Score |
|---|---|---|
| COD Coefficient | 77.42 65.84 | 11.59 |
| TN Coefficient | 74.71 68.56 | 6.15 |
| SS Coefficient | 72.41 70.86 | 1.55 |
| TP Coefficient | 71.76 1.50 | 0.26 |
| Analysis | The model is most sensitive to the COD coefficient, which aligns with the objective of reducing organic pollutant load. |
| Method | Average Reward | Standard Deviation | 95% CI | Improvement vs. Random |
|---|---|---|---|---|
| SAC | 70.23 | 7.97 | [67.94, 72.52] | +53.1% |
| TD3 | 70.33 | 8.04 | [68.02, 72.64] | +53.3% |
| PPO | 68.78 | 8.04 | [66.47, 71.09] | +49.9% |
| Proportional | 61.01 | 9.70 | [58.23, 63.80] | +33.0% |
| Random | c | 15.97 | [47.29, 50.45] | +0.0% |
| Scenario | Control Strategy | Compliance Rate | Dosing Cost | Average Reward | Cost Saving vs. Proportional |
|---|---|---|---|---|---|
| Normal (1.0×) | SAC | 100.0% | 1.66 | 70.23 | 80.8% |
| Proportional | 84.0% | 8.68 | 61.01 | - | |
| Spike +30% | SAC | 88.0% | 1.60 | 64.93 | 85.8% |
| Proportional | 80.0% | 11.27 | 52.84 | - | |
| Spike +50% | SAC | 84.0% | 1.38 | 62.43 | 89.3% |
| Proportional | 78.0% | 12.92 | 47.40 | - |
| Technical Route | Nonlinear Adaptation | Online Risk | Multi-Objective Coordination | Pre-Deployment Validation | Representative References |
|---|---|---|---|---|---|
| Heuristic method (fixed dosing) | Low | Low | Low | Low | [33,34] |
| APC (PID/MPC) | Medium | Medium | Medium | Medium | [8,13] |
| Online RL (model-free) | High | High | High | Low | [2,19] |
| This MBRL route (offline screening) | High | Low | High | High | [23,24,26] |
| Comparison Pair | Raw p-Value | Adjusted p-Value | Significance |
|---|---|---|---|
| SAC vs. Proportional | 4.75 × 10−7 | 2.85 × 10−6 | Highly significant |
| TD3 vs. Proportional | 5.68 × 10−7 | 3.41 × 10−6 | Highly significant |
| SAC vs. TD3 | 0.9204 | 1.0000 | Not significant |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, Y.; Meng, D.; Ma, W. Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms. Processes 2026, 14, 1800. https://doi.org/10.3390/pr14111800
Zhang Y, Meng D, Ma W. Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms. Processes. 2026; 14(11):1800. https://doi.org/10.3390/pr14111800
Chicago/Turabian StyleZhang, Yuchen, Deyu Meng, and Weichao Ma. 2026. "Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms" Processes 14, no. 11: 1800. https://doi.org/10.3390/pr14111800
APA StyleZhang, Y., Meng, D., & Ma, W. (2026). Model-Based Reinforcement Learning for Chemical Dosing Optimization in a Municipal Wastewater Treatment Plant: A Comparative Study of Three Actor–Critic Algorithms. Processes, 14(11), 1800. https://doi.org/10.3390/pr14111800

