Unveiling Surface Water Quality and Key Influencing Factors in China Using a Machine Learning Approach
Abstract
1. Introduction
2. Materials and Methods
2.1. Data Collection and Preprocessing
2.2. Machine Learning Modeling
2.3. Model Evaluation
2.4. Feature Importance Analysis
2.4.1. Model-Specific Feature Importance
2.4.2. SHAP Analysis for Model Interpretability
3. Results and Discussion
3.1. Data Statistical Results and Visualization
3.2. Comparison of Model Performance
3.3. Feature Importance
3.4. Shapley Additive Explanations (SHAP)
4. Environmental Implications
5. Conclusions
6. Environmental Implication
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Singha, C.; Bhattacharjee, I.; Sahoo, S.; Abdelrahman, K.; Uddin, M.G.; Fnais, M.S.; Govind, A.; Abioui, M. Prediction of urban surface water quality scenarios using hybrid stacking ensembles machine learning model in Howrah Municipal Corporation, West Bengal. J. Environ. Manag. 2024, 370, 122721. [Google Scholar] [CrossRef] [PubMed]
- Islam, M.S.; Yin, H.; Rahman, M. Long-term trend prediction of surface water quality of two main river basins of China using Machine Learning Method. Procedia Comput. Sci. 2024, 236, 257–264. [Google Scholar] [CrossRef]
- El Bilali, A.; Taleb, A. Prediction of irrigation water quality parameters using machine learning models in a semi-arid environment. J. Saudi Soc. Agric. Sci. 2020, 19, 439–451. [Google Scholar] [CrossRef]
- Berihun, M.L.; Bayabil, H.K.; Assegid, Y. Leveraging remote sensing–enabled machine learning for river water quality prediction in South Florida’s hydrological systems. Remote Sens. Appl. Soc. Environ. 2025, 38, 101616. [Google Scholar] [CrossRef]
- Jiang, S.; Cheng, X.; Shi, B.; Zhu, D.; Xie, J.; Zhou, Z. Optimal selection of machine learning algorithms for ciprofloxacin prediction based on conventional water quality indicators. Ecotoxicol. Environ. Saf. 2025, 289, 117628. [Google Scholar] [CrossRef] [PubMed]
- Prasad, D.V.V.; Venkataramana, L.Y.; Kumar, P.S.; Prasannamedha, G.; Soumya, K.; Poornema, A. Prediction on water quality of a lake in Chennai, India using machine learning algorithms. Desalin. Water Treat. 2021, 218, 44–51. [Google Scholar] [CrossRef]
- Kaur, A.; Goyal, S.; Batra, N.; Chhabra, K. Chapter 1—Artificial intelligence and machine learning based water quality monitoring, prediction, and analysis: A comprehensive review. In Computational Automation for Water Security; Dubey, A.K., Srivastav, A.L., Kumar, A., Garcia Marquez, F.P., Giannakoudakis, D.A., Eds.; Elsevier: Amsterdam, The Netherlands, 2025; pp. 1–10. [Google Scholar] [CrossRef]
- Ewuzie, U.; Bolade, O.P.; Egbedina, A.O. Chapter 9—Application of deep learning and machine learning methods in water quality modeling and prediction: A review. In Current Trends and Advances in Computer-Aided Intelligent Environmental Data Engineering; Intelligent Data-Centric Systems; Marques, G., Ighalo, J.O., Eds.; Academic Press: Cambridge, MA, USA, 2022; pp. 185–218. [Google Scholar] [CrossRef]
- Chen, K.; Chen, H.; Zhou, C.; Huang, Y.; Qi, X.; Shen, R.; Liu, F.; Zuo, M.; Zou, X.; Wang, J.; et al. Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. Water Res. 2020, 171, 115454. [Google Scholar] [CrossRef]
- Najah Ahmed, A.; Binti Othman, F.; Abdulmohsin Afan, H.; Khaleel Ibrahim, R.; Ming Fai, C.; Shabbir Hossain, M.; Ehteram, M.; Elshafie, A. Machine learning methods for better water quality prediction. J. Hydrol. 2019, 578, 124084. [Google Scholar] [CrossRef]
- Bui, D.T.; Khosravi, K.; Tiefenbacher, J.; Nguyen, H.; Kazakis, N. Improving prediction of water quality indices using novel hybrid machine-learning algorithms. Sci. Total Environ. 2020, 721, 137612. [Google Scholar] [CrossRef]
- Asadollah, S.B.H.S.; Sharafati, A.; Motta, D.; Yaseen, Z.M. River water quality index prediction and uncertainty analysis: A comparative study of machine learning models. J. Environ. Chem. Eng. 2021, 9, 104599. [Google Scholar] [CrossRef]
- Makumbura, R.K.; Mampitiya, L.; Rathnayake, N.; Meddage, D.; Henna, S.; Dang, T.L.; Hoshino, Y.; Rathnayake, U. Advancing water quality assessment and prediction using machine learning models, coupled with explainable artificial intelligence (XAI) techniques like shapley additive explanations (SHAP) for interpreting the black-box nature. Results Eng. 2024, 23, 102831. [Google Scholar] [CrossRef]
- Adusei, Y.Y.; Quaye-Ballard, J.; Adjaottor, A.A.; Mensah, A.A. Spatial prediction and mapping of water quality of Owabi reservoir from satellite imageries and machine learning models. Egypt. J. Remote Sens. Space Sci. 2021, 24, 825–833. [Google Scholar] [CrossRef]
- Mohseni, U.; Pande, C.B.; Chandra Pal, S.; Alshehri, F. Prediction of weighted arithmetic water quality index for urban water quality using ensemble machine learning model. Chemosphere 2024, 352, 141393. [Google Scholar] [CrossRef]
- del Castillo, A.F.; Garibay, M.V.; Díaz-Vázquez, D.; Yebra-Montes, C.; Brown, L.E.; Johnson, A.; Garcia-Gonzalez, A.; Gradilla-Hernández, M.S. Improving river water quality prediction with hybrid machine learning and temporal analysis. Ecol. Inform. 2024, 82, 102655. [Google Scholar] [CrossRef]
- Poursaeid, M.; Poursaeed, A.H.; Shabanlou, S. Water quality fluctuations prediction and Debi estimation based on stochastic optimized weighted ensemble learning machine. Process Saf. Environ. Prot. 2024, 188, 1160–1174. [Google Scholar] [CrossRef]
- Huan, S. A novel interval decomposition correlation particle swarm optimization-extreme learning machine model for short-term and long-term water quality prediction. J. Hydrol. 2023, 625, 130034. [Google Scholar] [CrossRef]
- Zhang, K.; Wang, X.; Liu, T.; Wei, W.; Zhang, F.; Huang, M.; Liu, H. Enhancing water quality prediction with advanced machine learning techniques: An extreme gradient boosting model based on long short-term memory and autoencoder. J. Hydrol. 2024, 644, 132115. [Google Scholar] [CrossRef]
- Nong, X.; He, Y.; Chen, L.; Wei, J. Machine learning-based evolution of water quality prediction model: An integrated robust framework for comparative application on periodic return and jitter data. Environ. Pollut. 2025, 369, 125834. [Google Scholar] [CrossRef]
- Yan, T.; Zhou, A.; Shen, S.L. Prediction of long-term water quality using machine learning enhanced by Bayesian optimisation. Environ. Pollut. 2023, 318, 120870. [Google Scholar] [CrossRef] [PubMed]
- Shah, M.I.; Javed, M.F.; Alqahtani, A.; Aldrees, A. Environmental assessment based surface water quality prediction using hyper-parameter optimized machine learning models based on consistent big data. Process Saf. Environ. Prot. 2021, 151, 324–340. [Google Scholar] [CrossRef]
- Wang, S.; Peng, H.; Liang, S. Prediction of estuarine water quality using interpretable machine learning approach. J. Hydrol. 2022, 605, 127320. [Google Scholar] [CrossRef]
- Nong, X.; Lai, C.; Chen, L.; Wei, J. A novel coupling interpretable machine learning framework for water quality prediction and environmental effect understanding in different flow discharge regulations of hydro-projects. Sci. Total Environ. 2024, 950, 175281. [Google Scholar] [CrossRef]
- Huang, S.; Xia, J.; Wang, Y.; Lei, J.; Wang, G. Water quality prediction based on sparse dataset using enhanced machine learning. Environ. Sci. Ecotechnology 2024, 20, 100402. [Google Scholar] [CrossRef] [PubMed]
- Zhong, H.; Yuan, Y.; Luo, L.; Ye, J.; Chen, M.; Zhong, C. Water quality prediction of MBR based on machine learning: A novel dataset contribution analysis method. J. Water Process Eng. 2022, 50, 103296. [Google Scholar] [CrossRef]
- Lu, H.; Ma, X. Hybrid decision tree-based machine learning models for short-term water quality prediction. Chemosphere 2020, 249, 126169. [Google Scholar] [CrossRef] [PubMed]
- Mori, M.; Gonzalez Flores, R.; Suzuki, Y.; Nukazawa, K.; Hiraoka, T.; Nonaka, H. Prediction of Microcystis Occurrences and Analysis Using Machine Learning in High-Dimension, Low-Sample-Size and Imbalanced Water Quality Data. Harmful Algae 2022, 117, 102273. [Google Scholar] [CrossRef]
- Anand, V.; Oinam, B.; Wieprecht, S. Machine learning approach for water quality predictions based on multispectral satellite imageries. Ecol. Inform. 2024, 84, 102868. [Google Scholar] [CrossRef]
- Awaleh, M.O.; Boschetti, T.; Marlin, C.; Robleh, M.A.; Ahmed, M.M.; Al-Aghbary, M.; Vystavna, Y.; Waberi, M.M.; Dabar, O.A.; Rossi, M.; et al. Geochemical and isotopic studies of the Douda-Damerjogue aquifer (Republic of Djibouti): Origin of high nitrate and fluoride, spatial distribution, associated health risk assessment and prediction of water quality using machine learning. Sci. Total Environ. 2025, 967, 178789. [Google Scholar] [CrossRef]
- M, G.J. Secure water quality prediction system using machine learning and blockchain technologies. J. Environ. Manag. 2024, 350, 119357. [Google Scholar] [CrossRef]
- Saboe, D.; Ghasemi, H.; Gao, M.M.; Samardzic, M.; Hristovski, K.D.; Boscovic, D.; Burge, S.R.; Burge, R.G.; Hoffman, D.A. Real-time monitoring and prediction of water quality parameters and algae concentrations using microbial potentiometric sensor signals and machine learning tools. Sci. Total Environ. 2021, 764, 142876. [Google Scholar] [CrossRef]
- GB 3838-2002; Environmental Quality Standards for Surface Water. Ministry of Ecology and Environment of the People’s Republic of China: Beijing, China, 2002. Available online: https://www.mee.gov.cn/ywgz/fgbz/bz/bzwb/shjbh/shjzlbz/200206/t20020601_66497.shtml (accessed on 30 August 2025).
- Grbčić, L.; Družeta, S.; Mauša, G.; Lipić, T.; Lušić, D.V.; Alvir, M.; Lučin, I.; Sikirica, A.; Davidović, D.; Travaš, V.; et al. Coastal water quality prediction based on machine learning with feature interpretation and spatio-temporal analysis. Environ. Model. Softw. 2022, 155, 105458. [Google Scholar] [CrossRef]
- Jiang, Y.; Song, Y.; Liu, J.; Liu, H.; Zang, X.; Ji, Z. Machine learning assisted precise prediction of algae bloom in large-scale water diversion engineering. Desalination 2025, 610, 118880. [Google Scholar] [CrossRef]
- Gao, Z.; Wang, G.; Chen, J.; Fang, L.; Ren, S.; Yinglan, A.; Ji, S.; Liu, R.; Wang, Q. Kalman filtering assimilated machine learning methods significantly improve the prediction performance of water quality parameters. Ecol. Inform. 2025, 90, 103337. [Google Scholar] [CrossRef]
- Rahaman, M.H.; Sajjad, H.; Hussain, S.; Roshani; Masroor, M.; Sharma, A. Surface water quality prediction in the lower Thoubal river watershed, India: A hyper-tuned machine learning approach and DNN-based sensitivity analysis. J. Environ. Chem. Eng. 2024, 12, 112915. [Google Scholar] [CrossRef]
- Koranga, M.; Pant, P.; Kumar, T.; Pant, D.; Bhatt, A.K.; Pant, R. Efficient water quality prediction models based on machine learning algorithms for Nainital Lake, Uttarakhand. Mater. Today Proc. 2022, 57, 1706–1712. [Google Scholar] [CrossRef]
- Deng, T.; Chau, K.W.; Duan, H.F. Machine learning based marine water quality prediction for coastal hydro-environment management. J. Environ. Manag. 2021, 284, 112051. [Google Scholar] [CrossRef]
- Chen, X.; Zhao, C.; Chen, J.; Jiang, H.; Li, D.; Zhang, J.; Han, B.; Chen, S.; Wang, C. Water quality parameters-based prediction of dissolved oxygen in estuaries using advanced explainable ensemble machine learning. J. Environ. Manag. 2025, 380, 125146. [Google Scholar] [CrossRef]
- Imani, M.; Hasan, M.M.; Bittencourt, L.F.; McClymont, K.; Kapelan, Z. A novel machine learning application: Water quality resilience prediction Model. Sci. Total Environ. 2021, 768, 144459. [Google Scholar] [CrossRef]
Parameter | Unit | Class I | Class II | Class III | Class IV | Class V |
---|---|---|---|---|---|---|
Temperature | (°C) | Weekly average temperature change ≤ ±2 °C | ||||
pH | (-) | 6–9 | ||||
DO | (mg/L) | ≥7.5 | ≥6 | ≥5 | ≥3 | ≥2 |
CODMn | (mg/L) | ≤2 | ≤4 | ≤6 | ≤10 | ≤15 |
NH3-N | (mg/L) | ≤0.15 | ≤0.5 | ≤1.0 | ≤1.5 | ≤2.0 |
TP | (mg/L) | ≤0.02 | ≤0.1 | ≤0.2 | ≤0.3 | ≤0.4 |
TN | (mg/L) | ≤0.2 | ≤0.5 | ≤1.0 | ≤1.5 | ≤2.0 |
Model | Hyperparameter | Search Range | Optimal Value |
---|---|---|---|
Logistic Regression | C | [0.01, 100] | 1.0 |
max_iter | [100, 2000] | 1000 | |
solver | [liblinear, lbfgs] | lbfgs | |
Random Forest | n_estimators | [10, 200] | 40 |
max_depth | [3, 20] | 10 | |
min_samples_split | [2, 20] | 2 | |
min_samples_leaf | [1, 10] | 1 | |
CatBoost | n_estimators | [50, 300] | 50 |
learning_rate | [0.01, 0.3] | 0.1 | |
depth | [3, 10] | 6 | |
l2_leaf_reg | [1, 10] | 3 | |
XGBoost | n_estimators | [50, 300] | 100 |
learning_rate | [0.01, 0.3] | 0.1 | |
max_depth | [3, 10] | 6 | |
subsample | [0.6, 1.0] | 0.8 | |
MLP | hidden_layer_sizes | [(50), (100,50)] | (100, 50) |
learning_rate_init | [0.001, 0.1] | 0.001 | |
alpha | [0.0001, 0.01] | 0.0001 | |
max_iter | [200, 1000] | 500 | |
GBDT | n_estimators | [50, 200] | 50 |
learning_rate | [0.01, 0.3] | 0.1 | |
max_depth | [3, 10] | 3 |
Statistic | Temperature | pH | DO | Conductivity | Turbidity | CODMn | NH3-N | TP | TN |
---|---|---|---|---|---|---|---|---|---|
(°C) | (-) | (mg/L) | (μS/cm) | (NTU) | (mg/L) | (mg/L) | (mg/L) | (mg/L) | |
Central Tendency | 9.40 | 7.93 | 10.96 | 677.41 | 16.47 | 2.67 | 0.138 | 0.050 | 3.30 |
(9.10) | (8.00) | (10.80) | (465.10) | (8.80) | (2.20) | (0.040) | (0.039) | (2.21) | |
Variability | 5.25 | 0.55 | 2.46 | 655.00 | 20.80 | 1.84 | 0.207 | 0.042 | 3.06 |
[5.30–12.60] | [8.00–8.00] | [9.40–12.30] | [292.00–806.35] | [4.40–18.80] | [1.30–3.70] | [0.020–0.150] | [0.020–0.066] | [1.37–4.04] | |
Range | 0.01 | 4.00 | 0.10 | 0.0002 | 0.01 | 0.20 | 0.020 | 0.005 | 0.05 |
32.70 | 9.00 | 22.20 | 3146.04 | 95.45 | 11.30 | 0.970 | 0.244 | 17.21 |
Model | Metric | Class-Specific Performance | Average Performance | |||||
---|---|---|---|---|---|---|---|---|
Class I | Class II | Class III | Class IV | Class V | Macro Avg | Weighted Avg | ||
XGBoost | Precision | 0.9955 | 0.9953 | 0.9937 | 0.9431 | 0.9075 | 0.9670 | 0.9903 |
Recall | 1.0000 | 0.9970 | 0.9866 | 0.9565 | 0.8561 | 0.9593 | 0.9904 | |
F1-Score | 0.9977 | 0.9962 | 0.9902 | 0.9498 | 0.8811 | 0.9630 | 0.9903 | |
Accuracy | 0.9904 | |||||||
Random Forest | Precision | 0.9799 | 0.9680 | 0.9844 | 0.9218 | 0.9478 | 0.9604 | 0.9707 |
Recall | 1.0000 | 0.9927 | 0.9336 | 0.8824 | 0.7712 | 0.9160 | 0.9707 | |
F1-Score | 0.9899 | 0.9802 | 0.9583 | 0.9017 | 0.8505 | 0.9361 | 0.9703 | |
Accuracy | 0.9707 | |||||||
CatBoost | Precision | 0.9752 | 0.9667 | 0.9771 | 0.8767 | 0.9007 | 0.9393 | 0.9640 |
Recall | 1.0000 | 0.9909 | 0.9299 | 0.8517 | 0.6203 | 0.8786 | 0.9645 | |
F1-Score | 0.9874 | 0.9787 | 0.9529 | 0.8641 | 0.7346 | 0.9035 | 0.9635 | |
Accuracy | 0.9645 | |||||||
MLP | Precision | 0.9661 | 0.9888 | 0.9640 | 0.9027 | 0.8377 | 0.9319 | 0.9714 |
Recall | 0.9977 | 0.9799 | 0.9676 | 0.8724 | 0.8278 | 0.9291 | 0.9714 | |
F1-Score | 0.9817 | 0.9843 | 0.9658 | 0.8873 | 0.8327 | 0.9304 | 0.9713 | |
Accuracy | 0.9714 | |||||||
GBDT | Precision | 0.9767 | 0.9699 | 0.9801 | 0.8995 | 0.8952 | 0.9443 | 0.9679 |
Recall | 1.0000 | 0.9913 | 0.9350 | 0.8674 | 0.7052 | 0.8998 | 0.9682 | |
F1-Score | 0.9882 | 0.9805 | 0.9570 | 0.8832 | 0.7889 | 0.9196 | 0.9676 | |
Accuracy | 0.9682 | |||||||
Logistic Regression | Precision | 0.8619 | 0.8564 | 0.7201 | 0.6490 | 0.6753 | 0.7525 | 0.8123 |
Recall | 0.8801 | 0.8870 | 0.7293 | 0.4968 | 0.2453 | 0.6477 | 0.8169 | |
F1-Score | 0.8709 | 0.8714 | 0.7247 | 0.5628 | 0.3599 | 0.6779 | 0.8120 | |
Accuracy | 0.8169 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Y.; Liu, L.; Cheng, L.; Shan, Y. Unveiling Surface Water Quality and Key Influencing Factors in China Using a Machine Learning Approach. Sustainability 2025, 17, 9205. https://doi.org/10.3390/su17209205
Li Y, Liu L, Cheng L, Shan Y. Unveiling Surface Water Quality and Key Influencing Factors in China Using a Machine Learning Approach. Sustainability. 2025; 17(20):9205. https://doi.org/10.3390/su17209205
Chicago/Turabian StyleLi, Yanli, Lei Liu, Lei Cheng, and Yahui Shan. 2025. "Unveiling Surface Water Quality and Key Influencing Factors in China Using a Machine Learning Approach" Sustainability 17, no. 20: 9205. https://doi.org/10.3390/su17209205
APA StyleLi, Y., Liu, L., Cheng, L., & Shan, Y. (2025). Unveiling Surface Water Quality and Key Influencing Factors in China Using a Machine Learning Approach. Sustainability, 17(20), 9205. https://doi.org/10.3390/su17209205