Analysis of Key Influencing Factors of Water Quality in Tai Lake Basin Based on XGBoost-SHAP
Abstract
:1. Introduction
2. Materials and Methods
2.1. Method
2.1.1. XGBoost
2.1.2. SHAP
2.1.3. XGBoost-SHAP Model
2.2. Data Source and Preprocessing
2.2.1. Study Area
2.2.2. Data Description and Preparation
3. Results and Discussion
3.1. Model Parameter Selection and Accuracy Evaluation
3.2. Analysis of Influencing Factors on Water Quality
3.3. Analysis of Influencing Factors on Seasonal Water Quality
3.4. Dependency Graph of Water Quality Influencing Factors
3.5. Seasonal Dependency Graphs of Water Quality Influencing Factors
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Cotruvo, J.A. 2017 WHO Guidelines for Drinking Water Quality: First Addendum to the Fourth Edition. J. Am. Water Work. Assoc. 2017, 109, 44–51. [Google Scholar] [CrossRef]
- Huan, J.; Fan, Y.X.; Xu, X.G.; Zhou, L.W.; Zhang, H.; Zhang, C.; Hu, Q.C.; Cai, W.X.; Ju, H.R.; Gu, S.L. Deep learning model based on coupled SWAT and interpretable methods for water quality prediction under the influence of non-point source pollution. Comput. Electron. Agric. 2025, 231, 109985. [Google Scholar] [CrossRef]
- Liao, H.B.; Yuan, L.; Wu, M.; Chen, H.S. Air quality prediction by integrating mechanism model and machine learning model. Sci. Total Environ. 2023, 899, 165646. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- Streeter, H.W.; Phclps, E.B. A Study of the Pollution and Natural Purification of the Ohio River. U.S. Public Hcaith Bull. 1925, 146, 1–75. [Google Scholar]
- Paliwal, R.; Sharma, P.; Kansal, A. Water quality modelling of the river Yamuna (India) using QUAL2E-UNCAS. J. Environ. Manag. 2007, 83, 131–144. [Google Scholar] [CrossRef]
- Costa, C.; Marques, L.D.; Almeida, A.K.; Leite, I.R.; de Almeida, I.K. Applicability of water quality models around the world-a review. Environ. Sci. Pollut. Res. 2019, 26, 36141–36162. [Google Scholar] [CrossRef]
- Arnold, J.G.; Srinivasan, R.; Muttiah, R.S.; Williams, J.R. Large area hydrologic modeling and assessment part I: Model development. Jawra 1998, 34, 73–89. [Google Scholar] [CrossRef]
- Wan, H.; Xu, R.; Zhang, M.; Cai, Y.P.; Li, J.; Shen, X. A novel model for water quality prediction caused by non-point sources pollution based on deep learning and feature extraction methods. J. Hydrol. 2022, 612, 128081. [Google Scholar] [CrossRef]
- Cui, Q.; Wang, X.; Li, C.H.; Cai, Y.P.; Liang, P.Y. Improved Thomas-Fiering and wavelet neural network models for cumulative errors reduction in reservoir inflow forecast. J. Hydro-Environ. Res. 2016, 13, 134–143. [Google Scholar] [CrossRef]
- Zhang, Q.Q.; Li, Z.; Zhu, L.; Zhang, F.; Sekerinski, E.; Han, J.C.; Zhou, Y. Real-time prediction of river chloride concentration using ensemble learning. Environ. Pollut. 2021, 291, 118116. [Google Scholar] [CrossRef] [PubMed]
- Shaw, A.R.; Sawyer, H.S.; LeBoeuf, E.J.; McDonald, M.P.; Hadjerioua, B. Hydropower Optimization Using Artificial Neural Network Surrogate Models of a High-Fidelity Hydrodynamics and Water Quality Model. Water Resour. Res. 2017, 53, 9444–9461. [Google Scholar] [CrossRef]
- Aliashrafi, A.; Zhang, Y.R.; Groenewegen, H.; Peleato, N.M. A review of data-driven modelling in drinking water treatment. Rev. Environ. Sci. Bio-Technol. 2021, 20, 985–1009. [Google Scholar] [CrossRef]
- Rajaee, T.; Khani, S.; Ravansalar, M. Artificial intelligence-based single and hybrid models for prediction of water quality in rivers: A review. Chemom. Intell. Lab. Syst. 2020, 200, 103978. [Google Scholar] [CrossRef]
- Noori, N.; Kalin, L.; Isik, S. Water quality prediction using SWAT-ANN coupled approach. J. Hydrol. 2020, 590, 125220. [Google Scholar] [CrossRef]
- Samsudin, M.S.; Azid, A.; Khalit, S.I.; Sani, M.S.A.; Lananan, F. Comparison of prediction model using spatial discriminant analysis for marine water quality index in mangrove estuarine zones. Mar. Pollut. Bull. 2019, 141, 472–481. [Google Scholar] [CrossRef]
- Ta, X.X.; Wei, Y.G. Research on a dissolved oxygen prediction method for recirculating aquaculture systems based on a convolution neural network. Comput. Electron. Agric. 2018, 145, 302–310. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. In A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Lubo-Robles, D.; Devegowda, D.; Jayaram, V.; Bedle, H.; Marfurt, K.J.; Pranter, M.J. Quantifying the sensitivity of seismic facies classification to seismic attribute selection: An explainable machine-learning study. Interpret.-A J. Subsurf. Charact. 2022, 10, SE41–SE69. [Google Scholar] [CrossRef]
- Batunacun; Wieland, R.; Lakes, T.; Nendel, C. Using Shapley additive explanations to interpret extreme gradient boosting predictions of grassland degradation in Xilingol, China. Geosci. Model Dev. 2021, 14, 1493–1510. [Google Scholar] [CrossRef]
- Niazkar, M.; Menapace, A.; Brentan, B.; Piraei, R.; Jimenez, D.; Dhawan, P.; Righetti, M. Applications of XGBoost in water resources engineering: A systematic literature review (Dec 2018-May 2023). Environ. Model. Softw. 2024, 174, 105971. [Google Scholar] [CrossRef]
- Morita, K.; Davies, D.W.; Butler, K.T.; Walsh, A. Modeling the dielectric constants of crystals using machine learning. J. Chem. Phys. 2020, 153, 024503. [Google Scholar] [CrossRef] [PubMed]
- Merabet, K.; Di Nunno, F.; Granata, F.; Kim, S.; Adnan, R.M.; Heddam, S.; Kisi, O.; Zounemat-Kermani, M. Predicting water quality variables using gradient boosting machine: Global versus local explainability using SHapley Additive Explanations (SHAP). Earth Sci. Inform. 2025, 18, 298. [Google Scholar] [CrossRef]
- Kruk, M. SHAP-NET, a network based on Shapley values as a new tool to improve the explainability of the XGBoost-SHAP model for the problem of water quality. Environ. Model. Softw. 2025, 188, 106403. [Google Scholar] [CrossRef]
- Park, J.; Lee, W.H.; Kim, K.T.; Park, C.Y.; Lee, S.; Heo, T.Y. Interpretation of ensemble learning to predict water quality using explainable artificial intelligence. Sci. Total Environ. 2022, 832, 155070. [Google Scholar] [CrossRef] [PubMed]
- Huang, J.; Wang, X.X.; Xi, B.D.; Xu, Q.J.; Tang, Y.; Jia, K.L.; Mao, J.Y. Long-term variations of TN and TP in four lakes fed by Yangtze River at various timescales. Environ. Earth Sci. 2015, 74, 3993–4009. [Google Scholar] [CrossRef]
- Li, C.C.; Feng, W.Y.; Song, F.H.; He, Z.Q.; Wu, F.C.; Zhu, Y.R.; Bai, Y.C. Three decades of changes in water environment of a large freshwater Lake and its relationship with socio-economic indicators. J. Environ. Sci. 2019, 77, 156–166. [Google Scholar] [CrossRef]
- Tan, R.; Wang, Z.; Wu, T.; Wu, J. A data-driven model for water quality prediction in Tai Lake, China, using secondary modal decomposition with multidimensional external features. J. Hydrol.-Reg. Stud. 2023, 47, 101435. [Google Scholar] [CrossRef]
- Xu, R.; Pang, Y.; Hu, Z.; Hu, X. The Spatiotemporal Characteristics of Water Quality and Main Controlling Factors of Algal Blooms in Tai Lake, China. Sustainability 2022, 14, 5710. [Google Scholar] [CrossRef]
- Zhou, J.; Wang, Y.; Xiao, F.; Wang, Y.; Sun, L. Water Quality Prediction Method Based on IGRA and LSTM. Water 2018, 10, 1148. [Google Scholar] [CrossRef]
- Sun, G.; Zhu, W.; Qian, X.; Wei, C.; Xie, P.; Shi, Y.; Cao, X.; He, Y. Machine Learning Models for Chlorophyll-a Forecasting in a Freshwater Lake: Case Study of Lake Taihu. Water 2025, 17, 1219. [Google Scholar] [CrossRef]
- Ananias, P.H.M.; Negri, R.G.; Dias, M.A.; Silva, E.A.; Casaca, W. A Fully Unsupervised Machine Learning Framework for Algal Bloom Forecasting in Inland Waters Using MODIS Time Series and Climatic Products. Remote Sens. 2022, 14, 4283. [Google Scholar] [CrossRef]
- Chen, X.Z.; Jia, J.F.; Bai, Y.L.; Guo, T.; Du, X.L. Prediction model of axial bearing capacity of concrete-filled steel tube columns based on XGBoost-SHAP. J. Zhejiang Univ. Eng. Sci. 2023, 57, 1061–1070. [Google Scholar]
- Ahmadi, S.M.; Balahang, S.; Abolfathi, S. Predicting the hydraulic response of critical transport infrastructures during extreme flood events. Eng. Appl. Artif. Intell. 2024, 133, 108573. [Google Scholar] [CrossRef]
- Elith, J.; Leathwick, J.R.; Hastie, T. A working guide to boosted regression trees. J. Anim. Ecol. 2008, 77, 802–813. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’16, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Wang, Y.H.; Wang, L.Q.; Liu, S.L.; Liu, P.F.; Zhu, Z.W.; Zhang, W.A. A comparative study of regional landslide susceptibility mapping with multiple machine learning models. Geol. J. 2024, 59, 2383–2400. [Google Scholar] [CrossRef]
- Choi, D.K. Data-Driven Materials Modeling with XGBoost Algorithm and Statistical Inference Analysis for Prediction of Fatigue Strength of Steels. Int. J. Precis. Eng. Manuf. 2019, 20, 129–138. [Google Scholar] [CrossRef]
- Shapley, L.S. A Value for n-Person Games; Princeton University Press: Princeton, NJ, USA, 1953; pp. 307–318. [Google Scholar]
- Zhang, J.; Ma, X.; Zhang, J.; Sun, D.; Zhou, X.; Mi, C.; Wen, H. Insights into geospatial heterogeneity of landslide susceptibility based on the SHAP-XGBoost model. J. Environ. Manag. 2023, 332, 117357. [Google Scholar] [CrossRef]
- Li, Z.Q. Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Comput. Environ. Urban Syst. 2022, 96, 101845. [Google Scholar] [CrossRef]
- Feng, D.C.; Wang, W.J.; Mangalathu, S.; Taciroglu, E. Interpretable XGBoost-SHAP Machine-Learning Model for Shear Strength Prediction of Squat RC Walls. J. Struct. Eng. 2021, 147, 04021173. [Google Scholar] [CrossRef]
- Pan, B.; Song, T.R.; Yue, M.; Chen, S.N.; Zhang, L.J.; Edlmann, K.; Neil, C.W.; Zhu, W.Y.; Iglauer, S. Machine learning- based shale wettability prediction: Implications for H2, CH4 and CO2 geo-storage. Int. J. Hydrogen Energy 2024, 56, 1384–1390. [Google Scholar] [CrossRef]
- Yin, Z.Y.; Li, J.S.; Liu, Y.; Xie, Y.; Zhang, F.F.; Wang, S.L.; Sun, X.; Zhang, B. Water clarity changes in Lake Taihu over 36 years based on Landsat TM and OLI observations. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102457. [Google Scholar] [CrossRef]
- Liu, Z.F.; Ying, J.H.; He, C.Y.; Guan, D.J.; Pan, X.H.; Dai, Y.H.; Gong, B.H.; He, K.R.; Lv, C.F.; Wang, X.; et al. Scarcity and quality risks for future global urban water supply. Landsc. Ecol. 2024, 39, 10. [Google Scholar] [CrossRef]
- Wu, Y.X.; Jiang, L.L.; Ouyang, X.T.; Wang, Z.L.; Jiang, Q.X. Sustainable evaluation of the water footprint in Heilongjiang Province, China, based on correlation-matter element analysis. J. Clean. Prod. 2023, 408, 137231. [Google Scholar] [CrossRef]
- Joharestani, M.Z.; Cao, C.X.; Ni, X.L.; Bashir, B.; Talebiesfandarani, S. PM2.5 Prediction Based on Random Forest, XGBoost, and Deep Learning Using Multisource Remote Sensing Data. Atmosphere 2019, 10, 373. [Google Scholar] [CrossRef]
- Fang, L.; Shi, X.F.; Pan, R.J.; Wu, Q. Water quality characteristic analysis about three different types of surface drinking water sources: A study case of Huzhou, China. Fresenius Environ. Bull. 2017, 26, 969–976. [Google Scholar]
- Hu, Q. Analysis and Comprehensive Evaluation of Taihu Lake Water Quality from 2011 to 2020. J. Shantou Univ. (Nat. Sci. Ed.) 2022, 37, 65–74. [Google Scholar]
- Zhu, G.W. Spatiotemporal Variations of Taihu Lake Water Quality and Its Relationship with Algal Blooms. Resour. Environ. Yangtze River Basin 2009, 18, 439–445. [Google Scholar]
- Environmental Quality Standards for Surface Water. Available online: https://www.mee.gov.cn/ywgz/fgbz/bz/bzwb/shjbh/shjzlbz/200206/t20020601_66497.shtml (accessed on 13 May 2025).
- Lian, H.S.; Liu, H.B.; Li, X.D.; Song, T.; Lei, Q.L.; Ren, T.Z.; Li, Y. Analysis of Spatial Variability of Water Quality and Pollution Sources in Lihe River Watershed, Taihu Lake Basin. Environ. Sci. 2017, 38, 3657–3665. [Google Scholar]
- Li, Y.B.; Xu, E.G.; Liu, W.; Chen, Y.; Liu, H.L.; Li, D.; Yu, H.X. Spatial and temporal ecological risk assessment of unionized ammonia nitrogen in Tai Lake, China (2004-2015). Ecotoxicol. Environ. Saf. 2017, 140, 249–255. [Google Scholar] [CrossRef]
- Yang, J.H.; Zhang, X.W.; Xie, Y.W.; Song, C.; Sun, J.Y.; Zhang, Y.; Yu, H.X. Ecogenomics of Zooplankton Community Reveals Ecological Threshold of Ammonia Nitrogen. Environ. Sci. Technol. 2017, 51, 3057–3064. [Google Scholar] [CrossRef]
- Peng, Q.L.; He, W.J.; Kong, Y.; Shen, J.Q.; Yuan, L.; Ramsey, T.S. Spatio-temporal analysis of water sustainability of cities in the Yangtze River Economic Belt based on the perspectives of quantity-quality-benefit. Ecol. Indic. 2024, 160, 111909. [Google Scholar] [CrossRef]
Index | Variable | Unit | Mean | Standard Deviation | Minimum Value | Median | Maximum Value |
---|---|---|---|---|---|---|---|
X1 | Water temperature | °C | 19.73 | 8.17 | 1.40 | 21.15 | 36.30 |
X2 | pH | Dimensionless | 7.66 | 0.3567 | 6.49 | 7.63 | 8.75 |
X3 | DO | mg/L | 7.24 | 2.48 | 0.01 | 7.39 | 13.25 |
X4 | CODMn | mg/L | 3.24 | 1.02 | 0.25 | 3.16 | 7.33 |
X5 | NH3-N | mg/L | 0.15 | 0.14 | 0.03 | 0.10 | 0.84 |
X6 | TP | mg/L | 0.09 | 0.05 | 0.01 | 0.09 | 0.30 |
X7 | TN | mg/L | 2.16 | 1.00 | 0.05 | 2.08 | 5.82 |
X8 | Conductivity | μS/cm | 469.73 | 127.41 | 118.33 | 462.93 | 956.21 |
X9 | Turbidity | NTU | 47.96 | 28.80 | 0.01 | 44.51 | 145.16 |
Parameter Names | Indicated Meaning | Range | Optimal Adjustment Value |
---|---|---|---|
N_estimators | Maximum iterations of base learners | Positive integer | 100 |
Learning_rate | Learning rate | [0, 1] | 0.01 |
Max_depth | Maximum depth of decision trees | [0, Inf) | 6 |
Min_child_weight | Minimum sum of sample weights in leaf nodes | [0, Inf) | 1 |
Colsample_bytree | Ratio of sampled column numbers | (0, 1] | 1 |
Alpha | L1 regularization weight | Typically [0, 5] | 0 |
Lambda | L2 regularization weight | Typically [0, 5] | 1 |
Gamma | Minimum loss reduction | [0, Inf] | 0 |
Subsample | Random sampling of training samples | (0, 1] | 1 |
Season | Precision | Accuracy | Recall | |
---|---|---|---|---|
Overall | 0.927 | 0.964 | 0.883 | 0.903 |
Spring | 0.853 | 0.973 | 0.768 | 0.798 |
Summer | 0.945 | 0.969 | 0.819 | 0.853 |
Autumn | 0.930 | 0.981 | 0.884 | 0.905 |
Winter | 0.822 | 0.971 | 0.814 | 0.848 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, W.; Deng, M.; Liu, C.; Cao, Q. Analysis of Key Influencing Factors of Water Quality in Tai Lake Basin Based on XGBoost-SHAP. Water 2025, 17, 1619. https://doi.org/10.3390/w17111619
Li W, Deng M, Liu C, Cao Q. Analysis of Key Influencing Factors of Water Quality in Tai Lake Basin Based on XGBoost-SHAP. Water. 2025; 17(11):1619. https://doi.org/10.3390/w17111619
Chicago/Turabian StyleLi, Weiling, Menghua Deng, Chang Liu, and Qing Cao. 2025. "Analysis of Key Influencing Factors of Water Quality in Tai Lake Basin Based on XGBoost-SHAP" Water 17, no. 11: 1619. https://doi.org/10.3390/w17111619
APA StyleLi, W., Deng, M., Liu, C., & Cao, Q. (2025). Analysis of Key Influencing Factors of Water Quality in Tai Lake Basin Based on XGBoost-SHAP. Water, 17(11), 1619. https://doi.org/10.3390/w17111619