Skip to Content
WaterWater
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

16 March 2026

Comparative Performance Analysis of Machine Learning Models for Predicting the Weighted Arithmetic Water Quality Index

,
,
,
and
1
Department of Environmental Engineering, Faculty of Engineering, Harran University, Şanlıurfa 63000, Türkiye
2
Department of Animal Science, Faculty of Agriculture, Harran University, Şanlıurfa 63000, Türkiye
3
Department of Electrical and Electronics Engineering, Faculty of Engineering, Osmaniye Korkut Ata University, Osmaniye 80010, Türkiye
4
Department of Electronics and Communication Engineering, Faculty of Technology, Karadeniz Technical University, Trabzon 61000, Türkiye
This article belongs to the Section Urban Water Management

Abstract

Precise water quality forecasting is vital for sustainable resource management and public health, especially in semi-arid environments. This study investigates the predictive capabilities of ten Machine Learning (ML) algorithms using a dataset of 308 drinking water samples collected from various districts in Şanlıurfa Province, Türkiye. We evaluated ten predictive models, including Support Vector Regressor (SVR) and Extreme Gradient Boosting (XGBoost), both integrated with dimensionality reduction and hyperparameter optimization. Nineteen physicochemical and microbiological parameters—Temperature, chlorine (Cl), pH, Electrical Conductivity (EC), Total Dissolved Solids (TDS), nitrite (NO2), nitrate (NO3), ammonium (NH4+), sulfate (SO42−), Free Chlorine (Cl2), calcium (Ca2+), magnesium (Mg2+), sodium (Na+), potassium (K+), fluoride (F), trihalomethanes (THMs), Escherichia coli, Enterococci, Total Coliform—were used as input features. The dataset was split into training (75%) and testing (25%) subsets, and model performance was assessed through 10-fold cross-validation and hold-out testing procedures. To improve model generalization and mitigate the effects of class imbalance, we implemented the Adaptive Synthetic Sampling (ADASYN) technique. ML algorithms were evaluated using standard regression metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and the Coefficient of Determination (R2). The LSTM model optimized using Randomized Search outperformed the SVR and XGBoost models, demonstrating the highest accuracy and generalization capability, as evidenced by the superior R2 value of 0.999 following ADASYN balancing and the lowest RMSE (1.206). These findings underscore the effectiveness of the LSTM framework in modeling the complex variance of the Weighted Arithmetic Water Quality Index (WAWQI). The findings of this study are expected to support future water quality monitoring strategies, inform policy development, and contribute to sustainable water resource management in arid and semi-arid regions.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.