Abstract
Precise water quality forecasting is vital for sustainable resource management and public health, especially in semi-arid environments. This study investigates the predictive capabilities of ten Machine Learning (ML) algorithms using a dataset of 308 drinking water samples collected from various districts in Şanlıurfa Province, Türkiye. We evaluated ten predictive models, including Support Vector Regressor (SVR) and Extreme Gradient Boosting (XGBoost), both integrated with dimensionality reduction and hyperparameter optimization. Nineteen physicochemical and microbiological parameters—Temperature, chlorine (Cl−), pH, Electrical Conductivity (EC), Total Dissolved Solids (TDS), nitrite (NO2−), nitrate (NO3−), ammonium (NH4+), sulfate (SO42−), Free Chlorine (Cl2), calcium (Ca2+), magnesium (Mg2+), sodium (Na+), potassium (K+), fluoride (F−), trihalomethanes (THMs), Escherichia coli, Enterococci, Total Coliform—were used as input features. The dataset was split into training (75%) and testing (25%) subsets, and model performance was assessed through 10-fold cross-validation and hold-out testing procedures. To improve model generalization and mitigate the effects of class imbalance, we implemented the Adaptive Synthetic Sampling (ADASYN) technique. ML algorithms were evaluated using standard regression metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and the Coefficient of Determination (R2). The LSTM model optimized using Randomized Search outperformed the SVR and XGBoost models, demonstrating the highest accuracy and generalization capability, as evidenced by the superior R2 value of 0.999 following ADASYN balancing and the lowest RMSE (1.206). These findings underscore the effectiveness of the LSTM framework in modeling the complex variance of the Weighted Arithmetic Water Quality Index (WAWQI). The findings of this study are expected to support future water quality monitoring strategies, inform policy development, and contribute to sustainable water resource management in arid and semi-arid regions.