Abstract
Accurate assessment of water quality is crucial for protecting public health and promoting environmental sustainability. Conventional laboratory-based methods for evaluating microbial contaminants are often time-consuming, resource-intensive, and reactive in nature, limiting their effectiveness for real-time water quality monitoring and management. This study examines the application of data-driven machine learning models to predict E. coli concentrations in Midmar Dam, utilizing readily available physicochemical parameters. A comparative analysis was conducted using five classical standalone ML algorithms: Random Forest (RF), Support Vector Machine (SVM), k-Nearest Neighbors (kNN), Artificial Neural Network (ANN), and Extreme Gradient Boosting (XGBoost). These models were assessed based on their predictive performance using standard error metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). Among the models evaluated, the kNN algorithm demonstrated superior performance, achieving the lowest MSE and RMSE values, thereby highlighting its effectiveness in capturing the complex relationships between physicochemical indicators and microbial contamination levels. The findings demonstrate the potential of ML-based approaches to serve as efficient, scalable, and proactive tools for sustainable water-quality monitoring and management in dams.