The analysis of the dataset provided critical insights into the factors that influence water quality and potability. The machine learning models employed demonstrated various levels of accuracy in predicting whether water was safe for human consumption. The machine learning models were evaluated based on their ability to accurately predict water potability using the provided physicochemical parameters. The primary models selected for this analysis include decision tree, Extra Trees, AdaBoost, XGBoost, and random forest.
4.1. Comparison of Resampling Methods
Table 3 summarizes the predictive performance of various classifiers under different data balancing strategies, expressed as percentages. In the baseline scenario without resampling, linear models such as logistic regression achieved moderate accuracy (59.4%) but low precision (49.0%) and F1-score (44.7%), reflecting limited capability in detecting minority-class instances. Distance-based models, exemplified by K-Nearest Neighbors, slightly improved both accuracy (61.8%) and F1-score (60.0%), while ensemble approaches, including Random Forest and Extra Trees, reached accuracies above 66% and F1-scores above 64%. Support Vector Classification displayed the most competitive baseline performance, with an accuracy of 67.7% and an F1-score of 63.5%, indicating better discrimination in the unbalanced dataset.
Traditional resampling methods yielded mixed outcomes. Random oversampling enhanced the performance of ensemble classifiers: Random Forest and Decision Tree achieved accuracies of 74.4% and 69.2%, respectively, with corresponding F1-scores exceeding 74.4% and PR–AUC values above 84%. Conversely, random undersampling generally reduced model effectiveness, particularly for linear and distance-based learners, likely due to information loss from majority-class reduction. Synthetic oversampling methods, including SMOTE, ADASYN, and Borderline-SMOTE, provided moderate gains for ensemble learners, improving both accuracy and F1-scores, with Random Forest consistently performing at the top (up to 69.9% with ADASYN and 72.2% with SMOTE).
Table 4 presents the performance of various classifiers under multiple advanced resampling techniques, including Borderline-SMOTE, Tomek Links, SMOTE-ENN, SMOTE-Tomek, and ADASYN-Tomek. Each method was evaluated using seven classifiers: logistic regression, K-Nearest Neighbors, Decision Tree, Support Vector Classifier (SVC), Random Forest, Gradient Boosting, and Extra Trees. Performance metrics include accuracy, precision, recall, F1-score, and ROC-AUC.
Under Borderline-SMOTE, logistic regression achieved moderate performance with 51.0% accuracy and an F1-score of 51.0%, while K-Nearest Neighbors demonstrated improved results with 66.5% accuracy and 66.4% F1-score. Ensemble-based models, particularly Random Forest and Extra Trees, outperformed other classifiers, attaining 73.3% and 77.0% accuracy respectively, with corresponding F1-scores of 73.3% and 77.0%, and ROC-AUC values exceeding 80%.
The application of Tomek Links yielded modest gains for certain classifiers. Logistic Regression reached 58.0% accuracy but exhibited limited F1 performance (43.3%), whereas Random Forest and Extra Trees maintained consistent effectiveness with accuracies of 65.7% and F1-scores of 64.4% and 63.9%, respectively. SVC performance remained competitive, with 65.9% accuracy and 61.9% F1-score.
Hybrid resampling approaches demonstrated substantial improvements. SMOTE-ENN notably enhanced performance for K-Nearest Neighbors (81.1% accuracy, 80.3% F1-score) and Extra Trees (87.3% accuracy, 87.0% F1-score), achieving the highest ROC-AUC values across all classifiers (up to 94.5%). SMOTE-Tomek also yielded favorable results, with Random Forest reaching 74.3% accuracy and 74.3% F1-score, while Extra Trees attained 76.1% for both metrics. Similarly, ADASYN-Tomek improved ensemble classifier performance, with Extra Trees achieving 75.8% accuracy and F1-score above 75.8%, and Random Forest maintaining 72.5% accuracy.
Table 5 presents the comparative performance of various classifiers following different feature selection strategies on the dataset. The baseline feature generation, performed over ten and twenty generations, converged on five features with a best fitness of 62.48%, indicating that this subset achieved a moderate balance between dimensionality reduction and classification potential.
When employing the Genetic Algorithm, five out of nine features were selected. Classifier performance under this configuration varied notably: Logistic Regression achieved an accuracy of 61.04% with an F1-score of 46.27%, while Extra Trees demonstrated comparatively superior performance with an accuracy of 63.89% and an F1-score of 61.06%. Support Vector Classifier and Gradient Boosting achieved intermediate results, highlighting the differential sensitivity of classifiers to the selected features.
The Particle Swarm Optimization (PSO) method selected four features, representing the most aggressive reduction. Under PSO, classification accuracy ranged from 54.93% (Decision Tree) to 62.36% (Logistic Regression), while F1-scores were generally lower compared to the Genetic Algorithm, suggesting that overly aggressive feature reduction can impair predictive balance.
Feature selection based on Mutual Information retained all nine features, yielding performance comparable to the baseline. Logistic Regression maintained 61.04% accuracy, and Extra Trees reached 66.33% accuracy with an F1-score of 62.05%. Similarly, Chi-Square selection preserved the full feature set, producing results consistent with Mutual Information, indicating that these statistical approaches did not eliminate predictive information.
4.2. Performance Evaluation of Extra Trees with SMOTE-ENN
Table 6 shows the evaluation metrics obtained by the Extra Trees classifier when trained using the SMOTE-ENN resampling strategy. The results indicate strong and balanced predictive performance across both potable and non-potable water classes. For the non-potable class, the model achieved a precision of 0.91 and a recall of 0.75, resulting in an F1-score of 0.82 over 181 samples. This outcome suggests a low rate of false positive predictions for non-potable instances, while maintaining a reasonable level of sensitivity. In contrast, the potable class exhibited a precision of 0.85 and a recall of 0.95, corresponding to an F1-score of 0.90 across 274 samples. The higher recall reflects the model’s strong ability to correctly identify potable water samples, which is particularly important for public health applications.
At the aggregate level, the classifier attained an overall accuracy of 0.87 on the test set. The macro-averaged precision, recall, and F1-score were 0.88, 0.85, and 0.86, respectively, indicating consistent performance across classes without dominance from class imbalance. Similarly, the weighted averages closely aligned with the overall accuracy, confirming that the predictive capability remains stable when accounting for class support. These findings demonstrate that the integration of SMOTE-ENN with the Extra Trees model yields reliable and well-balanced classification outcomes for water quality prediction tasks.
Figure 3 illustrates the confusion matrix obtained using the Extra Trees classifier combined with the SMOTE-ENN resampling strategy. The performance of the Extra Trees classifier combined with the SMOTE-ENN resampling strategy was assessed using standard classification metrics derived from the confusion matrix. Out of 455 evaluated samples, the model correctly identified 261 (True Positives (TP)) potable instances and 136 (True Negatives (TN)) non-potable instances, while producing 45 False Positives (FP) and 13 False Negatives (FN). This resulted in an overall accuracy of 87.25%, indicating that the majority of samples were classified correctly.
Table 7 summarizes the classification performance of the Extra Trees classifier using multiple evaluation metrics. The model achieves an overall accuracy of 0.882, indicating strong general predictive capability. Notably, the recall score is exceptionally high (0.968), demonstrating the model’s effectiveness in correctly identifying positive instances, which is particularly important in risk-sensitive or safety-critical applications such as water quality monitoring. The precision score of 0.854 and the corresponding F1 of 0.907 reflect a well-balanced trade-off between false positives and false negatives. Furthermore, the high PR-AUC value of 0.972 suggests robust performance under class imbalance conditions. The balanced accuracy of 0.860 confirms consistent performance across classes, while the Matthews Correlation Coefficient (0.7566) indicates a strong overall correlation between predicted and true labels. The optimal decision threshold of 0.35 highlights that performance is maximized at a non-default cutoff, emphasizing the importance of threshold tuning in practical deployment scenarios.
Figure 4 illustrates a comprehensive evaluation of the Extra Trees classifier through complementary threshold-independent and threshold-dependent analyses. The Precision–Recall curve demonstrates consistently high precision across a wide range of recall values, yielding a PR-AUC of 0.972, which indicates excellent performance under class imbalance and strong reliability in positive class identification. The ROC curve further confirms robust discriminative capability, with a ROC-AUC of 0.957 and a clear separation from the random baseline, reflecting high true positive rates at low false positive rates. The metrics-versus-threshold analysis reveals the inherent trade-off between precision and recall, with the F1 peaking around a mid-range threshold, highlighting an optimal balance between the two. This behavior is reinforced by the optimal threshold selection plot, where a decision threshold of 0.35 is identified as optimal, satisfying a minimum precision constraint of 0.7 while maintaining high recall.
4.3. XAI Results
The robust performance of the optimized Random Forest model, combined with the interpretability provided by the feature importance analysis, underscores the value of integrating machine learning techniques with domain knowledge for water quality assessment. These findings offer valuable insights that can inform targeted interventions and policy decisions to address water pollution and ensure safe water access for all. To gain a deeper understanding of the relative importance of the input features, a feature importance analysis was conducted. The SHAP method was used to quantify the contribution of each parameter to the model’s predictions. This analysis identified parameters such as turbidity, solids, sulfate, hardness, and pH as some of the most influential factors in determining water potability.
Figure 5 illustrates the mean absolute SHAP values for various water quality features, providing insights into their average impact on model output magnitude across two distinct classes. The
x-axis represents the mean absolute SHAP values, which quantify the average contribution of each feature to predictions, while the
y-axis lists the features, including chloramines, conductivity, organic carbon, trihalomethanes, turbidity, solids, sulfate, hardness, and pH. The comparison between Class 0 and Class 1 highlights the varying importance of these features in influencing model outcomes. Through the analysis of mean absolute values, it becomes clear which parameters have the most significant effect on the model’s predictions, thereby enhancing understanding of the relationships between water quality indicators and the targeted outcomes within the two classes.
Figure 6 presents a summary of SHAP values, illustrating the impact of various water quality features on model output. Each feature, including trihalomethanes, turbidity, conductivity, chloramines, hardness, organic carbon, solids, pH, and sulfate, is plotted along the
y-axis, while the
x-axis indicates the SHAP value, which reflects the contribution of each feature to the model’s predictions. The distribution of points over the
x-axis illustrates the range of feature values, differentiating between low and high values across the different features.
Feature importance analysis shows that key parameters such as sulfate, pH, hardness, and the concentration of total dissolved solids were significant predictors of water quality. Specifically, pH levels were found to have a strong negative correlation with non-potable classifications, indicating that lower pH values significantly increased the likelihood of water being unsafe. Hardness also emerged as a critical factor, with higher hardness levels correlating with a greater probability of water being classified as non-potable. The concentration of total dissolved solids was similarly influential, with elevated levels associated with compromised water quality.