Next Article in Journal
Modelling and Performance Assessment of a Ground-Coupled Ammonia Heat Pump System: The EMPEC Ustka Case Study
Previous Article in Journal
Research on the Impact of New Quality Productivity on the Resilience of the Energy Industry Supply Chain—Empirical Evidence from China’s Provincial Panel Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Water Quality Prediction Using Explainable Machine Learning Models

Department of Computer Science, College of Computer Science and Engineering, Taibah University, Yanbu 46421, Saudi Arabia
Sustainability 2026, 18(4), 1721; https://doi.org/10.3390/su18041721
Submission received: 7 January 2026 / Revised: 1 February 2026 / Accepted: 4 February 2026 / Published: 7 February 2026

Abstract

Water pollution is a common global challenge, with significant impacts for ecosystem preservation, human health, and sustainable development. Addressing this issue requires a comprehensive understanding of the complex relationships between environmental factors and water quality outcomes. This study investigates the application of machine learning models to enhance water quality prediction and environmental management. The study utilizes robust machine learning models, including Adaboost, Random Forest, and Decision Tree classifiers, to uncover patterns within multidimensional water quality datasets. The Extra Trees classifier combined with the SMOTE-ENN resampling strategy achieved an accuracy of 87.25%, a recall of 95.26%, and an ROC of 94.52%. The Explainable AI (XAI) method is used to determine the impact of each parameter to the predictions of the model. This study identified parameters such as turbidity, solids, sulfate, hardness, and pH as some of the most influential factors in determining water potability. The identified factors provide valuable insights that can inform policy decisions and targeted interventions to reduce water pollution. The use of machine learning techniques provides a strong foundation for enhancing water quality evaluation and prediction, therefore facilitating sustainable water resource management.

1. Introduction

Water is necessary for life, but pollution is making it less healthy, which is very harmful for people’s health and for ecological systems [1]. Water pollution, defined as the deterioration of water quality, makes it unsuitable for various uses, including drinking, sanitation, and recreational activities [2,3]. This challenge is a global concern, exacerbated by the release of untreated wastewater into natural rivers and lakes, which negatively affects aquatic ecosystems.
The pollution of water can be caused by a variety of sources, including chemical runoff, waste disposal, and biological contaminants [4]. Agricultural practices are a primary contributor, as they utilize over 70% of the planet’s freshwater resources, significantly impacting water quality. These practices are recognized as major drivers of water pollution in rivers and streams. Approximately 80% of the world’s wastewater flows without treatment, leading to the contamination of rivers, lakes, and oceans. Pollution presents an issue to both human health and aquatic organisms that depend on clean water for survival [5,6]. Contaminated water and poor sanitation are major contributors to the spread of numerous diseases, including cholera, hepatitis A, dysentery, diarrhea, typhoid, and polio. The lack, insufficiency, or mismanagement of sanitation and water services leaves people vulnerable to preventable health risks. This issue is especially alarming in healthcare facilities, where inadequate water, sanitation, and hygiene services increase the risk of infection and disease for both patients and staff. Worldwide, around 15% of patients acquire infections during hospital stays, with this rate being notably higher in low-income countries.
Access to safe and readily available water is vital for public health, fulfilling essential needs for drinking, food production, domestic use, and recreational activities [7,8]. Improving access to water and sanitation, coupled with effective management of water resources, can substantially drive economic growth and contribute to poverty reduction across nations. The global challenge of access to safe water highlights that approximately 2.1 billion individuals lacked safe water at home in 2015 [9]. Specifically, it notes that 263 million people spent over 30 min on average per round trip to collect water, emphasizing the significant time burden associated with water procurement. Additionally, 159 million individuals relied on surface sources, such as streams or lakes, for their drinking water, which poses health risks due to potential contamination. Moreover, 844 million individuals lacked access to enhanced drinking water services [10].
The consequences of unsafe water are profound, resulting in more fatalities annually than conflict and violence combined [11,12]. Furthermore, less than one percent of Earth’s freshwater is accessible for human consumption. Without proactive measures, these challenges are projected to intensify, with global freshwater demand expected to increase by one-third by 2050.
Causal inference in machine learning represents a sophisticated methodological approach aimed at understanding and quantifying cause-and-effect relationships within complex datasets, transcending traditional correlational analyses. This computational strategy seeks to disentangle the underlying mechanisms that generate observed patterns by employing advanced statistical techniques, probabilistic models, and interventional frameworks such as potential outcomes and graphical models. Researchers leveraging causal inference methodologies attempt to estimate the precise impact of specific interventions or treatments by controlling for confounding variables, selection bias, and other potential distortions that might obscure genuine causal mechanisms. By integrating domain expertise, experimental design principles, and sophisticated algorithmic techniques, this field enables data scientists to construct more nuanced and interpretable models that can predict not just statistical associations but also the fundamental causal pathways driving observed phenomena across diverse domains, including healthcare, economics, social sciences, and policy research.
In support of sustainable water resource management and evidence-based environmental decision-making, this study is guided by the following research questions:
  • RQ1: Can a reproducible machine learning pipeline reliably predict water potability under class imbalance conditions while ensuring stable and robust performance?
  • RQ2: To what extent do ensemble-based classifiers, including Random Forest, Extra Trees, AdaBoost, and Decision Tree models, enhance predictive performance for water quality assessment compared to single-model approaches?
  • RQ3: Which physicochemical and environmental parameters contribute most significantly to water potability predictions, as revealed through explainable artificial intelligence methods, and how can these insights support sustainable water quality monitoring and management strategies?
To tackle the complexities of water quality data, this study explores the application of machine learning algorithms, such as Random Forest and Decision Tree [13,14,15,16,17,18], to uncover the relationships between environmental factors and water quality outcomes. By analyzing extensive datasets that incorporate diverse water quality metrics and climatic conditions, this research seeks to offer valuable insights into the factors that influence water quality.
This study investigates the application of machine learning techniques to enhance water quality prediction and environmental management. The study utilizes robust machine learning models, including Random Forest, Extra Trees, Adaboost, and Decision Tree classifiers, to uncover patterns within multidimensional water quality datasets. The results demonstrate the efficacy of the proposed framework, with the optimized Extra Trees model achieving an accuracy of 87.25% after hyperparameter tuning. The Explainable AI (XAI) method was used to quantify the impact of each parameter to the model’s predictions. This study identified parameters such as turbidity, solids, sulfate, hardness, and pH as some of the most influential factors in determining water potability. The identified factors provide valuable insights that can inform policy decisions and targeted interventions to reduce water pollution.
The key contributions of this study are summarized as follows:
  • This research introduces an interpretable and reproducible machine learning framework tailored for sustainable water quality assessment, integrating data preprocessing, imbalance correction via SMOTE-ENN, model optimization, and explainable artificial intelligence techniques.
  • A comprehensive comparative analysis of multiple tree-based and ensemble learning models is conducted, demonstrating that the optimized Extra Trees classifier achieves superior predictive performance, with an accuracy of 87.25% and a ROC-AUC of 0.9452, thereby offering a reliable tool for long-term water quality monitoring.
  • By employing SHAP-based explainability, the study identifies critical water quality parameters, such as turbidity, total dissolved solids, sulfate concentration, hardness, and pH, that significantly influence potability predictions, providing actionable insights to support sustainable policy formulation, targeted pollution mitigation, and resilient water resource management.

2. Related Work

Numerous studies have investigated various aspects of water quality, including its parameters, impacts, and management strategies. This literature review gathers key findings from recent research on water quality and its implications. The quality of drinking water is a crucial factor influencing public health, environmental sustainability, and economic development.
The physicochemical characteristics of water significantly influence its suitability for human consumption and ecosystem health. Parameters such as pH, hardness, turbidity, and the presence of contaminants (e.g., chloramines, trihalomethanes) have been extensively studied. According to WHO [19], pH levels outside the range of 6.5 to 8.5 can adversely affect both human health and aquatic life. Similarly, hardness has been linked to various health outcomes, with some studies suggesting a correlation between hard water and reduced cardiovascular disease incidence [20].
The relationship between water quality and public health is well-documented. Contaminated water sources are associated with a range of diseases, including cholera, dysentery, and typhoid fever [21,22]. A systematic review by Prüss-Ustün et al. [23,24] quantified the health impacts of unsafe water and inadequate sanitation, estimating that 2 million deaths annually can be attributed to waterborne diseases. Furthermore, the presence of trihalomethanes and chloramines has raised concerns regarding their potential carcinogenic effects [25,26].
Access to safe water is also pivotal for economic growth. The World Bank [27] emphasizes that improved water supply and sanitation can enhance productivity and contribute significantly to poverty alleviation. Investments in water infrastructure yield substantial economic returns, as they reduce healthcare costs associated with waterborne diseases and improve workforce productivity [28,29].
Effective management strategies are critical for safeguarding water quality. Integrated Water Resources Management (IWRM) is recognized as a comprehensive approach that fosters the coordinated development and management of water, land, and related resources [30,31]. Furthermore, technological innovations, including remote sensing and machine learning, are increasingly being utilized to monitor water quality and predict pollution events [32,33].
The synthesis of the existing literature highlights the multifaceted challenges posed by water quality issues. Continued research is essential to develop innovative solutions and effective management practices that ensure safe water access for all, thereby safeguarding public health and promoting sustainable development [34,35].

3. Methodology

This section outlines the proposed framework employed to investigate the factors influencing water quality using machine learning algorithms. The approach involves data collection, the preprocessing of that data, training of the machine learning models, the evaluation of those models, and interpretation of results as shown in Figure 1. The figure illustrates the comprehensive workflow employed in the analysis of water quality data, focusing on the classification of water as potable (1) or non-potable (0). The process begins with the collection and pre-processing of the dataset, which involves cleaning and preparing the data to ensure its suitability for subsequent analysis. Following this, the dataset is partitioned into training and testing sets, with 30% of the data reserved for testing purposes.

3.1. Water Quality Dataset

The dataset used for this study is gathered from multiple reliable databases that provide comprehensive information on water quality parameters [36]. The dataset includes the following parameters for each water sample: pH, chloramines, sulfate, hardness, trihalomethanes, conductivity, organic carbon, total dissolved solids, turbidity, potability (binary: 0 for non-potable, 1 for potable). The dataset is accessible online from the Kaggle website (https://www.kaggle.com/code/alberthronaldy/water-quality-prediction, accessed on 6 January 2026) and Github (https://github.com/TusharPaul01/Water-Quality-IoT-ML, accessed on 6 January 2026).
The dataset utilized in this study comprises various physicochemical parameters that are critical for assessing water quality. Table 1 presents the structure of dataset and the description of water quality features. Each feature is defined as follows: (1) pH: This variable represents the acidity or alkalinity of water, measured on a scale from 0 to 14. A pH value of 7 indicates neutrality, while values below and above this threshold signify acidic and basic conditions, respectively. (2) Hardness: This parameter quantifies the capacity of water to precipitate soap, expressed in milligrams per liter (mg/L). Hardness is primarily attributed to the presence of calcium and magnesium ions and is an important indicator of water quality [37]. (3) Solids: This feature denotes the total dissolved solids in parts per million (ppm), which encompass a variety of inorganic and organic substances dissolved in water. (4) Chloramines: This variable measures the concentration of chloramines, a group of chemical compounds formed when chlorine is used for disinfection, reported in parts per million (ppm). Elevated levels can indicate water treatment practices. (5) Sulfate: This parameter reflects the concentration of sulfates dissolved in water, measured in milligrams per liter (mg/L). Sulfates can originate from natural sources and anthropogenic activities and affect water taste and quality.
Furthermore, (6) Conductivity: This variable indicates the electrical conductivity of water, expressed in microsiemens per centimeter ( μ s/cm ) Conductivity is a useful measure for assessing the ion concentration and overall water quality. (7) Organic Carbon: This feature represents the concentration of organic carbon in parts per million (ppm). Organic carbon content is a key indicator of water quality, as it can influence microbial activity and the presence of pollutants. (8) Trihalomethanes: This parameter quantifies the amount of trihalomethanes, which are byproducts of water chlorination, measured in micrograms per liter ( μ g/L). Their presence is pertinent to evaluating potential health risks associated with drinking water. (9) Turbidity: This variable measures the clarity of water, expressed in Nephelometric Turbidity Units (NTU). Turbidity indicates the presence of suspended particles, impacting both aesthetic quality and potential health risks. Finally, (10) Potability: This binary variable represents water safety for human consumption, with a value of 1 indicating potable water and a value of 0 signifying non-potable water.
Table 2 presents descriptive statistics for various water quality parameters, encompassing a total of 2785 to 3276 observations across different metrics. Key statistics include the mean pH value of 7.08, indicating a slightly acidic to neutral water quality, with a standard deviation of 1.59, suggesting variability in pH levels. Hardness averages at 196.37 mg/L, while total dissolved solids have a mean of 22,014.09 mg/L, reflecting the overall mineral content of the samples. Chloramines, sulfate, and conductivity are also reported, with means of 7.12 mg/L, 333.78 mg/L, and 426.21 μ S/cm, respectively. The data highlights significant variability, as evidenced by the standard deviations and the range from minimum to maximum values across all parameters. Notably, the potability status shows a mean of 0.39, indicating that approximately 39% of the samples were deemed potable, thereby emphasizing the necessity for ongoing monitoring and assessment of water quality to ensure public health and safety.
Figure 2 illustrates the distribution of water samples based on their potability status, represented as a bar chart. Two categories are depicted: non-potable (0) and potable (1) samples. The non-potable category comprises approximately 60.99% of the total count, reflecting a predominant proportion of the samples that do not meet safety standards for consumption. In contrast, the potable category accounts for 39.01%, indicating that a smaller subset of the samples is considered safe for drinking. This visual representation effectively highlights the disparity between potable and non-potable water sources, underscoring the critical need for continued monitoring and intervention to enhance water quality for public health purposes.

3.2. Data Preprocessing

Preprocessing is crucial to ensure the quality and usability of the dataset: (1) Data Cleaning: Identify and address missing values, outliers, and inconsistencies. Missing values can be handled using techniques such as mean/mode imputation or removal of records with excessive missing data. (2) Normalization: Normalize the data to bring all features within a similar scale. This can be done using Min-Max scaling or Z-score standardization, which is essential for algorithms sensitive to the range of data. (3) Feature Engineering: Create additional features if necessary, such as interaction terms or polynomial features, to capture complex relationships between parameters. (4) Data Splitting: Divide the dataset into training (70%), and test (30%) subsets to ensure robust model evaluation and avoid overfitting.
Table 1 summarizes the counts of missing data for various water quality parameters, highlighting the extent of incomplete observations across key metrics. Notably, the pH parameter shows a significant gap with 491 missing entries, while sulfate exhibits the highest level of missing data at 781. In contrast, other parameters, including hardness, solids, chloramines, conductivity, organic carbon, turbidity, and potability, report no missing values. Additionally, trihalomethanes have 162 absent entries. To address the missing data in the pH, sulfate, and trihalomethanes columns, mean values are utilized for imputation. This approach allows for the mitigation of missing data issues while preserving the integrity of the relationships among the various parameters.

3.3. Model Selection

The methodology provides a comprehensive framework for analyzing water quality using machine learning. By leveraging advanced analytical techniques, the study used various machine learning algorithms to identify the most effective model for predicting water quality. The selected algorithms include: Random forest is an ensemble learning model that constructs multiple decision trees and integrates their outputs to enhance performance and reduce overfitting.
The decision tree is a simple and interpretable model that divides data into subsets according to feature values. In contrast, Gradient Boosting Machines (XGB) utilize an ensemble approach, constructing decision trees in a sequential manner. Each subsequent tree aims to address the errors made by its predecessor, thereby enhancing the overall predictive performance of the model. This iterative process allows for improved accuracy and adaptability to complex data patterns.

3.4. Model Training

The framework incorporates the application of various machine learning models, which are optimized through hyperparameter tuning to enhance their predictive accuracy. The grid search technique is used to optimize hyperparameters for each algorithm. The performance of these models is rigorously evaluated using established metrics to determine their effectiveness in classifying water quality. Additionally, the framework used the SHapley Additive exPlanations (SHAP), a method established in Explainable AI (XAI), to provide interpretable insights into the model’s decision-making process. This approach ensures transparency and aids in understanding the factors influencing the classification outcomes. Overall, the framework demonstrates an integrated and accurate approach to water quality assessment, using state-of-the-art machine learning algorithms and interpretability tools.

3.5. Model Evaluation

The evaluation of the trained models was conducted using various metrics, including: Accuracy: The ratio of correct predictions to total predictions, particularly important for binary classification tasks.
A c c u r a c y = T P + T N T P + F P + F N + T N
Precision and Recall: To evaluate the model’s ability to accurately distinguish between potable and non-potable water.
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F1 Score: The harmonic mean of precision and recall, offering a single comprehensive measure of model performance.
F 1 = 2 · p r e c i s i o n · r e c a l l p r e c i s i o n + r e c a l l
To ensure a robust and fair evaluation under class imbalance, multiple complementary performance metrics are employed. In addition to accuracy, precision, recall, F 1 -score, and ROC–AUC, this study incorporates the Area Under the Precision–Recall Curve (PR–AUC), Matthews Correlation Coefficient (MCC), and Balanced Accuracy.
PR–AUC is particularly suitable for imbalanced classification problems, as it focuses on the trade-off between precision and recall for the positive (potable water) class and is insensitive to the abundance of true negatives. Balanced Accuracy is defined as the average of sensitivity (true positive rate) and specificity (true negative rate), ensuring equal importance is assigned to both classes. The Matthews Correlation Coefficient provides a single scalar measure that accounts for all elements of the confusion matrix and remains reliable even in highly skewed class distributions [38].
MCC = ( T P × T N ) ( F P × F N ) ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N )
The MCC ranges from 1 to + 1 , where a value of + 1 indicates perfect classification with complete agreement between predicted and true labels. A value of 0 corresponds to performance equivalent to random guessing, while a value of 1 reflects total disagreement, indicating that the predicted labels are entirely opposite to the ground truth.

3.6. Interpretation of Results

After evaluating the models, the best-performing model is selected for further analysis. The interpretation of results includes: (1) Feature Importance Analysis: Employing techniques like SHAP values to assess the contribution of each feature to the model’s predictions. (2) Visualizations: Creating visual representations of key findings, such as feature importance plots, confusion matrices, and ROC curves, to communicate the results effectively.

4. Experiments

The analysis of the dataset provided critical insights into the factors that influence water quality and potability. The machine learning models employed demonstrated various levels of accuracy in predicting whether water was safe for human consumption. The machine learning models were evaluated based on their ability to accurately predict water potability using the provided physicochemical parameters. The primary models selected for this analysis include decision tree, Extra Trees, AdaBoost, XGBoost, and random forest.

4.1. Comparison of Resampling Methods

Table 3 summarizes the predictive performance of various classifiers under different data balancing strategies, expressed as percentages. In the baseline scenario without resampling, linear models such as logistic regression achieved moderate accuracy (59.4%) but low precision (49.0%) and F1-score (44.7%), reflecting limited capability in detecting minority-class instances. Distance-based models, exemplified by K-Nearest Neighbors, slightly improved both accuracy (61.8%) and F1-score (60.0%), while ensemble approaches, including Random Forest and Extra Trees, reached accuracies above 66% and F1-scores above 64%. Support Vector Classification displayed the most competitive baseline performance, with an accuracy of 67.7% and an F1-score of 63.5%, indicating better discrimination in the unbalanced dataset.
Traditional resampling methods yielded mixed outcomes. Random oversampling enhanced the performance of ensemble classifiers: Random Forest and Decision Tree achieved accuracies of 74.4% and 69.2%, respectively, with corresponding F1-scores exceeding 74.4% and PR–AUC values above 84%. Conversely, random undersampling generally reduced model effectiveness, particularly for linear and distance-based learners, likely due to information loss from majority-class reduction. Synthetic oversampling methods, including SMOTE, ADASYN, and Borderline-SMOTE, provided moderate gains for ensemble learners, improving both accuracy and F1-scores, with Random Forest consistently performing at the top (up to 69.9% with ADASYN and 72.2% with SMOTE).
Table 4 presents the performance of various classifiers under multiple advanced resampling techniques, including Borderline-SMOTE, Tomek Links, SMOTE-ENN, SMOTE-Tomek, and ADASYN-Tomek. Each method was evaluated using seven classifiers: logistic regression, K-Nearest Neighbors, Decision Tree, Support Vector Classifier (SVC), Random Forest, Gradient Boosting, and Extra Trees. Performance metrics include accuracy, precision, recall, F1-score, and ROC-AUC.
Under Borderline-SMOTE, logistic regression achieved moderate performance with 51.0% accuracy and an F1-score of 51.0%, while K-Nearest Neighbors demonstrated improved results with 66.5% accuracy and 66.4% F1-score. Ensemble-based models, particularly Random Forest and Extra Trees, outperformed other classifiers, attaining 73.3% and 77.0% accuracy respectively, with corresponding F1-scores of 73.3% and 77.0%, and ROC-AUC values exceeding 80%.
The application of Tomek Links yielded modest gains for certain classifiers. Logistic Regression reached 58.0% accuracy but exhibited limited F1 performance (43.3%), whereas Random Forest and Extra Trees maintained consistent effectiveness with accuracies of 65.7% and F1-scores of 64.4% and 63.9%, respectively. SVC performance remained competitive, with 65.9% accuracy and 61.9% F1-score.
Hybrid resampling approaches demonstrated substantial improvements. SMOTE-ENN notably enhanced performance for K-Nearest Neighbors (81.1% accuracy, 80.3% F1-score) and Extra Trees (87.3% accuracy, 87.0% F1-score), achieving the highest ROC-AUC values across all classifiers (up to 94.5%). SMOTE-Tomek also yielded favorable results, with Random Forest reaching 74.3% accuracy and 74.3% F1-score, while Extra Trees attained 76.1% for both metrics. Similarly, ADASYN-Tomek improved ensemble classifier performance, with Extra Trees achieving 75.8% accuracy and F1-score above 75.8%, and Random Forest maintaining 72.5% accuracy.
Table 5 presents the comparative performance of various classifiers following different feature selection strategies on the dataset. The baseline feature generation, performed over ten and twenty generations, converged on five features with a best fitness of 62.48%, indicating that this subset achieved a moderate balance between dimensionality reduction and classification potential.
When employing the Genetic Algorithm, five out of nine features were selected. Classifier performance under this configuration varied notably: Logistic Regression achieved an accuracy of 61.04% with an F1-score of 46.27%, while Extra Trees demonstrated comparatively superior performance with an accuracy of 63.89% and an F1-score of 61.06%. Support Vector Classifier and Gradient Boosting achieved intermediate results, highlighting the differential sensitivity of classifiers to the selected features.
The Particle Swarm Optimization (PSO) method selected four features, representing the most aggressive reduction. Under PSO, classification accuracy ranged from 54.93% (Decision Tree) to 62.36% (Logistic Regression), while F1-scores were generally lower compared to the Genetic Algorithm, suggesting that overly aggressive feature reduction can impair predictive balance.
Feature selection based on Mutual Information retained all nine features, yielding performance comparable to the baseline. Logistic Regression maintained 61.04% accuracy, and Extra Trees reached 66.33% accuracy with an F1-score of 62.05%. Similarly, Chi-Square selection preserved the full feature set, producing results consistent with Mutual Information, indicating that these statistical approaches did not eliminate predictive information.

4.2. Performance Evaluation of Extra Trees with SMOTE-ENN

Table 6 shows the evaluation metrics obtained by the Extra Trees classifier when trained using the SMOTE-ENN resampling strategy. The results indicate strong and balanced predictive performance across both potable and non-potable water classes. For the non-potable class, the model achieved a precision of 0.91 and a recall of 0.75, resulting in an F1-score of 0.82 over 181 samples. This outcome suggests a low rate of false positive predictions for non-potable instances, while maintaining a reasonable level of sensitivity. In contrast, the potable class exhibited a precision of 0.85 and a recall of 0.95, corresponding to an F1-score of 0.90 across 274 samples. The higher recall reflects the model’s strong ability to correctly identify potable water samples, which is particularly important for public health applications.
At the aggregate level, the classifier attained an overall accuracy of 0.87 on the test set. The macro-averaged precision, recall, and F1-score were 0.88, 0.85, and 0.86, respectively, indicating consistent performance across classes without dominance from class imbalance. Similarly, the weighted averages closely aligned with the overall accuracy, confirming that the predictive capability remains stable when accounting for class support. These findings demonstrate that the integration of SMOTE-ENN with the Extra Trees model yields reliable and well-balanced classification outcomes for water quality prediction tasks.
Figure 3 illustrates the confusion matrix obtained using the Extra Trees classifier combined with the SMOTE-ENN resampling strategy. The performance of the Extra Trees classifier combined with the SMOTE-ENN resampling strategy was assessed using standard classification metrics derived from the confusion matrix. Out of 455 evaluated samples, the model correctly identified 261 (True Positives (TP)) potable instances and 136 (True Negatives (TN)) non-potable instances, while producing 45 False Positives (FP) and 13 False Negatives (FN). This resulted in an overall accuracy of 87.25%, indicating that the majority of samples were classified correctly.
Table 7 summarizes the classification performance of the Extra Trees classifier using multiple evaluation metrics. The model achieves an overall accuracy of 0.882, indicating strong general predictive capability. Notably, the recall score is exceptionally high (0.968), demonstrating the model’s effectiveness in correctly identifying positive instances, which is particularly important in risk-sensitive or safety-critical applications such as water quality monitoring. The precision score of 0.854 and the corresponding F1 of 0.907 reflect a well-balanced trade-off between false positives and false negatives. Furthermore, the high PR-AUC value of 0.972 suggests robust performance under class imbalance conditions. The balanced accuracy of 0.860 confirms consistent performance across classes, while the Matthews Correlation Coefficient (0.7566) indicates a strong overall correlation between predicted and true labels. The optimal decision threshold of 0.35 highlights that performance is maximized at a non-default cutoff, emphasizing the importance of threshold tuning in practical deployment scenarios.
Figure 4 illustrates a comprehensive evaluation of the Extra Trees classifier through complementary threshold-independent and threshold-dependent analyses. The Precision–Recall curve demonstrates consistently high precision across a wide range of recall values, yielding a PR-AUC of 0.972, which indicates excellent performance under class imbalance and strong reliability in positive class identification. The ROC curve further confirms robust discriminative capability, with a ROC-AUC of 0.957 and a clear separation from the random baseline, reflecting high true positive rates at low false positive rates. The metrics-versus-threshold analysis reveals the inherent trade-off between precision and recall, with the F1 peaking around a mid-range threshold, highlighting an optimal balance between the two. This behavior is reinforced by the optimal threshold selection plot, where a decision threshold of 0.35 is identified as optimal, satisfying a minimum precision constraint of 0.7 while maintaining high recall.

4.3. XAI Results

The robust performance of the optimized Random Forest model, combined with the interpretability provided by the feature importance analysis, underscores the value of integrating machine learning techniques with domain knowledge for water quality assessment. These findings offer valuable insights that can inform targeted interventions and policy decisions to address water pollution and ensure safe water access for all. To gain a deeper understanding of the relative importance of the input features, a feature importance analysis was conducted. The SHAP method was used to quantify the contribution of each parameter to the model’s predictions. This analysis identified parameters such as turbidity, solids, sulfate, hardness, and pH as some of the most influential factors in determining water potability.
Figure 5 illustrates the mean absolute SHAP values for various water quality features, providing insights into their average impact on model output magnitude across two distinct classes. The x-axis represents the mean absolute SHAP values, which quantify the average contribution of each feature to predictions, while the y-axis lists the features, including chloramines, conductivity, organic carbon, trihalomethanes, turbidity, solids, sulfate, hardness, and pH. The comparison between Class 0 and Class 1 highlights the varying importance of these features in influencing model outcomes. Through the analysis of mean absolute values, it becomes clear which parameters have the most significant effect on the model’s predictions, thereby enhancing understanding of the relationships between water quality indicators and the targeted outcomes within the two classes.
Figure 6 presents a summary of SHAP values, illustrating the impact of various water quality features on model output. Each feature, including trihalomethanes, turbidity, conductivity, chloramines, hardness, organic carbon, solids, pH, and sulfate, is plotted along the y-axis, while the x-axis indicates the SHAP value, which reflects the contribution of each feature to the model’s predictions. The distribution of points over the x-axis illustrates the range of feature values, differentiating between low and high values across the different features.
Feature importance analysis shows that key parameters such as sulfate, pH, hardness, and the concentration of total dissolved solids were significant predictors of water quality. Specifically, pH levels were found to have a strong negative correlation with non-potable classifications, indicating that lower pH values significantly increased the likelihood of water being unsafe. Hardness also emerged as a critical factor, with higher hardness levels correlating with a greater probability of water being classified as non-potable. The concentration of total dissolved solids was similarly influential, with elevated levels associated with compromised water quality.

5. Discussion

This study investigated the effectiveness of machine learning models for water potability classification using a set of physicochemical water quality indicators. The results demonstrate that incorporating imbalance-aware evaluation metrics and strict data handling protocols substantially improves the reliability and interpretability of model performance. In particular, the use of PR–AUC, balanced accuracy, and Matthews Correlation Coefficient (MCC) provides a more faithful assessment than accuracy alone, which can be misleading under class imbalance.
The experimental findings indicate that models optimized using resampling strategies applied exclusively to the training data achieve improved discrimination of the potable class without compromising the integrity of the test set. This confirms that careful separation between training and evaluation data is critical to avoid optimistic bias and information leakage. The observed improvements in MCC and PR–AUC suggest that the proposed framework achieves a better balance between sensitivity and specificity, which is essential in water quality assessment, where false negatives may pose public health risks.
Despite the promising results, this study has several limitations that should be acknowledged. First, the dataset used in this work is derived from publicly available sources and may not fully capture the diversity of water quality conditions encountered across different geographic regions or climatic contexts. Consequently, model generalizability to unseen environments may be limited without further validation on independent datasets. Second, the study relies exclusively on physicochemical parameters and does not incorporate microbiological, biological, or temporal variables, which are often critical for comprehensive water quality assessment. The absence of time-series analysis also limits the ability to capture seasonal variations or long-term trends in water quality.
Most importantly, the proposed models are intended strictly for research and decision-support purposes. They have not been verified or validated as standalone diagnostic or regulatory tools and should not be used for clinical screening, regulatory compliance, or the analysis of blood or other clinical samples unless formally evaluated and approved by appropriate regulatory authorities, such as the U.S. Food and Drug Administration (FDA). Future work should focus on external validation using geographically diverse datasets, integration of additional data modalities, and collaboration with regulatory and domain experts to assess the feasibility of transitioning from decision-support systems to certified operational tools.

6. Conclusions

This study demonstrates the efficacy of machine learning models in predicting water quality and assessing potability based on various physicochemical parameters. The results indicate that the Extra Trees model outperformed other algorithms, achieving a notable accuracy of 0.8725 following hyperparameter optimization. Key parameters such as pH, hardness, and organic carbon content emerged as significant indicators of water potability. The integration of advanced analytical techniques not only enhances predictive capabilities but also provides insights into the underlying factors affecting water quality. Furthermore, even though machine learning has great potential in water quality prediction, there are several limitations related to data quality, model interpretability, environmental complexity, and computational resources. These challenges need to be addressed in future work to improve the practical application and robustness of water quality prediction models.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Tong, S.; Bambrick, H.; Beggs, P.J.; Chen, L.; Hu, Y.; Ma, W.; Steffen, W.; Tan, J. Current and future threats to human health in the Anthropocene. Environ. Int. 2022, 158, 106892. [Google Scholar] [CrossRef] [PubMed]
  2. Singh, J.; Yadav, P.; Pal, A.K.; Mishra, V. Water pollutants: Origin and status. In Sensors in Water Pollutants Monitoring: Role of Material; Springer: Berlin/Heidelberg, Germany, 2020; pp. 5–20. [Google Scholar]
  3. Sharma, K.; Rajan, S.; Nayak, S.K. Water pollution: Primary sources and associated human health hazards with special emphasis on rural areas. In Water Resources Management for Rural Development; Elsevier: Amsterdam, The Netherlands, 2024; pp. 3–14. [Google Scholar]
  4. Akhtar, N.; Syakir Ishak, M.I.; Bhawani, S.A.; Umar, K. Various natural and anthropogenic factors responsible for water quality degradation: A review. Water 2021, 13, 2660. [Google Scholar] [CrossRef]
  5. Jan, S.; Mishra, A.K.; Bhat, M.A.; Bhat, M.A.; Jan, A.T. Pollutants in aquatic system: A frontier perspective of emerging threat and strategies to solve the crisis for safe drinking water. Environ. Sci. Pollut. Res. 2023, 30, 113242–113279. [Google Scholar] [CrossRef] [PubMed]
  6. Madhav, S.; Ahamad, A.; Singh, A.K.; Kushawaha, J.; Chauhan, J.S.; Sharma, S.; Singh, P. Water pollutants: Sources and impact on the environment and human health. In Sensors in Water Pollutants Monitoring: Role of Material; Springer: Berlin/Heidelberg, Germany, 2020; pp. 43–62. [Google Scholar]
  7. World Health Organization. Safe Drinking-Water from Desalination; Technical Report; World Health Organization: Geneva, Switzerland, 2011. [Google Scholar]
  8. Reader, G.T. Access to drinking water, food security and adequate housing: Challenges for engineering, past, present and future. In Proceedings of the Symposium on Responsible Engineering and Living; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–41. [Google Scholar]
  9. Unicef, U. Strategy for Water, Sanitation and Hygiene 2016–2030. 2016. Available online: https://www.unicef.org/documents/unicef-strategy-water-sanitation-and-hygiene-2016-2030 (accessed on 6 January 2026).
  10. Cassivi, A.; Johnston, R.; Waygood, E.; Dorea, C. Access to drinking water: Time matters. J. Water Health 2018, 16, 661–666. [Google Scholar] [CrossRef] [PubMed]
  11. Yakubu, A.; Gano, Z.S.; Ahmed, O.U.; Shuwa, S.M.; Atta, A.Y.; Jibril, B.Y. Application of hydrophobic deep eutectic solvent for the extraction of aromatic compounds from contaminated water. Korean J. Chem. Eng. 2022, 39, 1299–1306. [Google Scholar] [CrossRef]
  12. Wolf, A.T.; Kramer, A.; Carius, A.; Dabelko, G.D. Managing water conflict and cooperation. In State of the World 2005; Routledge: Oxfordshire, UK, 2017; pp. 106–125. [Google Scholar]
  13. Lu, H.; Ma, X. Hybrid decision tree-based machine learning models for short-term water quality prediction. Chemosphere 2020, 249, 126169. [Google Scholar] [CrossRef] [PubMed]
  14. Gad, I.; Elmezain, M.; Alwateer, M.M.; Almaliki, M.; Elmarhomy, G.; Atlam, E. Breast cancer diagnosis using a machine learning model and swarm intelligence approach. In Proceedings of the 2023 1st International Conference on Advanced Innovations in Smart Cities (ICAISC); IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
  15. Gakii, C.; Jepkoech, J. A Classification Model for Water Quality Analysis Using Decision Tree. 2019. Available online: http://repository.embuni.ac.ke/handle/embuni/2203 (accessed on 6 January 2026).
  16. Fang, X.; Li, X.; Zhang, Y.; Zhao, Y.; Qian, J.; Hao, C.; Zhou, J.; Wu, Y. Random forest-based understanding and predicting of the impacts of anthropogenic nutrient inputs on the water quality of a tropical lagoon. Environ. Res. Lett. 2021, 16, 055003. [Google Scholar] [CrossRef]
  17. Alfahaid, A.; Alalwany, E.; Almars, A.M.; Alharbi, F.; Atlam, E.; Mahgoub, I. Machine Learning-Based Security Solutions for IoT Networks: A Comprehensive Survey. Sensors 2025, 25, 3341. [Google Scholar] [CrossRef] [PubMed]
  18. Alwateer, M.; Atlam, E.S.; Abd El-Raouf, M.M.; Ghoneim, O.A.; Gad, I. Missing data imputation: A comprehensive review. J. Comput. Commun. 2024, 12, 53–75. [Google Scholar] [CrossRef]
  19. World Health Organization. Guidelines for Drinking-Water Quality; World Health Organization: Geneva, Switzerland, 2002. [Google Scholar]
  20. Bhatnagar, A.; Maziak, W.; Eissenberg, T.; Ward, K.D.; Thurston, G.; King, B.A.; Sutfin, E.L.; Cobb, C.O.; Griffiths, M.; Goldstein, L.B.; et al. Water pipe (hookah) smoking and cardiovascular disease risk: A scientific statement from the American Heart Association. Circulation 2019, 139, e917–e936. [Google Scholar] [CrossRef] [PubMed]
  21. Bartram, J.; Brocklehurst, C.; Fisher, M.B.; Luyendijk, R.; Hossain, R.; Wardlaw, T.; Gordon, B. Global monitoring of water supply and sanitation: History, methods and future challenges. Int. J. Environ. Res. Public Health 2014, 11, 8137–8165. [Google Scholar] [CrossRef] [PubMed]
  22. Eniolorunda, M.F. Faecal Contamination of Drinking Water Sources in Chanchaga Local Government AREA, Minna and Its Public Health Implications. Ph.D. Thesis, Federal University of Technology Minna, Minna, Nigeria, 2021. [Google Scholar]
  23. Prüss-Ustün, A.; Wolf, J.; Bartram, J.; Clasen, T.; Cumming, O.; Freeman, M.C.; Gordon, B.; Hunter, P.R.; Medlicott, K.; Johnston, R. Burden of disease from inadequate water, sanitation and hygiene for selected adverse health outcomes: An updated analysis with a focus on low-and middle-income countries. Int. J. Hyg. Environ. Health 2019, 222, 765–777. [Google Scholar] [CrossRef] [PubMed]
  24. Prüss-Üstün, A.; Bos, R.; Gore, F.; Bartram, J. Safer Water, Better Health: Costs, Benefits and Sustainability of Interventions to Protect and Promote Health; World Health Organization: Geneva, Switzerland, 2008. [Google Scholar]
  25. Richardson, S.D. Environmental mass spectrometry: Emerging contaminants and current issues. Anal. Chem. 2010, 82, 4742–4774. [Google Scholar] [CrossRef] [PubMed]
  26. Richardson, S.D. Disinfection By-Products: Formation and Occurrence in Drinking Water. Encycl. Environ. Health 2011, 2, 110–136. [Google Scholar]
  27. Andrés, L.; Joseph, G.; Rana, S. The Economic and Health Impacts of Inadequate Sanitation; Oxford University Press: Oxford, UK, 2021. [Google Scholar] [CrossRef]
  28. Hutton, G.; Haller, L.; Water, S.; World Health Organization. Evaluation of the Costs and Benefits of Water and Sanitation Improvements at the Global Level; Technical Report; World Health Organization: Geneva, Switzerland, 2004. [Google Scholar]
  29. Hulton, G.; World Health Organization. Global Costs and Benefits of Drinking-Water Supply and Sanitation Interventions to Reach the MDG Target and Universal Coverage; Technical Report; World Health Organization: Geneva, Switzerland, 2012. [Google Scholar]
  30. Fromageau, E. The global water partnership: Between institutional flexibility and legal legitimacy. Int. Organ. Law Rev. 2011, 8, 367–395. [Google Scholar] [CrossRef]
  31. Ewuzie, U.; Bolade, O.P.; Egbedina, A.O. Application of deep learning and machine learning methods in water quality modeling and prediction: A review. In Current Trends and Advances in Computer-Aided Intelligent Environmental Data Engineering; ACM: New York, NY, USA, 2022; pp. 185–218. [Google Scholar]
  32. Cheng, B.; Zhang, Y.; Xia, R.; Wang, L.; Zhang, N.; Zhang, X. Spatiotemporal analysis and prediction of water quality in the Han River by an integrated nonparametric diagnosis approach. J. Clean. Prod. 2021, 328, 129583. [Google Scholar] [CrossRef]
  33. Zhu, M.; Wang, J.; Yang, X.; Zhang, Y.; Zhang, L.; Ren, H.; Wu, B.; Ye, L. A review of the application of machine learning in water quality evaluation. Eco-Environ. Health 2022, 1, 107–116. [Google Scholar] [CrossRef] [PubMed]
  34. Muigua, P.D. Safeguarding Human Health Through Health in All Policies Approach to Sustainability. 2023. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4802419 (accessed on 6 January 2026).
  35. Takhumova, O.; Goncharova, M. Innovative approaches to water resource management in achieving sustainable development goals. In Proceedings of the E3S Web of Conferences, Malang, Indonesia, 21 October 2025; EDP Sciences: Occitania, France, 2025; Volume 614, p. 04019. [Google Scholar]
  36. Paul, T.; Raj, R.; Garg, P.; Jain, S. Real Time Monitoring of Water Quality for Rural Areas: A Machine Learning and Internet of Things Approach. In Proceedings of the 2023 4th International Conference on Intelligent Engineering and Management (ICIEM), London, UK, 9–11 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
  37. Mequanint, G.; Argew, H.; Tilahun, H. Antimicrobial Solid Soap Production. Ph.D. Thesis, 2018. [Google Scholar]
  38. Chicco, D.; Jurman, G. The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Min. 2023, 16, 4. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The framework employed in the analysis of water quality.
Figure 1. The framework employed in the analysis of water quality.
Sustainability 18 01721 g001
Figure 2. The distribution of water samples.
Figure 2. The distribution of water samples.
Sustainability 18 01721 g002
Figure 3. The confusion matrix of the Extra Trees classifier combined with the SMOTE-ENN.
Figure 3. The confusion matrix of the Extra Trees classifier combined with the SMOTE-ENN.
Sustainability 18 01721 g003
Figure 4. The PR-AUC, ROC-AUC, and optimal threshold selection for the Extra Trees classifier using SMOTE-ENN.
Figure 4. The PR-AUC, ROC-AUC, and optimal threshold selection for the Extra Trees classifier using SMOTE-ENN.
Sustainability 18 01721 g004
Figure 5. The SHAP summary of the Random Forest model.
Figure 5. The SHAP summary of the Random Forest model.
Sustainability 18 01721 g005
Figure 6. The SHAP summary of the XGB model.
Figure 6. The SHAP summary of the XGB model.
Sustainability 18 01721 g006
Table 1. Database structure and description of water quality features.
Table 1. Database structure and description of water quality features.
FeatureUnitPlausible Range/UnitMissing
pH0–14491
Hardnessmg/L0–3000
Total Dissolved Solidsppm0–50,0000
Chloraminesppm0–150
Sulfatemg/L0–500781
Conductivityμs/cm0–20000
Organic Carbonppm0–300
Trihalomethanesμg/L0–120162
TurbidityNTU0–100
PotabilityBinary{0,1}0
Table 2. Water Quality Parameters Descriptive Statistics.
Table 2. Water Quality Parameters Descriptive Statistics.
StatisticpHHardnessSolidsChloraminesSulfateConductivityOrganic CarbonTrihalomethanesTurbidityPotability
Count2785327632763276249532763276311432763276
Mean7.08196.3722,014.097.12333.78426.2114.2866.403.970.39
Std Dev1.5932.888768.571.5841.4280.823.3116.180.780.49
Min047.43320.940.35129181.482.200.741.450
25%6.09176.8515,666.696.13307.70365.7312.0755.843.440
50%7.04196.9720,927.837.13333.07421.8814.2266.623.960
75%8.06216.6727,332.768.11359.95481.7916.5677.344.501
Max14323.1261,227.2013.13481.03753.3428.301246.741
Table 3. Comprehensive comparative performance of classifiers under different resampling and hybrid strategies.
Table 3. Comprehensive comparative performance of classifiers under different resampling and hybrid strategies.
MethodClassifierAccuracyPrecisionRecallF1ROC AUCPR AUCBalanced AccMCC
BaselineLogisticRegression0.5940.4900.5940.4470.4960.4160.499−0.010
KNeighborsClassifier0.6180.6040.6180.6000.6300.5340.5770.170
DecisionTreeClassifier0.6260.6260.6260.6260.6110.6300.6110.223
SVC0.6770.6990.6770.6350.7050.6300.6160.311
RandomForestClassifier0.6620.6570.6620.6410.7090.6190.6180.268
GradientBoostingClassifier0.6340.6230.6340.6030.6770.5990.5820.197
Random OversamplingLogisticRegression0.5060.5060.5060.5060.5140.5330.5060.011
KNeighborsClassifier0.6070.6070.6070.6070.6600.6510.6070.214
DecisionTreeClassifier0.6920.6930.6920.6910.6920.7720.6920.385
SVC0.6600.6670.6600.6560.7270.7210.6600.327
RandomForestClassifier0.7440.7460.7440.7440.8040.8400.7440.490
GradientBoostingClassifier0.6560.6560.6560.6550.7230.7250.6560.312
Random UndersamplingLogisticRegression0.4970.4970.4970.4970.5220.5510.497−0.006
KNeighborsClassifier0.5950.5960.5950.5950.6110.6150.5950.191
DecisionTreeClassifier0.6080.6080.6080.6070.6080.7100.6080.216
SVC0.6470.6480.6470.6460.6830.7170.6470.295
RandomForestClassifier0.6490.6490.6490.6490.6990.7190.6490.298
GradientBoostingClassifier0.6370.6370.6370.6360.6980.6980.6370.274
SMOTELogisticRegression0.5110.5110.5110.5110.5020.5110.5110.022
KNeighborsClassifier0.6290.6310.6290.6280.6810.6730.6290.260
DecisionTreeClassifier0.6430.6430.6430.6430.6430.7300.6430.286
SVC0.6470.6480.6470.6470.7220.7120.6470.295
RandomForestClassifier0.7220.7220.7220.7220.7880.8000.7220.444
GradientBoostingClassifier0.6710.6720.6710.6700.7290.7140.6710.342
ADASYNLogisticRegression0.4730.4590.4730.4500.4550.4980.465−0.077
KNeighborsClassifier0.6350.6390.6350.6270.6680.6560.6290.269
DecisionTreeClassifier0.6150.6150.6150.6150.6140.7250.6140.228
SVC0.6390.6430.6390.6320.6900.6750.6330.277
RandomForestClassifier0.6990.7010.6990.6960.7820.7980.6960.397
GradientBoostingClassifier0.6210.6220.6210.6170.6710.6650.6170.240
Table 4. Comprehensive comparative performance of classifiers under different resampling and hybrid imbalance-handling methods.
Table 4. Comprehensive comparative performance of classifiers under different resampling and hybrid imbalance-handling methods.
MethodClassifierAccuracyPrecisionRecallF1ROCSamples TrainSamples TestClass Distribution
Borderline SMOTELogisticRegression0.51040.51040.51040.51040.5074279711990: 1998, 1: 1998
KNeighborsClassifier0.66470.66720.66470.66350.7140279711990: 1998, 1: 1998
DecisionTreeClassifier0.62300.62300.62300.62300.6230279711990: 1998, 1: 1998
SVC0.64550.64810.64550.64400.7130279711990: 1998, 1: 1998
RandomForestClassifier0.73310.73310.73310.73310.8068279711990: 1998, 1: 1998
GradientBoostingClassifier0.63970.64020.63970.63940.6935279711990: 1998, 1: 1998
ExtraTreesClassifier0.76980.77000.76980.76980.8396279711990: 1998, 1: 1998
Tomek LinksLogisticRegression0.58000.69670.58000.43350.513021009000: 1722, 1: 1278
KNeighborsClassifier0.61890.61320.61890.61350.639521009000: 1722, 1: 1278
DecisionTreeClassifier0.58560.58830.58560.58670.579421009000: 1722, 1: 1278
SVC0.65890.68160.65890.61980.705821009000: 1722, 1: 1278
RandomForestClassifier0.65670.65310.65670.64390.687821009000: 1722, 1: 1278
GradientBoostingClassifier0.64330.64340.64330.61840.663121009000: 1722, 1: 1278
ExtraTreesClassifier0.65670.65560.65670.63900.688721009000: 1722, 1: 1278
SMOTE ENNLogisticRegression0.59560.49450.59560.46090.524710604550: 603, 1: 912
KNeighborsClassifier0.81100.82010.81100.80330.904710604550: 603, 1: 912
DecisionTreeClassifier0.73850.73920.73850.73880.728510604550: 603, 1: 912
SVC0.78460.78960.78460.77640.844610604550: 603, 1: 912
RandomForestClassifier0.83080.83140.83080.82790.906810604550: 603, 1: 912
GradientBoostingClassifier0.76700.77360.76700.75600.824110604550: 603, 1: 912
ExtraTreesClassifier0.87250.87670.87250.86990.945210604550: 603, 1: 912
SMOTE TomekLogisticRegression0.50350.50350.50350.50350.5216263611300: 1883, 1: 1883
KNeighborsClassifier0.66020.66510.66020.65760.7223263611300: 1883, 1: 1883
DecisionTreeClassifier0.60970.61010.60970.60940.6097263611300: 1883, 1: 1883
SVC0.64420.64470.64420.64400.7193263611300: 1883, 1: 1883
RandomForestClassifier0.74250.74250.74250.74250.8190263611300: 1883, 1: 1883
GradientBoostingClassifier0.63810.63810.63810.63800.7140263611300: 1883, 1: 1883
ExtraTreesClassifier0.76110.76120.76110.76100.8552263611300: 1883, 1: 1883
ADASYN TomekLogisticRegression0.48790.48790.48790.48780.4822280612030: 1998, 1: 2011
KNeighborsClassifier0.64090.64160.64090.64040.6725280612030: 1998, 1: 2011
DecisionTreeClassifier0.61930.61950.61930.61910.6192280612030: 1998, 1: 2011
SVC0.63090.63530.63090.62770.6875280612030: 1998, 1: 2011
RandomForestClassifier0.72490.72490.72490.72480.8064280612030: 1998, 1: 2011
GradientBoostingClassifier0.64920.64960.64920.64890.6943280612030: 1998, 1: 2011
ExtraTreesClassifier0.75810.75850.75810.75800.8382280612030: 1998, 1: 2011
Table 5. Comparative performance of classifiers under different feature selection strategies.
Table 5. Comparative performance of classifiers under different feature selection strategies.
CategoryClassifierAccuracyPrecisionRecallF1-ScoreROCFeatures SelectedReduction Rate
Genetic AlgorithmLogistic Regression0.61040.37260.61040.46270.527650.44
K-Nearest Neighbors0.59510.57450.59510.57590.560150.44
Decision Tree0.57780.57800.57780.57790.556450.44
SVC0.65310.67760.65310.58070.618150.44
Random Forest0.62360.60510.62360.60070.601550.44
Gradient Boosting0.64090.63160.64090.58690.604650.44
Extra Trees0.63890.62270.63890.61060.609950.44
Particle Swarm OptimizationLogistic Regression0.61040.37260.61040.46270.530240.56
K-Nearest Neighbors0.56260.54090.56260.54560.525840.56
Decision Tree0.54930.54960.54930.54940.526540.56
SVC0.62360.63930.62360.51210.568740.56
Random Forest0.59820.57170.59820.56850.572540.56
Gradient Boosting0.60630.56910.60630.53970.576040.56
Extra Trees0.59210.56370.59210.56130.566140.56
Mutual InformationLogistic Regression0.61040.37260.61040.46270.530190.00
K-Nearest Neighbors0.61650.59690.61650.59390.612090.00
Decision Tree0.58090.57930.58090.58010.557590.00
SVC0.66840.69440.66840.60860.653690.00
Random Forest0.66230.65840.66230.62360.651890.00
Gradient Boosting0.66840.68090.66840.61740.620290.00
Extra Trees0.66330.66310.66330.62050.665290.00
Chi-SquareLogistic Regression0.61040.37260.61040.46270.530190.00
K-Nearest Neighbors0.61650.59690.61650.59390.612090.00
Decision Tree0.58090.57930.58090.58010.557590.00
SVC0.66840.69440.66840.60860.653690.00
Random Forest0.66230.65840.66230.62360.651890.00
Gradient Boosting0.66840.68090.66840.61740.620290.00
Extra Trees0.66330.66310.66330.62050.665290.00
Table 6. Classification report for the Extra Trees classifier using the SMOTE-ENN method.
Table 6. Classification report for the Extra Trees classifier using the SMOTE-ENN method.
ClassPrecisionRecallF1-ScoreSupport
Not potable0.910.750.82181
Potable0.850.950.90274
Accuracy 0.87455
Macro avg0.880.850.86455
Weighted avg0.880.870.87455
Table 7. The performance metrics of the Extra Trees classifier combined with the SMOTE-ENN.
Table 7. The performance metrics of the Extra Trees classifier combined with the SMOTE-ENN.
MetricValue
Accuracy0.882
Precision0.854
Recall0.968
F1-score0.907
PR-AUC0.972
Balanced Accuracy0.860
Matthews Correlation Coefficient (MCC)0.7566
Optimal Decision Threshold0.35
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alwateer, M. Water Quality Prediction Using Explainable Machine Learning Models. Sustainability 2026, 18, 1721. https://doi.org/10.3390/su18041721

AMA Style

Alwateer M. Water Quality Prediction Using Explainable Machine Learning Models. Sustainability. 2026; 18(4):1721. https://doi.org/10.3390/su18041721

Chicago/Turabian Style

Alwateer, Majed. 2026. "Water Quality Prediction Using Explainable Machine Learning Models" Sustainability 18, no. 4: 1721. https://doi.org/10.3390/su18041721

APA Style

Alwateer, M. (2026). Water Quality Prediction Using Explainable Machine Learning Models. Sustainability, 18(4), 1721. https://doi.org/10.3390/su18041721

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop