A Machine Learning-Based Spatial Risk Mapping for Sustainable Groundwater Management Under Fluoride Contamination: A Case Study of Mastung, Balochistan

Butt, Nabeel Afzal; Muhammad, Khan; Yaseen, Waqass; Bashir, Shahid; Khan, Muhammad Younis; Khan, Asif; Sadique, Umar; Uddin, Saeed; Abdul Manan, Razzaq; Younas, Muhammad; Economou, Nikos

doi:10.3390/su18073328

Open AccessArticle

A Machine Learning-Based Spatial Risk Mapping for Sustainable Groundwater Management Under Fluoride Contamination: A Case Study of Mastung, Balochistan

by

Nabeel Afzal Butt

¹

,

Khan Muhammad

^1,2,3

,

Waqass Yaseen

⁴,

Shahid Bashir

⁵

,

Muhammad Younis Khan

^6,*

,

Asif Khan

⁷

,

Umar Sadique

¹

,

Saeed Uddin

⁴,

Razzaq Abdul Manan

⁴,

Muhammad Younas

^1,2 and

Nikos Economou

⁸

¹

National Centre of Artificial Intelligence, University of Engineering and Technology, Peshawar 25120, Pakistan

²

Department of Mining Engineering, University of Engineering and Technology, Peshawar 25000, Pakistan

³

Department of Earth and Environmental Science, Camborne School of Mines, University of Exeter, Penryn TR10 9FE, UK

⁴

Centre of Excellence in Mineralogy, University of Balochistan, Quetta 87300, Pakistan

⁵

Department of Electrical Engineering, University of Engineering and Technology, Peshawar 25000, Pakistan

⁶

Department of Earth Science, Sultan Qaboos University, Muscat 123, Oman

⁷

Department of Mineral Resource Engineering, Pak-Austria Fachhochschule: Institute of Applied Sciences and Technology, Haripur 22620, Pakistan

⁸

School of Mineral Resources Engineering, Technical University of Crete, Polytechnioupolis, Kounoupidiana, 731 00 Chania, Crete, Greece

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(7), 3328; https://doi.org/10.3390/su18073328

Submission received: 24 December 2025 / Revised: 10 March 2026 / Accepted: 16 March 2026 / Published: 30 March 2026

Download

Browse Figures

Versions Notes

Abstract

Sustainable groundwater management is essential for water security and human health protection. Fluoride contamination is a serious concern for the sustainable drinking water supply in many parts of Pakistan, including Balochistan, where arid climate conditions and geological formations support the enrichment of fluoride. The toxic nature of fluoride contamination has resulted in negative health impacts on the local population. Conventional geostatistical techniques are usually ineffective to delineate the nonlinear relationships that affect the distribution of fluoride. This study aims to develop a machine learning-driven spatial modelling framework for classifying the spatial distribution of fluoride contamination in groundwater across the study area. The model will help to understand the spatial variability of fluoride contamination and its controlling factors, essential for effective mitigation and early warning systems. Physiochemical elements were used as predictive features in this study, utilizing a unified feature importance framework combining hydrogeochemical analysis, spatial distribution assessment, and ensemble SHAP-based interpretation to identify consistent predictors. Model performance was evaluated using a nested cross-validation framework, followed by validation on an independent geology-informed spatial holdout test set to ensure realistic generalization. Among machine learning models, the Logistic Regression (LR), Support Vector Classifier (SVC), XGBoost (XGB), Decision Tree (DT), Gaussian Naïve Bayes (GNB), and K-Nearest Neighbours (KNN) were evaluated. Support Vector Classifier (SVC) demonstrated a high predictive performance. On the independent spatial holdout dataset, SVC achieved an overall accuracy of 0.75 and an area under the receiver operating characteristic curve (AUC) of 0.821. In addition to classification, a human health risk assessment was conducted using chronic daily intake (CDI) and hazard quotient (HQ) calculations for children and adults, identifying several high-risk water supply schemes. The prediction maps successfully delineated high-risk fluoride points across specific areas, offering a tool for sustainable groundwater management. This study helps to achieve a Sustainable Development Goal (Clean Water and Sanitation, SDG#6) and promotes long-term sustainable planning in water-stressed areas by integrating spatial machine learning mapping and health risk assessment.

Keywords:

sustainable groundwater management; predictive modelling; hydrogeochemistry; SHAP; spatial cross-validation; Support Vector Classifier; chronic daily intake (CDI); hazard quotient (HQ); SDG 6

Graphical Abstract

1. Introduction

Groundwater is most the vital natural resource for ecosystems, agriculture, and human livelihoods. Over 2.5 billion people in the world rely on groundwater as a primary source of safe drinking water. However, this vital resource faces significant challenges, particularly contamination. Recent studies show that more than 200 million individuals are exposed to groundwater fluoride contamination [1,2]. Excessive fluoride contamination in groundwater can lead to several severe health disorders, including dental and skeletal fluorosis, crippling fluorosis, and osteosclerosis [3].

A high fluoride concentration in groundwater is mainly due to the natural geological settings. It is generally known that high fluoride in groundwater resources is attributed to the dissolution of fluoride-rich rocks [4]. Clay minerals and micas can also contribute to fluoride levels in groundwater [5] along with alkaline volcanic rocks [6]. Hydrothermal activity is another important natural source, where fluoride-rich fluids are released from host rocks into the surrounding water [7]. However, natural sources are not the only concern. Human activities significantly contribute to fluoride contamination as well. Industrial processes, particularly aluminum smelting [8] and coal processing [9], release substantial amounts of fluoride into groundwater systems. Agricultural practices also play a role, as the widespread use of fertilizers can introduce fluoride into groundwater supplies [10].

Machine learning (ML) has become a transformative analytical tool to understand and predict fluoride contamination. Conventional geostatistical methodologies often rely on the assumption of linear dependencies between the variables, while hydrogeochemical processes governing fluoride dynamics are quite complex and nonlinear. Thus, researchers all over the world increasingly relying on machine learning algorithms to capture complex environmental patterns. In a global assessment of fluoride contamination conducted by [1], the ML Random Forest model was used to produce a global hazard map of fluoride contamination. In China, a nationwide map of geogenic fluoride contamination was developed using artificial neural network models [11]. Similarly, in Turkey, a study conducted by [12] used machine learning and deep learning models to predict groundwater fluoride using sessional observation.

In Pakistan, fluoride contamination is major concern effecting public health across multiple provinces. More than 25 million people in Pakistan are at risk from high fluoride concentration in their groundwater [13]. In the Punjab province of Pakistan, the fluoride contamination in groundwater and associated health risk is reported by [14]. Similarly, ref. [15] conducted research to determine geochemical processes driving fluoride enrichment in unconfined aquifers. Fluoride occurrence has also been reported surrounding an active fluorite mining operation in Pakistan [9]. In a recent study conducted by [16], a fluoride contamination map was created at national level using ML. This national-level map is very useful in determining broad vulnerable zones, but its major drawback is that national scale prediction is basically based on broad environmental covariates, which do not necessarily provide a complete picture of local hydrogeochemical controls and hotspot variation at the community level, which is necessary in local planning.

Groundwater fluoride pollution has been reported in different parts of Baluchistan. Ref. [17] assessed fluoride in drinking water sources and correlated the variation in fluoride with physicochemical parameters. In other studies, Quetta and rural areas were the focus of descriptive statistics and spatial mapping to determine the high-fluoride zones [18]. More recent work has continued this with the use of health risk indices including the hazard quotient (HQ) and pollution indices in the comparison of risk across districts [19]. In general, the literature at hand shows that the fluoride contamination in Balochistan is dominant and diffused throughout the region. Nevertheless, the majority of past studies have been primarily based on traditional hydrochemical interpretation, descriptive statistics, and simple spatial mapping. These methods are useful in reporting fluoride occurrence, but they are weak at incorporating various interacting predictors and the nonlinear relationship that is vital to predicting hotspots reliably in arid and semi-arid aquifers. That is why the use of machine learning is also relevant in this field of study, as it enhances fluoride susceptibility mapping to help local planning in communities by defining high-risk areas more precisely.

This study develops and tests machine learning models to demarcate groundwater zones where the concentration of fluoride is high. Physicochemical parameters are used as predictor variables, and statistical preprocessing is applied in order to enhance model robustness and prediction performance. Machine learning algorithms are used to produce high resolution susceptibility maps of fluoride contamination. The resulting maps are able to give a foundation of evidence as to the identification of high-risk areas and to inform specific mitigation and groundwater management approaches. In addition to predictive modelling, a human health risk assessment was conducted using chronic daily intake (CDI) and hazard quotient (HQ) indices. The integration of machine learning-based spatial prediction with health risk metrics proposes a sustainability-oriented framework that facilitates evidence-based intervention strategies aligned with the objectives of Sustainable Development Goal (Clean Water and Sanitation).

2. Study Area and Geology

The site of study (Mastung District) lies in the North-East of Balochistan province in Pakistan. The coordinates of the study area: Latitude

29.7904 °

and longitude

66.8334 °

. Geologically, Mastung is a part of the Pishin Basin [20] and is situated within the Kirthar Fold and Thrust Belt. Tectonically, the region displays the convergence of multiple geological faults such as the Chaman Transform Fault, which form the western boundary between the Indian and Eurasian plates. This fault is responsible for multiple seismic activities in the local region [21].

The Shirnab Formation (Early to Middle Jurassic), Chiltan Limestone (Middle to Late Jurassic), Sembar Formation (Early Cretaceous), Goru Formation (Middle Cretaceous), Parh Limestone (Middle to Late Cretaceous), Fort Munro Formation (Late Cretaceous), Kirthar Formation (Eocene), and Bostan Formation (Pleistocene) are considered the stratigraphic sequence in the study region.

Hydrogeologically, the Mastung aquifer system includes two main parts: the upper part, which predominantly comprises the unconsolidated alluvial deposits consisting of various layers of silt, clay with patches of gravel, and sand, and the lower part is made of a consolidated limestone bedrock [22].

3. Materials and Methods

3.1. Groundwater Sampling Analysis

Groundwater samples (n = 87) were collected from community tube wells of the Mastung District (Figure 1). Sampling was conducted during a single field campaign; therefore, temporal replicates across contrasting seasons are not available in the current dataset. The concentration of samples is significantly high in the centre of the study area, reflecting dense human settlements, increased agricultural activities and, groundwater abstraction in the Mastung District. Conversely, the peripheral areas are sparsely populated and have limited access to tube wells, which constrained sampling density. Most of the samples were obtained in Jurassic metamorphic and sedimentary rocks and fewer in Paleogene sedimentary rocks and Jurassic–Triassic units due to the low availability of wells in these formations or units. The groundwater samples were collected in pre-cleaned, acid-washed polyethene bottles (1000 mL). The physical parameters, pH, electrical conductivity (EC), total dissolved solids (TDS), and turbidity were measured on-site using calibrated portable equipment (Lovibond SinsoDirect (Lovibond, Dortmund, Germany)) and JENWAY 6035, (Jenway, London, UK). Two sets of sample bottles were prepared at each location: one set was unacidified and used for physical and anion analysis, the other set was acidified for cation analysis. Acidification was performed by adding 1.5 mL of concentrated HNO₃ per 1000 mL sample to achieve pH < 2. Global Positioning System (GPS) (H.C. Garmin, Reston, VA, USA) was used to determine the coordinates of each sampling point. All samples were transferred to the laboratory in coolers maintained at 4 °C and stored in refrigerators at 4 °C for further analysis. The analysis was performed within the holding time limits suggested by EPA and APHA (see Table S2 in Supplementary Materials).

The groundwater was analyzed for nitrates (NO₃⁻) and sulfates (SO₄²⁻) using a DR 2800 Spectrophotometer (HACH, Loveland, CO, USA), while the HACH DR 2800 Spectrophotometer (HACH, Loveland, CO, USA) was also used to analyze the concentration of fluoride (F⁻) in groundwater. The concentration of chloride (Cl⁻) and bicarbonate (HCO₃⁻) was determined using the Mohr method and titration, respectively. Other chemical elements such as Magnesium (Mg²⁺), calcium (Ca²⁺), Iron (Fe), and Sodium (Na⁺) were analyzed using Inductively Coupled Plasma Mass Spectrometry (ICP-MS) from PerkinElmer Technologies, Shelton, CT, USA, featuring a detection limit of 0.01 µg/L.

A comprehensive quality assurance and quality control was applied during the sample collection and analysis. Field blanks (n = 9, 10.3 percent of samples) prepared using deionized water showed that all the measured parameters were below the limit of detection, confirming the absence of contamination in the collection and handling procedures. Laboratory duplicates (n = 9, 10.3%) showed a mean analytical precision of 96.7% and range from 94.9 to 99.35% (Table S3). The matrix spikes showed a recovery of 95% to 104% (Table S4), suggesting no significant interference of the matrix.

Limit of detection (LOD) and limit of quantification (LOQ) were determined using repeated blank measurements as LOD = 3σ and LOQ = 10σ, where σ is the standard deviation of the replicated blank. As a result, LOD and LOQ were determine for fluoride as 0.02 mg/L and 0.05 mg/L, respectively. The LOD and LOQ values of other parameters are provide in Supplementary Materials (Table S5).

3.2. Hydrogeochemical Analysis

3.2.1. Saturation Index Computation

The PHREEQC (version 3, US Geological Survey, Reston, VA, USA) [23] was used to compute saturation indices (SIs) of fluorite, calcite, and dolomite to analyze mineral equilibrium conditions. The SI was categorized as undersaturated (SI < 0), supersaturated (SI > 0), and SI ≤ 0.1 near equilibrium.

3.2.2. Hydrochemical Facies Classification

A Piper trilinear diagram [24] was used to classify hydrochemical facies. Cations (Ca²⁺, Mg²⁺, and Na⁺ + K⁺) and anions (HCO₃⁻, SO₄²⁻, and Cl⁻) were converted to milliequivalent percentages and plotted in the cation–anion triangles and central diamond. The resulting Piper diagram was used to identify dominant water types and hydrochemical evolution.

3.2.3. Ionic Ratio Analysis

The ion ratio plots were generated to determine the most important geochemical pro-cesses. The Na⁺/Cl relation helped to assess the contribution of the dissolution of halite/evaporites against other sources of Na like silicate weathering or cation exchange. The (Ca + Mg)/HCO₃⁻ plot was used to assess carbonate weathering, buffering effects, and ion-exchange processes. Moreover, the relationship between the Ca²⁺–SO₄²⁻ was employed to test the dissolution of gypsum/anhydrite using the 1:1 reference line, while the Cl⁻ versus NO₃–N relationship was used as a pollution indicator to evaluate whether chloride enrichment co-occurs with elevated nitrate.

3.3. Model Development

3.3.1. Data Processing and Feature Engineering

We split the data into two categories, as per the World Health Organization (WHO) standard limit of 1.5 mg/L fluoride. The samples with lower values below the limit of 1.5 mg/L were labelled as 0, indicating that these samples were within the safe range. The samples indicating a fluoride concentration higher than the WHO threshold were labelled as 1.

3.3.2. Data Splitting

To reduce spatial autocorrelation between training and testing data, a fishnet grid utilizing GIS and geology data was used to partition samples. Training and testing were performed using sixty-three and twenty-four samples, respectively. To account for spatial autocorrelation and geological continuity, a spatial holdout validation strategy was adopted using fishnet-based spatial partitioning (Figure 1). Spatial blocks were defined based on the lithological units of the surface that are proxies of regional hydrogeological domains that determine the groundwater chemistry through recharge pathways and rock–water interactions. The bottom spatial block, which is dominated by Jurassic and Triassic rocks, contains Cretaceous sedimentary rocks that have been used as an independent test dataset, while the training dataset was mainly composed of Jurassic metamorphic, sedimentary rocks samples. This lithologically informed spatial separation ensures that model evaluation reflects generalization across distinct hydrogeochemical settings rather than reliance on spatial proximity or formation-specific signatures.

3.3.3. Data Scaling: Z-Score Normalization

Data scaling methodology [25] was implemented to ensure that all features have an equal numerical contribution. Python’s scikit-learn library (version 1.8.0) was used to apply the Z-score normalization method by using the mean and standard deviation (Equation (1)). Most datasets have features that vary in magnitude and range; therefore, this strategy is used to avoid skewness of the model towards the feature that has a greater magnitude.

Z-score normalization’s mathematical form is as follows:

Z = \frac{X - μ}{σ}

(1)

where Z = scaled value, X = original value, μ = mean, and σ = standard deviation.

3.3.4. Handling Class Imbalance and Spatial Validation

Synthetic oversampling methods, including SMOTE and ADASYN, were considered during the initial analysis; using the imbalanced-learn library (version: 0.14.1) they did not feature in the ultimate modelling structure because they might create spatially implausible hydrochemical samples and create information leakage among spatially correlated observations. In contrast, the issue of class imbalance was solved through model selection and class weighting.

All resampling and model tuning procedures were restricted to the training data, and final model performance was evaluated on a spatially independent holdout set derived from geology-informed fishnet partitioning. This strategy ensures that the performance reported is based on real spatial generalization and not artefacts caused by the generation of synthetic data.

3.3.5. Feature Selection

An integrated feature importance framework was adopted to ensure both predictive robustness and hydrogeochemical interpretability. The feature selection framework combining hydrogeochemical analysis, spatial distribution assessment, and SHAP (SHapley Additive exPlanations) [26] values estimated multiple machine-learning classifiers using cross-validated training data. This integrated approach ensures that the selected features are not only predictive but also physically interpretable within the hydrogeochemical context of fluoride mobilization. Spatial distribution maps of all features were generated using ordinary kriging interpolation in ArcGIS (version 10.7.1, Esri, Redlands, CA, USA) (geostatistical analyst) to further refine the feature selection criteria. Features representing strong spatial correlations with target variables were considered to be most important, as they could provide useful information for the prediction process.

The density of sampling was determined by the Point Density tool of ArcGIS; a circular neighbourhood tool was utilized (a radius of 10 km) to estimate the size of the number of sampling points in relation to a square kilometre, and local variability was determined to measure the spatial heterogeneity of the fluoride concentration by applying Focal Statistics in the ArcGIS to the interpolated fluoride surface.

3.4. Model Selection and Training

3.4.1. Machine Learning Models

This study utilizes machine learning models including Support Vector Machine (SVM) and K-Nearest Neighbours (KNN) as well as Logistic Regression, Extreme Gradient Boosting, Decision Tree, and Gaussian Naïve Bayes; classifiers were also evaluated for comparison purposes. Detailed algorithmic descriptions and mathematical formulations are provided in the Supplementary Materials.

3.4.2. Support Vector Machine (SVM)

SVM learning algorithm was developed by [27] for binary classification. The fundamental concept of SVM is to identify an optimal decision boundary (hyperplane), which separates the d-dimension data perfectly into two classes [28]. SVM chooses the hyperplane that maximizes the margins, which is the distance between the hyperplane and the nearest data points from each class (Support Vectors) [29]. Mathematically, it is represented as:

w \cdot x + b = 0

(2)

where w is the weight vector perpendicular to the hyperplane, x is the input feature vector, and b is the bias term that shifts the hyperplane.

SVM was used in this study, since it is capable of handling high-dimensional data, can efficiently discriminate classes that are hard to divide by a straight line, and provides an accurate classification approach that reduces the risk of overfitting relative to other models.

3.4.3. Nested Cross-Validation and Hyperparameter Optimization

Hyperparameter optimization was performed as a nested cross-validation framework to reduce overfitting and optimistic bias. In particular, a five-fold stratified outer cross-validation loop was used to obtain an estimated generalization performance of the model. Within each outer training fold, a 3-fold stratified inner cross-validation was used exclusively for hyperparameter tuning through grid search.

The inner loop involved model selection through the maximization of the Average Precision (AP) score, which is appropriate in cases of imbalanced classification. After that, the best hyperparameter configuration found during this inner loop was tested on the corresponding outer validation fold.

After the nested cross-validation, the final model was retrained using the most suitable hyperparameters obtained. Then, the finely tuned model was tested using an independent spatial holdout test set, which was not included in the training of the model or hyperparameter optimization.

3.4.4. Performance Evaluation

The confusion matrix, receiver operating characteristic curve (ROC), and area under the curve are used to identify the performance of the trained machine learning model on our test data.

The confusion matrix is the main tool for the evaluation of errors in classification problems [30]. Four fundamental parts of the confusion matrix are as follows: True Positive (TP), True Negative (TN), False Positive (FP), and false negative (FN) [31]. The number of samples correctly identified as contaminated (having high fluoride concentration in groundwater) is represented by TP, while TN denotes the number of samples correctly identified as uncontaminated (fluoride concentration below permissible limit).

Similarly, FP is the number of samples that were incorrectly classified as contaminated even though they were not, and FN represents the number of samples incorrectly classified as uncontaminated.

The performance metrics of an algorithm are accuracy, precision, recall, and F1 score [32], determined based on the previously described TP, TN, FP, and FN values.

The accuracy of an algorithm is defined as the ratio of correctly classified samples (TP + TN) to the total number of samples (TP + TN + FP + FN).

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(3)

The precision of an algorithm is defined as the ratio of correctly identified contaminated samples (TP) to the total samples predicted as contaminated (TP + FP).

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

Recall is defined as the ratio of correctly identified contaminated samples (TP) to the total number of actual contaminated samples (TP + FN).

R e c a l l = \frac{T P}{T P + F N}

(5)

The F1 score represents the balance between precision and recall by calculating their harmonic mean.

F 1 S c o r e = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(6)

ROC was used to further evaluate the performance of the model. The discrimination accuracy of a prediction model is described by the ROC [33]. The ROC plot shows the trade-off curve between the True Positive Rate (recall) and the False Positive Rate (1-specificity) at multiple classification thresholds [34]. The TPR represents the proportion of correctly identified contaminated groundwater samples, whereas the FPR represents the proportion of uncontaminated samples that were incorrectly classified as contaminated. The True Positive Rate and False Positive Rate are determined as follows using the following formula.

T P R = \frac{T P}{T P + F N}

(7)

F P R = \frac{F P}{F P + T N}

(8)

A better discriminatory power classifier resulting in the ROC close to upper left corner of the plot indicates lower values of FPR and a higher sensitivity. Area under the curve (AUC) serves as a single-value measure for assessing the overall detection capability of a classifier [35]. According to [36], an AUC of 0.5 indicates random forecasts; AUC = 1 indicates a perfect forecast, of course.

3.4.5. Power Analysis (ROC–AUC)

A formal ROC–AUC power analysis was conducted to assess whether the available sample size can detect a target discrimination level of AUC ≥ 0.90 at α = 0.05 (two-sided). Samples were classified as high fluoride (F > 1.5 mg/L) and low fluoride (F ≤ 1.5 mg/L), yielding n = 87 observations (n₁ = 22 cases; n₀ = 65 controls). Power was computed for

H_{0} : AUC = 0.50

versus

H_{1} : AUC = 0.90

using a large-sample normal approximation with the AUC standard error estimated with the Hanley–McNeil method [37]. The achieved power exceeded 0.99 (≈1.00), indicating that the dataset is adequately powered to detect AUC ≥ 0.90 at the specified significance level [38].

3.5. Human Health Risk Assessment

Chronic daily intake (CDI) and hazard quotient (HQ) are considered to be useful tools for monitoring heath assessments due to fluoride contamination in adults and children. The CDI and HQ can be calculated by Equations (9) and (10), respectively [39]. The value of parameters used in CDI and HQ are shown in Table 1.

C D I = \frac{C \times I R \times E F \times E D}{B W \times A T}

(9)

H Q = \frac{C D I}{R F D}

(10)

The overall workflow of the machine learning methodology is illustrated in Figure 2.

4. Result

The fluoride concentration in groundwater across Balochistan ranges from 0.04 to 2.30 mg/L, with a mean value of 1.03 mg/L. A statistical summary of the physicochemical parameters examined in the groundwater samples is presented in Table 2. The elevated concentrations of other parameters, such as hardness (up to 780 mg/L) and bicarbonate (up to 450 mg/L), suggest that carbonate mineral dissolution and rock–water interactions are the major processes controlling groundwater quality in the study area. The distribution of fluoride concentrations shows a clear imbalance in the dataset. As illustrated in Figure 3, the histogram on the left has a mean value of 1.03 mg/L and a skewness of 0.52, indicating that most samples fall within the uncontaminated range. The dataset shows an uneven distribution of the fluoride classes, with most of the samples having a lower value than what is recommended by the World Health Organization guidelines.

The saturation indices of fluorite revealed that fluorite is undersaturated in all groundwater samples (Figure 4A,B), with a mean value of −1.38 and range of −0.28 to −4.03, meaning the dissolution of fluorite is thermodynamically favoured throughout the study area, whereas calcite and dolomite are mainly supersaturated (calcite: 86.4% supersaturated and dolomite: 84.1% supersaturated); only a small proportion of the samples were near equilibrium (calcite: 10.2%; dolomite: 3.4%).

The Piper diagram (Figure 5) summarized the major-ion composition and hydrochemical facies of groundwater in the area of study. In the cation triangle, most samples lie within the no-dominant cation category, which means that neither Ca²⁺, Mg²⁺, nor Na⁺ + K⁺ individually exceeds ~50% of total cations (meq%). This mixed-cation chemistry reflects the combined effect of various geochemical processes, such as carbonate weathering, alkali contributions (e.g., silicate weathering), or cation exchange rather than the dominance of a single cation source. In the anion triangle, there is a dominance of samples between the bicarbonate and no-dominant type, thus showing alkalinity to be a significant contributor to the ground water chemistry in the region. In the central diamond, a majority of samples fall within the Ca–Mg–HCO₃ mixed facies, which is typical of groundwater influenced by rock–water interactions under recharge conditions and carbonate buffering.

Pearson correlation analysis (Figure 6) reveals that there is strong positive relationship between EC, TDS, Cl⁻, and SO₄ (R > 0.90), indicating their collective contribution to groundwater mineralization.

The Na–Cl⁻ graph (Figure 7A) indicates that, although a few samples lie near the 1:1 line, most samples do not follow the same relationship, with some showing an enrichment of Na as compared to Cl.

The (Ca²⁺ + Mg²⁺)-HCO₃⁻ plot (Figure 7B) shows that the majority of samples lie above the line at 1:1, indicating that (Ca²⁺ + Mg²⁺) is higher, rather than bicarbonate, among the majority of the samples. These plots reveal that groundwater chemistry is influenced by multiple water–rock interaction processes rather than simple mineral dissolution.

Although there is a strong correlation (r > 0.9) between electrical conductivity (EC), total dissolved solids (TDSs), chloride (Cl), and sulfate (SO₄) (Figure 6), diagnostic plots, however, differentiate their sources. The relationship between calcium (Ca²) and sulfate (SO₄) does not follow the 1:1 dissolution line of gypsum (Figure 8A) and therefore indicates that the sulfate concentrations are not mainly controlled by the dissolution of evaporite, and the sulfate concentrations have a relatively small role in the overall mineralization of groundwater. On the other hand, the chlorine–nitrate (Cl versus NO₃–N) relationship shows that the major fraction of samples is of a low nitrate value across a range of chloride values, whereas only a small fraction shows high nitrate values across the range of higher chloride values (Figure 8B). This distribution has more of the localized anthropogenic distribution than the diffuse, widespread contamination of groundwater salinity. Collectively, the results imply that natural mineralization processes control the salinity of groundwater.

The spatial distribution pattern between fluoride and other hydrogeochemical parameters such as Na, SO₄, Cl, hardness, and TDS are shown in Figure 9. This consistent spatial correlation indicates that the control of fluoride enrichment occurs on the regional level and is related to the mineralization of groundwater not to the single point-source contamination.

A sampling density map and a local spatial variability map of fluoride were developed to measure the uncertainty in spatial distribution. The sampling density map (Figure 10A) indicates a high concentration of groundwater samples in the central region of the study area and relatively low coverage in the peripheral areas. The variance map (Figure 10B) of fluoride reveals that the region of high variability overlaps with the high sampling density zone, indicating that the observed heterogeneity reflects genuine hydrogeochemical complexity rather than artefacts of sparse sampling. However, isolated high-variability zones occurring in low-density areas should be interpreted with greater uncertainty.

On the basis of the integrated evaluation of ensemble SHAP ranking, hydrogeochemical relevance, and patterns of spatial distribution, a final set of features was determined for fluoride classification (Table 3). Features like turbidity, SO₄²⁻, Mg²⁺, EC, TDS, Na⁺, Ca²⁺, pH, HCO₃⁻, and Cl⁻ were retained, because they are consistently shown to be important in SHAP and physically meaningful in terms of their relationship with groundwater mineralization and fluoride mobilization. On the other hand, Fe, NO₃–N, K⁺, PO₄³⁻, and hardness were excluded, because SHAP values were unstable, had limited hydrogeochemical occurrence, or were redundant with other variables.

The classification performance of six machine learning models, Support Vector Classifier (SVC), Logistic Regression (LR), XGBoost, Decision Tree (DT), Gaussian Naïve Bayes (NB), and K-Nearest Neighbour (KNN), was evaluated by using 5 × 3 nested stratified cross-validation to prevent hyperparameter tuning bias. Table 4 summarizes the outer fold mean performance.

The Support Vector Classifier demonstrated a best generalization performance. The Support Vector Machine showed the highest AUC (0.664), Average Precision (0.552), and F1_high (0.447) and thus has better discriminative strength and minority-class ranking performance compared to other models. Logistic Regression exhibited the lowest generalization ability (AUC = 0.43).

After the nested cross-validation, the final tuned models were retrained with the entire data set and then tested again on the independent spatial holdout test set to assess the external generalization performance (Table 5).

The Support Vector Classifier (SVC) has demonstrated the best predictive power on the independent test sample (Table 5). The model demonstrated a total accuracy of 0.75 with an AUC of 0.821, which implies a good distinguishing capability of the low and high fluoride samples. The Average Precision (AP) score was 0.483, reflecting a moderate precision–recall performance under class imbalance. The F1 score of the high-fluoride category (F1 high) was 0.571.

The confusion matrix indicated that SVC classified 14 out of 19 low fluoride samples accurately and 4 out of 5 high fluoride samples (Figure 11A). So, the sensitivity (recall of the high fluoride category) and specificity was 0.80 and 0.74 respectively. The model missed only one contaminated sample (false negative), while five low-fluoride samples were incorrectly classified as high-fluoride (False Positives). The precision for the high-fluoride class was 0.44, indicating that nearly half of the predicted contaminated samples were correctly identified. In general, the SVC provided an acceptable sensitivity and specificity and strong discrimination capability, with limited false negative occurrences.

The discriminative performance of the Support Vector Classifier (SVC) model can also be depicted by the ROC (Figure 11B). The curve is above the diagonal line of reference, and this confirms that there is strong separation between low- and the high-fluoride samples. The AUC value of 0.821 indicates a good overall classification ability across varying decision thresholds, demonstrating that the model consistently ranks contaminated samples higher than non-contaminated ones.

A SHAP summary plot, shown in (Figure 12), revealed the relative influence of individual features used by SVC to classify samples with a high fluoride concentration. Turbidity and SO₄²⁻ were the most significant features, followed by pH and HCO₃⁻. Higher values of turbidity and sulfate predominantly contributed positive SHAP values, indicating an increased probability of high fluoride concentrations. Similarly, elevated pH and bicarbonate levels were associated with positive contributions toward high-fluoride classification, consistent with alkaline conditions promoting fluoride mobilization. In contrast, calcium exhibited mixed contributions, with higher Ca²⁺ values often associated with reduced fluoride probability, potentially reflecting fluorite precipitation effects. Electrical conductivity (EC) and total dissolved solids (TDSs) showed comparatively lower marginal contributions, suggesting that specific ion chemistry rather than bulk salinity played a stronger role in model prediction. Comprehensively, the SHAP analysis shows that the SVC model represents hydrogeochemically relevant controls on the occurrence of fluoride.

The spatial distribution of observed and predicted classes of fluoride is shown in (Figure 13A,B). Figure 13A shows that high-fluoride samples are predominantly present in the central and eastern parts of the district, while the low-fluoride samples are uniformly distributed throughout the area.

The Support Vector Classifier (SVC) determined spatial validation using an independent geology-based holdout test, and the data is shown in (Figure 13B). Most of the high-fluoride samples were correctly identified (True Positives: Red triangle, four samples). Similarly, most of the low-fluoride ones were also accurately identified (True Negative: Green triangle, 14 samples). Misclassifications for False Positives (Yellow triangle: One sample) and false negatives (black triangle: Five samples) did not exhibit systematic clustering within any single lithological domain. The agreement of the observed and predicted classes across the area of study confirms that the SVC model is a good model, with consistent abilities to capture regional hydrogeochemical controls of fluoride distribution.

Health risk assessment (HQ) and population exposure

Health risk assessment revealed that children have significantly greater HQ values compared to adults at the same level of fluoride exposure. The HQ range among children is 0.04 to 2.56, with a mean of 1.14 and median of 1.04. By comparison, the HQ range of adults is 0.02 to 1.10, with a mean of 0.49 and median of 0.45 (Table 6). Overall, 45/88 samples (51.1%) had HQ child > 1, whereas only 5/88 samples (5.68%) had HQ adult > 1 (Figure 14A).

Population exposure is measured based on the total population being served under every scheme rather than age-specific counts. Figure 14B shows five schemes where HQ child and HQ adult are both greater than 1, indicating the highest-priority sources for mitigation. These schemes are Baloch Colony (3400), Ghulam Parenz 2 (3000), Killi Mohammad Hessni (2300), Degree College Mastung (500), and Raiki (500), with a total population of 9700 people.

Discussion

Hydrogeochemical results reveal that rock–water interactions mainly control the enrichment of fluoride in Mastung groundwater rather than anthropogenic sources, and the fluorite dissolution is thermodynamically preferential in the aquifer system, evidenced by the universal undersaturation of fluorite. The observation is consistent with the literature in which semi-arid hard-rock environments, long residence times, and alkalinity favour the release of fluoride from fluoride-bearing minerals [46].

The dominance of Ca–Mg–HCO₃ to mixed facies in the Piper plot is more evidence of carbonate weathering and buffering reactions. The super saturation of calcite and dolomite is indicative of active carbonate deposition, which in turn may indirectly increase fluoride mobility by lowering aqueous Ca²⁺ levels, thereby causing a shift in equilibrium toward the further dissolution of fluorite. The relationship between carbonate precipitation and fluoride enrichment has been reported in India, China, and East Africa [47,48,49].

The strong correlations among EC, TDS, Cl⁻, and SO₄²⁻ (r > 0.9) reflect generalized mineralization processes. The ion-ratio plots show that sulfate is not predominantly governed by the dissolution of gypsum, and the amount of nitrate in most of the samples is low, meaning that there is no extensive anthropogenic contamination. Similarly, Na enrichment over Cl indicates the presence of silicate weathering and cation exchange processes, which have proven to be the primary agents influencing the accumulation of fluoride in arid aquifers [50]. All these findings represent that the source of fluoride is geogenic and controlled by the lithology of the region, the evolution of groundwater, and alkaline hydrochemical conditions instead of point-source contamination.

A key contribution of this study is the integration of ensemble SHAP-based interpretations with hydrogeochemical reasoning and spatial consistency. The retained predictors, turbidity, SO₄²⁻, Mg²⁺, EC, TDS, Na⁺, Ca²⁺, pH, HCO₃⁻, and Cl⁻, are all directly or indirectly linked to groundwater evolution and mineralization processes. The inclusion of pH and HCO₃ despite of low mean SHAP rank is due to the fact that alkaline conditions enhance the solubility of fluoride through mineral dissolution and desorption. Similarly, the inclusion of Ca²⁺ as a retained feature is hydrogeochemically consistent, as calcium concentrations regulate fluorite saturation through precipitation–dissolution dynamics. Meanwhile, features such as NO₃–N and PO₄³⁻ were excluded because of limited hydrogeochemical relevance and a weak spatial distribution pattern. The convergence between SHAP rankings, spatial distribution patterns, and classical hydrogeochemical interpretation enhances confidence that the model captures physically meaningful processes rather than statistical artefacts.

The Support Vector Classifier showed the strongest predictive performance with an AUC of 0.664 in nested cross-validation and a higher value of 0.821 in the independent spatial holdout test set, with an overall accuracy of 0.75. The sensitivity and specificity of the SVC model was 0.80 and 0.74, respectively, which indicates a balanced discrimination of the high- and low-fluoride groups and a relatively low rate of false negatives.

The high performance of the SVC is most likely explained by the ability to capture nonlinear interactions in high-dimensional feature spaces [51,52]. Hydrogeochemical systems can be considered inherently nonlinear because of the combined mineral–water interactions, hence making kernel-based classifiers suitable. On the other hand, Logistic Regression exhibited lower discriminatory power, which suggests that the linear decision boundaries are insufficient to represent fluoride controls in this system. Gaussian Naive Bayes did not perform well, which is likely due to its assumption of conditional independence between features, which does not hold when hydrochemical variables are strongly correlated, as observed in recent research [53,54].

The spatial holdout AUC of 0.821 obtained in this study is comparable to several recent regional investigations. Ref. [55] reported AUC = 0.82 using SVC in the Datong Basin, China, while [56] achieved AUC = 0.73 using CART in western Balochistan. At broader scales, ref. [1] reported an overall accuracy of approximately 0.82 using Random Forest in a global assessment of fluoride risk and [16] achieved higher AUC values (~0.92) at the national scale in Pakistan (Table 7).

The spatial comparison between observed and predicted fluoride classes demonstrates that the model’s performance is geographically consistent across the study area. The majority of predictions correspond well with the measured fluoride categories, indicating that the classifier maintained stability when applied to spatially independent test data. Importantly, the misclassified samples are scattered rather than concentrated within a particular sector or geological unit. This pattern suggests that prediction errors are not structurally biassed and are unlikely to result from spatial overfitting or data leakage. The geology-informed spatial holdout strategy therefore provides a realistic evaluation of model generalization across different parts of the district.

Despite these strengths, there are still some limitations. The relatively small sample size is one of the limitations of groundwater research in arid and low-population areas like Balochistan. The small dataset can pose challenges for machine learning applications, but this study addressed these limitations by including nested cross-validation, spatial holdout validation, and using feature selection frameworks by integrating SHAP analysis and hydrogeochemical and spatial distribution patterns.

The sampling was performed in one field campaign; we cannot quantify seasonal variability in fluoride (e.g., seasonal mean differences, within-year standard deviation, or percentage change at repeated locations). Therefore, the susceptibility maps represent a snapshot of groundwater conditions during the sampling period rather than an annual average surface. As well, this paper measures exposure-based risk in terms of hazard quotient and population served by every high-risk scheme; but clinical prevalence data for dental fluorosis and case data on skeletal fluorosis were not available in this hydrogeochemical and modelling study. Therefore, we report population served by high-risk schemes as a screening indicator rather than confirmed disease burden. Future work should integrate community dental examinations (e.g., Dean’s Index surveys in school children) and health records to quantify the true prevalence of dental/skeletal fluorosis and validate exposure–outcome relationships.

5. Conclusions

This study demonstrated that machine learning-based spatial modelling is an appropriate approach to classify fluoride contamination in groundwater. The statistical analysis revealed that 25% of samples are above the WHO permissible limit, indicating a significant public health concern in parts of the study area.

The saturation index of fluoride, ion ratio, and spatial distribution of patterns support the fact that the enrichment of fluoride is geogenic and is majorly caused by water–rock interaction processes. Parameters such as TDS, Cl⁻, Na⁺, SO₄²⁻, and related mineralization indicators exhibited spatial patterns consistent with fluoride distribution, suggesting shared mobilization mechanisms.

Among the evaluated machine learning models, the Support Vector Classifier (SVC) demonstrated the most reliable performance. The nested cross-validation framework yielded stable internal generalization (outer AUC = 0.664), while the independent geology-informed spatial holdout test produced an accuracy of 0.75 and AUC of 0.821, indicating a strong discriminatory capability between low- and high-fluoride groundwater samples. The integration of SHAP-based interpretation further confirmed that the model captured hydrogeochemically meaningful relationships, particularly in the roles of pH, HCO₃⁻, Na⁺, and Ca²⁺ in controlling fluoride mobility.

The health risk assessment reveals that children are a high-risk group, with over 50% of the samples exceeding hazard quotient levels. The findings highlight the urgency of specific intervention programmes in the risk areas.

In terms of sustainability, the combination of spatial machine learning modelling and heath assessments would be an effective decision support model to manage the long-term risk of groundwater contamination. Risk-informed mapping supports the fair allocation of mitigation resources and also helps in the sustainable governance of drinking water in water-stressed areas.

Future studies incorporating expanded spatial coverage, multi-season sampling, and the integration of epidemiological data would further enhance predictive accuracy and strengthen the linkage between hydrogeochemical modelling and public health protection.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/su18073328/s1, Table S1: Shapiro–Wilk normality test results for physicochemical parameters. Table S2: Analytical timelines and EPA/APHA-recommended holding times for groundwater quality parameters. Table S3: Quality assurance results showing RPD ranges, mean RPD, and analytical precision for laboratory-duplicated groundwater samples. Table S4: Matrix spike recovery results demonstrating analytical accuracy and minimal matrix interference for groundwater quality parameters. Table S5. Limits of detection (LOD) and quantification (LOQ) of measured parameters. Table S6: Optimized Hyperparameters Selected via Nested Cross-Validation. References [59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76] are cited in the Supplementary Materials.

Author Contributions

N.A.B.: Conceptualization, Methodology, Data Collection, Preprocessing, Formal Analysis, Model Development, Spatial Mapping, Visualization, and Writing—Original Draft Preparation. K.M.: Supervision, Methodological Guidance, and Review and Editing. W.Y., S.B., U.S., A.K., S.U., R.A.M., M.Y., N.E. and M.Y.K.: Contributed to data collection and curation, validation, and manuscript editing. All authors have read and agreed to the published version of the manuscript.

Funding

The work reported in this article was supported by the Telecommunications Regulatory Authority (TRA)—Oman, through its funding of the UNESCO Chair on Artificial Intelligence at The Communication and Information Research Center (CIRC), Sultan Qaboos University, Oman.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The anonymized groundwater dataset supporting the findings of this study is publicly available in the Zenodo repository at: DOI. 10.5281/zenodo.18649110.

Acknowledgments

The authors would like to acknowledge the constructive suggestions from colleagues and reviewers, which helped improve the quality of this manuscript. The authors also thank National Centre of Artificial Intelligence, Higher Education Commission of Pakistan (HEC).

Conflicts of Interest

The authors declare no competing interests.

References

Podgorski, J.; Berg, M. Global analysis and prediction of fluoride in groundwater. Nat. Commun. 2022, 13, 4232. [Google Scholar] [CrossRef]
Kimambo, V.; Bhattacharya, P.; Mtalo, F.; Mtamba, J.; Ahmad, A. Fluoride occurrence in groundwater systems at global scale and status of defluoridation–state of the art. Groundw. Sustain. Dev. 2019, 9, 100223. [Google Scholar] [CrossRef]
Prasad, B.; Kaur, P.S.; Gupta, S. Fluoride Contamination in Drinking Water and Associated Health Risk. In Fluorides in Drinking Water: Source, Issue, and Mitigation Strategies; Springer: Berlin/Heidelberg, Germany, 2025; pp. 37–62. [Google Scholar]
Mukherjee, I.; Singh, U.K. Groundwater fluoride contamination, probable release, and containment mechanisms: A review on Indian context. Environ. Geochem. Health 2018, 40, 2259–2301. [Google Scholar] [CrossRef] [PubMed]
Mukherjee, I.; Singh, U.K. Fluoride abundance and their release mechanisms in groundwater along with associated human health risks in a geologically heterogeneous semi-arid region of east India. Microchem. J. 2020, 152, 104304. [Google Scholar] [CrossRef]
Chowdhury, A.; Adak, M.K.; Mukherjee, A.; Dhak, P.; Khatun, J.; Dhak, D. A critical review on geochemical and geological aspects of fluoride belts, fluorosis and natural materials and other sources for alternatives to fluoride exposure. J. Hydrol. 2019, 574, 333–359. [Google Scholar] [CrossRef]
Addison, M.J.; Rivett, M.O.; Robinson, H.; Fraser, A.; Miller, A.M.; Phiri, P.; Mleta, P.; Kalin, R.M. Fluoride occurrence in the lower East African rift system, southern Malawi. Sci. Total Environ. 2020, 712, 136260. [Google Scholar] [CrossRef]
Susheela, A.; Mondal, N.; Singh, A. Exposure to fluoride in smelter workers in a primary aluminum industry in India. Int. J. Occup. Environ. Med. 2013, 4, 61–72. [Google Scholar]
Rashid, A.; Guan, D.-X.; Farooqi, A.; Khan, S.; Zahir, S.; Jehan, S.; Khattak, S.A.; Khan, M.S.; Khan, R. Fluoride prevalence in groundwater around a fluorite mining area in the flood plain of the River Swat, Pakistan. Sci. Total Environ. 2018, 635, 203–215. [Google Scholar] [CrossRef]
Su, H.; Li, H.; Chen, H.; Li, Z.; Zhang, S. Source identification and potential health risks of fluoride and nitrate in groundwater of a typical alluvial plain. Sci. Total Environ. 2023, 904, 166920. [Google Scholar] [CrossRef]
Cao, H.; Xie, X.; Wang, Y.; Liu, H. Predicting geogenic groundwater fluoride contamination throughout China. J. Environ. Sci. 2022, 115, 140–148. [Google Scholar] [CrossRef]
Demir Yetiş, A.; İlhan, N.; Kara, H. Integrating deep learning and regression models for accurate prediction of groundwater fluoride contamination in old city in Bitlis province, Eastern Anatolia Region, Türkiye. Environ. Sci. Pollut. Res. 2024, 31, 47201–47219. [Google Scholar] [CrossRef] [PubMed]
Rasool, A.; Farooqi, A.; Xiao, T.; Ali, W.; Noor, S.; Abiola, O.; Ali, S.; Nasim, W. A review of global outlook on fluoride contamination in groundwater with prominence on the Pakistan current situation. Environ. Geochem. Health 2018, 40, 1265–1281. [Google Scholar] [CrossRef] [PubMed]
Sadiq, M.; Eqani, S.A.M.A.S.; Nawaz, I.; Bangash, N.; Ilyas, S.; Podgorski, J.; Berg, M. Fluoride contamination of groundwater in different geological settings of Punjab Province, Pakistan: Levels, possible mechanisms and health risks. Sci. Total Environ. 2025, 1001, 180450. [Google Scholar] [CrossRef] [PubMed]
Ali, W.; Aslam, M.W.; Junaid, M.; Ali, K.; Guo, Y.; Rasool, A.; Zhang, H. Elucidating various geochemical mechanisms drive fluoride contamination in unconfined aquifers along the major rivers in Sindh and Punjab, Pakistan. Environ. Pollut. 2019, 249, 535–549. [Google Scholar] [CrossRef]
Ling, Y.; Podgorski, J.; Sadiq, M.; Rasheed, H.; Eqani, S.A.M.A.S.; Berg, M. Monitoring and prediction of high fluoride concentrations in groundwater in Pakistan. Sci. Total Environ. 2022, 839, 156058. [Google Scholar] [CrossRef]
Chandio, T.A.; Khan, M.N.; Sarwar, A. Fluoride estimation and its correlation with other physicochemical parameters in drinking water of some areas of Balochistan, Pakistan. Environ. Monit. Assess. 2015, 187, 531. [Google Scholar] [CrossRef]
Mohammad, A.D.; Faisal, R.; Akhtar, M.M. Appraisal of Fluoride Contamination in Groundwater Using Statistical Approach in Rural Areas of Quetta, Balochistan. Pak. J. Anal. Environ. Chem. 2020, 21, 314–321. [Google Scholar] [CrossRef]
Arain, G.M.; Sattar, N.; Khatoon, S.; Naseem, S.; Badshah, S. Groundwater Fluoride Contamination in Balochistan, Pakistan: Health Risk and Regional Variability Analysis using HQ and NPI Indices1. Asian J. Chem. 2026, 38, 123–132. [Google Scholar] [CrossRef]
Sagintayev, Z.; Sultan, M.; Khan, S.; Khan, S.; Mahmood, K.; Yan, E.; Milewski, A.; Marsala, P. A remote sensing contribution to hydrologic modelling in arid and inaccessible watersheds, Pishin Lora basin, Pakistan. Hydrol. Process. 2012, 26, 85–99. [Google Scholar] [CrossRef]
Usman, M.; Furuya, M. Complex faulting in the Quetta Syntaxis: Fault source modeling of the October 28, 2008 earthquake sequence in Baluchistan, Pakistan, based on ALOS/PALSAR InSAR data. Earth Planets Space 2015, 67, 142. [Google Scholar] [CrossRef]
Kakar, N.; Zhao, C.; Li, G.; Zhao, H. GNSS and Sentinel-1 InSAR Integrated Long-Term Subsidence Monitoring in Quetta and Mastung Districts, Balochistan, Pakistan. Remote Sens. 2024, 16, 1521. [Google Scholar] [CrossRef]
Parkhurst, D.L.; Appelo, C. User’s Guide to PHREEQC (Version 2): A Computer Program for Speciation, Batch-Reaction, One-Dimensional Transport, and Inverse Geochemical Calculations; US Geological Survey: Reston, VA, USA, 1999.
Piper, A.M. A graphic procedure in the geochemical interpretation of water-analyses. Eos Trans. Am. Geophys. Union 1944, 25, 914–928. [Google Scholar]
Singh, D.; Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Boswell, D. Introduction to Support Vector Machines; Department of Computer Science and Engineering, University of California: San Diego, CA, USA, 2002; Volume 11, pp. 16–17. [Google Scholar]
Awad, M.; Khanna, R. Support vector machines for classification. In Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers; Springer: Berlin/Heidelberg, Germany, 2015; pp. 39–66. [Google Scholar]
Beauxis-Aussalet, E.; Hardman, L. Visualization of confusion matrix for non-expert users. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST)-Poster Proceedings, Paris, France, 25–31 October 2014; pp. 1–2. [Google Scholar]
Jiao, Y.; Du, P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant. Biol. 2016, 4, 320–330. [Google Scholar] [CrossRef]
Reddy, B.H.; Karthikeyan, P. Classification of fire and smoke images using decision tree algorithm in comparison with logistic regression to measure accuracy, precision, recall, F-score. In Proceedings of the 2022 14th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS), Karachi, Pakistan, 12–13 November 2022; pp. 1–5. [Google Scholar]
Obuchowski, N.A.; Bullen, J.A. Receiver operating characteristic (ROC) curves: Review of methods with applications in diagnostic medicine. Phys. Med. Biol. 2018, 63, 07TR01. [Google Scholar] [CrossRef]
Berrar, D. Performance measures for binary classification. In Encyclopedia of Bioinformatics and Computational Biology; Ranganathan, S., Gribskov, M., Nakai, K., Schönbach, C., Eds.; Academic Press: Oxford, UK, 2019; pp. 546–560. [Google Scholar]
Atapattu, S.; Tellambura, C.; Jiang, H. MGF based analysis of area under the ROC curve in energy detection. IEEE Commun. Lett. 2011, 15, 1301–1303. [Google Scholar] [CrossRef]
Marzban, C. The ROC curve and the area under it as performance measures. Weather Forecast. 2004, 19, 1106–1114. [Google Scholar] [CrossRef]
Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef] [PubMed]
Obuchowski, N.A.; McClish, D.K. Sample size determination for diagnostic accuracy studies involving binormal ROC curve indices. Stat. Med. 1997, 16, 1529–1542. [Google Scholar] [CrossRef]
Mamun, A.; Alazmi, A.S.; Alruwaili, M.; Bhandari, S.; Sharif, H.O. Groundwater Nitrate Contamination and Age-Specific Health Risks in Semi-Urban Northeastern Areas of Saudi Arabia. Urban Sci. 2025, 9, 538. [Google Scholar] [CrossRef]
Ahada, C.P.; Suthar, S. Assessment of human health risk associated with high groundwater fluoride intake in southern districts of Punjab, India. Expo. Health 2019, 11, 267–275. [Google Scholar] [CrossRef]
Rehman, F.; Siddique, J.; Shahab, A.; Azeem, T.; Bangash, A.A.; Naseem, A.A.; Riaz, O.; ur Rehman, Q. Hydrochemical appraisal of fluoride contamination in groundwater and human health risk assessment at Isa Khel, Punjab, Pakistan. Environ. Technol. Innov. 2022, 27, 102445. [Google Scholar] [CrossRef]
Vesković, J.; Onjia, A. Exposure and toxicity factors in health risk assessment of heavy metal (loid) s in water. Water 2025, 17, 2901. [Google Scholar] [CrossRef]
Salami, I.R.S.; Thufailah, N.A.; Fahimah, N.; Roosmini, D. Health risk assessment of physicochemical and heavy metals exposures of the usage of shallow groundwater located at the proximity to Citarum River, Indonesia. Case Stud. Chem. Environ. Eng. 2025, 11, 101153. [Google Scholar] [CrossRef]
Aslam, H.; Hashmi, A.; Khan, I.; Ahmad, S.; Umar, R. Deciphering effects of coal fly ash on hydrochemistry and heavy metal (loid) s occurrence in surface and groundwater: Implications for environmental impacts and management. Water Air Soil Pollut. 2024, 235, 640. [Google Scholar] [CrossRef]
Alam, K.; Nafees, M.; Ali, W.; Muhammad, S.; Raziq, A. Geogenic contamination of groundwater in a highland watershed: Hydrogeochemical assessment, source apportionment, and health risk evaluation of fluoride and nitrate. Hydrology 2025, 12, 70. [Google Scholar] [CrossRef]
Mondal, D.; Gupta, S. Fluoride hydrogeochemistry in alluvial aquifer: An implication to chemical weathering and ion-exchange phenomena. Environ. Earth Sci. 2015, 73, 3537–3554. [Google Scholar] [CrossRef]
Li, D.; Gao, X.; Wang, Y.; Luo, W. Diverse mechanisms drive fluoride enrichment in groundwater in two neighboring sites in northern China. Environ. Pollut. 2018, 237, 430–441. [Google Scholar] [CrossRef] [PubMed]
Singh, C.K.; Mukherjee, S. Aqueous geochemistry of fluoride enriched groundwater in arid part of Western India. Environ. Sci. Pollut. Res. 2015, 22, 2668–2678. [Google Scholar] [CrossRef] [PubMed]
Rango, T.; Bianchini, G.; Beccaluva, L.; Ayenew, T.; Colombani, N. Hydrogeochemical study in the Main Ethiopian Rift: New insights to the source and enrichment mechanism of fluoride. Environ. Geol. 2009, 58, 109–118. [Google Scholar] [CrossRef]
Younas, A.; Mushtaq, N.; Khattak, J.A.; Javed, T.; Rehman, H.U.; Farooqi, A. High levels of fluoride contamination in groundwater of the semi-arid alluvial aquifers, Pakistan: Evaluating the recharge sources and geochemical identification via stable isotopes and other major elemental data. Environ. Sci. Pollut. Res. 2019, 26, 35728–35741. [Google Scholar] [CrossRef]
Zhang, X.; Wu, Y.; Wang, L.; Li, R. Variable selection for support vector machines in moderately high dimensions. J. R. Stat. Soc. Ser. B Stat. Methodol. 2016, 78, 53–76. [Google Scholar] [CrossRef]
Razaque, A.; Ben Haj Frej, M.; Almi’ani, M.; Alotaibi, M.; Alotaibi, B. Improved support vector machine enabled radial basis function and linear variants for remote sensing image classification. Sensors 2021, 21, 4431. [Google Scholar] [CrossRef]
Jahromi, A.H.; Taheri, M. A non-parametric mixture of Gaussian naive Bayes classifiers based on local independent features. In Proceedings of the 2017 Artificial Intelligence and Signal Processing Conference (AISP), Shiraz, Iran, 25–27 October 2017; pp. 209–212. [Google Scholar]
Zaidi, N.A.; Cerquides, J.; Carman, M.J.; Webb, G.I. Alleviating naive Bayes attribute independence assumption by attribute weighting. J. Mach. Learn. Res. 2013, 14, 1947–1988. [Google Scholar]
Wei, Y.; Zhong, R.; Yang, Y. Groundwater Fluoride Prediction for Sustainable Water Management: A Comparative Evaluation of Machine Learning Approaches Enhanced by Satellite Embeddings. Sustainability 2025, 17, 8505. [Google Scholar] [CrossRef]
Durrani, T.S.; Akhtar, M.M.; Kakar, K.U.; Khan, M.N.; Muhammad, F.; Khan, M.; Habibullah, H.; Khan, C. Geochemical evolution, geostatistical mapping and machine learning predictive modeling of groundwater fluoride: A case study of western Balochistan, Quetta. Environ. Geochem. Health 2025, 47, 32. [Google Scholar] [CrossRef]
Kerketta, A.; Kapoor, H.S.; Sahoo, P.K. Groundwater fluoride prediction modeling using physicochemical parameters in Punjab, India: A machine-learning approach. Front. Soil Sci. 2024, 4, 1407502. [Google Scholar] [CrossRef]
Singh, G.; Mehta, S. Prediction of geogenic source of groundwater fluoride contamination in Indian states: A comparative study of different supervised machine learning algorithms. J. Water Health 2024, 22, 1387–1408. [Google Scholar] [CrossRef] [PubMed]
Bisong, E. Building Machine Learning and Deep Learning Models on Google Cloud Platform; Springer: Berkeley, CA, USA, 2019. [Google Scholar]
Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Dhieb, N.; Ghazzai, H.; Besbes, H.; Massoud, Y. Extreme gradient boosting machine learning algorithm for safe auto insurance operations. In Proceedings of the 2019 IEEE International Conference on Vehicular Electronics and Safety (ICVES), Cairo, Egypt, 4–6 September 2019; pp. 1–5. [Google Scholar]
Zhang, P.; Jia, Y.; Shang, Y. Research and application of XGBoost in imbalanced data. Int. J. Distrib. Sens. Netw. 2022, 18, 15501329221106935. [Google Scholar] [CrossRef]
Chan, J.Y.L.; Leow, S.M.H.; Bea, K.T.; Cheng, W.K.; Phoong, S.W.; Hong, Z.W.; Chen, Y.L. Mitigating the multicollinearity problem and its machine learning approach: A review. Mathematics 2022, 10, 1283. [Google Scholar] [CrossRef]
Osisanwo, F.; Akinsola, J.E.; Awodele, O.; Hinmikaiye, J.; Olakanmi, O.; Akinjobi, J. Supervised machine learning algorithms: Classification and comparison. Int. J. Comput. Trends Technol. 2017, 48, 128–138. [Google Scholar] [CrossRef]
Mustafa, O.M.; Ahmed, O.M.; Saeed, V.A. Comparative analysis of decision tree algorithms using gini and entropy criteria on the forest covertypes dataset. In Proceedings of the International Conference on Innovations in Computing Research; Springer: Cham, Switzerland, 2024; pp. 185–193. [Google Scholar]
Simovici, D.A.; Cristofor, D.; Cristofor, L. Impurity measures in databases. Acta Inform. 2002, 38, 307–324. [Google Scholar] [CrossRef]
Mariello, A.; Battiti, R. Feature selection based on the neighborhood entropy. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 6313–6322. [Google Scholar] [CrossRef]
Osei-Bryson, K.-M.; Giles, K. Splitting methods for decision tree induction: An exploration of the relative performance of two entropy-based families. Inf. Syst. Front. 2006, 8, 195–209. [Google Scholar] [CrossRef]
Bafjaish, S.S. Comparative analysis of naive bayesian techniques in health-related for classification task. J. Soft Comput. Data Min. 2020, 1, 1–10. [Google Scholar]
Cunningham, P.; Delany, S.J. K-nearest neighbour classifiers—A tutorial. ACM Comput. Surv. 2021, 54, 1–25. [Google Scholar] [CrossRef]
Mucherino, A.; Papajorgji, P.; Pardalos, P.M. K-nearest neighbor classification. In Data Mining in Agriculture; Springer: Boston, MA, USA, 2009; pp. 83–106. [Google Scholar]
Abu Alfeilat, H.A.; Hassanat, A.B.A.; Lasassmeh, O.; Tarawneh, A.S.; Alhasanat, M.B.; Eyal Salman, H.S.; Prasath, V.B.S. Effects of distance measure choice on k-nearest neighbor classifier performance: A review. Big Data 2019, 7, 221–248. [Google Scholar] [CrossRef]
Aggarwal, C.C.; Hinneburg, A.; Keim, D.A. On the surprising behavior of distance metrics in high dimensional space. In Proceedings of the International Conference on Database Theory; Springer: Berlin/Heidelberg, Germany, 2001; pp. 420–434. [Google Scholar]
Gao, X.; Li, G. A KNN model based on manhattan distance to identify the SNARE proteins. IEEE Access 2020, 8, 112922–112931. [Google Scholar] [CrossRef]
Gou, J.; Du, L.; Zhang, Y.; Xiong, T. A new distance-weighted k-nearest neighbor classifier. J. Inf. Comput. Sci. 2012, 9, 1429–1436. [Google Scholar]

Figure 1. Location map of study area. (A) Provinces in Pakistan, including Balochistan, the boundary of Balochistan Province is highlighted in sky blue. (B) District map of Balochistan Province showing Mastung District; the boundary of Mastung is highlighted in light blue. (C) Geological map of the study area in the Mastung district; pink triangles represent groundwater sampling locations.

Figure 2. Workflow of machine learning methodology for spatial prediction.

Figure 3. Distribution of fluoride concentration and groundwater classification based on WHO standards. The red dashed line represents the mean fluoride concentration (1.03 mg/L), while the blue curve indicates the distribution trend. The color-coded bars represent classification, where blue indicates safe samples (Class 0) and red indicates contaminated samples (Class 1).

Figure 4. (A): Sample-wise variation in saturation indices (SI) for fluorite, calcite, and dolomite. (B) Percentage distribution of saturation states (undersaturated, near equilibrium, and supersaturated) for fluorite, calcite, and dolomite.

Figure 5. Piper trilinear diagram showing hydrochemical facies of groundwater samples in the study area.

Figure 6. Pearson correlation matrix of physicochemical parameters in groundwater sample.

Figure 7. Ion relationship plots: (A) Na⁺ versus Cl⁻ and (B) (Ca²⁺ + Mg²⁺) and HCO₃⁻ with 1:1 reference line (orange).

Figure 8. Plots of (A) Ca²⁺ vs. SO₄²⁻ and (B) Cl⁻ vs. NO₃–N with 1:1 reference line (orange).

Figure 9. Spatial distribution of fluoride and associated hydrochemical parameters in groundwater.

Figure 10. (A) Sampling density of groundwater samples and (B) local spatial standard deviation (SD) of interpolated fluoride concentration (mg/L).

Figure 11. (A) Confusion matrix showing true and false classifications of low- and high-fluoride samples; (B) receiver operating characteristic (ROC) curve illustrating the model’s discriminative ability with the corresponding AUC value, the yellow dashed line represents the performance of a random classifier.

Figure 12. SHAP summary plot showing the relative contribution of selected hydrogeochemical variables to the SVC model predictions for high-fluoride groundwater.

Figure 13. (A) Regional distribution of fluoride. (B) Spatial prediction map of fluoride in the Balochistan.

Figure 14. (A) Distribution of HQ values for children and adults showing the proportion of samples exceeding the safe limit (HQ > 1); (B) identification of high-priority water supply schemes where both child and adult HQ exceed 1.

Table 1. The values of parameters used in CDI and HQ are summarized.

Parameter	Symbol	Children	Adults	Units	Reference
Fluoride concentration	C	measured	measured	mg/L
Ingestion rate	IR	1.0	2.0	L/day	[40]
Body weight	BW	15	70	kg	[41]
Exposure frequency	EF	365	365	days/year	[42]
Exposure duration	ED	12	30	years	[43,44]
Average time	AT (AT = ED × 365)	4380	10,950	days	[42]
Reference dose (fluoride)	RfD	0.06	0.06	mg/kg-day	[45]

Table 2. Descriptive statistics of water quality parameters.

Parameter	Min	Max	Mean	S. D	WHO Limit
pH	7.00	8.40	7.85	0.28	6.5–8.5
EC (µS/cm)	445	2061	761	376	1500
Turbidity (NTU)	0.02	33.00	4.36	5.45	5
TDS (mg/L)	269	1288	448	212.6	1000
HCO₃ (mg/L)	110	450	195.1	50.2	-
Alkalinity (mmol/L)	2.20	9.00	3.91	0.99	-
K (mg/L)	0.05	4.00	1.17	0.72	-
Na (mg/L)	9.00	336	68.3	50.7	200
Ca (mg/L)	10.00	280	56.6	38.7	75
Mg (mg/L)	2.00	88.00	24.5	16.7	50
Hardness (mg/L)	80	780	242.4	107.6	500
Cl (mg/L)	8.00	497	77.1	85.9	250
SO₄ (mg/L)	20.00	350	80.1	57.6	250
Fe (mg/L)	0.02	1.11	0.26	0.22	0.3
PO₄ (mg/L)	0.01	0.85	0.06	0.10	-
F (mg/L)	0.04	2.30	1.03	0.60	1.5
NO₃ (N) (mg/L)	1.00	18.00	4.4	3.4	50

Table 3. Feature selection on the basis of mean SHAP rank, domain knowledge, and spatial distribution pattern.

Feature	Mean SHAP Rank	Hydrogeochemical Relevance to Fluoride	Spatial Consistency	Final Decision
Turbidity	2.17	Proxy for colloids and Fe-associated F mobilization	Moderate	Retained
SO₄²⁻	5.33	Indicator of mineralization and residence time	Strong	Retained
Mg²⁺	6.00	Carbonate weathering and water–rock interaction	Low	Retained
EC	7.83	Proxy for salinity and groundwater evolution	Moderate	Retained
TDS	8.50	Cumulative dissolved load	Strong	Retained
Na⁺	10.33	Silicate weathering and cation exchange favouring F	Strong	Retained
Ca²⁺	9.50	Controls fluorite saturation and F precipitation	Moderate	Retained
pH	9.67	Controls F solubility and desorption	Moderate	Retained
HCO₃⁻	12.50	Carbonate buffering, alkaline conditions	Low	Retained
Cl⁻	13.42	Conservative tracer for source discrimination	Strong	Retained
Fe	5.00	Redox proxy but unstable across models	Low	Excluded
NO₃–N	8.17	Indicator of anthropogenic influence	Low	Excluded
K⁺	3.83	Weak mechanistic link to F	Low	Excluded
PO₄³⁻	9.50	Secondary anthropogenic signal	Low	Excluded
Hardness	10.33	Redundant with Ca²⁺ + Mg²⁺	Moderate	Excluded

Table 4. Comparative generalization performance of six machine learning models evaluated using 5 × 3 nested stratified cross-validation.

Model	AUC	AP	F1_High
SVC	0.664	0.552	0.447
DT	0.636	0.423	0.423
KNN	0.569	0.452	0.441
GNB	0.507	0.356	0.390
LR	0.431	0.319	0.421
XGB	0.626	0.47	0.431

Table 5. Independent spatial validation results of final tuned machine learning models.

Model	Accuracy	AUC	AP	F1_High
SVC	0.75	0.821	0.483	0.571
KNN	0.50	0.589	0.269	0.455
DT	0.208	0.642	0.283	0.345
GNB	0.208	0.432	0.205	0.345
LR	0.25	0.568	0.274	0.357
XGBoost	0.25	0.337	0.192	0.308

Table 6. Fluoride concentration, hazard quotient (HQ) values for children and adults, and corresponding risk classification for selected high-priority water supply schemes where HQ indicates potential health risk.

Scheme Name	Population Served (Total)	Fluoride (mg/L)	HQ (Children)	HQ (Adults)	Risk Category
Baloch Colony	3400	2.30	2.56	1.10	Both child & adult
Ghulam Parenz-II	3000	2.23	2.48	1.06	Both child & adult
Killi Mohammad Hassni	2300	2.30	2.56	1.10	Both child & adult
Degree Collage Mastung	500	2.30	2.56	1.10	Both child & adult
Raiki	500	2.30	2.56	1.10	Both child & adult
Khadkocha-II	39,000	2.04	2.27	0.97	Child only
Choto Mill	25,400	0.94	1.04	0.45	Child only
Pringabad	9000	1.61	1.79	0.77	Child only
Ghulam Parenz Booster No 2	5000	1.63	1.81	0.78	Child only
Babri—II	4500	1.05	1.17	0.50	Child only
Nouroz Stadium	4500	1.21	1.34	0.58	Child only
Mastung Road	4000	2.06	2.29	0.98	Child only
Ghulam Parenz-I	3900	1.93	2.14	0.92	Child only
Shahi Bagh	3200	1.10	1.22	0.52	Child only
Kanak—I	3000	1.80	2.00	0.86	Child only
Pilot High School	3000	1.35	1.50	0.64	Child only
Azizabad	2800	1.71	1.90	0.81	Child only
Shah Dini	2800	1.90	2.11	0.90	Child only
Sheik Wasil	2800	1.69	1.88	0.80	Child only
Killi Noor Mohammad	2500	2.06	2.29	0.98	Child only
Pir Kanu	2500	1.33	1.48	0.63	Child only
Shamsabad	2500	1.05	1.17	0.50	Child only
Choto Shehar	2000	1.19	1.32	0.57	Child only
Katori	2000	1.71	1.90	0.81	Child only
Isplanji	1800	1.75	1.94	0.83	Child only
Killi Ghulam Haidar	1800	0.94	1.04	0.45	Child only
Amach—II	1500	1.05	1.17	0.50	Child only
Killi Hassni Dasht	1500	1.64	1.82	0.78	Child only
Yali Mastung	1500	1.22	1.36	0.58	Child only
Jalab Gundain	1400	1.16	1.29	0.55	Child only
Mohammad Shahi	1000	1.11	1.23	0.53	Child only

Table 7. Comparison of current study with previous studies.

Study	Location	Best Model	Performance
[55]	Datong Basin, China	Support Vector Classifier	AUC = 0.82
[57]	Punjab, India	Extreme Learning Machine	R² = 0.95
[58]	5 Indian states	Random Forest	ACC= 75.8%
[16]	Pakistan	Random Forest	AUC = 0.92
[56]	Quetta (Pakistan)	Classification and Regression Tree	AUC = 0.732
[1]	Global	Random Forest	ACC = 0.82
Present Study	Mastung, Pakistan	Support Vector Classifier	ACC = 0.71, AUC = 0.821

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Butt, N.A.; Muhammad, K.; Yaseen, W.; Bashir, S.; Khan, M.Y.; Khan, A.; Sadique, U.; Uddin, S.; Abdul Manan, R.; Younas, M.; et al. A Machine Learning-Based Spatial Risk Mapping for Sustainable Groundwater Management Under Fluoride Contamination: A Case Study of Mastung, Balochistan. Sustainability 2026, 18, 3328. https://doi.org/10.3390/su18073328

AMA Style

Butt NA, Muhammad K, Yaseen W, Bashir S, Khan MY, Khan A, Sadique U, Uddin S, Abdul Manan R, Younas M, et al. A Machine Learning-Based Spatial Risk Mapping for Sustainable Groundwater Management Under Fluoride Contamination: A Case Study of Mastung, Balochistan. Sustainability. 2026; 18(7):3328. https://doi.org/10.3390/su18073328

Chicago/Turabian Style

Butt, Nabeel Afzal, Khan Muhammad, Waqass Yaseen, Shahid Bashir, Muhammad Younis Khan, Asif Khan, Umar Sadique, Saeed Uddin, Razzaq Abdul Manan, Muhammad Younas, and et al. 2026. "A Machine Learning-Based Spatial Risk Mapping for Sustainable Groundwater Management Under Fluoride Contamination: A Case Study of Mastung, Balochistan" Sustainability 18, no. 7: 3328. https://doi.org/10.3390/su18073328

APA Style

Butt, N. A., Muhammad, K., Yaseen, W., Bashir, S., Khan, M. Y., Khan, A., Sadique, U., Uddin, S., Abdul Manan, R., Younas, M., & Economou, N. (2026). A Machine Learning-Based Spatial Risk Mapping for Sustainable Groundwater Management Under Fluoride Contamination: A Case Study of Mastung, Balochistan. Sustainability, 18(7), 3328. https://doi.org/10.3390/su18073328

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Machine Learning-Based Spatial Risk Mapping for Sustainable Groundwater Management Under Fluoride Contamination: A Case Study of Mastung, Balochistan

Abstract

1. Introduction

2. Study Area and Geology

3. Materials and Methods

3.1. Groundwater Sampling Analysis

3.2. Hydrogeochemical Analysis

3.2.1. Saturation Index Computation

3.2.2. Hydrochemical Facies Classification

3.2.3. Ionic Ratio Analysis

3.3. Model Development

3.3.1. Data Processing and Feature Engineering

3.3.2. Data Splitting

3.3.3. Data Scaling: Z-Score Normalization

3.3.4. Handling Class Imbalance and Spatial Validation

3.3.5. Feature Selection

3.4. Model Selection and Training

3.4.1. Machine Learning Models

3.4.2. Support Vector Machine (SVM)

3.4.3. Nested Cross-Validation and Hyperparameter Optimization

3.4.4. Performance Evaluation

3.4.5. Power Analysis (ROC–AUC)

3.5. Human Health Risk Assessment

4. Result

Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI