1. Introduction
Groundwater is most the vital natural resource for ecosystems, agriculture, and human livelihoods. Over 2.5 billion people in the world rely on groundwater as a primary source of safe drinking water. However, this vital resource faces significant challenges, particularly contamination. Recent studies show that more than 200 million individuals are exposed to groundwater fluoride contamination [
1,
2]. Excessive fluoride contamination in groundwater can lead to several severe health disorders, including dental and skeletal fluorosis, crippling fluorosis, and osteosclerosis [
3].
A high fluoride concentration in groundwater is mainly due to the natural geological settings. It is generally known that high fluoride in groundwater resources is attributed to the dissolution of fluoride-rich rocks [
4]. Clay minerals and micas can also contribute to fluoride levels in groundwater [
5] along with alkaline volcanic rocks [
6]. Hydrothermal activity is another important natural source, where fluoride-rich fluids are released from host rocks into the surrounding water [
7]. However, natural sources are not the only concern. Human activities significantly contribute to fluoride contamination as well. Industrial processes, particularly aluminum smelting [
8] and coal processing [
9], release substantial amounts of fluoride into groundwater systems. Agricultural practices also play a role, as the widespread use of fertilizers can introduce fluoride into groundwater supplies [
10].
Machine learning (ML) has become a transformative analytical tool to understand and predict fluoride contamination. Conventional geostatistical methodologies often rely on the assumption of linear dependencies between the variables, while hydrogeochemical processes governing fluoride dynamics are quite complex and nonlinear. Thus, researchers all over the world increasingly relying on machine learning algorithms to capture complex environmental patterns. In a global assessment of fluoride contamination conducted by [
1], the ML Random Forest model was used to produce a global hazard map of fluoride contamination. In China, a nationwide map of geogenic fluoride contamination was developed using artificial neural network models [
11]. Similarly, in Turkey, a study conducted by [
12] used machine learning and deep learning models to predict groundwater fluoride using sessional observation.
In Pakistan, fluoride contamination is major concern effecting public health across multiple provinces. More than 25 million people in Pakistan are at risk from high fluoride concentration in their groundwater [
13]. In the Punjab province of Pakistan, the fluoride contamination in groundwater and associated health risk is reported by [
14]. Similarly, ref. [
15] conducted research to determine geochemical processes driving fluoride enrichment in unconfined aquifers. Fluoride occurrence has also been reported surrounding an active fluorite mining operation in Pakistan [
9]. In a recent study conducted by [
16], a fluoride contamination map was created at national level using ML. This national-level map is very useful in determining broad vulnerable zones, but its major drawback is that national scale prediction is basically based on broad environmental covariates, which do not necessarily provide a complete picture of local hydrogeochemical controls and hotspot variation at the community level, which is necessary in local planning.
Groundwater fluoride pollution has been reported in different parts of Baluchistan. Ref. [
17] assessed fluoride in drinking water sources and correlated the variation in fluoride with physicochemical parameters. In other studies, Quetta and rural areas were the focus of descriptive statistics and spatial mapping to determine the high-fluoride zones [
18]. More recent work has continued this with the use of health risk indices including the hazard quotient (HQ) and pollution indices in the comparison of risk across districts [
19]. In general, the literature at hand shows that the fluoride contamination in Balochistan is dominant and diffused throughout the region. Nevertheless, the majority of past studies have been primarily based on traditional hydrochemical interpretation, descriptive statistics, and simple spatial mapping. These methods are useful in reporting fluoride occurrence, but they are weak at incorporating various interacting predictors and the nonlinear relationship that is vital to predicting hotspots reliably in arid and semi-arid aquifers. That is why the use of machine learning is also relevant in this field of study, as it enhances fluoride susceptibility mapping to help local planning in communities by defining high-risk areas more precisely.
This study develops and tests machine learning models to demarcate groundwater zones where the concentration of fluoride is high. Physicochemical parameters are used as predictor variables, and statistical preprocessing is applied in order to enhance model robustness and prediction performance. Machine learning algorithms are used to produce high resolution susceptibility maps of fluoride contamination. The resulting maps are able to give a foundation of evidence as to the identification of high-risk areas and to inform specific mitigation and groundwater management approaches. In addition to predictive modelling, a human health risk assessment was conducted using chronic daily intake (CDI) and hazard quotient (HQ) indices. The integration of machine learning-based spatial prediction with health risk metrics proposes a sustainability-oriented framework that facilitates evidence-based intervention strategies aligned with the objectives of Sustainable Development Goal (Clean Water and Sanitation).
4. Result
The fluoride concentration in groundwater across Balochistan ranges from 0.04 to 2.30 mg/L, with a mean value of 1.03 mg/L. A statistical summary of the physicochemical parameters examined in the groundwater samples is presented in
Table 2. The elevated concentrations of other parameters, such as hardness (up to 780 mg/L) and bicarbonate (up to 450 mg/L), suggest that carbonate mineral dissolution and rock–water interactions are the major processes controlling groundwater quality in the study area. The distribution of fluoride concentrations shows a clear imbalance in the dataset. As illustrated in
Figure 3, the histogram on the left has a mean value of 1.03 mg/L and a skewness of 0.52, indicating that most samples fall within the uncontaminated range. The dataset shows an uneven distribution of the fluoride classes, with most of the samples having a lower value than what is recommended by the World Health Organization guidelines.
The saturation indices of fluorite revealed that fluorite is undersaturated in all groundwater samples (
Figure 4A,B), with a mean value of −1.38 and range of −0.28 to −4.03, meaning the dissolution of fluorite is thermodynamically favoured throughout the study area, whereas calcite and dolomite are mainly supersaturated (calcite: 86.4% supersaturated and dolomite: 84.1% supersaturated); only a small proportion of the samples were near equilibrium (calcite: 10.2%; dolomite: 3.4%).
The Piper diagram (
Figure 5) summarized the major-ion composition and hydrochemical facies of groundwater in the area of study. In the cation triangle, most samples lie within the no-dominant cation category, which means that neither Ca
2+, Mg
2+, nor Na
+ + K
+ individually exceeds ~50% of total cations (meq%). This mixed-cation chemistry reflects the combined effect of various geochemical processes, such as carbonate weathering, alkali contributions (e.g., silicate weathering), or cation exchange rather than the dominance of a single cation source. In the anion triangle, there is a dominance of samples between the bicarbonate and no-dominant type, thus showing alkalinity to be a significant contributor to the ground water chemistry in the region. In the central diamond, a majority of samples fall within the Ca–Mg–HCO
3 mixed facies, which is typical of groundwater influenced by rock–water interactions under recharge conditions and carbonate buffering.
Pearson correlation analysis (
Figure 6) reveals that there is strong positive relationship between EC, TDS, Cl
−, and SO
4 (R > 0.90), indicating their collective contribution to groundwater mineralization.
The Na–Cl
− graph (
Figure 7A) indicates that, although a few samples lie near the 1:1 line, most samples do not follow the same relationship, with some showing an enrichment of Na as compared to Cl.
The (Ca
2+ + Mg
2+)-HCO
3− plot (
Figure 7B) shows that the majority of samples lie above the line at 1:1, indicating that (Ca
2+ + Mg
2+) is higher, rather than bicarbonate, among the majority of the samples. These plots reveal that groundwater chemistry is influenced by multiple water–rock interaction processes rather than simple mineral dissolution.
Although there is a strong correlation (r > 0.9) between electrical conductivity (EC), total dissolved solids (TDSs), chloride (Cl), and sulfate (SO
4) (
Figure 6), diagnostic plots, however, differentiate their sources. The relationship between calcium (Ca
2) and sulfate (SO
4) does not follow the 1:1 dissolution line of gypsum (
Figure 8A) and therefore indicates that the sulfate concentrations are not mainly controlled by the dissolution of evaporite, and the sulfate concentrations have a relatively small role in the overall mineralization of groundwater. On the other hand, the chlorine–nitrate (Cl versus NO
3–N) relationship shows that the major fraction of samples is of a low nitrate value across a range of chloride values, whereas only a small fraction shows high nitrate values across the range of higher chloride values (
Figure 8B). This distribution has more of the localized anthropogenic distribution than the diffuse, widespread contamination of groundwater salinity. Collectively, the results imply that natural mineralization processes control the salinity of groundwater.
The spatial distribution pattern between fluoride and other hydrogeochemical parameters such as Na, SO
4, Cl, hardness, and TDS are shown in
Figure 9. This consistent spatial correlation indicates that the control of fluoride enrichment occurs on the regional level and is related to the mineralization of groundwater not to the single point-source contamination.
A sampling density map and a local spatial variability map of fluoride were developed to measure the uncertainty in spatial distribution. The sampling density map (
Figure 10A) indicates a high concentration of groundwater samples in the central region of the study area and relatively low coverage in the peripheral areas. The variance map (
Figure 10B) of fluoride reveals that the region of high variability overlaps with the high sampling density zone, indicating that the observed heterogeneity reflects genuine hydrogeochemical complexity rather than artefacts of sparse sampling. However, isolated high-variability zones occurring in low-density areas should be interpreted with greater uncertainty.
On the basis of the integrated evaluation of ensemble SHAP ranking, hydrogeochemical relevance, and patterns of spatial distribution, a final set of features was determined for fluoride classification (
Table 3). Features like turbidity, SO
42−, Mg
2+, EC, TDS, Na
+, Ca
2+, pH, HCO
3−, and Cl
− were retained, because they are consistently shown to be important in SHAP and physically meaningful in terms of their relationship with groundwater mineralization and fluoride mobilization. On the other hand, Fe, NO
3–N, K
+, PO
43−, and hardness were excluded, because SHAP values were unstable, had limited hydrogeochemical occurrence, or were redundant with other variables.
The classification performance of six machine learning models, Support Vector Classifier (SVC), Logistic Regression (LR), XGBoost, Decision Tree (DT), Gaussian Naïve Bayes (NB), and K-Nearest Neighbour (KNN), was evaluated by using 5 × 3 nested stratified cross-validation to prevent hyperparameter tuning bias.
Table 4 summarizes the outer fold mean performance.
The Support Vector Classifier demonstrated a best generalization performance. The Support Vector Machine showed the highest AUC (0.664), Average Precision (0.552), and F1_high (0.447) and thus has better discriminative strength and minority-class ranking performance compared to other models. Logistic Regression exhibited the lowest generalization ability (AUC = 0.43).
After the nested cross-validation, the final tuned models were retrained with the entire data set and then tested again on the independent spatial holdout test set to assess the external generalization performance (
Table 5).
The Support Vector Classifier (SVC) has demonstrated the best predictive power on the independent test sample (
Table 5). The model demonstrated a total accuracy of 0.75 with an AUC of 0.821, which implies a good distinguishing capability of the low and high fluoride samples. The Average Precision (AP) score was 0.483, reflecting a moderate precision–recall performance under class imbalance. The F1 score of the high-fluoride category (F1 high) was 0.571.
The confusion matrix indicated that SVC classified 14 out of 19 low fluoride samples accurately and 4 out of 5 high fluoride samples (
Figure 11A). So, the sensitivity (recall of the high fluoride category) and specificity was 0.80 and 0.74 respectively. The model missed only one contaminated sample (false negative), while five low-fluoride samples were incorrectly classified as high-fluoride (False Positives). The precision for the high-fluoride class was 0.44, indicating that nearly half of the predicted contaminated samples were correctly identified. In general, the SVC provided an acceptable sensitivity and specificity and strong discrimination capability, with limited false negative occurrences.
The discriminative performance of the Support Vector Classifier (SVC) model can also be depicted by the ROC (
Figure 11B). The curve is above the diagonal line of reference, and this confirms that there is strong separation between low- and the high-fluoride samples. The AUC value of 0.821 indicates a good overall classification ability across varying decision thresholds, demonstrating that the model consistently ranks contaminated samples higher than non-contaminated ones.
A SHAP summary plot, shown in (
Figure 12), revealed the relative influence of individual features used by SVC to classify samples with a high fluoride concentration. Turbidity and SO
42− were the most significant features, followed by pH and HCO
3−. Higher values of turbidity and sulfate predominantly contributed positive SHAP values, indicating an increased probability of high fluoride concentrations. Similarly, elevated pH and bicarbonate levels were associated with positive contributions toward high-fluoride classification, consistent with alkaline conditions promoting fluoride mobilization. In contrast, calcium exhibited mixed contributions, with higher Ca
2+ values often associated with reduced fluoride probability, potentially reflecting fluorite precipitation effects. Electrical conductivity (EC) and total dissolved solids (TDSs) showed comparatively lower marginal contributions, suggesting that specific ion chemistry rather than bulk salinity played a stronger role in model prediction. Comprehensively, the SHAP analysis shows that the SVC model represents hydrogeochemically relevant controls on the occurrence of fluoride.
The spatial distribution of observed and predicted classes of fluoride is shown in (
Figure 13A,B).
Figure 13A shows that high-fluoride samples are predominantly present in the central and eastern parts of the district, while the low-fluoride samples are uniformly distributed throughout the area.
The Support Vector Classifier (SVC) determined spatial validation using an independent geology-based holdout test, and the data is shown in (
Figure 13B). Most of the high-fluoride samples were correctly identified (True Positives: Red triangle, four samples). Similarly, most of the low-fluoride ones were also accurately identified (True Negative: Green triangle, 14 samples). Misclassifications for False Positives (Yellow triangle: One sample) and false negatives (black triangle: Five samples) did not exhibit systematic clustering within any single lithological domain. The agreement of the observed and predicted classes across the area of study confirms that the SVC model is a good model, with consistent abilities to capture regional hydrogeochemical controls of fluoride distribution.
Health risk assessment (HQ) and population exposure
Health risk assessment revealed that children have significantly greater HQ values compared to adults at the same level of fluoride exposure. The HQ range among children is 0.04 to 2.56, with a mean of 1.14 and median of 1.04. By comparison, the HQ range of adults is 0.02 to 1.10, with a mean of 0.49 and median of 0.45 (
Table 6). Overall, 45/88 samples (51.1%) had HQ child > 1, whereas only 5/88 samples (5.68%) had HQ adult > 1 (
Figure 14A).
Population exposure is measured based on the total population being served under every scheme rather than age-specific counts.
Figure 14B shows five schemes where HQ child and HQ adult are both greater than 1, indicating the highest-priority sources for mitigation. These schemes are Baloch Colony (3400), Ghulam Parenz 2 (3000), Killi Mohammad Hessni (2300), Degree College Mastung (500), and Raiki (500), with a total population of 9700 people.
Discussion
Hydrogeochemical results reveal that rock–water interactions mainly control the enrichment of fluoride in Mastung groundwater rather than anthropogenic sources, and the fluorite dissolution is thermodynamically preferential in the aquifer system, evidenced by the universal undersaturation of fluorite. The observation is consistent with the literature in which semi-arid hard-rock environments, long residence times, and alkalinity favour the release of fluoride from fluoride-bearing minerals [
46].
The dominance of Ca–Mg–HCO
3 to mixed facies in the Piper plot is more evidence of carbonate weathering and buffering reactions. The super saturation of calcite and dolomite is indicative of active carbonate deposition, which in turn may indirectly increase fluoride mobility by lowering aqueous Ca
2+ levels, thereby causing a shift in equilibrium toward the further dissolution of fluorite. The relationship between carbonate precipitation and fluoride enrichment has been reported in India, China, and East Africa [
47,
48,
49].
The strong correlations among EC, TDS, Cl
−, and SO
42− (r > 0.9) reflect generalized mineralization processes. The ion-ratio plots show that sulfate is not predominantly governed by the dissolution of gypsum, and the amount of nitrate in most of the samples is low, meaning that there is no extensive anthropogenic contamination. Similarly, Na enrichment over Cl indicates the presence of silicate weathering and cation exchange processes, which have proven to be the primary agents influencing the accumulation of fluoride in arid aquifers [
50]. All these findings represent that the source of fluoride is geogenic and controlled by the lithology of the region, the evolution of groundwater, and alkaline hydrochemical conditions instead of point-source contamination.
A key contribution of this study is the integration of ensemble SHAP-based interpretations with hydrogeochemical reasoning and spatial consistency. The retained predictors, turbidity, SO42−, Mg2+, EC, TDS, Na+, Ca2+, pH, HCO3−, and Cl−, are all directly or indirectly linked to groundwater evolution and mineralization processes. The inclusion of pH and HCO3 despite of low mean SHAP rank is due to the fact that alkaline conditions enhance the solubility of fluoride through mineral dissolution and desorption. Similarly, the inclusion of Ca2+ as a retained feature is hydrogeochemically consistent, as calcium concentrations regulate fluorite saturation through precipitation–dissolution dynamics. Meanwhile, features such as NO3–N and PO43− were excluded because of limited hydrogeochemical relevance and a weak spatial distribution pattern. The convergence between SHAP rankings, spatial distribution patterns, and classical hydrogeochemical interpretation enhances confidence that the model captures physically meaningful processes rather than statistical artefacts.
The Support Vector Classifier showed the strongest predictive performance with an AUC of 0.664 in nested cross-validation and a higher value of 0.821 in the independent spatial holdout test set, with an overall accuracy of 0.75. The sensitivity and specificity of the SVC model was 0.80 and 0.74, respectively, which indicates a balanced discrimination of the high- and low-fluoride groups and a relatively low rate of false negatives.
The high performance of the SVC is most likely explained by the ability to capture nonlinear interactions in high-dimensional feature spaces [
51,
52]. Hydrogeochemical systems can be considered inherently nonlinear because of the combined mineral–water interactions, hence making kernel-based classifiers suitable. On the other hand, Logistic Regression exhibited lower discriminatory power, which suggests that the linear decision boundaries are insufficient to represent fluoride controls in this system. Gaussian Naive Bayes did not perform well, which is likely due to its assumption of conditional independence between features, which does not hold when hydrochemical variables are strongly correlated, as observed in recent research [
53,
54].
The spatial holdout AUC of 0.821 obtained in this study is comparable to several recent regional investigations. Ref. [
55] reported AUC = 0.82 using SVC in the Datong Basin, China, while [
56] achieved AUC = 0.73 using CART in western Balochistan. At broader scales, ref. [
1] reported an overall accuracy of approximately 0.82 using Random Forest in a global assessment of fluoride risk and [
16] achieved higher AUC values (~0.92) at the national scale in Pakistan (
Table 7).
The spatial comparison between observed and predicted fluoride classes demonstrates that the model’s performance is geographically consistent across the study area. The majority of predictions correspond well with the measured fluoride categories, indicating that the classifier maintained stability when applied to spatially independent test data. Importantly, the misclassified samples are scattered rather than concentrated within a particular sector or geological unit. This pattern suggests that prediction errors are not structurally biassed and are unlikely to result from spatial overfitting or data leakage. The geology-informed spatial holdout strategy therefore provides a realistic evaluation of model generalization across different parts of the district.
Despite these strengths, there are still some limitations. The relatively small sample size is one of the limitations of groundwater research in arid and low-population areas like Balochistan. The small dataset can pose challenges for machine learning applications, but this study addressed these limitations by including nested cross-validation, spatial holdout validation, and using feature selection frameworks by integrating SHAP analysis and hydrogeochemical and spatial distribution patterns.
The sampling was performed in one field campaign; we cannot quantify seasonal variability in fluoride (e.g., seasonal mean differences, within-year standard deviation, or percentage change at repeated locations). Therefore, the susceptibility maps represent a snapshot of groundwater conditions during the sampling period rather than an annual average surface. As well, this paper measures exposure-based risk in terms of hazard quotient and population served by every high-risk scheme; but clinical prevalence data for dental fluorosis and case data on skeletal fluorosis were not available in this hydrogeochemical and modelling study. Therefore, we report population served by high-risk schemes as a screening indicator rather than confirmed disease burden. Future work should integrate community dental examinations (e.g., Dean’s Index surveys in school children) and health records to quantify the true prevalence of dental/skeletal fluorosis and validate exposure–outcome relationships.
5. Conclusions
This study demonstrated that machine learning-based spatial modelling is an appropriate approach to classify fluoride contamination in groundwater. The statistical analysis revealed that 25% of samples are above the WHO permissible limit, indicating a significant public health concern in parts of the study area.
The saturation index of fluoride, ion ratio, and spatial distribution of patterns support the fact that the enrichment of fluoride is geogenic and is majorly caused by water–rock interaction processes. Parameters such as TDS, Cl−, Na+, SO42−, and related mineralization indicators exhibited spatial patterns consistent with fluoride distribution, suggesting shared mobilization mechanisms.
Among the evaluated machine learning models, the Support Vector Classifier (SVC) demonstrated the most reliable performance. The nested cross-validation framework yielded stable internal generalization (outer AUC = 0.664), while the independent geology-informed spatial holdout test produced an accuracy of 0.75 and AUC of 0.821, indicating a strong discriminatory capability between low- and high-fluoride groundwater samples. The integration of SHAP-based interpretation further confirmed that the model captured hydrogeochemically meaningful relationships, particularly in the roles of pH, HCO3−, Na+, and Ca2+ in controlling fluoride mobility.
The health risk assessment reveals that children are a high-risk group, with over 50% of the samples exceeding hazard quotient levels. The findings highlight the urgency of specific intervention programmes in the risk areas.
In terms of sustainability, the combination of spatial machine learning modelling and heath assessments would be an effective decision support model to manage the long-term risk of groundwater contamination. Risk-informed mapping supports the fair allocation of mitigation resources and also helps in the sustainable governance of drinking water in water-stressed areas.
Future studies incorporating expanded spatial coverage, multi-season sampling, and the integration of epidemiological data would further enhance predictive accuracy and strengthen the linkage between hydrogeochemical modelling and public health protection.