Integrated Machine Learning and Health Risk Assessment for Groundwater Nitrate Contamination in Handan City, China

Zhao, Yuanchao; Liu, Jing; Zhang, Xiaokai; Li, Qun; Wu, Jin

doi:10.3390/w18101174

Open AccessArticle

Integrated Machine Learning and Health Risk Assessment for Groundwater Nitrate Contamination in Handan City, China

by

Yuanchao Zhao

¹,

Jing Liu

²,

Xiaokai Zhang

³,

Qun Li

^1,* and

Jin Wu

^4,*

¹

Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment of the People’s Republic of China, Nanjing 210042, China

²

China National Environmental Monitoring Centre, Beijing 100012, China

³

Hebei Geological Environment Monitoring Institute, Shijiazhuang 050022, China

⁴

Advanced Interdisciplinary Institute of Satellite Applications, Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China

^*

Authors to whom correspondence should be addressed.

Water 2026, 18(10), 1174; https://doi.org/10.3390/w18101174

Submission received: 3 April 2026 / Revised: 27 April 2026 / Accepted: 1 May 2026 / Published: 13 May 2026

(This article belongs to the Section Hydrogeology)

Download

Browse Figures

Versions Notes

Abstract

Groundwater nitrate (NO₃⁻) pollution is a critical environmental challenge with direct implications for human health. In this work, we propose a comprehensive analytical framework that integrates multi-model intercomparison, interpretable machine learning techniques, and quantitative health risk evaluation to tackle the pressing groundwater nitrate governance dilemmas in Handan City, a representative urban area in North China. Based on 157 groundwater samples and 17 hydrochemical parameters, comparative analysis of three state-of-the-art machine learning algorithms showed that the Light Gradient Boosting Machine (LightGBM) algorithm outperformed all counterparts, delivering the optimal predictive performance (R² = 0.753, RMSE = 3.67). SHapley Additive exPlanations (SHAP) analysis identified F⁻, Ca²⁺, Cl⁻, K⁺, total hardness, and Mg²⁺ as dominant factors influencing groundwater NO₃⁻ concentrations, reflecting the combined effects of carbonate dissolution, nitrification, and anthropogenic inputs. Subsequently, we performed a health risk assessment based on the standard methodological framework issued by the United States Environmental Protection Agency (USEPA), and the results indicated that children were the most vulnerable group, with hazard quotient (HQ, a non-carcinogenic risk indicator) values reaching 1.07 in the western mountainous region, exceeding the safety threshold (HQ > 1). These findings clarify the pollution mechanisms and spatial heterogeneity, and provide targeted policy guidance for groundwater protection as well as the safeguarding of public health.

Keywords:

groundwater nitrate pollution; machine learning; SHAP interpretability; health risk assessment

1. Introduction

Groundwater is a vital resource for domestic use, industrial development, and agricultural irrigation, drinking, especially in arid and semi-arid regions [1]. However, it is reported that tens of thousands of people worldwide die each day due to the lack of access to clean water [2]. Among all contaminants in groundwater, nitrate (NO₃⁻) has emerged as one of the most widespread reported worldwide [3]. Furthermore, high NO₃⁻ concentration in potable water poses significant health risks to humans and the environment, including methemoglobinemia and thyroid hypertrophy, especially in infants and children [4]. Notably, the health effects of nitrate exposure are mostly long-term, potentially emerging decades later as cognitive or chronic impairments. The sources of NO₃⁻ pollution are diverse and characterized by randomness, complexity, and uncertainty, which complicates effective risk management. Therefore, it is crucial to identify the potential influencing factors behind groundwater NO₃⁻ pollution and develop the health risk prediction model [5].

NO₃⁻ contamination is a global environmental concern, originating from diverse sources spanning agricultural fertilizers, domestic sewage, industrial wastewater, and manure [6,7,8,9]. Among these, nitrogen fertilizers, manure, and sewage are the primary contributors to NO₃⁻ enrichment in groundwater [7,10]. The increasing global demand for food production has further exacerbated this issue, leading to unsustainable fertilizer use and subsequent NO₃⁻ in groundwater systems [11]. Additionally, the transformation processes of NO₃⁻, NO₂⁻, and NH₄⁺ are governed by complex biogeochemical processes and multiple environmental factors such as organic matter content and pH value [12]. Therefore, given the multiplicity of pollution sources and the dynamic nature of nitrogen cycling, accurately predicting NO₃⁻ concentrations in groundwater necessitates a research framework incorporating comprehensive hydrochemical parameters and advanced predictive modeling techniques.

In recent years, more and more research has utilized machine learning to address various challenges in water research studies due to its effectiveness in modeling complex relationships among hydrochemical parameters, particularly NO₃⁻ pollution [13,14]. Among various machine learning algorithms, random forest (RF) is an ensemble method that offers high accuracy in predicting hydrochemical parameters by building multiple decision trees [15,16]. Extreme gradient boosting (XGBoost) has been shown to outperform other tree-based ensemble methods due to its construction of parallel sequential trees [17]. In contrast to XGBoost, which employs a pre-sorted decision tree algorithm, Light Gradient Boosting Machine (LightGBM) employs a histogram-based decision tree algorithm, enabling significantly faster training on large-scale datasets [1]. Despite the widespread application of machine learning in water research, complex models still suffer from limited interpretability [18]. Researchers often obtain model prediction results without insight into the underlying decision-making processes. This lack of transparency poses significant risks to the broader application of machine learning in groundwater studies [19]. SHapley Additive exPlanations (SHAP) is an interpretability algorithm that quantifies feature importance at both global and local levels using game-theoretic principles [17]. The SHAP model has been widely employed in water environmental studies, greatly enhancing the interpretability of machine learning models [20,21,22].

NO₃⁻ poses significant health risks when reduced to NO₂⁻ under specific environmental conditions, potentially causing methemoglobinemia, hypertension, and thyroid disorders [23]. Chronic exposure to NO₃⁻-contaminated groundwater has been shown to increase health risks in exposed populations substantially [24]. Quantifying the non-carcinogenic risks of NO₃⁻ exposure is essential for developing effective public health interventions [25]. Traditionally, NO₃⁻ risk assessment has depended on extensive groundwater sampling and laboratory-based concentration analyses [26]. With advances in machine learning, predictive modeling has emerged as an effective tool for estimating NO₃⁻-related health risks and identifying high-risk areas [27]. Therefore, machine learning models are conducive to promoting the implementation of targeted mitigation measures and optimizing public health protection strategies.

Despite the increasing application of machine learning algorithms in predicting groundwater nitrate (NO₃⁻) concentrations, several critical gaps remain in the existing literature. Most studies rely on a single algorithm, which undermines the robustness and stability of predictive outcomes. Furthermore, the lack of interpretability in model decision-making prevents domain experts from fully understanding the hydrogeochemical mechanisms underlying predictions, thereby limiting their trust in model results. More importantly, current research has primarily focused on concentration prediction, without effectively linking model outputs to health risk assessment frameworks, which significantly reduces the practical relevance of such studies for groundwater risk management. To overcome these limitations, we propose a comprehensive multi-dimensional analytical framework that integrates three representative machine learning models, SHAP-based interpretability analysis, and the USEPA health risk assessment model. Using Handan City as a typical case study area, this study aims to: (1) predict NO₃⁻ concentrations using RF, XGBoost, and LightGBM algorithms, and comparatively evaluate their prediction accuracy; (2) apply the explainable machine learning technique SHAP to analyze model performance and decision-making processes in NO₃⁻ concentration prediction; (3) predict the non-carcinogenic health risks associated with NO₃⁻ exposure using health risk assessment models. This study provides novel insights for accurate NO₃⁻ pollution risk prediction and sustainable groundwater resource management.

2. Materials and Methods

We developed an integrated framework to assess groundwater nitrate risks. Nitrate concentrations were predicted using RF, XGBoost, and LightGBM, with model performance benchmarked by multiple accuracy metrics. SHAP analysis was applied to identify the dominant hydrochemical drivers and interpret model behavior. Predicted concentrations were subsequently incorporated into the USEPA-recommended health risk assessment framework to quantify non-carcinogenic health hazards, thereby enabling spatial risk delineation and laying a scientific groundwork for sustainable groundwater management practices (Figure 1).

2.1. Study Area

Handan City, located in the southern hinterland of Hebei Province in North China, has a total administrative area of approximately 12,000 km². Geographically, the city extends from 113°27′ E to 115°38′ E in longitude and from 36°04′ N to 37°02′ N in latitude (Figure 2). Its neighboring areas include Liaocheng City of Shandong Province to the east, Anyang City of Henan Province to the south, the Taihang Mountain range to the west, and Xingtai City of Hebei Province to the north. The regional topography shows a distinct west-to-east gradient, descending gradually from the uplifted western mountains to the flat eastern alluvial plains. The area is characterized by a warm temperate semi-humid continental monsoon climate, with pronounced seasonal differences. Based on data published by the Hebei Provincial Meteorological Bureau (2021), the mean annual temperature is around 13–14 °C, and the 30-year average precipitation (1981–2010) is 500–600 mm, with more than 70% of the annual rainfall occurring during the summer months (June to August). Intensive agriculture and urban expansion contribute to high nitrogen inputs and groundwater vulnerability, making it an ideal setting for evaluating NO₃⁻ contamination and health risks. As shown in Figure 2, this area is crucial for understanding local land cover, population distribution, and elevation patterns.

2.2. Groundwater Sampling and Analysis

For this research, we collected a total of 157 groundwater samples in July 2023 from shallow phreatic aquifers distributed throughout the study region, with sampling depths varying between 5 m and 30 m. All samples were obtained from wells used for domestic drinking water supply and agricultural irrigation purposes. All hydrochemical laboratory analyses were performed at the Hebei Institute of Geological Environment Monitoring. pH values were determined via the potentiometric electrode method, whereas total hardness and carbonate species (HCO₃⁻, CO₃²⁻, and free CO₂) were determined by titration. Total dissolved solids (TDS) and permanganate index (CODMn) were measured using gravimetric and titration methods, respectively. Major cations (Na⁺, K⁺, Ca²⁺, Mg²⁺) were quantified by inductively coupled plasma optical emission spectrometry (ICP-OES), and major anions (SO₄²⁻, Cl⁻, NO₃⁻, NO₂⁻, F⁻) were measured by ion chromatography. NH₄⁺ and S²⁻ were analyzed using flow injection spectrophotometry. In accordance with APHA standard methods, the ionic balance error for all samples was controlled within ±5%, which confirmed the accuracy and reliability of the obtained analytical data.

2.3. Data Preparation

Prior to model development, the dataset was subjected to a structured preprocessing procedure. Outliers in the predictor variables were identified using the interquartile range (IQR) method and removed to minimize bias introduced by extreme values. Continuous predictor variables were then standardized using Z-score normalization (mean = 0, standard deviation = 1) to mitigate disparities in scale and measurement units across predictors [28]. Importantly, the target variable (groundwater NO₃⁻ concentration) was retained in its original unit (mg/L) to preserve the interpretability and practical relevance of the predicted outcomes. This preprocessing step ensured comparability among input features while maintaining the real-world meaning of the prediction target.

The entire preprocessed hydrochemical dataset was randomly partitioned into a training set (80% of the total samples) and a completely independent test set (20% of the total samples) using a fixed random seed to eliminate any randomness-induced variability in results. This 8:2 train–test split ratio has been extensively adopted in the field of environmental machine learning, as it achieves a favorable trade-off between providing enough samples for effective model parameter estimation and feature learning, and retaining a sufficiently large independent test set for unbiased evaluation of the model’s out-of-sample generalization capability. The training dataset was exclusively used for training the machine learning algorithms and performing systematic hyperparameter tuning, while the independent test dataset, which was never seen by the model during training, was used to objectively evaluate the final model prediction accuracy and generalization performance.

2.4. Machine Learning Models

2.4.1. Random Forest (RF)

Random Forest (RF), the seminal ensemble learning algorithm developed by Breiman [29], is a robust supervised learning approach that mitigates the overfitting problem of single decision trees by constructing a large number of independent decision trees. It employs two key randomization strategies: bootstrap sampling of the training dataset to generate diverse tree subsets, and random feature selection at each node split to determine the optimal partitioning variable [30]. When applied to regression tasks, the final prediction result of the RF model is produced by aggregating the outputs of all constituent decision trees through simple averaging. The complete mathematical expression describing the overall model prediction can be written as:

\hat{y} = \frac{1}{T} \sum_{t = 1}^{T} h_{t} (x)

(1)

where

T

represents the total number of trees and

h_{t} (x)

denotes the prediction of the t-th tree. A major advantage of RF is its ability to capture nonlinear relationships and interactions without prior assumptions about the data distribution. Furthermore, RF is robust to multicollinearity among predictors and effectively reduces model variance through ensemble averaging [31]. For the Random Forest (RF) model, we performed its implementation using the RF package available in the R statistical programming environment. A constant random seed value of 42 was specified throughout the modeling process to ensure complete reproducibility of all analytical results. Critical hyperparameters, specifically the total number of decision trees in the ensemble and the number of input variables randomly selected for each split node, were systematically fine-tuned via a rigorous cross-validation framework with the primary goal of maximizing the coefficient of determination (R²) achieved on the held-out test set.

2.4.2. Extreme Gradient Boosting (XGB)

Extreme Gradient Boosting, commonly abbreviated as XGBoost, represents a state-of-the-art and computationally efficient implementation of the gradient boosting decision tree (GBDT) ensemble learning framework, which was initially developed by Friedman [32] and later significantly enhanced by Chen and Guestrin [33]. It constructs a forward additive ensemble model by iteratively training new regression trees to correct the residual errors remaining after the previous round of predictions. For the t-th iterative step, the cumulative prediction of the additive model is updated as:

{\hat{y_{i}}}^{(t)} = {\hat{y_{i}}}^{(t - 1)} + f_{t} (x_{i}), f_{t} \in F

(2)

where

f_{t} (x_{i})

is the newly added regression tree and

F

represents the space of all possible trees. The objective function combines a convex loss function with a regularization term:

{O b j}^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y_{l}}}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t})

(3)

where

l (y_{i}, {\hat{y_{l}}}^{(t - 1)} + f_{t} (x_{i}))

measures the loss and

Ω (f_{t})

penalizes model complexity. Using a second-order Taylor expansion, the objective can be approximated as:

{O b j}^{(t)} \approx \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t})

(4)

where

g_{i}

and

h_{i}

denote the first- and second-order derivatives of the loss function. This formulation allows efficient split selection and tree construction. Compared with traditional GBDT, XGBoost introduces explicit

L_{1}

and

L_{2}

regularization on leaf weights, column subsampling and shrinkage to mitigate overfitting, and parallelized split finding to enhance scalability.

In this study, the XGBoost model was implemented using the xgboost package (version 1.7.5.1) in R (version 4.3.1), with a fixed random seed (57). Hyperparameters including learning_rate, max_depth, n_estimators, subsample, colsample_bytree, min_child_weight, and gamma were optimized through grid search combined with 10-fold cross-validation, with the test set R² as the primary optimization objective and RMSE as the secondary reference. This approach ensures that the selected hyperparameters not only maximize the model’s explanatory power but also minimize prediction errors.

2.4.3. Light Gradient Boosting Machine (LightGBM)

Light Gradient Boosting Machine (LightGBM) is an efficient gradient boosting framework developed by Microsoft [34]. It is built on the same gradient boosting decision tree (GBDT) foundation originally introduced by Friedman [32], and therefore shares the same mathematical formulation as XGBoost described earlier. In this section, only the implementation differences are outlined.

LightGBM uses a histogram-based algorithm that discretizes continuous features into bins to accelerate split finding. It applies a leaf-wise tree growth strategy rather than the level-wise strategy used in XGBoost, with depth constraints included to mitigate overfitting. In addition, it incorporates two key optimization techniques: Gradient-based One-Side Sampling (GOSS), which retains samples with large gradients and randomly samples samples with small gradients to reduce computational cost without losing accuracy; and Exclusive Feature Bundling (EFB), which bundles mutually exclusive features to reduce the number of features, further accelerating model training [35].

In this study, the LightGBM model is implemented using the lightgbm package in R, with hyperparameters including max_depth. Hyperparameter optimization is performed through grid search combined with 10-fold cross-validation to achieve stable and reliable results.

2.5. Shapley Additive Explanation (SHAP) Method

To overcome the inherent interpretability deficiency of the machine learning models developed in this study, we employed the Shapley Additive exPlanations (SHAP) analytical framework [36,37]. As a universally applicable model-agnostic interpretability approach founded on the fundamental principles of cooperative game theory, SHAP provides a consistent and theoretically sound way to measure the marginal contribution of each individual feature to the model’s final prediction by considering all possible feature coalitions. For any single prediction instance x generated by an arbitrary black-box model f, the corresponding SHAP explanation can be formally formulated as an additive feature attribution model:

f (x) = ϕ_{0} + \sum_{i = 1}^{M} ϕ_{i}

(5)

ϕ_{i} = \sum_{S \subseteq N ∖ {i}} \frac{|S|! (M - |S| - 1)!}{M!} [f (S \cup {i}) - f (S)]

(6)

where

f (x)

is the model output for instance

x

;

ϕ_{0}

is the baseline value (expected prediction across all samples);

ϕ_{i}

is the Shapley value of feature

i

;

M

is the number of features;

N

is the feature set; and

S

is any subset excluding

i

. This formulation ensures consistent and theoretically grounded feature attribution. SHAP was then applied to the XGBoost, Random Forest, and LightGBM models to identify key drivers, quantify feature contributions, and visualize prediction mechanisms under different pollution scenarios. To verify reliability, feature importance rankings were compared across cross-validation: the top 4 features remained consistent, confirming high stability.

2.6. Human Health Risk Assessment

The health risk assessment (HRA) model recommended by the United States Environmental Protection Agency (USEPA) is widely applied in environmental health studies due to its robust reliability and well-established toxicological basis [38]. It provides a standardized framework to estimate human exposure to pollutants and evaluate potential non-carcinogenic risks by comparing calculated exposure doses with safety thresholds. The average daily dose (ADD, mg·kg⁻¹·day⁻¹) of nitrate intake through drinking water was estimated as:

A D D = \frac{C \times I R \times E F \times E D}{B W \times A T}

(7)

where

C

is the nitrate concentration in drinking water (mg/L),

I R

is the ingestion rate of water (L·day⁻¹),

E F

is the exposure frequency (days·year⁻¹),

E D

is the exposure duration (years),

B W

is the average body weight (kg), and

A T

is the averaging time (days, equal to

E D

× 365 for non-carcinogenic risks).

The non-carcinogenic health risk was then characterized using the hazard quotient (HQ):

H Q = \frac{A D D}{R f D}

(8)

where

R f D

is the reference dose for nitrate (mg·kg⁻¹·day⁻¹). Following USEPA guidelines, RfD was set to 1.6 mg·kg⁻¹·day⁻¹. An

H Q

< 1 indicates negligible risk, while

H Q

> 1 suggests potential health concerns.

The parameter values used for different population groups are summarized in Table 1. Children were considered separately due to their lower body weight and higher water intake per unit body weight, which makes them more vulnerable to nitrate exposure. The relevant parameters were derived from China’s Technical Guidelines for Health Risk Assessment of Groundwater Pollution.

2.7. Model Performance Evaluation

To comprehensively evaluate the predictive performance of the models (RF, XGBoost, and LightGBM), three widely adopted statistical indicators were employed: the root mean square error (RMSE), the coefficient of determination (R²), and the mean absolute error (MAE) [39,40]. These metrics are extensively applied in hydrology and environmental modeling to assess predictive accuracy and explanatory capacity [41]. RMSE quantifies the average magnitude of squared deviations, R² measures the proportion of variance explained by the model, and MAE captures the mean absolute deviation:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}

(9)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(10)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - \hat{y_{i}}|

(11)

where yᵢ corresponds to the actual observed value of the target variable, ŷᵢ is the value predicted by the machine learning model, ȳ represents the population mean of all observed target values, and n denotes the total number of independent samples used for evaluation. The Root Mean Square Error (RMSE) metric quantifies the root-mean-squared difference between predicted and observed values, providing a measure of the model’s overall prediction error; lower RMSE values directly correspond to higher predictive precision. Mean Absolute Error (MAE) calculates the average of the absolute differences between predictions and observations, reflecting the model’s average absolute prediction bias. Finally, the coefficient of determination measures the proportion of variance in the dependent variable that can be explained by the independent variables in the model; values closer to 1 indicate a stronger goodness-of-fit and more robust explanatory power of the model.

3. Results and Discussion

3.1. Model Construction

In this study, 17 critical variables were selected as the input features for the NO₃⁻ concentration prediction model, including pH, TH, TDS, SO₄²⁻, Cl⁻, COD, NH₄⁺, S²⁻, Na⁺, NO₂⁻, F⁻, CO₃²⁻, CO₂, K⁺, Ca²⁺, Mg²⁺, and HCO₃⁻. The datasets were split into a training set (80%) and a test set (20%). The training set was used to train machine learning algorithms and tune parameters, while the test set was used to evaluate model accuracy and generalization performance. For the RF model, the number of decision trees was set to 100, using “squared_error” as the splitting criterion, with a minimum of 2 samples to split an internal node and at least 1 sample per leaf node. The XGBoost and LightGBM models used “xgtree” and “gbdt” as booster types, respectively, both with 100 decision trees and a learning rate of 0.05. XGBoost parameters included a maximum tree depth of 5, subsample rate of 0.7, column sampling rate of 0.6, and a minimum child weight of 5. LightGBM was configured with a maximum tree depth of 5, a maximum of 20 leaves, a minimum of 15 samples per leaf node, a minimum child weight of 0.01, and a subsample rate of 0.8. L1 (λ = 1) and L2 (λ = 5) regularization were applied to prevent overfitting.

3.2. Comparison of Machine Learning Models’ Performance

The predictive performance of RF, XGBoost, and LightGBM was evaluated using R², RMSE, and MAE (Figure 3 and Figure 4). RF and XGBoost achieved high training accuracy (R² = 0.952 and 0.862) but showed clear overfitting on the test set (R² = 0.651 and 0.692). In contrast, LightGBM yielded the best generalization (R² = 0.753, RMSE = 3.673, MAE = 2.515). This advantage reflects both the data characteristics—moderate sample size, correlated and imbalanced predictors—and algorithmic innovations of LightGBM, including histogram-based feature binning, leaf-wise growth, and regularization. Together, these features enable LightGBM to capture complex nonlinear interactions while suppressing overfitting, making it more robust than RF and XGBoost under heterogeneous hydrogeochemical conditions.

The spatial distribution of observed and predicted NO₃⁻ concentrations is shown in Figure 5. Both maps exhibited elevated concentrations in the north-central part of the study area, confirming that the models reproduced the main spatial patterns. Among them, LightGBM produced an average prediction of 5.49 mg/L, closer to the observed mean of 5.84 mg/L than RF (5.45 mg/L) or XGBoost (4.52 mg/L), indicating its higher overall accuracy. Nevertheless, all three models underestimated high-concentration zones (>20 mg/L) in the north-central region. The maximum predicted concentrations were consistently lower than observed values, reflecting the inherent limitation of tree-based ensemble models in extrapolating beyond the training data range. Tree-based models can only make predictions within the range of feature values observed in the training set; for extreme values outside this range, they tend to output the average value of the nearest leaf node, leading to conservative predictions. This limitation is particularly pronounced in groundwater pollution studies, where high-concentration pollution events are often rare and underrepresented in sampling datasets [42].

3.3. Influencing Factors of Groundwater NO₃⁻ Concentrations

Numerous studies have shown that natural NO₃⁻ concentrations in groundwater are typically below 3 mg/L. Elevated NO₃⁻ levels above 10 mg/L are primarily attributed to anthropogenic activities, such as extensive fertilizer application, animal waste, soil nitrogen, and manure sewage waste [7]. In this study, NO₃⁻ concentrations ranged from 0.002 mg/L to 61.1 mg/L, with an average of 5.51 mg/L. Moreover, the study area was representative of a typical agricultural region, with numerous factories and livestock farms located within it (Figure 2c). Consequently, nitrogen fertilizers and wastewater discharge were the principal contributors to NO₃⁻ pollution in this study area.

Identifying the factors influencing NO₃⁻ concentration in groundwater is essential for implementing effective control and management strategies. Among various machine learning models, LightGBM demonstrated superior predictive performance. To enhance model interpretability and assess the contribution of different hydrochemical parameters to predicted NO₃⁻ concentrations, this study employed the SHAP method. The global SHAP summary plot organized features by their mean absolute SHAP values [43] and identified hydrochemical parameters such as F⁻, Ca²⁺, Cl⁻, and K⁺ as the top four predictors in the LightGBM model. Moreover, this study provided SHAP dependency plots for these leading predictors to offer valuable insights into feature interactions and underlying patterns (Figure 6b).

3.3.1. Effect of F⁻ and Ca²⁺ on NO₃⁻ Concentration

According to the global SHAP summary plot, among all hydrochemical parameters in groundwater, F⁻ and Ca²⁺ emerged as the most influential predictors (Figure 6a). Previous research has indicated that F⁻ in groundwater primarily originated from fluorine-containing minerals [44,45]. According to the findings of Zhang [46], both F⁻ and excessive Ca²⁺ in this study were attributed to the dissolution of fluorite (Equation (13)). The negative correlation between F⁻ and Ca²⁺ in the dependency plot was affected by the concentrations of calcite (CaCO₃) and dolomite (CaMg(CO₃)₂) as described in Equations (14) and (15) (Figure 6b) [45]. As a result, high Ca²⁺ concentrations inhibited the further dissolution of fluorite, whereas low Ca²⁺ concentrations facilitated the enrichment of F⁻ [47].

Ca²⁺ was identified as one of the dominant factors influencing groundwater NO₃⁻ concentrations. The global SHAP summary plot and dependency plot revealed a positive correlation between Ca²⁺ and NO₃⁻ (Figure 6). This correlation arises from a coupled geochemical process: nitrification of anthropogenic NH₄⁺ releases H⁺, which accelerates the dissolution of carbonate minerals (limestone/calcite) abundant in the study area, leading to synchronous increases in both Ca²⁺ and NO₃⁻ concentrations. In the study area, the lithology was characterized by limestone, calcite, and sandstone, indicating that Ca²⁺ in groundwater primarily resulted from the dissolution of carbonates. NO₃⁻ in groundwater originated from nitrification, where nitrobacteria convert NH₄⁺ in the soil to NO³⁻ (Equations (12) and (17)). Therefore, the H⁺ produced during the transformation of NH₄⁺ to NO₃⁻ enhanced the leaching of carbonates, leading to increased Ca²⁺ concentrations in groundwater [48]. In summary, the acidic environment generated by nitrification is a key geochemical mechanism that drives the dissolution of carbonates, resulting in a synergistic increase in both Ca²⁺ and NO³⁻ concentrations. This phenomenon is consistent with previous studies in the North China Plain, such as the research in the Yinchuan Region [17] and the North China Plain [24]

C a F_{2} \to C a^{2 +} + 2 F^{-}

(12)

C a F_{2} + 2 H C O_{3}^{-} \to C a C O_{3} ↓ + 2 F^{-} + H_{2} O + C O_{2} ↑

(13)

C a^{2 +} + M g^{2 +} + O H^{-} + 2 H C O_{3}^{-} \to C a M g {(C O_{3})}_{2} ↓ + 2 H_{2} O

(14)

2 N H_{4}^{+} + 3 O_{2} \to 2 N O_{2}^{-} + 4 H^{+} + 2 H_{2} O

(15)

2 N O_{2}^{-} + O_{2} \to 2 N O_{3}^{-}

(16)

5 C H_{2} O + 4 N O_{3}^{-} = 5 H C O_{3}^{-} + H^{+} + 2 H_{2} O

(17)

3.3.2. Effect of Cl⁻ and K⁺ on NO₃⁻ Concentration

In the SHAP analysis, Cl⁻ and K⁺ were identified as the third and fifth most significant predictor variables affecting groundwater NO₃⁻ concentrations. The dependence plot indicated that when Cl⁻ concentration reached approximately 150.0 mg/L, the SHAP value peaked, suggesting heightened sensitivity of NO₃⁻ concentration to changes in Cl⁻ concentration near this threshold. Beyond this point, NO₃⁻ concentrations entered a stable plateau, indicating a complex relationship between Cl⁻ and NO₃⁻. Cl⁻ was considered a highly conservative ion in groundwater, meaning it typically does not participate in chemical reactions or ion exchange processes in freshwater environments. The sources of Cl⁻ in groundwater included both natural halite dissolution and anthropogenic activities such as urban sewage, industrial wastewater, agricultural drainage, chemical fertilizers, and landfill leachate. These anthropogenic sources overlapped with the source of NO₃⁻ [27,49,50]. Therefore, at lower concentration levels, Cl⁻ and NO₃⁻ may vary together due to their common anthropogenic sources. Under anoxic conditions, while Cl⁻ remained stable due to its conservative nature, NO₃⁻ concentrations decreased over time due to denitrification [51]. Therefore, at higher concentration levels, natural sources of Cl⁻ and the impact of denitrification led to a decoupling of their source mechanisms and concentration changes.

The dependence plot revealed that groundwater samples with high SHAP values were frequently associated with elevated K⁺ concentrations. This observation suggested a strong link between K⁺ and human and agricultural activities, specifically through manure, nitrogen, and potassium-containing pesticide fertilizers, and other anthropogenic inputs [47,52,53]. Findings from various global studies similarly suggested that the infiltration of human pollutants into groundwater was often accompanied by the enrichment of major ions, including K⁺, in the groundwater system [54]. Anthropogenic activities tended to introduce both K⁺ and NO₃⁻ simultaneously, leading to a coordinated uptrend in their concentrations. However, while anthropogenic sources played a predominant role in contributing K⁺, natural sources such as evaporite rocks also contributed to groundwater potassium levels. This dual origin introduced complexity into the relationship between K⁺ and NO₃⁻ concentrations.

3.3.3. Effect of TH and Mg²⁺ on NO₃⁻ Concentration

TH was identified as the fifth most significant predictor of groundwater NO₃⁻ concentrations. The dependency plot indicated that the trend of TH concentrations closely resembled that of Cl⁻. When TH was below 50 mg/L, an increase in TH led to a decrease in NO₃⁻ concentrations. However, when TH ranged between 500 mg/L and 1000 mg/L, an increase in TH caused a rise in NO₃⁻ concentrations. Beyond 1000 mg/L, the effect of TH on NO₃⁻ became relatively weak. According to the global SHAP summary plot, Ca²⁺ was positively correlated with NO₃⁻ concentrations, while Mg²⁺ showed a negative correlation. This indicated that the influence of TH on NO₃⁻ was jointly determined by the opposing effects of Ca²⁺ and Mg²⁺. The negative correlation between Mg²⁺ and NO₃⁻ may be attributed to carbonic acid generated during denitrification [55]. In conclusion, the relationship between TH and NO₃⁻ concentrations was characterized by the interplay of Ca²⁺ and Mg²⁺.

3.4. Analysis of Human Health Risk Assessment Results

Prolonged ingestion of nitrate-contaminated groundwater can cause hemoglobinopathies and other non-carcinogenic health problems, especially among infants and pregnant women. Based on LightGBM-predicted concentrations and the USEPA health risk assessment framework, hazard quotients (HQ) were calculated for children, men, and women (Figure 7). The measured concentrations were used to validate the model prediction results, while the predicted concentrations were employed for spatial health risk assessment to cover the entire study area. The predicted HQ values ranged from 0–1.07 for children, 0–0.92 for men, and 0–0.77 for women (Figure 7a). These results indicate that only children exceeded the safety threshold of HQ = 1, while adults remained below acceptable limits.

Spatially, the distribution of HQ values (Figure 7b–d) shows that risk hotspots were concentrated in the western mountainous part of the study area, broadly overlapping with the high nitrate concentration zones identified in Section 3.2 (Figure 5). This consistency confirms that elevated groundwater NO₃⁻ concentrations are the primary driver of health risks. However, the risk patterns are not perfectly identical to the concentration patterns. For example, child-specific exceedances (HQ > 1) also occurred in the southern sub-region, even where nitrate levels were moderate. This discrepancy arises because HQ is jointly determined by both concentration and exposure parameters. Children have a substantially higher intake-to-body-weight ratio compared to adults, lowering the concentration threshold at which risks become unacceptable. Consequently, areas with nitrate concentrations around 24–60 mg/L were sufficient to trigger HQ > 1 for children but remained below the risk threshold for men and women.

Overall, the results highlight that risk maps largely follow concentration patterns, but exposure scaling amplifies the vulnerability of children, resulting in risk exceedances beyond the absolute concentration peaks. These findings underscore the urgent need to prioritize mitigation efforts in the western mountainous hotspot while also paying attention to localized child-specific risk areas in the south.

3.5. Recommendations for NO₃⁻ Management Plan Considering Local Heterogeneity

The findings indicated that anthropogenic activities were the primary contributors to NO₃⁻ pollution, including the excessive application of agricultural nitrogen fertilizers, the unregulated discharge of urban and industrial wastewater, and improper handling of agricultural manure. Additionally, the model revealed that groundwater systems in the mountainous western region of the study area were significantly affected by NO₃⁻ contamination. These findings highlighted the urgent need for tailored NO₃⁻ control and groundwater remediation strategies in the western mountainous region. In the short term, the following targeted management measures should be prioritized: (1) Source control: Promote scientific fertilization and replace traditional nitrogen fertilizers with slow-release and organic fertilizers; standardize livestock manure treatment and resource utilization. Reduce the application of fertilizers containing nitrate, urea, and ammonium bicarbonate. (2) Pollution interception: Accelerate the construction of rural sewage collection and treatment systems, and strengthen the supervision of high-nitrogen industrial wastewater discharge. (3) Monitoring and early warning: Expand the groundwater quality monitoring network in the western mountainous region, focusing on nitrate concentration changes in drinking water wells for children. ** In parallel, it is essential to enhance the groundwater quality monitoring network to identify pollution hotspots and migration pathways accurately.

4. Conclusions

Focusing on Handan City in Hebei Province, this study proposed and applied a comprehensive analytical framework that combines machine learning prediction, interpretability analysis, and health risk assessment, providing new insights into the spatiotemporal patterns and risks of groundwater nitrate (NO₃⁻) pollution. Model comparison results indicated that, relative to RF and XGBoost, LightGBM exhibited superior generalization performance, with predictions closely aligned with observed concentrations. SHAP analysis further identified F⁻, Ca²⁺, Cl⁻, K⁺, total hardness, and Mg²⁺ as key factors, highlighting the coupled effects of carbonate dissolution, nitrification, and anthropogenic inputs such as fertilizer application and wastewater discharge. Spatial analysis revealed that the north-central and western mountainous regions are the principal hotspots of nitrate pollution. Health risk assessment showed that, although the overall risk level was generally low, the hazard quotient (HQ) for children exceeded the safety threshold in certain parts of the western mountainous region, suggesting that this group faces elevated exposure risks. Collectively, these results not only deepen the understanding of the geochemical mechanisms and spatial heterogeneity of groundwater nitrate pollution but also provide a scientific basis for identifying priority risk areas and sensitive populations.

Methodologically, the study contributes by integrating predictive modeling, interpretability, and health risk assessment into a unified framework, thereby enhancing predictive robustness, clarifying key geochemical and anthropogenic drivers, and linking concentration prediction with health risk assessment to support more practical groundwater risk management. Nevertheless, several limitations should be acknowledged: the analysis relied on a single sampling campaign, which constrains the ability to capture seasonal and interannual variability, and the health risk assessment focused exclusively on drinking-water ingestion, without accounting for alternative exposure pathways such as dietary intake and soil contact. Future research should integrate multi-temporal sampling data (including dry and wet seasons) and long-term monitoring data to construct spatiotemporal prediction models, and combine multi-source data such as remote sensing land use data and climate data to further improve the accuracy and comprehensiveness of risk assessment.

Looking forward, the framework could be extended to integrate socio-economic drivers, land-use planning scenarios, and climate projections, thereby supporting more comprehensive risk governance strategies and informing region-specific policies for sustainable groundwater management in nitrate-impacted regions.

Author Contributions

Conceptualization, J.W. and Y.Z.; methodology, Q.L.; software, J.L.; validation, J.W. and Q.L.; formal analysis, X.Z.; investigation, X.Z.; resources, X.Z.; data curation, J.L.; writing—original draft preparation, J.W.; writing—review and editing, Y.Z.; visualization, J.L.; supervision, J.W. and Q.L.; project administration, J.W.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

Projects funded by the Fundamental Research Funds for the Central Universities (2243100021).

Data Availability Statement

The data are available from the authors upon reasonable request due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HQ	Hazard Quotient
RF	Random Forest
XGBoost	Extreme gradient boosting
DEM	Digital Elevation Model
TDS	Total Dissolved Solids
CODMn	Chemical Oxygen Demand
ICP-OES	Inductively Coupled Plasma Optical Emission Spectrometry
ntree	Number of Trees
ADD	Average Daily Dose
RFD	Reference Dose for nitrate

References

Xiong, H.; Wang, J.; Yang, C.; Li, S.; Li, X.; Xiong, R.; Wang, Y.; Ma, C. Vegetation and human activity indicators for groundwater quality prediction in Jianghan Plain. Chemosphere 2025, 376, 144278. [Google Scholar] [CrossRef] [PubMed]
Chaudhary, I.J.; Chauhan, R.; Kale, S.S.; Gosavi, S.; Rathore, D.; Dwivedi, V.; Singh, S.; Yadav, V.K. Groundwater nitrate contamination and its effect on human health: A review. Water Conserv. Sci. Eng. 2025, 10, 33. [Google Scholar] [CrossRef]
Shukla, S.; Saxena, A. Global status of nitrate contamination in groundwater. In Handbook of Environmental Materials Management; Springer: Cham, Switzerland, 2019; pp. 869–888. [Google Scholar] [CrossRef]
Zhang, D.; Wang, P.; Cui, R.; Yang, H.; Li, G.; Chen, A.; Wang, H. Electrical conductivity and dissolved oxygen as predictors of nitrate concentrations in shallow groundwater in Erhai Lake region. Sci. Total Environ. 2022, 802, 149879. [Google Scholar] [CrossRef]
Deng, Y.; Ye, X.; Du, X. Predictive modeling and analysis of groundwater nitrate pollution based on machine learning. J. Hydrol. 2023, 624, 129934. [Google Scholar] [CrossRef]
Rotiroti, M.; Sacchi, E.; Caschetto, M.; Zanotti, C.; Fumagalli, L.; Biasibetti, M.; Bonomi, T.; Leoni, B. Groundwater and surface water nitrate pollution in an irrigated system. J. Hydrol. 2023, 623, 129868. [Google Scholar] [CrossRef]
Saka, D.; Adu-Gyamfi, J.; Skrzypek, G.; Antwi, E.O.; Heng, L.; Torres-Martínez, J.A. Disentangling nitrate pollution sources in tropical agriculture using multi-isotopes. Environ. Pollut. 2023, 328, 121589. [Google Scholar] [CrossRef] [PubMed]
Wang, D.; Li, P.; Yang, N.; Yang, C.; Zhou, Y.; Li, J. Nitrate in intensive agricultural region, NW China. Environ. Res. 2023, 237, 116911. [Google Scholar] [CrossRef]
Zhu, M.; Chen, J.; He, C.; Ren, S.; Liu, G. Groundwater nitrate and sulfate contamination by karst mines in Southwest China. Sci. Total Environ. 2024, 946, 174375. [Google Scholar] [CrossRef]
Mao, H.; Wang, G.; Liao, F.; Shi, Z.; Rao, Z.; Zhang, H.; Qiao, Z.; Bai, Y.; Chen, X.; Yan, X.; et al. Spatiotemporal variation of groundwater nitrate controlled by groundwater flow. Water Resour. Res. 2024, 60, e2023WR035299. [Google Scholar] [CrossRef]
Reddy, S.; Sunitha, V.; Kumar, P. Land use & land cover dynamics and its relationship with nitrate pollution in groundwater around inactive mine areas using geospatial techniques, SW part of Cuddapah basin, Southern India. Chemosphere 2024, 365, 143322. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Yuan, Y.; Xiong, G.; Yin, Z.; Guo, Y.; Song, J.; Zhu, X.; Wu, J.; Wang, J.; Wu, J. Patterns of nitrate load variability under surface water-groundwater interactions in intensive valleys. Water Res. 2024, 267, 122474. [Google Scholar] [CrossRef]
Covatti, G.; Li, K.-Y.; Podgorski, J.; Winkel, L.H.E.; Berg, M. Nitrate contamination in groundwater across Switzerland: Spatial prediction and drivers. Sci. Total Environ. 2025, 973, 179121. [Google Scholar] [CrossRef]
Iqbal, J.; Su, C.; Abbas, H.; Jiang, J.; Han, Z.; Baloch, M.Y.J.; Xie, X. Prediction of nitrate concentration and land use impacts in Nansi Lake Basin. J. Hazard. Mater. 2025, 487, 137185. [Google Scholar] [CrossRef]
Aju, C.D.; Achu, A.L.; Mohammed, M.P.; Raicy, M.C.; Gopinath, G.; Reghunath, R. Groundwater quality prediction and risk assessment in Kerala, India: A machine-learning approach. J. Environ. Manag. 2024, 370, 122616. [Google Scholar] [CrossRef]
Usman, U.S.; Salh, Y.H.M.; Yan, B.; Namahoro, J.P.; Zeng, Q.; Sallah, I. Fluoride contamination in African groundwater: Predictive modeling using stacking ensemble. Sci. Total Environ. 2024, 957, 177693. [Google Scholar] [CrossRef]
Alam, S.M.K.; Li, P.; Rahman, M.; Fida, M.; Elumalai, V. Key factors affecting groundwater nitrate levels in the Yinchuan Region, Northwest China: Research using the XGBoost model with SHAP method. Environ. Pollut. 2025, 364, 125336. [Google Scholar] [CrossRef]
Zhang, T.; Wu, J.; Chu, H.; Liu, J.; Wang, G. Interpretable Machine Learning Based Quantification of the Impact of Water Quality Indicators on Groundwater Under Multiple Pollution Sources. Water 2025, 17, 905. [Google Scholar] [CrossRef]
Bai, T.; Wang, X.-S.; Han, P.-F. Controls of groundwater-dependent vegetation coverage in the Yellow River Basin, China: Insights from interpretable machine learning. J. Hydrol. 2024, 631, 130747. [Google Scholar] [CrossRef]
Fan, R.; Deng, Y.; Du, Y.; Xie, X. Predicting geogenic groundwater arsenic contamination risk using interpretable machine-learning model. Environ. Pollut. 2024, 340, 122787. [Google Scholar] [CrossRef] [PubMed]
LaBianca, A.; Koch, J.; Jensen, K.H.; Sonnenborg, T.O.; Kidmose, J. Machine learning for predicting shallow groundwater levels in urban areas. J. Hydrol. 2024, 632, 130902. [Google Scholar] [CrossRef]
Mo, Y.; Xu, J.; Zhu, S.; Xu, B.; Wu, J.; Jin, G.; Wang, Y.-G.; Li, L. Spatial heterogeneity of groundwater depths in coastal cities by interpretable ML. Geosci. Front. 2025, 16, 102033. [Google Scholar] [CrossRef]
Li, Z.; Yang, Q.; Xie, C.; Lu, X. Source identification and health risks of nitrate contamination in shallow groundwater. Environ. Sci. Pollut. Res. 2022, 30, 13660–13670. [Google Scholar] [CrossRef]
Wang, S.; Chen, J.; Liu, F.; Chen, D.; Zhang, S.; Bai, Y.; Zhang, X.; Kang, S. Groundwater nitrate sources and human health risks in North China. Environ. Geochem. Health 2024, 46, 495. [Google Scholar] [CrossRef] [PubMed]
Unigwe, C.O.; Egbueri, J.C.; Omeka, M.E. Geospatial and statistical approaches to nitrate health risk in SE Nigeria. J. Indian Chem. Soc. 2022, 99, 100479. [Google Scholar] [CrossRef]
Adimalla, N.; Qian, H. Evaluation of non-carcinogenic causing health risks (NCHR) associated with exposure to fluoride and nitrate contaminated groundwater from a semi-arid region of south India. Environ. Sci. Pollut. Res. 2022, 29, 81370–81385. [Google Scholar] [CrossRef]
Kaur, L.; Rishi, M.S.; Siddiqui, A.U. Health risk assessment due to fluoride and nitrate in groundwater of Panipat, India. Environ. Pollut. 2020, 259, 113711. [Google Scholar] [CrossRef]
Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009. [Google Scholar]
Breiman, L. Statistical modeling: The two cultures. Stat. Sci. 2001, 16, 199–231. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4768–4777. [Google Scholar]
USEPA. Risk Assessment Guidance for Superfund (RAGS), Vol. I: Human Health Evaluation Manual (Part A); U.S. Environmental Protection Agency: Washington, DC, USA, 1989. [Google Scholar]
Hamal, K.; Sharma, S.; Khadka, N.; Baniya, B.; Ali, M.; Shrestha, M.S.; Xu, T.; Shrestha, D.; Dawadi, B. Evaluation of MERRA-2 precipitation products using gauge observation in Nepal. Hydrology 2020, 7, 40. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Rossi, M. Natural Hazards GIS-Based Spatial Modeling Using Data Mining Techniques; Springer: Cham, Switzerland, 2019; Volume 48. [Google Scholar]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)? Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
He, S.; Wu, J.; Wang, D.; He, X. Predictive modeling of groundwater nitrate pollution using random forest. Chemosphere 2022, 290, 133388. [Google Scholar] [CrossRef]
Appukuttan, A.; Aju, C.D.; Reghunath, R.; Srinivas, R.; Krishnan, K.A.; Arya, S. Exploring hydrochemical drivers of drinking water quality in a tropical river basin using self-organizing maps and explainable AI. Water Res. 2025, 284, 123884. [Google Scholar] [CrossRef]
Dhakate, R.; More, S.; Duvva, L.K.; Enjamuri, S. Groundwater chemistry and health hazard risk valuation of fluoride and nitrate enhanced groundwater. Environ. Sci. Pollut. Res. 2023, 30, 43554–43572. [Google Scholar] [CrossRef]
Lu, M.-Y.; Liu, Y.; Liu, G.-J.; Li, Y.-L.; Xu, J.-Z.; Wang, G.-Y. Prediction of fluorine concentration in groundwater. Sci. Total Environ. 2023, 857, 159415. [Google Scholar] [CrossRef]
Zhang, M.; Wang, L.; Zhao, Q.; Wang, J.; Sun, Y. Hydrogeochemical and anthropogenic controls on groundwater quality and risks in a resource-based area. J. Clean. Prod. 2024, 440, 140911. [Google Scholar] [CrossRef]
Mukherjee, I.; Singh, U.K. Environmental fate and health exposures of groundwater contaminants in Lower Ganga Basin. Geosci. Front. 2022, 13, 101365. [Google Scholar] [CrossRef]
Yu, L.; Zheng, T.; Yuan, R.; Zheng, X. APCS-MLR model for nitrate pollution sources in groundwater. J. Environ. Manag. 2022, 314, 115101. [Google Scholar] [CrossRef] [PubMed]
Batsaikhan, B.; Yun, S.-T.; Kim, K.-H.; Yu, S.; Lee, K.-J.; Lee, Y.-J.; Namjil, J. Groundwater contamination assessment in Ulaanbaatar City, Mongolia with hydrochemical, isotopic, and statistical approaches. Sci. Total Environ. 2020, 765, 142790. [Google Scholar] [CrossRef]
Zhang, X.; Li, X.; Gao, X. Coal mining induced karst water quality degradation in the Niangziguan system, China. Environ. Sci. Pollut. Res. 2016, 23, 6286–6299. [Google Scholar] [CrossRef]
Lee, C.-M.; Choi, H.; Kim, Y.; Kim, M.; Kim, H.; Hamm, S.-Y. Land use effect on shallow groundwater contamination. Sci. Total Environ. 2021, 800, 149632. [Google Scholar] [CrossRef] [PubMed]
Jin, L.; Ye, H.; Shi, Y.; Li, L.; Liu, R.; Cai, Y.; Li, J.; Li, F.; Jin, Z. Quantifying nitrate sources in groundwater of Zhuji, East China. Appl. Geochem. 2022, 143, 105354. [Google Scholar] [CrossRef]
Liu, R.; Xie, X.; Hou, Q.; Han, D.; Song, J.; Huang, G. Nitrate in groundwater of coastal Hainan, China. J. Hydrol. 2024, 634, 131088. [Google Scholar] [CrossRef]
Mao, H.; Wang, G.; Rao, Z.; Liao, F.; Shi, Z.; Huang, X.; Chen, X.; Yang, Y. Spatial pattern of groundwater chemistry and nitrogen pollution in Poyang Lake Basin. J. Clean. Prod. 2021, 329, 129697. [Google Scholar] [CrossRef]
Wang, Z.; Gao, Z.; Wang, S.; Liu, J.; Li, W.; Deng, Q.; Lv, L.; Liu, Y.; Su, Q. Hydrochemistry under anthropogenic impacts in Yiyuan, N. China. Environ. Earth Sci. 2021, 80, 60. [Google Scholar] [CrossRef]

Figure 1. Technical roadmap of the study framework.

Figure 2. Overview of the study area. (a) Population distribution map. (b) Elevation map (DEM). (c) Land cover types and groundwater sample locations.

Figure 3. The performance of RF, XGBoost, and LightGBM models in the training step and testing step.

Figure 4. Comparison of the predicted concentrations and actual concentrations of NO₃⁻ by the RF, XGBoost, and LightGBM models.

Figure 5. The spatial distribution maps of actual and predicted NO₃⁻. (a) Actual NO₃⁻, (b) predicted NO₃⁻ by RF model, (c) predicted NO₃⁻ by XGBoost model and (d) predicted NO₃⁻ by LightGBM model.

Figure 6. (a) SHAP values, and (b) SHAP dependence plots of the top six essential variables for the LightGBM model. Bar length represents the mean absolute SHAP value (global importance); the beeswarm/dot plot shows the distribution of SHAP values, where positive values indicate a positive impact on the model output, negative values indicate a negative impact, and color shading represents the magnitude of the feature value.

Figure 7. (a) The predicted HQ of NO₃⁻ for children, men, and women, (b) the spatial distribution map of predicted HQ of NO₃⁻ for children, (c) the spatial distribution map of predicted HQ of NO₃⁻ for men, and (d) the spatial distribution map of predicted HQ of NO₃⁻ for women.

Table 1. Exposure Parameters for Health Risk Assessment (HRA).

Group	IR (L·Day⁻¹)	BW (kg)	ED (Years)	EF (Days·Year⁻¹)	AT (Days)
Children	1.5	25.9	6	365	2190
Men	3.62	73	30	365	10,950
Women	2.66	64	30	365	10,950

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Y.; Liu, J.; Zhang, X.; Li, Q.; Wu, J. Integrated Machine Learning and Health Risk Assessment for Groundwater Nitrate Contamination in Handan City, China. Water 2026, 18, 1174. https://doi.org/10.3390/w18101174

AMA Style

Zhao Y, Liu J, Zhang X, Li Q, Wu J. Integrated Machine Learning and Health Risk Assessment for Groundwater Nitrate Contamination in Handan City, China. Water. 2026; 18(10):1174. https://doi.org/10.3390/w18101174

Chicago/Turabian Style

Zhao, Yuanchao, Jing Liu, Xiaokai Zhang, Qun Li, and Jin Wu. 2026. "Integrated Machine Learning and Health Risk Assessment for Groundwater Nitrate Contamination in Handan City, China" Water 18, no. 10: 1174. https://doi.org/10.3390/w18101174

APA Style

Zhao, Y., Liu, J., Zhang, X., Li, Q., & Wu, J. (2026). Integrated Machine Learning and Health Risk Assessment for Groundwater Nitrate Contamination in Handan City, China. Water, 18(10), 1174. https://doi.org/10.3390/w18101174

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrated Machine Learning and Health Risk Assessment for Groundwater Nitrate Contamination in Handan City, China

Abstract

1. Introduction