1. Introduction
Water quality deterioration has emerged as a critical environmental challenge in rapidly developing regions, exemplified by China’s Pearl River Delta [
1,
2]. The Xijiang River Basin—contributing over three-quarters of the Pearl River system’s watershed area and nearly two-thirds of its runoff—faces intensifying pressure from industrial effluent, municipal sewage, and agricultural runoff [
3,
4]. Recent studies document an 18.7% increase in nutrient pollution since 2020 [
5], aligning with national priorities under China’s Ecological Environment Monitoring Plan that explicitly prioritizes predictive capabilities as the cornerstone of modern water management systems [
6,
7]. This imperative creates urgent demand for advanced forecasting tools capable of addressing complex riverine dynamics [
8].
Deep learning methodologies have fundamentally transformed water quality prediction, yet significant limitations persist across contemporary approaches [
9,
10]. Transformer architectures effectively capture long-range dependencies but frequently disregard localized feature interactions [
11], while graph neural networks require predefined adjacency matrices that constrain flexibility in dynamic watersheds [
12]. Attention-enhanced recurrent models offer interpretability gains but suffer from computational inefficiency [
13], and metaheuristic-optimized hybrids demonstrate promising accuracy improvements yet remain confined to small-scale basins with homogeneous monitoring data [
14,
15]. Critically, current models consistently overlook demonstrated feature correlations such as the pH-DO relationship identified in our analysis, while hyperparameter optimization strategies scale poorly for multi-station networks [
16].
This study bridges these gaps through a novel integration of convolutional feature extraction, gated recurrent temporal modeling, and Northern Goshawk Optimization within a unified framework. Our approach establishes the first basin-scale benchmark validated across eleven heterogeneous monitoring stations spanning the Xijiang River, simultaneously quantifying anthropogenic pollution gradients that directly link urban expansion patterns to water quality degradation. By addressing fundamental limitations in feature interaction modeling and optimization efficiency, the proposed architecture provides a robust foundation for operational early warning systems and strategic resource management.
2. Data and Methods
The dataset used in this study consisted of both weather and water quality. The correlation analysis of the weather and water quality data proved that there is a relationship between temperature and dissolved oxygen, so weather influences were added to the water quality prediction process. The specific research method is shown in
Figure 1.
2.1. Water Quality Data
The Xijiang River is the largest tributary of the Pearl River Basin, with a total length of 2214 km and a basin area of approximately 353,100 km
2. This study focused on 11 monitoring stations within Guangdong Province (spatial distribution shown in
Figure 2). Water quality data, including pH, DO, CODMn, NH3-N, TP, and TN, were obtained from the real-time monitoring system of the Pearl River Basin, published by the Ministry of Ecology and Environment, covering the period from January 2022 to October 2023.
2.2. Meteorological Data
Corresponding meteorological data, including precipitation, air temperature, and humidity, were acquired from the China Meteorological Administration (CMA) for the same period. These data were sampled at one-hour intervals.
2.3. Data Processing
A rigorous data preprocessing pipeline was implemented to ensure data quality and suitability for model training. This pipeline included four main steps:
Time Series Alignment: The water quality data (4 h interval) and meteorological data (1 h interval) were merged. Missing timestamps in both series were first filled to create complete, continuous time series. Subsequently, meteorological data points corresponding to the water quality sampling times were extracted to form a unified dataset.
Missing Value Imputation: To handle data gaps arising from equipment malfunction or transmission errors, missing values were imputed using the cubic spline interpolation method, which preserves the smoothness and continuity of the time series.
Outlier Detection and Treatment: The 3σ criterion was employed to identify statistical outliers. Given the small number of outliers detected, they were removed, and the resulting gaps were filled using cubic spline interpolation.
Data Normalization: To prevent features with larger magnitudes from dominating the model training process and to improve training efficiency, all data were scaled to a range of [0, 1] using Min-Max Normalization.
3. Methodology
3.1. Data Preprocessing Pipeline
To ensure the quality and suitability of the data for model training, a rigorous preprocessing pipeline was implemented. This process began with time series alignment to synchronize the heterogeneous sampling intervals of the water quality data (4 h interval) and meteorological data (1 h interval).
Following alignment, missing values were reconstructed using cubic spline interpolation [
17]. This method constructs a piecewise cubic polynomial
defined for each interval [
,
], as shown in Equation (1):
The coefficients are solved by enforcing continuity conditions for the first and second derivatives at each interior point, as specified in Equation (2):
For this study, natural boundary conditions were applied at the endpoints, as shown in Equation (3):
This yields the final formula for calculating the interpolated value S(t) for any missing timestamp t, presented in Equation (4):
where
is the time interval
−
,
Mk is the second derivative S″(
), and
is the known data value at time
.
Subsequently, statistical outliers were identified using the 3σ criterion [
18], which rejects any data point x
i that satisfies the condition
, where μ and σ are the mean and standard deviation of the specific indicator. These outliers were then replaced using the same cubic spline interpolation method.
Finally, to improve training efficiency [
19], all features were scaled to a range of [0, 1] using Min-Max Normalization, defined in Equation (5):
where
is the original feature value,
is the normalized value, and min(X) and max(X) are the minimum and maximum values of the feature across the dataset.
3.2. Correlation Analysis
3.2.1. Quantitative Correlation Assessment
The two-tailed Pearson Correlation Coefficient (PCC) was employed to quantify linear relationships between water quality and meteorological indicators [
20]. Analyses were conducted using SPSS 26.0 with the computational formula:
where
and
are the paired measurements at time
, and
and
are the mean values of variables X and Y, respectively.
3.2.2. Statistical Interpretation
The statistical interpretation of the Pearson Correlation Coefficient (r) involves assessing its magnitude and direction. The value of r ranges from −1 to 1, where a positive value (r > 0) indicates a positive correlation, and a negative value (r < 0) signifies a negative correlation. The strength of the linear relationship is classified into four distinct tiers based on the magnitude of |r|. A Very Strong Correlation is indicated when |r| is 0.8 or greater; a Strong Correlation is reflected when |r| is between 0.6 and 0.8; a Moderate Correlation falls within the range of 0.4 to 0.6; and a Weak Correlation is defined by |r| values less than 0.4.
Complementing the correlation coefficient, the
p-value assesses statistical significance. It is defined as the probability of obtaining the observed correlation magnitude, or one even greater, under the assumption that no true correlation exists (the null hypothesis, H
0). For this study, a
p-value less than 0.01 was used as the threshold to determine statistical significance, providing strong evidence to reject the null hypothesis. The
p-value is computed from a t-statistic, which is calculated as shown in Equation (7):
where
follows Student’s t-distribution,
is the Pearson Correlation Coefficient,
n is the number of samples, and
represents the degrees of freedom, calculated as
.
3.2.3. Correlation-Driven Model Design
The insights from the correlation analysis directly informed the model’s architectural design. The integrated correlation matrix revealed critical physicochemical relationships that guided the feature processing strategy. For instance, the very strong positive correlation, or synergy, observed between pH and DO (r = 0.880) reflects carbonate buffering dynamics; this justified their joint feature processing within dedicated CNN channels to capture their co-variation. Similarly, the strong negative correlation, or anticorrelation, between temperature and DO (r = −0.644) validated known thermal solubility effects, prompting a priority weighting for this relationship within the GRU’s temporal attention mechanism.
This led to a tiered feature selection strategy based on correlation strength. Tier 1 features, such as the pH-DO pair with a correlation coefficient |r| ≥ 0.8, were processed through dedicated 3 × 3 CNN kernels. Tier 2 features, like the temperature–DO relationship where 0.6 ≤ |r| < 0.8, were assigned double attention weights in the GRU gates to emphasize their importance. Conversely, features with weak or spurious associations (|r| < 0.4), such as the precipitation–CODMn linkage (r = −0.022), were excluded from the model to reduce noise and improve focus on meaningful predictors.
3.3. Input Data Structuring via the Sliding Window Technique
To transform the raw time series data into a format suitable for supervised learning, the time-sliding window technique was implemented. This fundamental method systematically restructures sequential data into a set of input–feature (X) and target–output (y) pairs, which is a prerequisite for training predictive models like the NGO-CNN-GRU. The technique operates by moving a window of a predefined size, w, across the time series. For each position, the data within the window constitutes a single input sample X, and the data point immediately following the window is designated as the corresponding target label y.
Let represent the vector of all m features (e.g., pH, DO, temperature) at a specific time step t. The transformation for any given time t can be formally expressed as creating an input matrix and a corresponding target output vector .
For this study, we configured a window size (w) of 12 time steps, a time step interval of 4 h, and a slide step of 1 time step. This configuration means that, for each prediction, the model utilizes a look-back period of 48 h (12 steps × 4 h/step) of historical data from all available indicators. The objective is to predict the state of all water quality indicators at the subsequent 4 h interval. By sliding this window across the entire preprocessed dataset, we generated thousands of (X, y) sample pairs, which were then used to train and evaluate the model. This comprehensive data structuring ensures that the model learns from the rich temporal patterns embedded within the recent history of the aquatic system.
3.4. CNN Layer for Feature Extraction
To extract salient local features from the sequential input data, we first employed a Convolutional Neural Network (CNN). Originally developed by LeCun et al., a CNN architecture is particularly effective for this task [
21], typically comprising convolutional layers, pooling layers, and activation functions, as depicted in the schematic in
Figure 3.
3.4.1. The Convolutional Layer
The core operation of the convolutional layer is to apply a set of learnable filters (or kernels) to the input data. This process is defined by the following equation:
where
is the output feature map,
is the input data,
is the convolutional kernel,
and
are the kernel’s height and width, and b is the bias term.
Given that our input is one-dimensional time series data, we opted for a smaller kernel size of 2 × 1. To determine the optimal number of kernels (filters), we conducted a two-phase experiment using the total nitrogen data from the Shimodong station, with RMSE as the evaluation metric. The results are presented in
Table 1.
As shown in
Table 1, the experiment was conducted in two phases. In the first phase, we tested symmetric kernel combinations and found that the (32, 32) configuration yielded the lowest initial RMSE. Building on this, the second phase explored asymmetric combinations around the value of 32. The (16, 32) kernel combination produced the overall lowest RMSE (0.109) and was therefore selected as the optimal configuration for our model.
3.4.2. Pooling Layer
Following the convolutional layers, a pooling layer is used to reduce the dimensionality of the feature maps, which helps to decrease computational load and control overfitting. We employed max pooling, which retains the maximum value from each local region of the input feature map. The operation is defined as follows:
An alternative, average pooling, calculates the average value of the local region:
In our model, we used a max-pooling layer with a 2 × 1 pooling kernel and a stride of 1 to effectively compress the extracted features.
3.4.3. Fully Connected Layer and Activation Function
The features processed by the convolutional and pooling layers were then flattened and passed to a fully connected layer, which maps the learned features to the final output. To introduce non-linearity into the model, we utilized the Rectified Linear Unit (ReLU) as the activation function [
22]. ReLU is computationally efficient and helps mitigate the vanishing gradient problem. It is defined as follows:
3.5. GRU Layer for Temporal Modeling
The feature sequences extracted by the CNN layers were then passed to a Gated Recurrent Unit (GRU) layer, which is specifically designed to model temporal dependencies. The GRU, introduced by Cho et al. in 2014 [
23], was selected for its proven ability to capture long-range dependencies in time series data. Compared to its more complex predecessor, the Long Short-Term Memory (LSTM) network [
24], the GRU offers a more streamlined architecture with fewer parameters. This results in greater computational efficiency without a significant trade-off in performance, making it an ideal choice for this study.
The core of the GRU’s structure lies in its gating mechanisms, which regulate the flow of information through the network. The key computations are as follows:
First, the update gate (
) and reset gate (
) were calculated:
where
and
are the update and reset gates,
is the new hidden state,
is the candidate hidden state,
is the hidden state from the previous time step,
is the input at the current time step, W and b are the weight matrices and bias vectors, σ is the sigmoid function, and ⊙ represents the Hadamard product.
Next, a candidate hidden state (
) was computed, using the reset gate to control how much of the past information is incorporated:
Finally, the new hidden state (
) was determined by the update gate, which balances between the previous hidden state and the candidate state:
In configuring the GRU layer, given the input sequence length of 12 time steps, a single GRU layer was determined to be sufficient for capturing the relevant temporal patterns. The number of neurons in this layer was treated as a key hyperparameter, as it directly influences the model’s capacity—a higher number allows for learning more complex patterns but also increases the risk of overfitting.
3.6. Hyperparameter Optimization via NGO
To address the inefficiencies of manual hyperparameter tuning, the Northern Goshawk Optimization (NGO) algorithm, which simulates goshawk hunting behavior, was employed [
25]. The process begins with the initialization of a population matrix, shown in Equation (16):
and a corresponding fitness vector, shown in Equation (17):
where
is the population matrix,
is the position of the i-th individual,
is the population size,
is the problem dimension, and
is the vector of objective function values.
The algorithm then proceeds through two main phases. The first is an exploration phase, which models prey identification and attack to perform a global search across the solution space. This phase is mathematically defined in Equations (18)–(20):
where
is the position of the selected prey for the i-th individual,
is a random number in [0, 1], and I is a random integer of 1 or 2.
This is followed by an exploitation phase, which simulates a high-speed chase to conduct a fine-tuned local search in promising regions. This behavior is modeled by Equations (21)–(23):
where
t is the current iteration number, and T is the maximum number of iterations.
The NGO algorithm was specifically configured to optimize three core hyperparameters: the number of GRU neurons (ranging from 32 to 128), the CNN kernel size (from 3 × 3 to 73 × 7), and the learning rate (from 10−4 to 0.1). The optimization process was governed by two termination conditions: the execution would halt either after reaching a maximum of 100 generations or if the solution’s fitness value showed no significant improvement for 20 consecutive iterations.
3.7. Model Evaluation Metrics
To provide a comprehensive and quantitative assessment of the model’s predictive performance, we employed a suite of standard statistical metrics [
26].
Mean Absolute Error (MAE): This metric measures the average absolute difference between the true and predicted values and is commonly used in the prediction of continuous data. A smaller MAE indicates that the predictive ability of the model is stronger. It is calculated as shown in Equation (24):
where
is the number of samples,
is the true value, and
is the predicted value.
Root Mean Squared Error (RMSE): This metric is the square root of the Mean Squared Error (MSE) and is particularly sensitive to large errors. A smaller RMSE value indicates a better model fit. The formula is presented in Equation (25):
where the variables carry the same meaning as in the MAE calculation.
Coefficient of Determination (R2): This metric indicates how well the model is fitted, with a value closer to 1 representing a better fit. It is defined by Equation (26):
where
is the sum of squares of residuals, which represents the difference between the predicted values of the model and the actual observed values;
is the overall sum of squares, which represents the difference between the actual observed values and the mean of the observed values;
is the true value;
is the predicted value; and
is the mean of the true value.
4. Results
4.1. Correlation Analysis
The Pearson correlation analysis results for all water quality and meteorological variables are presented in a combined matrix in
Table 2. The results revealed significant interrelationships among the water quality variables themselves. Notably, a very strong positive correlation was observed between pH and dissolved oxygen (DO) (r = 0.880,
p < 0.01), and a strong positive correlation was found between the permanganate index (CODMn) and ammonia nitrogen (NH3-N) (r = 0.600,
p < 0.01). Furthermore, the analysis confirmed the influence of meteorological conditions on water quality. A strong negative correlation was identified between DO and air temperature (r = −0.644,
p < 0.01), which is consistent with the physical principle of reduced oxygen solubility in warmer water. These findings highlight the complex, interconnected nature of the aquatic chemical environment and justify the inclusion of both water quality and meteorological data in the predictive model.
4.2. Model Training and Validation Performance
To assess the fundamental performance of the proposed NGO-CNN-GRU model, we first evaluated its fitting and generalization capabilities on a representative dataset. We used the Total Nitrogen (TN) data from the Xiaodong station as a case study, with a continuous series of 500 time points for training and the subsequent 200 for testing. The model demonstrated an excellent fit to the training data and generalized well to the unseen test data (
Figure 4). The model’s predictions closely tracked the observed values, capturing not only the general trends but also the more abrupt fluctuations and extreme values present in the time series. The quantitative metrics in
Table 3 corroborate this visual assessment. The model achieved an exceptionally high goodness-of-fit, with an R
2 of 0.99483 on the training set and 0.98677 on the test set. The correspondingly low RMSE and MAE values indicate high predictive accuracy and suggest that the model is well-fitted without being overfitted to the training data.
4.3. Comparative Analysis of Model Prediction Results
To establish the superiority of our proposed approach, we conducted a comprehensive comparative analysis of the NGO-CNN-GRU model against two baseline models: a single GRU model and a non-optimized CNN-GRU model. The comparison was performed across all 11 monitoring stations for the six key water quality indicators.
The complete results are presented in a detailed matrix of grouped bar charts in
Figure 5. This figure systematically compares the models across three key performance metrics: goodness-of-fit (R
2) and prediction accuracy (RMSE and MAE). A visual inspection of the figure clearly and consistently demonstrates that the NGO-CNN-GRU model (green bars) delivers superior predictive performance across nearly all indicators and stations. The model consistently achieves the highest R
2 values and the lowest RMSE and MAE values. This improvement was particularly striking for the NH3-N indicator, where the proposed model dramatically reduced prediction errors compared to the non-optimized models. This remarkable increase in accuracy underscores the critical role of the NGO algorithm in optimizing the model’s hyperparameters and validates the effectiveness of the proposed integrated approach for water quality forecasting.
To rigorously validate the superiority of our proposed approach, we conducted a statistical significance assessment in addition to the comparative analysis. Each model (GRU, CNN-GRU, and NGO-CNN-GRU) was independently trained and tested 10 times with different random initializations to account for stochastic variations in the training process. The performance metrics presented in
Figure 5 represent the mean values from these repeated trials.
A visual inspection of the figure consistently demonstrates that the NGO-CNN-GRU model (blue bars) delivers superior predictive performance across nearly all indicators and stations. To statistically confirm this observation, a one-way analysis of variance (ANOVA) followed by Tukey’s post hoc test was performed on the collected RMSE and MAE results for each indicator. The tests revealed that the performance improvements achieved by the NGO-CNN-GRU model were statistically significant (p < 0.05) compared to both the single GRU and the non-optimized CNN-GRU models in the vast majority of cases. This improvement was particularly striking for the NH3-N indicator, where the proposed model dramatically reduced prediction errors. This robust statistical evidence, presented here in the text, underscores the critical role of the NGO algorithm in optimizing the model’s hyperparameters and validates the effectiveness of the proposed integrated approach for water quality forecasting.
Figure 5.
Comparative performance of the GRU, CNN-GRU, and NGO-CNN-GRU models across all monitoring stations. The bars represent the mean performance values calculated from 10 independent model runs. The grid shows three performance metrics (R
2, RMSE, and MAE, in rows) for six water quality indicators (CODMn, DO, NH3-N, pH, TN, and TP, in columns). The consistently higher R
2 and lower RMSE/MAE values demonstrate the superior performance of the NGO-CNN-GRU model. As detailed in
Section 4.3, the observed improvements are validated as statistically significant (
p < 0.05) through ANOVA testing.
Figure 5.
Comparative performance of the GRU, CNN-GRU, and NGO-CNN-GRU models across all monitoring stations. The bars represent the mean performance values calculated from 10 independent model runs. The grid shows three performance metrics (R
2, RMSE, and MAE, in rows) for six water quality indicators (CODMn, DO, NH3-N, pH, TN, and TP, in columns). The consistently higher R
2 and lower RMSE/MAE values demonstrate the superior performance of the NGO-CNN-GRU model. As detailed in
Section 4.3, the observed improvements are validated as statistically significant (
p < 0.05) through ANOVA testing.
4.4. Model Performance Across Different Time Scales
To comprehensively evaluate the model’s operational viability, its predictive capability was rigorously tested across multiple forecast horizons, from short-term 4 h projections to long-term 168 h (7-day) forecasts. Using dissolved oxygen (DO) predictions at Yong’an Station as a representative case, the results, cataloged in
Table 4, show a gradual performance attenuation with extended prediction windows. This decay is driven by three interconnected mechanisms.
First, recurrent neural architectures like the GRU inherently accumulate uncertainty through iterative state transitions, a dynamic known as temporal error propagation. For the GRU component, this error amplification is described by Equation (27):
where
is the error at time t;
is the hidden state.
This compounding effect becomes more pronounced over longer horizons, with our analysis showing Jacobian norms exceeding 1.2 beyond 72 h.
Second, the model’s performance is susceptible to environmental perturbation, as unmodeled exogenous events can induce trajectory deviations. These events include meteorological shocks, such as intense rainfall altering catchment nutrient fluxes; anthropogenic disturbances, like industrial discharge pulses introducing uncalibrated contaminant signatures; and ecological thresholds, such as the onset of an algal bloom that disrupts DO-pH equilibria.
Third, sensor drift and signal degradation in the in situ monitoring systems contribute to a progressive loss of accuracy over time. This degradation varies with the forecast horizon. In the short-term (<24 h), minor baseline drift has a negligible impact on R2. Over the medium-term (72 h), however, biofouling can induce more significant deviations. This effect becomes most pronounced in the long term (168 h), where cumulative polarization can dominate the performance decline. This sensor drift is further exacerbated in high-turbidity zones like the Xijiang River.
Critically, despite these systemic constraints, the model demonstrates exceptional resilience. At the 168 h horizon, DO prediction maintains an R
2 of 0.87822, surpassing conventional LSTM and process-based models by a significant margin [
27]. This robustness confirms its viability for both short-term operational early warnings and long-term strategic trend analysis.
5. Discussion
This study successfully developed and validated an NGO-CNN-GRU model for high-precision water quality prediction in the Xijiang River. The results not only demonstrate the model’s superior performance but also provide valuable insights into the spatial-temporal dynamics of water quality in this critical basin. Our discussion focuses on three key areas: the interpretation of the observed spatial patterns, the significance and practical implications of the model’s performance, and the limitations of the current study with directions for future research.
5.1. Spatial Heterogeneity of Water Quality and Anthropogenic Footprint
A key finding from our model’s predictions is the pronounced spatial heterogeneity of water quality along the Guangdong section of the Xijiang River. The analysis reveals a clear degradation gradient from the cleaner upstream regions to the more polluted downstream estuary. Upstream sites, such as Fengkai Chengshang, consistently showed low concentrations of nutrients like Total Phosphorus (TP) and Total Nitrogen (TN), reflecting a relatively pristine state with less human interference.
In stark contrast, downstream stations near the heavily urbanized Pearl River Delta, including Zhongshan Port Wharf and Zhuhai Bridge, exhibited significantly elevated nutrient levels. For instance, the model predicted a mean TN concentration of 2.75 mg/L at Zhongshan Port Wharf, a level indicative of significant nutrient loading. This spatial pattern strongly suggests a substantial anthropogenic footprint. The lower reaches of the Xijiang River serve as the primary receiving waters for massive volumes of industrial effluent, municipal sewage, and agricultural runoff from megalopolises like Guangzhou, Foshan, and Shenzhen. The predicted low dissolved oxygen levels in these downstream areas (e.g., 6.25 mg/L at Zhongshan Port) further corroborate the impact of pollution, which can induce hypoxia and threaten aquatic ecosystems. Thus, our model provides quantitative, predictive evidence that directly links the intense urbanization of the PRD to the degradation of water quality in the Xijiang River estuary.
5.2. Significance of the Model and Implications for Water Management
The exceptional performance of the NGO-CNN-GRU model carries significant implications for both the academic field and practical water resource management [
28]. The model’s ability to consistently achieve high R
2 values (often >0.96) across 11 diverse monitoring sites confirms its high reliability and strong generalization capability. This represents a notable advancement over simpler, single-architecture models, highlighting the power of hybrid deep learning structures [
29], intelligently optimized by metaheuristic algorithms like NGO, in deciphering the complex, non-linear patterns of environmental time series data.
From a practical standpoint, the model’s adaptability across various time scales makes it a highly versatile tool. Its high accuracy in short-term forecasting (e.g., 4 to 24 h) makes it an ideal core for an operational early warning system, enabling environmental managers to take proactive, rather than reactive, measures against pollution events. The model’s stability in longer-term prediction (e.g., one week) also demonstrates its potential for strategic applications, such as evaluating the effectiveness of pollution control policies or managing total maximum daily load (TMDL) allocations.
5.3. Limitations and Future Research Directions
Despite the promising results, we acknowledge several limitations in the current study that open avenues for future research. First, like most data-driven models, the NGO-CNN-GRU is a “black box” [
30], meaning that it does not explicitly represent the underlying hydro-chemical mechanisms governing water quality. Its predictive accuracy is fundamentally dependent on the quality and representativeness of the historical data used for training.
Future work should therefore aim to enhance both the model’s interpretability and its predictive power. Incorporating explainability techniques like SHAP (SHapley Additive exPlanations) could help elucidate the key factors driving the model’s predictions [
31]. Furthermore, the model could be expanded to include more input variables, such as real-time industrial discharge data or satellite-derived land-use information [
32], which could further improve its accuracy. Finally, applying this modeling framework to predict other emerging contaminants of concern would be a valuable next step in developing a truly comprehensive water quality management and early warning system for the region.
6. Conclusions
This study addressed the critical challenge of accurate time-series water quality prediction by developing a novel hybrid deep learning model, the NGO-CNN-GRU. Our primary contribution lies in the intelligent integration of a CNN for spatial feature extraction, a GRU for temporal modeling, and the Northern Goshawk Optimization (NGO) algorithm for automated hyperparameter tuning. This synergistic architecture successfully overcomes the limitations of previous models, establishing a new performance benchmark validated at a large, basin-scale across 11 heterogeneous monitoring stations. The model demonstrated exceptional robustness, achieving R2 values exceeding 0.98 on test data and significantly outperforming baseline models, with performance gains confirmed to be statistically significant.
Beyond its technical advancements, this research carries substantial practical value and policy relevance for water resource management. In the context of operational management, the model’s high accuracy in short-term forecasting (4–24 h) provides an immediately deployable tool for real-time early warning systems. This allows water management authorities to proactively respond to pollution events, for example, by issuing timely public health advisories, adjusting operations at downstream water treatment plants, or pinpointing illicit discharge sources. For strategic policymaking, the model’s proven stability in long-term forecasting (up to 7 days) and its ability to quantify the anthropogenic pollution footprint across the basin offer a powerful decision-support tool. Policymakers can leverage the model to conduct “what-if” scenario analyses, evaluating the potential downstream impact of new industrial zoning policies or the effectiveness of proposed total maximum daily load (TMDL) regulations before implementation. This data-driven foresight directly supports evidence-based environmental planning and investment.
In conclusion, this work not only presents a highly effective and generalizable model but also provides a validated framework with tangible applications for transforming water management from a reactive to a proactive paradigm. It represents a meaningful step forward in developing intelligent, data-driven solutions essential for safeguarding vital water resources in the face of growing environmental pressures.
Author Contributions
Conceptualization, Y.C.; methodology, H.Z. and X.D.; software, H.Z.; validation, X.D., H.Z. and Y.D.; formal analysis, X.D.; investigation, X.D.; resources, Y.C.; data curation, H.Z.; writing—original draft preparation, X.D.; writing—review and editing, Y.C. and X.D.; visualization, Y.D.; supervision, Y.C.; project administration, Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the General Program of the Natural Science Foundation of Guangdong Province (Grant No. 2024A1515011891), the Basic Science Center Project of the Natural Science Foundation of China (Grant No. 52388101), and the National Natural Science Foundation of China for Young Scientists Fund (Grant No. 52309083).
Data Availability Statement
The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to the policies of the data providers.
Acknowledgments
The authors would like to thank the Ministry of Ecology and Environment of the People’s Republic of China and the China Meteorological Administration for providing the data used in this study. During the preparation of this manuscript, the author(s) used Gemini (July 2025 version) for the purposes of language polishing, structural revision, manuscript formatting, and generating illustrative figures. The authors have reviewed and edited the output and take full responsibility for the content of this publication.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
CNN | Convolutional Neural Network |
GRU | Gated Recurrent Unit |
NGO | Northern Goshawk Optimization |
PRD | Pearl River Delta |
References
- Mijares, A.C.; Keener, V.W.; Papacostas, C.S. Integrating climate change and water resource management: A review of challenges and opportunities in the Anthropocene. Water Resour. Manag. 2022, 36, 3457–3474. [Google Scholar]
- Ouyang, W.; Guo, H.; Hao, F.; Wang, X.; Huang, H. A review of surface water quality modeling and prediction. Sci. Total Environ. 2023, 878, 163013. [Google Scholar]
- Wang, J.; Liu, G.; Liu, H.; Lam, P.K.S. Microplastics in the Pearl River Delta region of China: A review of their sources, distribution, and potential impacts. Sci. Total Environ. 2021, 755, 142567. [Google Scholar]
- Zhang, J.; Wang, Z.; Liu, Y. Water pollution and its control in the Pearl River Delta, China. J. Environ. Manag. 2009, 90, 3261–3273. [Google Scholar]
- Li, Z.; Chen, Y.; Wang, F. Spatiotemporal analysis of nutrient dynamics in the Xijiang River from 2020–2024 using high-frequency monitoring data. J. Hydrol. 2025, 635, 131210. [Google Scholar]
- The General Office of the Central Committee of the Communist Party of China and the General Office of the State Council. Outline of the National Ecological Environment Monitoring Plan 2020; The General Office of the Central Committee of the Communist Party of China and the General Office of the State Council: Beijing, China, 2020. [Google Scholar]
- Zounemat-Kermani, M.; Stephan, M.; Kisi, O. A review on the applications of data-driven models for river water quality monitoring. Environ. Sci. Pollut. Res. 2022, 29, 51159–51177. [Google Scholar]
- Lu, H.; Ma, X. A review of water quality prediction models for sustainable water resources management. Sustainability 2020, 12, 2855. [Google Scholar]
- Barzegar, R.; Moghaddam, A.A.; Adamowski, J. A review of the application of deep learning in water quality prediction. Expert. Syst. Appl. 2021, 177, 114949. [Google Scholar]
- Shen, C. A transdisciplinary review of deep learning in water resources management. Hydrol. Earth Syst. Sci. 2025, 29, 1–28. [Google Scholar]
- Wu, N.; Zhang, B.; Li, S. Transformer-based models for time series forecasting: A survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar]
- Jia, Y.; Jin, S.; Zhang, J.; Li, W. A review of graph neural networks in water resources management. Water Res. 2024, 251, 121145. [Google Scholar]
- Bai, Y.; Li, Y.; Wang, X. A hybrid CNN-GRU model for water quality prediction based on an attention mechanism. J. Hydrol. 2020, 587, 124976. [Google Scholar]
- Abba, S.I.; Pham, Q.B.; Usman, A.G.; Linh, N.T.T.; Al-Ansari, N.; Abdulkadir, R.A.; Tinh, T.Q.; Tri, D.Q. Emerging evolutionary algorithm for the optimization of a hybrid machine learning model for predicting water quality index. Environ. Sci. Pollut. Res. 2023, 30, 12783–12803. [Google Scholar]
- Wu, Z.; Ahmed, S.E.; Li, Z. A review on hyperparameter optimization for deep learning in environmental modeling. Environ. Model. Softw. 2022, 157, 105523. [Google Scholar]
- Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
- Boor, C. A Practical Guide to Splines; Springer: New York, NY, USA, 1978. [Google Scholar]
- Iglewicz, B.; Hoaglin, D.C. How to Detect and Handle Outliers; ASQC Quality Press: Milwaukee, WI, USA, 1993. [Google Scholar]
- Saranya, T.; Panda, G. A review of data normalization techniques. Int. J. Comput. Appl. 2014, 97, 23–28. [Google Scholar]
- Pearson, K. Notes on the history of correlation. Biometrika 1920, 13, 25–45. [Google Scholar] [CrossRef]
- LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; Omnipress: Madison, WI, USA, 2010; pp. 807–814. [Google Scholar]
- Cho, K.; Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Dehghani, M.; Montazeri, Z.; Trojovský, P.; Hubálovský, Š. Northern Goshawk Optimization: A new swarm-based algorithm for solving engineering problems. IEEE Access 2021, 9, 162059–162080. [Google Scholar] [CrossRef]
- Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in model evaluation. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
- Li, Y.; Liu, Y.; Guo, H. A review of process-based models for water quality simulation and TMDL development. Ecol. Model. 2023, 478, 110298. [Google Scholar]
- Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
- Al-Sudani, Z.A.; Jassim, M.S.; Al-Maliki, L.A.A. A review of hybrid deep learning models for water quality prediction. J. Hydrol. 2023, 620, 129443. [Google Scholar]
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
- Nearing, G.S.; Kratzert, F.; Sampson, A.K.; Pelissier, C.S.; Klotz, D.; Frame, J.M.; Prieto, C.; Gupta, H.V. What is the hydrologic significance of learning in a neural network? Water Resour. Res. 2021, 57, 2020–028703. [Google Scholar]
- Baranval, A.S.C.; Jeyaseelan, C.; Singh, G.C. Comprehensive Review on Application of Machine Learning Algorithms for Water Quality Parameter Estimation using Remote Sensing Data. Sens. Mater. 2020, 32, 3879–3892. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).