1. Introduction
Water is the source of life and a fundamental strategic resource for the survival and development of human society, and the sustainable utilization of water resources and the maintenance of water environmental sustainability are core components of the United Nations 2030 Agenda for Sustainable Development and global sustainable development goals [
1,
2,
3,
4].With the rapid advancement of global industrialization, urbanization, and the intensive development of agricultural activities, water pollution has become an increasingly severe global sustainability challenge, seriously threatening the integrity of aquatic ecosystems, drinking water safety, industrial and agricultural production, human well-being, and the overall stability of socioeconomic sustainability. The frequent occurrence of sudden water pollution incidents (such as industrial spills and algal blooms) and the accumulation of long-term pollution (e.g., eutrophication and heavy metal deposition) have severely impaired the sustainability of water environments and aquatic ecosystems, highlighting the critical importance of efficient water environmental management, quantitative sustainability monitoring, and risk early warning for achieving sustainable development. Water pollution caused by human activities disrupts the ecological balance of water resources and hinders the achievement of sustainable water resource utilization goals [
5,
6,
7,
8]. Therefore, the development of high-precision water quality prediction models is essential for mitigating the impacts of pollution, quantifying water environmental sustainability, and supporting integrated scientific approaches to sustainable development.
The permanganate index represents the amount of potassium permanganate consumed during the oxidation of water samples under prescribed conditions, and is regarded as a critical indicator for evaluating water pollution levels in environmental water quality monitoring [
9,
10,
11]. Under specified experimental conditions, reductive organic compounds as well as inorganic substances such as nitrites and sulfides present in water react with potassium permanganate, resulting in its consumption [
12,
13,
14]. Accordingly, the permanganate index is widely adopted to comprehensively reflect the pollution intensity imposed by reductive components in aquatic environments. Quantification of this index enables effective characterization of the abundance and oxidative properties of organic pollutants in water bodies, thereby supporting the assessment of their potential ecological risks [
15,
16,
17,
18]. Consequently, continuous monitoring and regulation of the permanganate index are practically essential in industrial production and daily life. Furthermore, analyzing its temporal dynamics facilitates the timely identification of pollution sources and evolutionary trends, providing a scientific foundation for decision-making in environmental water management and remediation.
Existing water quality prediction models can be mainly divided into two categories: mechanistic models and non-mechanistic models. Although mechanistic models take into account the physical, chemical, and biological factors that drive water quality changes, their complex structures and massive data requirements limit their wide application. In recent years, with technological advances, non-mechanistic models have become a research focus in water quality prediction. These include traditional linear prediction methods such as early statistical regression [
19], probabilistic statistics, and gray system theory [
20]. Among ML models, shallow architectures such as support vector machines (SVMs) [
21], random forest (RF) [
22], and extreme gradient boosting (XGBoost) [
23] have been widely applied. While they capture certain nonlinear patterns, their ability to model long-term temporal dependencies remains limited.
In recent years, deep learning (DL) models have shown superior performance in time series forecasting. Recurrent neural networks (RNNs) [
24], Long Short-Term Memory (LSTM) [
25], and Bidirectional LSTM (BiLSTM) [
26] can learn sequential dependencies, but they often suffer from gradient vanishing/exploding and insufficient feature extraction when dealing with highly non-stationary signals. Temporal convolutional networks (TCNs) [
27] offer an alternative with dilated convolutions to capture multi-scale patterns, yet they may still be affected by noise and non-stationarity. Moreover, most existing DL models rely on manually tuned hyperparameters, which is time-consuming and often suboptimal.
Another common limitation is the neglect of signal preprocessing. Ensemble empirical mode decomposition (EEMD) has been shown to reduce non-stationarity by decomposing a signal into intrinsic mode functions (IMFs) [
28]. However, few studies integrate EEMD with advanced DL architectures in a systematic and optimized manner for water quality prediction. Furthermore, the hyperparameter optimization problem is rarely addressed in an automated fashion, leaving model performance highly dependent on empirical settings.
Given the above, the key scientific problem is as follows: how to effectively reduce the non-stationarity of water quality time series, extract multi-scale temporal features, capture bidirectional dependencies, and simultaneously obtain optimal hyperparameters to achieve high prediction accuracy and robustness. Existing studies typically address only one or two of these aspects, lacking a holistic solution.
To fill this gap, this study proposes a novel hybrid model named RUN-EEMD-TCN-BiLSTM. The main novelties are: (1) EEMD preprocesses the raw CODMn series to mitigate non-stationarity and noise. (2) TCN extracts multi-scale temporal features using dilated convolutions. (3) BiLSTM captures forward and backward temporal dependencies. (4) Runge–Kutta optimizer (RUN) automatically searches for the optimal hyperparameters (e.g., number of filters, kernel size, dropout rate, BiLSTM hidden units) to avoid manual trial-and-error.
The model is systematically validated using data from multiple monitoring sections in the Songhua River Basin. Quantitative results (MAE = 0.0829, RMSE = 0.1315, R2 = 0.9508) and ablation studies demonstrate the individual contribution of each component. Generalization experiments on three other sections within the same basin show R2 > 0.93, confirming cross-site robustness.
2. Materials and Methods
2.1. Ensemble Empirical Mode Decomposition (EEMD)
EEMD is an improved and extended signal decomposition method based on the Empirical Mode Decomposition (EMD) [
29]. The decomposition process of EEMD is similar to that of EMD, but randomness is introduced in each iteration. The specific steps are as follows:
Initialization: Select the original signal as the starting point of the first decomposition and define it as the current intrinsic mode function (IMF).
Iterative decomposition: Add different random noises to the signal each time, decompose to obtain IMFs, and repeat the process multiple times.
Result summarization: Average or sum the IMFs obtained from each iteration to get the final IMFs.
EEMD has good performance in processing nonlinear and non-stationary signals and can better handle noise components in signals. By introducing randomness, EEMD considers the influence of noise in the decomposition process, making the results more stable and reliable. Due to the different noises in each iteration, the IMFs obtained from each decomposition have certain differences. Finally, the influence of noise on the decomposition results can be reduced by summarizing these results [
30].
2.2. Temporal Convolutional Network (TCN)
TCN is a deep learning model specially designed for sequence data and is widely used in time series prediction, sequence classification and other tasks. By introducing causal convolution and dilated convolution, TCN can effectively capture long-term dependencies in sequences. The core idea of TCN is to extend the traditional Convolutional Neural Network (CNN) to process sequence data, replacing recurrent neural networks such as RNN and LSTM, and overcoming their gradient vanishing and explosion problems when processing long sequences.
TCN is an improved form of CNN, composed of causal convolution, dilated convolution and residual modules, which can effectively handle time series problems. Causal convolution ensures the causal time sequence when the feature information of data is extracted; dilated convolution allows interval sampling of convolution inputs, making neurons respond to input data in a wider area, which is conducive to TCN capturing longer time series dependencies; the residual module is used to alleviate the problem of gradient instability, solve the interference caused by the increase in network depth, and improve the accuracy of model prediction.
By introducing the dilation coefficient and causal coefficient, the convolution kernel is expanded to enlarge the receptive field of the model and strengthen its ability to capture long-range sequence dependencies. The computational formula is given in Equation (1) [
31]:
In this formulation, y[x] denotes the output of the convolution operation, w[m] represents the weight of the convolution kernel, x[t − d·m] refers to the corresponding element in the input sequence, and d stands for the dilation rate.
Meanwhile, to mitigate gradient vanishing and network degradation, these layers are stacked with residual connections, allowing the TCN to extract and fuse features across multiple time scales. The ReLU activation function employed in this module is defined in Equation (2), with detailed computational expressions provided in Equation (3). Equation (4) summarizes the convolutional layer stacking and residual connection mechanism of the TCN.
2.3. Bidirectional Long Short-Term Memory (BiLSTM) Network
The LSTM network is an advanced recurrent neural network (RNN), specially designed to solve the limitations of traditional RNNs, especially the gradient vanishing problem, thus being able to model long-term dependencies in sequential data [
32].
As a special network structure in recurrent neural networks [
33], its network structure consists of a forget gate, an input gate and an output gate from left to right. X(t) is the input state at time t, h(t) is the hidden layer output state at time t, and C(t) is the memory cell state at time t. ft, it and Ot are the calculation results of the forget gate, input gate and output gate states, respectively; tanh is the activation function. These gating mechanisms use the Sigmoid function to control the degree of information flow [
34].
Although the LSTM model is very effective in predicting nonlinear and extended time patterns, its traditional architecture processes information in a single time direction and only relies on past data, so LSTM may not fully capture the time dependencies in the data [
35]. To overcome this limitation, the BiLSTM architecture is introduced. BiLSTM consists of two parallel LSTM layers that process the input sequence in opposite directions: one layer processes the sequence from start to finish (forward), and the other layer processes it from finish to start (backward). This bidirectional approach enables the network to utilize both forward and backward time information at each time step, thereby improving the model’s ability to capture dependencies and ultimately improving prediction accuracy. BiLSTM calculates the bidirectional hidden state by running the LSTM layer forward and backward along the time axis, and predicts the data at the current moment by combining the data features in the two forward and backward directions. Its dynamic is controlled by the following equations [
36]:
where for time step t, the output of the forward LSTM is
, and the output of the backward LSTM is
.
represents the output of BiLSTM at time step t.
2.4. TCN-BiLSTM Model
To further improve the accuracy and robustness of time series prediction, a joint model of Temporal Convolutional Network (TCN) and Bidirectional Long Short-Term Memory Network (BiLSTM) is proposed and constructed to capture the local features and long-term dependencies of time series data. This joint model is mainly composed of three functional modules: the TCN feature extraction module, the BiLSTM temporal modeling module, and the fully connected output module. Specifically, the TCN module adopts a two-layer convolution structure, where the first layer uses standard convolution for feature extraction, and the second layer employs dilated convolution to expand the receptive field. After each convolution layer, a ReLU activation function and a Dropout layer are connected to effectively prevent overfitting. The BiLSTM module consists of two LSTM layers (forward and backward) to capture bidirectional temporal features. Finally, the features are mapped to the prediction space through the fully connected layer, and the model structure is shown in
Figure 1.
2.5. Runge–Kutta Optimization Algorithm (RUN)
The Runge–Kutta optimizer (RUN) is a new intelligent optimization algorithm proposed in 2021 [
37]. Based on the computational gradient search concept proposed in the Runge–Kutta method to guide optimization, it has the characteristics of strong optimization ability and fast convergence speed [
38,
39,
40].
The structures of LSTM and BiLSTM networks are shown in
Figure 2 and
Figure 3, respectively.
The specific steps for optimizing model hyperparameters using the Runge–Kutta algorithm are as follows:
Step 1: Initialize experimental parameters and hyperparameter population.
Set the core parameters of the RK algorithm: population size (each population corresponds to a set of hyperparameter combinations) and maximum number of iterations to ensure algorithm convergence. Define the search ranges for hyperparameters, and initialize hyperparameter combinations via uniform random sampling to form the initial population .
Step 2: Calculate fitness values of the initial population.
The fitness function directly reflects the prediction error of the model on the validation set. For each hyperparameter combination in the initial population, perform the following operations:
Predict on the validation set using the trained model and compute the Mean Absolute Error (MAE) as the fitness value:
where
is the number of samples in the validation set, and
and
are the predicted and true values, respectively.
A smaller fitness value indicates a better hyperparameter combination. Record the optimal fitness value in the initial population and its corresponding hyperparameter combination .
Step 3: Iteratively update the hyperparameter population via the RK algorithm.
Each individual in the population is iteratively updated using the fourth-order Runge–Kutta update formula to guide hyperparameters toward the optimal direction. The detailed update process is as follows:
Calculate the fourth-order slopes for each individual, where are randomly generated normally distributed search direction vectors with the same dimension as the number of hyperparameters.
Update the individual position (hyperparameter combination) according to the fourth-order Runge–Kutta formula, as shown below:
where
denotes the hyperparameter combination of the
-th individual at the
-th iteration,
denotes the updated hyperparameter combination at the
-th iteration, and
is the step size of the RK algorithm.
Hyperparameter boundary constraints: Given the definite search ranges of hyperparameters, clip the updated hyperparameter combination to ensure all hyperparameter values fall within the preset ranges, preventing model training failure caused by invalid hyperparameters.
Step 4: Update fitness values and optimal hyperparameters.
After iterative population update, recalculate the fitness value of each individual . Compare the optimal fitness value of the current population with the historical optimal fitness value : if the current optimal value is smaller, update and its corresponding hyperparameter combination ; otherwise, retain the historical optimal value.
Step 5: Judge the iteration termination condition.
Check whether the current number of iterations reaches , or the variation in the optimal fitness value is less than the preset threshold. If either condition is satisfied, terminate the iteration; otherwise, return to Step 3 for further iterative updates.
Step 6: Output optimal hyperparameters and train the final model.
Upon iteration termination, output the globally optimal hyperparameter combination . Substitute this set of hyperparameters into the TCN-BiLSTM model, train the model using the full training set, and obtain the final optimized TCN-BiLSTM model for subsequent prediction experiments.
2.6. Construction of the River Water Quality Prediction
Water quality data usually exhibit high volatility, strong randomness, and lack of periodicity, requiring a model that can effectively capture their variation patterns with strong learning ability [
41]. The TCN-BiLSTM model combines the feature extraction capability of TCN with the bidirectional modeling capability of BiLSTM. This hybrid not only inherits the parallel processing and multi-scale feature extraction of TCN, but also leverages the gated structure of LSTM to handle complex temporal dependencies. To avoid manual parameter tuning, the Runge–Kutta optimization algorithm (RUN) is employed to automatically search for key hyperparameters, including the number of filters, kernel size, dropout rate, and the number of BiLSTM hidden units. RUN improves prediction accuracy and generalization while reducing the uncertainty of manual configuration. The overall technical roadmap of the research is shown in
Figure 4. The specific steps of the proposed RUN-EEMD-TCN-BiLSTM model are as follows:
Collect raw water quality data and use linear interpolation to fill missing values and correct outliers.
Two methods are adopted for feature selection: Spearman correlation analysis is used to select features with strong correlations, and SHAP analysis based on the random forest model is used to rank feature importance. The top-ranked features are comprehensively selected as model inputs.
Decompose the permanganate index (CODMn) sequence using EEMD to obtain K intrinsic mode functions (IMFs).
For each IMF component, build a TCN-BiLSTM model whose hyperparameters are optimized by the RUN algorithm. The input to each model consists of the selected external features and the respective IMF component as the target.
Sum the prediction outputs of all IMF components to obtain the final CODMn prediction.
Figure 4.
Technical roadmap of the study.
Figure 4.
Technical roadmap of the study.
2.7. SHAP
Machine learning models are often regarded as “black boxes” due to their limited interpretability in practical applications. To address this limitation, explainable artificial intelligence (XAI) algorithms have received increasing attention in recent years. In this study, the SHapley Additive exPlanations (SHAP) method, rooted in cooperative game theory, is employed to quantify the contribution of individual input variables to the model output. SHAP assigns an importance value (i.e., SHAP value) to each feature by calculating its marginal contribution to the model’s prediction. For a given instance, the SHAP value quantifies the extent to which a specific feature increases or decreases the predicted permanganate index relative to the model’s baseline output. By averaging the SHAP values across all samples, we quantify the relative contributions of the predictor variables to the model output and identify the most influential variables for surface water permanganate index prediction. The mathematical formulation of the SHAP-based interpretation is given by
: the SHAP value of the i-th input feature for the j-th sample, representing the marginal contribution of the feature to the predicted permanganate index;
F: the set of all input features, with |F| denoting the total number of features;
S: a subset of F that excludes the i-th feature;
: the model’s predicted output for the j-th sample when only the features in subset S are used;
: the model’s predicted output for the j-th sample after adding the i-th feature to subset S.
2.8. Evaluation Metrics
MATLAB was used to calculate the coefficient of determination (R
2), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE) of the test set to evaluate the prediction accuracy and generalization ability of the model [
41]. R
2 is used to measure the model’s ability to explain data variation; the closer the value is to 1, the better the model fitting, and a model with R
2 exceeding 0.8 is generally considered to have high goodness of fit. RMSE reflects the deviation between the predicted and true values and is more sensitive to large errors; MAE is the average of the absolute values of prediction errors and is more robust to outliers. Generalization ability usually refers to the prediction performance of a machine learning algorithm on new data or samples not involved in training, reflecting the adaptation level of the learning model to out-of-distribution samples under the independent and identically distributed assumption. The closer the R
2, RMSE and MAE of the test set are to those of the training set, the better the prediction accuracy and the stronger the generalization ability of the model. The calculation formulas of the three metrics are as follows:
where
is the number of samples;
is the predicted value of the i-th sample;
is the measured value of the i-th sample; and
is the average value of all measured values.
4. Conclusions
Based on water quality monitoring data from the Sandao section of the Songhua River Basin in Mudanjiang City, Heilongjiang Province, this study addresses the problems of non-stationarity, nonlinearity, and strong temporal dependence in water quality time series, which lead to low accuracy of traditional prediction methods. A hybrid prediction model integrating the Runge–Kutta optimization algorithm (RUN), ensemble empirical mode decomposition (EEMD), Temporal Convolutional Network (TCN), and Bidirectional Long Short-Term Memory (BiLSTM) is proposed. The main conclusions are as follows:
- (1)
The proposed RUN-EEMD-TCN-BiLSTM model effectively overcomes the shortcomings of traditional models, such as gradient vanishing, insufficient feature extraction, and weak generalization ability. The RUN algorithm provides automatic hyperparameter optimization, EEMD reduces data non-stationarity, and BiLSTM captures bidirectional temporal dependencies, offering a new and effective method for accurate water quality prediction.
- (2)
In the prediction task of the permanganate index (CODMn) at the Sandao section, the proposed model significantly outperforms baseline models (BiLSTM, BP, LSTM, XGBoost, TCN) in terms of MAE, RMSE, and R2. Generalization experiments conducted on multiple sections within the Songhua River Basin further verify the stability and applicability of the model, demonstrating its capability to meet the requirements of real-time water quality monitoring and control.
- (3)
Extracting intrinsic mode functions using EEMD decomposition can effectively mitigate the interference of non-stationarity in water quality data, thereby improving the prediction performance of the model.
- (4)
Optimization with the RUN algorithm enhances prediction accuracy and generalization ability, while reducing errors caused by manual parameter tuning.
Despite these achievements, certain limitations exist in this study. First, the model was developed and validated only for the permanganate index (CODMn); its predictive performance for other water quality parameters (e.g., total phosphorus, total nitrogen, ammonia nitrogen) has not been examined. Second, the generalization experiments were conducted only on multiple sections within the Songhua River Basin, without validation using data from other basins (e.g., the Yangtze River or Yellow River). Therefore, the cross-basin generalization capability of the model under different hydrological, climatic, and pollution characteristics remains unclear. Third, this study focuses only on short-term prediction and does not incorporate external driving factors such as meteorology, hydrology, or pollution source emissions. The model’s adaptability to medium-to-long-term trends and extreme pollution events remains limited. Additionally, the EEMD decomposition and RUN optimization algorithm introduce certain computational costs.
Future work will extend the model to predict multiple water quality indicators, integrate multi-source spatiotemporal data, develop lightweight network architectures to improve computational efficiency, and validate the approach across more watersheds, thereby enhancing the model’s practicality and applicability.