Comparative Analysis of Machine/Deep Learning Models for Single-Step and Multi-Step Forecasting in River Water Quality Time Series

Fang, Hongzhe; Li, Tianhong; Xian, Huiting

doi:10.3390/w17131866

Open AccessArticle

Comparative Analysis of Machine/Deep Learning Models for Single-Step and Multi-Step Forecasting in River Water Quality Time Series

by

Hongzhe Fang

^1,2,

Tianhong Li

^1,2,3,*

and

Huiting Xian

⁴

¹

College of Environmental Sciences and Engineering, Peking University, Beijing 100871, China

²

State Environmental Protection Key Laboratory of All Material Fluxes in River Ecosystems, Beijing 100871, China

³

Center for Habitable Intelligent Planet, Institute of Artificial Intelligence, Peking University, Beijing 100871, China

⁴

Guangzhou Water Affairs Bureau, Guangzhou 610072, China

^*

Author to whom correspondence should be addressed.

Water 2025, 17(13), 1866; https://doi.org/10.3390/w17131866

Submission received: 16 May 2025 / Revised: 18 June 2025 / Accepted: 22 June 2025 / Published: 23 June 2025

(This article belongs to the Special Issue Application of Artificial Intelligence (AI) in Water Quality Monitoring)

Download

Browse Figures

Versions Notes

Abstract

There is a lack of a systematic comparison framework that can assess models in both single-step and multi-step forecasting situations while balancing accuracy, training efficiency, and prediction horizon. This study aims to evaluate the predictive capabilities of machine learning and deep learning models in water quality time series forecasting. It made use of 22-month data with a 4 h interval from two monitoring stations located in a tributary of the Pearl River. Six models, specifically Support Vector Regression (SVR), XGBoost, K-Nearest Neighbors (KNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) Network, Gated Recurrent Unit (GRU), and PatchTST, were employed in this study. In single-step forecasting, LSTM Network achieved superior accuracy for a univariate feature set and attained an overall 22.0% (Welch’s t-test, p = 3.03 × 10⁻⁷) reduction in Mean Squared Error (MSE) compared with the machine learning models (SVR, XGBoost, KNN), while RNN demonstrated significantly reduced training time. For a multivariate feature set, the deep learning models exhibited comparable accuracy but with no model achieving a significant increase in accuracy compared to the univariate scenario. The KNN model underperformed across error evaluation metrics, with the lowest accuracy, and the XGBoost model exhibited the highest computational complexity. In multi-step forecasting, the direct multi-step PatchTST model outperformed the iterated multi-step models (RNN, LSTM, GRU), with a reduced time-delay effect and a slower decrease in accuracy with increasing prediction length, but it still required specific adjustments to be better suited for the task of river water quality time series forecasting. The findings provide actionable guidelines for model selection, balancing predictive accuracy, training efficiency, and forecasting horizon requirements in environmental time series analysis.

Keywords:

time series forecasting; machine learning; deep learning; river water quality

1. Introduction

River water quality not only affects the aquatic ecosystem in the river basin but also impacts the health of residents along the riverbanks. With rapid economic development and accelerated urbanization, water pollution in river basins has become increasingly severe. Modeling and accurately predicting the changes in river water quality indicators can provide support to river basin environmental management and decision-making through early warning mechanisms as one component of a comprehensive suite of approaches [1]. In recent years, automatic water quality monitoring technology has developed rapidly, and monitoring items mainly include water temperature, pH, dissolved oxygen, conductivity, turbidity, permanganate index, total organic carbon, and ammonia nitrogen [2]. According to the Technical Specifications for Automatic Monitoring of Surface Water (HJ 915-2017) [3] issued by the Ministry of Environmental Protection of China, the data collection frequency for automatic surface water quality monitoring is generally stipulated as once every 4 h (unless in emergency situations). Such a large number of multi-dimensional water quality data with a long time series puts a high demand on the ability of time series forecasting, which is not only a technical tool in environmental science but also a critical link connecting ecological conservation, public health, economic efficiency, and societal governance [4]. The core value of time series water quality forecasting lies in enabling data-driven decision-making, shifting from “reactive remediation” to “proactive prevention,” and providing scientific support for establishing dynamic early-warning mechanisms in water resource management and pollution monitoring.

Traditional water quality models are often limited by extensive data requirements, difficult model operation, and limited applicability to different types of water bodies [5,6], which makes it hard for them to be used at different regions at a low cost. Machine/deep learning methods, with their flexibility in handling complex nonlinear relationships, have been widely applied in this field [7,8]. Studies have shown that models like LSTM [9,10] or other neural networks [11,12] can accurately simulate water quality dynamics, and hybrid approaches combining data preprocessing or parameter optimization algorithms can further enhance prediction performance. The rise in Transformer-based models [13,14,15,16] has also brought new perspectives to time series prediction, though originally designed for natural language processing. Subsequently, a simple linear model called DLinear [17] outperformed all aforementioned Transformer-based models on multiple datasets, which implies that applying these models to time series prediction requires specific techniques rather than merely designing increasingly complex architectures. Specifically in water quality time series prediction, numerous promising applications have emerged, all demonstrating superior performance compared to conventional machine/deep learning models [18,19,20].

A critical gap in the existing research lies in the lack of a systematic comparison framework that evaluates models across both single-step and multi-step forecasting scenarios while balancing accuracy, training efficiency, and prediction horizon. Most studies focus on either short-term predictions (neglecting the long-term planning needs of river management) or complex model architectures without fully assessing the competitive edge of simpler models. For example, RNNs often perform only slightly worse than models like LSTM in water quality time series forecasting, while their computational efficiency is much higher [21]. In addition, multi-step forecasting holds more practical application value than single-step forecasting, while the difficulty is also greater. There are two basic strategies for the accomplishment of multi-step forecasting, called iterated multi-step forecasting (IMS forecasting) and direct multi-step forecasting (DMS forecasting). However, IMS forecasting models often suffer from monotonic prediction biases and rapid accuracy degradation as the forecast length increases, limiting their practical utility for scenarios requiring long-term trend analysis [22] compared with DMS forecasting models.

Against this backdrop, this study conducts a comprehensive performance comparison of common machine/deep learning models using nearly two years of gauged data with a 4 h interval from two monitoring stations in a tributary of the Pearl River. By systematically evaluating the models in both single-step and multi-step forecasting tasks, this study aims to address two key limitations: (1) the lack of clarity on how model simplicity and complexity tradeoff between accuracy and efficiency across different prediction horizons and (2) the underutilization of direct multi-step forecasting models that can mitigate time delay effects and maintain stable performance for long-term predictions. The findings seek to provide a data-driven reference for model selection in river water quality forecasting, highlighting the practical value of matching model architecture with specific management needs.

2. Study Framework

2.1. Study Area

The water quality data used in this study are obtained from two automatic water quality monitoring stations located about 4.1 km apart in the 100 m-wide tributaries of the Baini River in Guangdong Province, namely the XHY Station (upstream) and the TMH Station (downstream). The locations of the two monitoring stations are shown in Figure 1. The Baini River is an important part of the Pearl River, one of the important rivers in China. The total area of the Baini River basin is approximately 1493 square kilometers, and the total length of the main stream within the territory of Guangzhou is 33 km.

The water quality data cover a time series collected at 4 h intervals from August 2021 to December 2023, spanning approximately two years. The monitored water quality indicators include transparency, dissolved oxygen, oxidation-reduction potential (ORP), water level, ambient temperature, environmental humidity, ammonia, and nitrogen. The collection and analysis methods are carried out in accordance with the Environmental Quality Standards for Surface Water (GB3838-2002) [23].

2.2. Data Preprocessing and Description

Before missing value processing, the datasets of the downstream station and upstream station both consist of 5169 lines. The interpolation method is linear interpolation filling, which means replacing a missing value with a linear combination of the two closest known data points on either side of the missing value. This process showed that 199 lines and 177 lines of missing or invalid values were interpolated, respectively. Subsequently, to improve the forecasting performance for specific models, all the data are mean normalized (also named Z-score normalization) by Equation (1):

X_{N o r m} = \frac{X - X_{M e a n}}{X_{S t d}},

(1)

where X is the raw value, X_Mean is the column mean value of the raw value, and X_Std is the column standard deviation value of the raw value (X_Norm ∈ R^{samples×features}).

The water quality data of each station incorporates 7 water quality indicators, specifically transparency, dissolved oxygen, ORP, ammonia nitrogen (AN), water level, ambient temperature, and humidity. Considering that the datasets for other indicators except ammonia nitrogen were not as complete, we only conducted the experiment with the ammonia nitrogen concentration at the downstream station as the predicting goal and attempted to use the upstream station’s data as candidate features.

The ANs of the two stations and their descriptive statistics are shown in Table 1. The detection limit of AN is 0.05 mg/L, and we set data below the detection limit as half of the detection limit. As can be seen in Figure 2, the content of AN at the downstream station exceeds 2.5 mg/L at more than half of the time points. In addition, there is a growing trend in autumn and winter.

Statistical analyses and data interpolation were conducted in Python version 3.9.0 using the package pandas (version 1.2.4) and scikit-learn (version 1.6.0) [24].

In the present study, only ammonia nitrogen was retained as the input feature to obtain the best performance. In the following experiments, this study compared the models’ performance under the two different feature subsets—one includes only the downstream station’s data (univariate), and the other includes the two stations’ data (multivariate). The dataset was split into training and testing sets; specifically, the first 80% (4098 samples) of time points were used for training, and the subsequent 20% (1039 samples) were used for testing.

2.3. Machine/Deep Learning Models

The machine learning models involved in our research include Support Vector Regression (SVR), K-Nearest Neighbors (KNN), and XGBoost, which generally treat past observation sequences equally. The deep learning models involved in our research include Recurrent Neural Network (RNN), Long Short-term Memory (LSTM), and Gated Recurrent Unit (GRU), which are all latent autoregressive models (Figure 3) that maintain some summary ht of past observations and, at the same time, update ht in addition to the prediction

\hat{x_{t}}

[25]. A brief introduction to the above models is as follows.

2.3.1. Machine Learning Models

Support Vector Regression (SVR) is a branch of the Support Vector Machine (SVM). SVM is used for classification tasks and was first developed by Vapnik [26]. SVR is based on SVM for regression analysis. The traditional regression model of the loss function is mean square deviation

\frac{1}{n} \sum {(\hat{y} - y)}^{2}

, which directly measures the difference between the predicted values and real values. SVR allows for a deviation of ϵ between the two values. The loss is calculated only when

| \hat{y} - y |

exceeds ϵ. Moreover, according to mathematical derivation [27], if the deviation term is not taken into account, the learned SVR model can always represent the linear combination of kernel functions.

The K-Nearest Neighbors (KNN) method predicts the value of a sample based on its neighboring instances in the feature space. For regression tasks, it uses the average value of the k nearest neighbors as the prediction result. KNN does not require an explicit training process—all computations occur during the prediction phase. It makes no assumptions about the data distribution, making it suitable for various data types [27].

The XGBoost model [28] has good performance in large datasets and has high accuracy. The random forest sampling strategy means that every sample in the original training dataset has the same probability of being selected (1/n), while for XGBoost, samples that are not well predicted or classified by the previous decision tree will be given a higher probability of being selected by the next decision tree.

2.3.2. Deep Learning Models

The computational logic of an RNN at adjacent time steps can be explained by Equation (2). The hidden layer output Ht of each time step will depend not only on the current input Xt but also on the hidden layer output Ht-1 of the previous time step.

H_{t} = Φ (W_{1} X + W_{h} H_{t - 1} + b_{1}) O_{t} = W_{2} H_{t} + b_{2},

(2)

where X is the input feature, and Φ is an activation function.

LSTM is an attempt to solve the problem of long-term information retention and short-term input loss in RNNs [29]. LSTM resembles RNN, the computational logic of which can be explained by Equation (3):

F_{t} = σ (W_{x f} X_{t} + {W_{h f} H}_{t - 1} + b_{f}) I_{t} = σ (W_{x i} X_{t} + {W_{h i} H}_{t - 1} + b_{i}) O_{t} = σ (W_{x o} X_{t} + {W_{h o} H}_{t - 1} + b_{o}) \tilde{C_{t}} = \tanh (W_{x c} X_{t} + {W_{h c} H}_{t - 1} + b_{c}) C_{t} = F_{t} ⊙ C_{t - 1} + I_{t} ⊙ \tilde{C_{t}} H_{t} = O_{t} ⊙ \tanh (C_{t}),

(3)

where Xt is the input feature of time step t, σ is an activation function (usually sigmoid), and

⊙

is the concatenate operation.

GRU [30] is a slightly simplified variant of the LSTM, which can achieve comparable performance but is often faster to compute. Its computational logic can be explained by Equation (4):

R_{t} = σ (W_{x r} X_{t} + {W_{h r} H}_{t - 1} + b_{r}) Z_{t} = σ (W_{x z} X_{t} + {W_{h z} H}_{t - 1} + b_{z}) \tilde{H_{t}} = t a n h (W_{x h} X_{t} + {W_{h h} (R_{t} ⊙ H}_{t - 1}) + b_{h}) H = Z_{t} ⊙ H_{t - 1} + {(1 - Z}_{t}) ⊙ \tilde{H_{t}}

(4)

The PatchTST model [31] is a state-of-the-art deep learning architecture designed for time series forecasting. It addresses the limitations of traditional Transformer models in handling long sequences by integrating patch partitioning and hierarchical attention mechanisms, enabling efficient capture of both short-term patterns and long-range dependencies.

The SVR and XGBoost models were conducted in Python version 3.9.0 using package scikit-learn (version 1.6.0) [24] and package XGBoost (version 1.7.3), respectively. The RNN, LSTM, and GRU models were conducted in Python version 3.9.0 using package pytorch (version 2.0.0). The PatchTST model was conducted in Python version 3.9.0 using package tsai (version 0.3.9) [32].

2.4. Experimental Design

In the single-step forecasting comparison experiment, this study compares the performance of the SVR, XGBoost, KNN, RNN, LSTM, and GRU models. Subsequently, the models exhibiting superior performance in single-step forecasting are selected as the basis for the iterated multi-step forecasting models. Specifically, the three deep learning models—RNN, LSTM, and GRU—are employed for multi-step prediction tasks. All of the models are following the same experimental setup, with the look-back window L = 60, with the prediction length T = 1, with the size of testing set S = 0.2, with the input feature dimension I ∈ {1,2}, and with the output label dimension O = 1.

In the multi-step forecasting comparison experiment, this study compares the performance of the IMS forecasting models (RNN, LSTM, and GRU) and DMS forecasting model (PatchTST). All of the models are following the same experimental setup, with the look-back window L = 104 for PatchTST and L = 60 for the other models, with the prediction length T = 12 (covering two days), with the size of testing set S = 0.2, with the input feature dimension I ∈ {1,2}, and with the output label dimension O ∈ {1,2}, respectively.

Generally, IMS forecasting inevitably suffers from error accumulation effects [16]. Therefore, IMS forecasting is limited to a relatively short prediction length. This study compares their accuracy with different prediction lengths and in different periods and then analyzes the pattern of decreases in accuracy with the increase in prediction length.

The hyper-parameters of the SVR, XGBoost, KNN, RNN, LSTM, GRU, and PatchTST models in this experiment were optimized by the framework Optuna [33], which could be found in Supplementary Materials Table S1.

All models were developed and ran under the same computational configuration (Supplementary Materials Table S2). The system is powered by an Intel Core i7-9750H CPU, 16 GB of RAM, and an NVIDIA GeForce GTX 1650 GPU with 4 GB of dedicated video memory (VRAM). The CPU has 6 cores and 12 threads and a basic frequency of 2.6GHz. It runs on the Windows 10 operating system, with the software environment set up using the Anaconda distribution (Python version 3.9.0).

2.5. Model Evaluation

In order to evaluate the models from multiple aspects, this study chose mean squared error (MSE) and mean absolute error (MAE) as the evaluation metrics. According to the literature [34], evaluating the time series forecasting result with only MSE or MAE is inadequate. Therefore, this study additionally chooses Dynamic Time Warping (DTW) [35] and Temporal Distortion Index (TDI) [36] as the evaluation metrics for multi-step forecasting. A lower DTW means higher similarity between the shape of true values and predicted values, and a lower TDI means a lower time delay between the curve of true values and the curve of predicted values.

Considering the randomness of initial parameters of deep learning models, this study conducted the experiment 5 times and calculated out the mean metrics for each model.

Calculation of MSE and MAE was conducted in Python version 3.9.0 using package scikit-learn (version 1.6.0) [24]. Calculation of DTW and TDI was conducted in Python version 3.9.0 using package tslearn (version 0.6.3) [37].

3. Results and Discussion

3.1. Single-Step Time Series Forecasting

In single-step forecasting, this study chooses SVR, KNN, XGBoost, RNN, LSTM, and GRU as objects of comparison, and they are the baselines of each other. This study always chooses the best results for each model to make moderate estimation. The hyper-parameters set in this experiment for each model can properly achieve their good performance. The details are provided in Supplementary Materials Table S1.

Table 2 shows the single-step forecasting results for both the univariate (one station) and multivariate (two stations) feature subsets. For the univariate feature subset, LSTM outperforms the other models but only with a minor advantage over the other deep learning models—RNN and GRU. Considering RNN takes a much shorter time for training, RNN should be the SOTA model in this context. Quantitatively, compared with the machine learning models (KNN, SVR, and XGBoost), LSTM achieves an overall 22.0% (Welch’s t-test, p = 3.03 × 10⁻⁷) reduction in MSE and 22.3% (Welch’s t-test, p = 5.93 × 10⁻⁸) reduction in MAE. In addition, KNN has the lowest MSE accuracy, and XGBoost takes the most training time, which means KNN may not be well applied in river water quality forecasting.

For the multivariate feature subset, same as the above discussion, the deep learning models perform almost the same in terms of accuracy, but RNN takes a much shorter training time. Only GRU and KNN achieve a slight reduction in MSE compared with the univariate feature subset, which means directly matching the two stations’ data at the same time point cannot make a steady improvement on prediction accuracy. However, the models’ accuracy can be improved by matching the two datasets in a different way (see details in Supplementary Materials S1). What is more, LSTM and GRU achieve a considerable reduction in training time, even with more input data, which can be seen as an advantage of the deep learning models.

Figure 4 provides a visualization of the results of the testing set that prove LSTM’s and GRU’s outstanding performance. As can be seen in Figure 4, SVR performs very badly on the 900th to 1000th samples in the testing set for the multivariate feature subset, and obviously, KNN exhibits consistently poor performance across the entire testing set. A more detailed visualization of each model is provided in Supplementary Materials Figures S1 and S2.

3.2. Multi-Step Time Series Forecasting

In single-step forecasting, RNN, LSTM, and GRU perform well. Consequently, in multi-step forecasting, this study chooses RNN, LSTM, and GRU IMS forecasting models as baselines of the PatchTST DMS forecasting model. This study always chooses the best results for each model to make moderate estimation. The details are provided in Supplementary Materials Table S1.

The hyper-parameters of the PatchTST model in this experiment were optimized by the framework Optuna [33,37], which could be found in Supplementary Materials Table S2.

Table 3 shows the multi-step forecasting results for both the univariate and multivariate feature subsets. First of all, from one station’s data to two stations’ data, the PatchTST model demonstrates significant improvements across all metrics: MSE decreases by 11.7%, DTW decreases by 5.1%, and TDI decreases by 9.5%, indicating enhanced accuracy, enhanced shape matching, and reduced time delay. In contrast, the RNN, LSTM, and GRU models exhibit increased MSE and DTW values, suggesting poorer shape correspondence with additional features. But RNN and LSTM obtain a lower TDI, which means that more features can reduce the time delay effect but still need a better method of being constructed. Overall, the DMS model PatchTST outperforms all the other models. Notably, the PatchTST model underperforms the other baselines for the univariate feature subset, yielding the highest MSE (0.409 ± 0.080) and DTW (1.56 ± 0.08). However, for the multivariate subset, PatchTST achieves optimal performance across all three metrics, outperforming the other models. Specifically, TDI is reduced by 15.7% (Welch’s t-test, p = 0.006) compared to the best results attainable by the competing methods, while all the selected models perform nearly the same in terms of MSE for the multivariate subset, which means the PatchTST model can reduce the time delay effect to some extent.

Figure 5 provides a visualization of the multi-step forecasting results that prove PatchTST’s and RNN’s outstanding performance over various aspects compared to the other models. For instance, focusing on Figure 5c–e, PatchTST (red line) can accurately capture the turning point of true values (purple line), which shows the ability of PatchTST in predicting changes. In addition, in all selected figures (Figure 5a–h), the LSTM and GRU IMS forecasting models perform so badly that these two models only give a monotonic prediction, which cannot be seen as a learned ability and only depends on the data trend to be monotonic so as to achieve a lower prediction error. On the other hand, RNN, a deep learning model with a quite simpler structure than LSTM and GRU, shows its ability in river water quality multi-step forecasting, which can be seen in Figure 5g, h. However, on most occasions, RNN performs just like LSTM and GRU, only giving a monotonic prediction, either increasing or decreasing, while PatchTST always predicts the time series in its own learned way. Even though, sometimes, PatchTST does not fit well with the true values’ trend, it will still give a reasonable prediction.

Table 4 shows the pattern of decreases in accuracy with the increase in prediction length for the univariate feature subset. As can be seen in Table 4, there is a sudden decline in MSE at the beginning of iterated forecasting, and then, the accuracy decreases slowly. In addition, the PatchTST model has a slower accuracy decrease with the increase in prediction length, which demonstrates the advantage of the DMS forecasting model.

3.3. Discussion

This study conducted a comprehensive comparison of multiple machine and deep learning models for time series forecasting of ammonia nitrogen in rivers. The findings offer several key contributions to the field.

In the single-step time series forecasting experiment, our results revealed that the RNN algorithm outperforms the others when considering both prediction accuracy and time cost. Specifically, in the univariate feature subset scenario, although LSTM had the highest accuracy, RNN achieved a comparable level of accuracy, with a significantly shorter training time. Compared to machine learning models, such as SVR and XGBoost, deep learning models like LSTM demonstrated a remarkable reduction in MSE (22.0%, Welch’s t-test p = 3.03 × 10⁻⁷) and MAE (22.3%, Welch’s t-test p = 5.93 × 10⁻⁸). This is consistent with the results of ammonia nitrogen prediction in previous research [38]; in addition, Hu et al. [39] proposed a hybrid deep neural network model for river water quality prediction. Their study also found that deep learning-based models showed strong capabilities in handling complex water quality data. However, our research uniquely emphasizes the efficiency–accuracy tradeoff among different deep learning models in single-step forecasting. Moreover, our research found that deep learning models (RNN, LSTM, and GRU) could reduce training time, even with more input data from two stations, highlighting their advantage over machine learning models in handling time series data. All these findings challenge the traditional perception that more complex models always perform better and provide a new perspective for model selection in water quality forecasting.

In multi-step forecasting, our research compared iterated multi-step models (RNN, LSTM, and GRU) with the direct multi-step PatchTST model. The results showed that LSTM and GRU struggled to learn the changing patterns of time series and often made monotonous predictive performance. In contrast, PatchTST not only achieved lower time lag (at least a 13.8% reduction in TDI compared to the other models, Welch’s t-test p = 0.006) but also maintained relatively stable accuracy within a limited prediction horizon. However, substantial temporal barriers remain, and its forecasting performance deteriorates beyond a certain step. Consequently, while PatchTST shows advantages for mid-term prediction, its applicability for very long-term water quality forecasting is limited. Additionally, despite its simple structure, the RNN’s efficiency (training in seconds) suits high-frequency, low-latency applications, as it performs better than LSTM and GRU in certain multi-step forecasting scenarios.

Our research has significant implications for river water quality management. It provides a practical reference for choosing appropriate models based on different application scenarios. For real-time monitoring and short-term prediction requirements, models like RNN can be considered due to their high efficiency. For long-term trend analysis and multi-step forecasting, PatchTST offers better performance, but at the same time, the current accuracy still falls short, indicating significant room for development in the field of water quality time series prediction.

However, this study also has limitations, as we only focused on ammonia nitrogen concentration and data from two stations. Different water quality parameters have different changing characteristics over time. Is the conclusion of this article’s time series forecasting for ammonia nitrogen universal? Future research could expand the scope to include more water quality indicators and data from a wider range of regions. Additionally, further exploration of model structure optimization according to the characteristics of river water quality data is also needed.

To sum up, the present research contributes to a better understanding of the performance of different machine/deep learning models in river water quality forecasting. It offers valuable insights for environmental managers and researchers, promoting more accurate and efficient water quality prediction and management.

4. Conclusions

This study evaluated six machine/deep learning models for river water quality forecasting using ammonia nitrogen data from two monitoring stations (4 h intervals, from August 2021 to December 2023), assessing them from both prediction accuracy and training efficiency perspectives while considering two scenarios: single-step and multi-step forecasting. The key findings reveal distinct model advantages across forecasting scenarios. (1) In single-step forecasting, deep learning models (RNN, LSTM, GRU) outperformed traditional methods (SVR, XGBoost, KNN). LSTM achieved the highest accuracy (MSE = 0.085 for univariate data), while RNN demonstrated the best training efficiency. However, multivariate integration (upstream–downstream data) only marginally improved the performance of RNN and GRU, indicating the need for advanced feature engineering. (2) For 12-step multi-step forecasting, the PatchTST model outperformed the iterated methods (LSTM/GRU) by reducing the time delay impact (TDI by at least 13.8%, Welch’s t-test p = 0.006) and avoiding monotonic prediction biases. PatchTST maintained stable accuracy degradation (226.1% MSE increase over 12 steps vs. 305.4% for LSTM), proving superior in capturing nonlinear spatiotemporal dynamics. Notably, the simpler RNN architecture occasionally matched the complex models in multi-step tasks, challenging the assumption that complexity always enhances performance. (3) The findings emphasize the critical balance between predictive accuracy, computational efficiency, and forecasting horizon. For real-time single-step applications requiring low latency, RNN is recommended. For long-term planning, the PatchTST model demonstrates certain potential, but it still requires specific adjustments to be better suited for the task of river water quality time series forecasting. Additionally, exploring or constructing more features, such as historical climate and weather data, can also be of great help. This study establishes a practical framework for model selection, prioritizing multi-step forecasting capabilities essential for proactive river basin management.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/w17131866/s1, Figure S1: Prediction curve of testing set of each model for univariate feature subset. (a–f) represent curves of SVR, XGBoost, KNN, RNN, LSTM and GRU respectively; Figure S2: Prediction curve of testing set of each model for multivariate feature subset. (a–f) represent curves of SVR, XGBoost, KNN, RNN, LSTM and GRU respectively; Figure S3: Prediction curve of testing set for univariate feature subset. Plates (a–h) represent prediction curve at different time period in the testing set; Table S1: Optimal hyper-parameters for single-step and multi-step time series forecasting experiment; Table S2: Computing configuration.

Author Contributions

Conceptualization, methodology, formal analysis, H.F. and T.L.; draft preparation, visualization, H.F.; review, editing, supervision, T.L.; investigation, validation, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 52330005.

Data Availability Statement

Due to commercial restrictions, the original data cannot be made publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, K.; Chen, H.; Zhou, C.; Huang, Y.; Qi, X.; Shen, R.; Liu, F.; Zuo, M.; Zou, X.; Wang, J.; et al. Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. Water Res. 2020, 171, 115454. [Google Scholar] [CrossRef] [PubMed]
China National Environmental Monitoring Centre. Introduction of National Surface Water Quality Automatic Monitoring System. 2017. Available online: http://www.cnemc.cn/zzjj/jcwl/shjjcwl_699/201705/t20170531_645113.shtml (accessed on 15 May 2025).
HJ 915-2017; Technical Specifications for Automatic Monitoring of Surface Water. Ministry of Environmental Protection of the People’s Republic of China: Beijing, China, 2017.
Gao, Z.; Chen, J.; Wang, G.; Ren, S.; Fang, L.; Yinglan, A.; Wang, Q. A novel multivariate time series prediction of crucial water quality parameters with long short-term memory (LSTM) networks. J. Contam. Hydrol. 2023, 259, 104262. [Google Scholar] [CrossRef] [PubMed]
Gao, L.L.; Li, D.L. A review of hydrological/water-quality models. Front. Agric. Sci. Eng. 2014, 1, 267. [Google Scholar] [CrossRef]
Charuleka, V.; Alison, P.; Appling, B.A. Can machine learning accelerate process understanding and decision-relevant predictions of river water quality? Hydrol. Process. 2022, 36, e14565. [Google Scholar]
Wang, Y.; Yuan, Y.; Pan, Y.; Fan, Z. Modeling daily and monthly water quality indicators in a canal using a hybrid wavelet-based support vector regression structure. Water 2020, 12, 1476. [Google Scholar] [CrossRef]
Fu, X.; Zheng, Q.; Jiang, G.; Roy, K.; Huang, L.; Liu, C.; Li, K.; Chen, H.; Song, X.; Chen, J.; et al. Water quality prediction of copper-molybdenum mining-beneficiation wastewater based on the PSO-SVR model. Front. Environ. Sci. Eng. 2023, 17, 98. [Google Scholar] [CrossRef]
Baek, S.S.; Pyo, J.; Chun, J.A. Prediction of water level and water quality using a CNN-LSTM combined deep learning approach. Water 2020, 12, 3399. [Google Scholar] [CrossRef]
Zhang, Y.; Li, C.; Jiang, Y.; Sun, L.; Zhao, R.; Yan, K.; Wang, W. Accurate prediction of water quality in urban drainage network with integrated EMD-LSTM model. J. Clean. Prod. 2022, 354, 131724. [Google Scholar] [CrossRef]
Chen, L.; Wu, T.; Wang, Z.; Lin, X.; Cai, Y. A novel hybrid BPNN model based on adaptive evolutionary artificial bee colony algorithm for water quality index prediction. Ecol. Indic. 2023, 146, 109882. [Google Scholar] [CrossRef]
Xu, J.; Wang, K.; Lin, C.; Xiao, L.; Huang, X.; Zhang, Y. FM-GRU: A time series prediction method for water quality based on seq2seq framework. Water 2021, 13, 1031. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. Int. Conf. Mach. Learn. 2022, 162, 27268–27286. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? Proc. AAAI Conf. Artif. Intell. 2023, 37, 11121–11128. [Google Scholar] [CrossRef]
Lin, Y.; Qiao, J.; Bi, J.; Yuan, H.; Wang, M.; Zhang, J.; Zhou, M. Transformer-based water quality forecasting with dual patch and trend decomposition. IEEE Internet Things J. 2024, 12, 10987–10997. [Google Scholar] [CrossRef]
Yao, S.; Zhang, Y.; Wang, P.; Xu, Z.; Wang, Y.; Zhang, Y. Long-term water quality prediction using integrated water quality indices and advanced deep learning models: A case study of Chaohu Lake, China, 2019–2022. Appl. Sci. 2022, 12, 11329. [Google Scholar] [CrossRef]
Bi, J.; Chen, D.; Yuan, H. Graph attention transformer with dilated causal convolution and Laplacian eigenvectors for long-term water quality prediction. In Proceedings of the 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Sarawak, Malaysia, 6–10 October 2024; pp. 3571–3576. [Google Scholar]
Jaffar, A.; Thamrin, N.M.; Megat, M.S.A. Water quality prediction using LSTM-RNN: A review. J. Sustain. Sci. Manag. 2022, 17, 204–225. [Google Scholar] [CrossRef]
Liu, Y.J. Prediction of Water Quality in the South Source of Qiantang River Basin Based on Single and Multi-Step Models. Master’s Thesis, Wuhan University, Wuhan, China, 2021. [Google Scholar]
GB 3838-2002; Environmental Quality Standards for Surface Water. Ministry of Ecology and Environment of the People’s Republic of China: Beijing, China, 2002.
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Zhang, A.; Lipton, Z.C.; Li, M.; Smola, A.J. Dive into deep learning. arXiv 2021, arXiv:2106.11342. [Google Scholar]
Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006; pp. 326–343. [Google Scholar]
Chen, T.Q.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Cho, K. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
Oguiza, I.; Tsai—A State-of-the-Art Deep Learning Library for Time Series and Sequential data. GitHub Repository. 2023. Available online: https://github.com/timeseriesAI/tsai (accessed on 15 May 2025).
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
Jhin, S.Y.; Kim, S.; Park, N. Addressing prediction delays in time series forecasting: A continuous GRU approach with derivative regularization. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 1234–1245. [Google Scholar]
Sakoe, H.; Chiba, S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 43–49. [Google Scholar] [CrossRef]
Frías-Paredes, L.; Mallor, F.; Gastón-Romeo, M.; León, T. Assessing energy forecasting inaccuracy by simultaneously considering temporal and absolute errors. Energy Convers. Manag. 2017, 142, 533–546. [Google Scholar] [CrossRef]
Tavenard, R.; Faouzi, J.; Vandewiele, G.; Divo, F.; Androz, G.; Holtz, C.; Payne, M.; Yurchak, R.; Rußwurm, M.; Kolar, K.; et al. Tslearn, a machine learning toolkit for time series data. J. Mach. Learn. Res. 2020, 21, 1–6. [Google Scholar]
Zhou, S.; Song, C.; Zhang, J.; Chang, W.; Hou, W.; Yang, L. A hybrid prediction framework for water quality with integrated W-ARIMA-GRU and LightGBM methods. Water 2022, 14, 1322. [Google Scholar] [CrossRef]
Hu, Y.; Lyu, L.; Wang, N.; Zhou, X.; Fang, M. Application of hybrid improved temporal convolution network model in time series prediction of river water quality. Sci. Rep. 2023, 13, 11260. [Google Scholar] [CrossRef]

Figure 1. The location of the monitoring stations in the Baini river basin.

Figure 2. Variation in ammonia nitrogen content at the downstream station.

Figure 3. A latent autoregressive model (according to [23]).

Figure 4. Prediction curve of testing set (a) for univariate feature subset and (b) for multivariate feature subset. Note: The y-axis shows normalized ammonia nitrogen values, while the x-axis represents time steps (each step = 4 h). The solid curve labeled “true” denotes observed data, and the other curves indicate predictions from the different models.

Figure 5. Prediction curve of testing set. Plates (a–h) represent prediction curves at different time periods in the testing set. Note: The y-axis shows normalized ammonia nitrogen values, while the x-axis represents time steps (each step = 4 h). The solid curve labeled “true” denotes observed data, and the other curves indicate predictions from the different models. The color difference arises because PatchTST uses 104 input timesteps versus 60 for the other models: red represents the 60 shared steps, while pink shows PatchTST’s extra steps before them.

Table 1. Details of AN datasets at the downstream station and upstream station.

Variable	Unit	Max	Min	Mean	Std
Ammonia Nitrogen at TMH	mg/L	19.10	0.00	2.60	1.46
Ammonia Nitrogen at XHY	mg/L	11.40	0.00	3.34	1.80

Table 2. Single-step forecasting results averaged over 5 runs (mean ± standard deviation).

Models	Metric	One Station	Two Stations
SVR	MSE	0.109	0.129
	MAE	0.215	0.252
	TIME	1.98	2.32
XGBoost	MSE	0.130	0.190
	MAE	0.215	0.242
	TIME	126 ± 0.6	165 ± 0.8
KNN	MSE	0.368	0.468
	MAE	0.398	0.496
	TIME	0.001	0.001
RNN	MSE	0.0869 ± 0.0003	0.0859 ± 0.0004
	MAE	0.167 ± 0.000	0.166 ± 0.002
	TIME	23.8 ± 2.7	21.3 ± 0.2
LSTM	MSE	0.0852 ± 0.0008	0.0856 ± 0.0012
	MAE	0.167 ± 0.001	0.169 ± 0.002
	TIME	121 ± 6.7	89.2 ± 0.4
GRU	MSE	0.0867 ± 0.0005	0.0852 ± 0.0006
	MAE	0.168 ± 0.001	0.166 ± 0.000
	TIME	98.7 ± 11.7	76.5 ± 0.8

Note: the best results are in bold, and the second-best results are underlined. TIME represents the training time of the models with the unit of second.

Table 3. Multi-step forecasting results averaged over 5 runs (mean ± standard deviation).

Models	Metric	One Station	Two Stations
RNN	MSE	0.359 ± 0.006	0.364 ± 0.015
	DTW	1.55 ± 0.01	1.57 ± 0.03
	TDI	1.52 ± 0.03	1.47 ± 0.11
LSTM	MSE	0.345 ± 0.002	0.361 ± 0.014
	DTW	1.50 ± 0.00	1.56 ± 0.05
	TDI	1.53 ± 0.09	1.49 ± 0.06
GRU	MSE	0.349 ± 0.013	0.367 ± 0.011
	DTW	1.51 ± 0.02	1.57 ± 0.03
	TDI	1.54 ± 0.03	1.65 ± 0.07
PatchTST	MSE	0.409 ± 0.080	0.361 ± 0.026
	DTW	1.56 ± 0.08	1.48 ± 0.03
	TDI	1.37 ± 0.02	1.24 ± 0.06

Note: the best results are in bold, and the second-best results are underlined.

Table 4. Decreases in accuracy with the increase in prediction length.

Models	RNN	LSTM	GRU	PatchTST
Prediction Step	Decrease %	Decrease %	Decrease %	Decrease %
1	0.0	0.0	0.0	0.0
2	51.6	52.1	50.9	38.7
3	92.6	93.3	90.8	69.9
4	126.9	127.7	123.7	95.5
5	156.5	157.0	151.2	117.1
6	180.7	181.2	173.5	135.8
7	202.9	203.7	194.0	153.5
8	224.1	225.5	213.8	170.5
9	245.0	247.1	233.2	186.6
10	265.4	268.0	252.0	201.5
11	284.3	287.4	269.4	214.5
12	301.7	305.4	285.4	226.1

Note: prediction steps mean time steps (each step = 4 h).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fang, H.; Li, T.; Xian, H. Comparative Analysis of Machine/Deep Learning Models for Single-Step and Multi-Step Forecasting in River Water Quality Time Series. Water 2025, 17, 1866. https://doi.org/10.3390/w17131866

AMA Style

Fang H, Li T, Xian H. Comparative Analysis of Machine/Deep Learning Models for Single-Step and Multi-Step Forecasting in River Water Quality Time Series. Water. 2025; 17(13):1866. https://doi.org/10.3390/w17131866

Chicago/Turabian Style

Fang, Hongzhe, Tianhong Li, and Huiting Xian. 2025. "Comparative Analysis of Machine/Deep Learning Models for Single-Step and Multi-Step Forecasting in River Water Quality Time Series" Water 17, no. 13: 1866. https://doi.org/10.3390/w17131866

APA Style

Fang, H., Li, T., & Xian, H. (2025). Comparative Analysis of Machine/Deep Learning Models for Single-Step and Multi-Step Forecasting in River Water Quality Time Series. Water, 17(13), 1866. https://doi.org/10.3390/w17131866

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Analysis of Machine/Deep Learning Models for Single-Step and Multi-Step Forecasting in River Water Quality Time Series

Abstract

1. Introduction

2. Study Framework

2.1. Study Area

2.2. Data Preprocessing and Description

2.3. Machine/Deep Learning Models

2.3.1. Machine Learning Models

2.3.2. Deep Learning Models

2.4. Experimental Design

2.5. Model Evaluation

3. Results and Discussion

3.1. Single-Step Time Series Forecasting

3.2. Multi-Step Time Series Forecasting

3.3. Discussion

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI