1. Introduction
With the acceleration of urbanization and the deterioration of existing urban drainage pipelines [
1], stable and efficient WWTP operation has become challenging due to the dynamic variations in WWTP wastewater influenced by pipeline leakage, the intrusion of rainwater and groundwater, and so on. An accurate wastewater flowrate forecast is crucial for the optimization of WWTP operation [
2].
However, traditional methods often rely on extensive historical data and empirical judgement [
3], which make timely and accurate forecasts difficult for complex irregular wastewater flowrate changes. Especially during extreme weather events or major festivals [
4], traditional forecasting methods cannot effectively and quickly capture changes in wastewater flow, resulting in the risk of overloading in the WWTP. Therefore, the development of reliable wastewater flowrate forecast models has become a top priority for today’s WWTP [
5].
The current forecasting models of wastewater flow are mainly based on traditional statistical models and deep learning models [
6]. Traditional statistical models include Autoregressive Integrated Moving Average (ARIMA), Prophet, Markov, etc. ARIMA is a statistical model widely used in time series analysis and forecasting, and it can effectively capture the trend, seasonality, and noise characteristics in time series data [
7]. Maleki et al. [
8] applied ARIMA to the inflow characteristic time series of WWTP to create models for short-term (7 days in advance) forecasting. Bień et al. [
9] mixed ARIMA with the Long Short-Term Memory (LSTM) model, greatly improving the forecast accuracy of sludge production in sewage treatment plants. The MAPE (Mean Absolute Percentage Error) of the hybrid model is as high as 9.4%. The Prophet model can take into account holiday effects on the basis of trends, seasonality, and other factors. The Markov model is a mathematical model based on Markov properties, which mainly describe stochastic processes. Li et al. [
10] used a discrete hidden Markov model to accurately predict water quality.
In addition, deep learning models, such as LSTM and Transformer, have gradually become the mainstream forecast method in recent years. As a special recurrent neural network (RNN), LSTM overcomes the gradient vanishing and gradient explosion problems of traditional RNNs when processing long sequence data and can effectively capture long-term dependencies in sequences that are suitable for dealing with nonlinear problems. Nitzan Farhi et al. [
11] proposed a novel machine learning method based on LSTM architecture that utilizes measurements from bioreactors sampled every minute and combines them with climate measurements to greatly improve forecast accuracy. When predicting nitrogen concentration and nitrate concentration, the AUC increased by 1% and 5%, respectively.
However, LSTM is highly complex. It takes a long time to train and requires a large amount of data [
12]. J. Ali et al. [
13] used the Transformer and ensemble models to predict the ammonium nitrogen levels in rivers and achieved good results. The coefficient of determination (R
2) and the relatively high Nash–Sutcliffe Efficiency (NSE) value reached 0.97 and 0.90, respectively. However, Transformer struggles to model local temporal relationships and lacks a physical explanation. In wastewater flow forecasting, a large number of external input features (such as weather data, holidays, and watershed conditions) are used, and the high-dimensional nature of this data may affect the training efficiency and forecast accuracy of the Transformer model.
Hyperparameter optimization is essential to improve model accuracy [
14]. Currently, prominent optimization algorithms include WOA, PSO, and SSA. Cui et al. [
15] proposed a new load forecast model that integrates the WOA to optimize the hyperparameters of an improved LSTM model, achieving excellent forecast results. Du et al. [
16] proposed a model that combines LSTM and kernel density estimation (KDE) using the PSO algorithm to optimize KDE’s hyperparameters, proving the superiority of the PSO optimization model through comparison with numerous hybrid models. Zhang et al. [
17] utilized the SSA to optimize Bidirectional Long Short-Term Memory (BiLSTM) and formed the VMD (Variational Mode Decomposition) –SSA–BiLSTM coupled model for predicting the monthly runoff of the Yellow River. Compared with the single BiLSTM model, the R
2 increased by 0.53059, which greatly improved the accuracy.
The ARIMA–Markov model and the ARIMA–LSTM–Transformer model were developed for wastewater flow forecasting during different periods of time. After comparing the two models, the ARIMA–LSTM–Transformer model was chosen. It was then joined by WOA, PSO, and SSA to optimize hyperparameters, and WOA demonstrated the best effect. Thereafter, the improved WOA–ARIMA–LSTM–Transformer model was created to achieve efficient and accurate wastewater flow forecasting during different periods of time.
In this study, the ARIMA–LSTM–Transformer model was chosen based on the complementarity of the advantages of each model: ARIMA is good at capturing linear trends in time series, but it cannot effectively deal with nonlinear features, and it performs poorly in long-term forecasting. Thanks to its memory unit and gating mechanism, LSTM can capture long-term dependencies more accurately, but the stability of the model is poor, so we further entered the Transformer module. Transformer excels at modeling long-term dependencies and complex nonlinear relationships. However, it is more dependent on the amount of data and computing resources. By integrating these three types of models, the shortcomings of each can be alleviated to a certain extent, and the accuracy and stability of sewage flow prediction can be comprehensively improved.
The paper is divided into five different parts: 1. Introduction; 2. Materials and Methods, including missing value handling, outlier identification, and the ARIMA–Markov model and ARIMA–LSTM–Transformer model; 3. Results and Discussion, including model validation, ablation experiments, and a stability test; 4. Algorithm Optimization, including parameter selection, accuracy testing, and a stability test; and 5. Conclusions.
In this study, Jiawen Ye contributed to the conceptualization, data curation, investigation, methodology, original draft writing, visualization, review and editing, and funding acquisition. Xulai Meng performed the formal analysis, data curation, methodology, original draft writing, and validation. Haiying Wang contributed to the conceptualization, methodology, supervision, funding acquisition, and review and editing. Qingdao Zhou and Siwei An contributed to the data curation, methodology, and original draft writing. Tong An participated in the investigation and in the review and editing. Pooria Ghorbani Bam and Diego Rosso were involved in the revision.
3. Results and Discussion
3.1. Model Validation
3.1.1. ARIMA–Markov’s Maximized Likelihood Function
The maximum likelihood function is a widely used method for parameter estimation in statistics. The core is to find a set of model parameter values under the premise of the given observation data so that the probability of data occurrence under this set of parameters reaches the maximum and can help determine the model configuration that best conforms to the internal laws of the data.
In our study, in order to ensure that the ARIMA model has good stationarity under different prediction periods, and to avoid information loss or model oversimplification due to excessive differences, the difference order d is limited to [0, 3], and the stationarity of the sequence is confirmed by graphical analysis, a unit root test, and ACF/PACF attenuation. The experimental results show that when d = 2, the differential sequence has reached a plateau, so d = 2 is finally selected.
For different values of
and
, the maximum likelihood functions were calculated for the three types of time periods needed: short term (4 days), medium term (4 weeks), and long term (8 months). In this study, the training-to-test split was set at 4:1; accordingly, the prediction horizons shown in the table are four times the length of the test set—namely 4 days, 4 weeks, and 8 months. Importantly, all parameter tuning and selection were conducted exclusively on the training set, with the test set reserved solely for final model evaluation. By comparing the magnitudes of log-likelihood values, the highest point of the three-dimensional function graph was selected as the optimal combination of
and
. The graph of the maximum likelihood function is shown in
Figure 3. The optimal values of
and
corresponding to each period are shown in
Table 1.
3.1.2. ARIMA–LSTM–Transformer’s Loss Function
The loss function, as a function that measures the difference between the predicted outcome of the model and the observed outcome, provides a quantitative metric for model evaluation [
33]. By calculating the loss value, the degree of deviation between the model’s forecast and the real situation was understood, and the smaller the value is, the closer the forecast of the model is to the observed value and thus the better the performance of the model is. The loss function is divided into many categories, among which the mean square error is often used for regression problems, and the formula is as follows [Equation (2)]:
where
is the observed value of wastewater flow,
is the predicted value of the wastewater flow, and
is the total number of samples.
Figure 4 shows the change curves of the training loss (loss) and validation loss (val_loss) functions of the ARIMA–LSTM–Transformer model during the training process. From left to right, the first picture shows the validation loss of the test set for 1 day and the training loss of the training set for 4 days. The second picture displays the validation loss of the test set for 1 week and the training loss of the training set for 4 weeks. The third picture presents the validation loss of the test set for 1 month and the training loss of the training set for 4 months. The training loss reflects the forecast error of the model on the training dataset, demonstrating the learning ability of the ARIMA–LSTM–Transformer model for the features of the training data. The validation loss, calculated through a validation dataset independent of the training set, is used to measure the generalization performance of the model on unseen data.
It can be seen that both the training loss and the validation loss are decreasing and tend to stabilize at relatively low values, and the gap between them does not increase significantly. This indicates that the model does not have a serious overfitting problem. Moreover, it can find the optimal hyperparameter values within a finite number of iterations, suggesting that the learning process is effective.
3.2. Ablation Experiments
An ablation experiment is an experimental method commonly used in scientific research [
34], especially in machine learning, deep learning, and other fields. The ablation experiment is used to analyze the impact of individual components or factors in the model on the overall model performance and to observe the changes in the model performance by gradually removing different components or features in the model, ultimately providing insight into the specific role and contribution of each component or feature in the model. The experiment was conducted five times for each time interval in order to eliminate the most accurate and least accurate forecasts. The average predicted wastewater flow values of the ARIMA model and ARIMA–Markov model were compared with the observed values (as shown in
Figure 5). The residual plot of the predicted flowrates for each model is shown in
Figure 6.
It can be seen from the graph that the ARIMA–Markov model predicted values are the closest to the observed values. Nevertheless, to quantify the comparative performance of Markov–ARIMA and the ARIMA models, the evaluation indicators of the models while drawing the images were calculated, and the results are shown in
Table 2.
As can be seen in
Table 2, the ARIMA model’s highest
is 0.7501 and lowest
and
are 193.1590 and 296.5314, respectively, which appear when the forecast period is 1 day. These data indicate that the forecast of a single ARIMA is more accurate for short intervals of time. However, it is difficult to capture the medium- and long-term data patterns, and there is room for progress.
After adding the Markov module to correct the residuals, the forecast accuracy of the short, medium, and long intervals significantly improves, with an average increase of 17.52% for the and an average decrease of 29.67% and 28.81% for and , respectively. Among them, the long-term forecast increases the most significantly, with the side increasing by 27.37%, and the and decreasing by 43.60% and 43.06%, respectively. The statement above shows that the hybrid model greatly improves the forecast ability of a single ARIMA for long-term data and verifies the accuracy and feasibility of the fusion of ARIMA and Markov.
Another ablation experiment was based on the ARIMA, LSTM, and Transformer models. The experiment was conducted five times for each time interval in order to eliminate the most accurate and least accurate forecasts. The average predicted wastewater flow values of the LSTM, Transformer, Transformer–LSTM, and ARIMA–LSTM–Transformer models were compared with the observed values (as shown in
Figure 7). The residual plot of the predicted flowrates for each model is shown in
Figure 8.
It can be seen from the graph that the ARIMA–LSTM–Transformer model predicted values are the closest to the observed values. The comparative quantification of ARIMA and ARIMA–Markov was performed here and is tabulated in
Table 3.
As can be seen from
Table 3, both the single Transformer and LSTM models achieve the highest accuracy when the prediction interval is 2 months. Specifically, the
of the Transformer is as high as 0.8546, with the
and
being 189.9022 and 265.6617, respectively. This indicates that the forecast of long-term conditions by a single Transformer or LSTM model in this experiment is more accurate, but it is difficult to capture the short-term local data features, and there is room for progress.
After the fusion of Transformer and LSTM models, the accuracy is greatly improved compared to the single model, especially in the medium-term forecast; the of the LSTM–Transformer combined model is 8.36% higher than that of the Transformer single model and 12.13% higher than that of the single LSTM model. The LSTM–Transformer hybrid model achieves 24.36% and 17.37% lower MAE and RMSE, respectively, compared to the single Transformer model, and demonstrates 28.46% and 22.19% reductions in and , respectively, relative to the single LSTM model.
However, the accuracy of the hybrid model in short- and long-term forecasts does not significantly improve. Compared to the single Transformer model, the of the hybrid model increases by 0.73% and 1.43% for short- and long-term forecasts, respectively, and the improvement effect is not very obvious.
Considering that the forecast model may have some limits with respect to short-term forecasts, the ARIMA model was added. The ARIMA–LSTM–Transformer hybrid model has good performance in the short-, medium-, and long-term forecasts. Specifically, its short-term is 4.32% higher than that of the LSTM–Transformer model, and its medium-term and long-term forecasts also exhibit outstanding results, with an average improvement of 4.24%.
Overall, the ARIMA–LSTM–Transformer model emerges as the optimal choice—especially for long-term forecasting tasks. Moreover, because the medium- and long-term dataset used in this study features high precipitation amounts, these results further demonstrate the model’s strong resistance to interference and validate the applicability and superiority of integrating ARIMA, LSTM, and Transformer techniques.
Noting the accuracy of the two hybrid models compared to their components longitudinally, two hybrid models were compared horizontally. In the one-day short-term forecast, ARIMA–Markov is more accurate; compared to the ARIMA–LSTM–Transformer model, the increases by 0.0293, and the and decrease by 28.9061 and 43.7782, respectively, probably because the short-term forecast dataset sample size is smaller. ARIMA directly models the linear relationship between the current value and the recent historical value through the autoregressive term, which is more sensitive to the capture of short-term trends and has natural adaptability to small samples and local dependence.
However, deep learning models such as LSTM and Transformer are good at capturing complex nonlinear relationships in long sequences, which require a large amount of data for training. A small amount of data may lead to overfitting and other situations that affect accuracy. Therefore, in the medium- and long-term forecasts, the accuracy of ARIMA–LSTM–Transformer is higher. Compared to ARIMA–Markov, the average increases by 0.0247, and the and decrease by 14.1839 and 28.3393, respectively. Overall, the accuracy of ARIMA–LSTM–Transformer is higher.
3.3. Stability Test for ARIMA–LSTM–Transformer Model
Of the above two hybrid models, ARIMA–Markov involves the Markov probability transfer matrix, and there is randomness in the transformation between various states; the ARIMA–LSTM–Transformer model involves the weight matrix and Adam optimization algorithm. Both models involve a certain random process, so the stability of the model is also an important indicator to measure the strength and disadvantage of the forecast model. In order to test the stability of ARIMA–Markov and ARIMA–LSTM–Transformer, the three datasets selected above were used, and each dataset was repeatedly predicted ten times with the exact same parameters. The
,
, and
of these ten forecasts were obtained, an array containing ten datasets was formed, and the standard deviation corresponding to the array was calculated in turn to achieve a relatively stable size, as shown in
Table 4.
From
Table 4, it can be concluded that the stability of ARIMA–LSTM–Transformer is better than that of the ARIMA–Markov model in short-term, medium-term, and long-term forecasting, and ARIMA–LSTM–Transformer has the highest stability for predicting long-term time data. From the above, it can be seen that the accuracy of ARIMA–LSTM–Transformer is comparable to that of ARIMA–Markov in short-term forecasting, but the accuracy of the former is better than the ARIMA–Markov model in both medium-term and long-term forecasting. Therefore, the ARIMA–LSTM–Transformer model performs better in wastewater forecasting by combining the two aspects of accuracy and stability.
5. Conclusions
To solve the complex problem of urban wastewater flowrate forecasting, models based on WOA, ARIMA, LSTM, and Transformer for wastewater flowrate forecasting were combined and achieved good results.
The experimental results show that ARIMA can accurately grasp the local characteristics of short-term wastewater flow changes, capture trends, and pass them on to LSTM as one of the characteristics. On this basis, LSTM can capture more complex temporal dependencies, and features such as holidays were also quantified to achieve multivariate regression analysis. Transformer greatly enhances the forecast accuracy of the hybrid model for medium- and long-term data, and WOA optimizes the number of LSTM layers, Encoder layers, and Decoder layers in LSTM and Transformer, reducing the computational cost and the possibility of overfitting the hybrid model.
After the introduction of the WOA, ARIMA, and LSTM modules, the increased by 10.86%, the was reduced by 25.21%, and the was reduced by 30.02% compared to a single Transformer, demonstrating a significant improvement. The comparison between single and hybrid models shows the latter to be associated with higher accuracy and stability. However, the hybrid model still has some limitations and may not be able to predict the flowrate due to high summer temperatures, water evaporation, or abnormal wastewater flow due to water and power outages in specific areas of the city, which should be the focus of subsequent research.
Future work includes (1) further optimizing the structure and performance of the model—although the proposed model performed well in the current study, its forecast of wastewater flow in the short term is still not sufficiently accurate, and (2) enhancing the anti-interference ability of the model by incorporating more factors that affect wastewater flow, such as urban population migration, urban water, and power outages.