1. Introduction
As global water resources become increasingly scarce, factors such as population growth, accelerated industrialization, and intensified climate change have further heightened the demand for freshwater, exacerbating the scarcity of water resources [
1,
2]. Reclaimed water inflows refer to the volume of treated wastewater that is reintroduced into water systems for reuse. Effective management of reclaimed water inflows is essential for supporting various levels of wastewater management and planning [
3]. With the proliferation of IoT devices and smart sensors, continuous monitoring and real-time decision-making have become possible, enhancing system responsiveness and management efficiency. This process involves real-time treatment of wastewater to remove contaminants, ensuring it can be safely used for irrigation, industrial processes, and replenishing natural water bodies, while also meeting different water volume demands and maintaining water quality standards. The inflows of reclaimed water are influenced by network factors, demographic characteristics, and meteorological events [
4]. During heavy rainfall, wastewater flows can surge dramatically, causing variations in flow rates within combined sewer systems by more than two orders of magnitude, a situation that is not uncommon [
5]. Such fluctuations in flow can exceed the hydraulic capacity of treatment plants, potentially disrupting biological treatment processes. Reclaimed water inflows help alleviate freshwater scarcity by reducing reliance on limited freshwater resources. At the same time, they enable the recycling of water resources, reducing the pressure on natural water bodies and supporting long-term environmental sustainability and ecological balance. Therefore, accurately predicting the inflow volumes of reclaimed water plants is essential not only for enhancing operational efficiency and achieving precise control of water quality but also for reducing the demand for freshwater and promoting sustainability in water use [
6]. This study assumes strict adherence to water recycling and plant management standards in post-treatment water quality, focusing primarily on the accurate prediction of water volumes.
Over the past few decades, research has primarily focused on developing models based on physical and data-driven approaches to simulate and manage the water cycle processes [
7,
8,
9]. Although physical models are capable of simulating the water cycle, they exhibit limitations in handling the complexity of wastewater flow volumes [
10,
11]. Operators often rely on empirical judgments or complex physical models for predictions, which poses challenges in practical applications. In contrast, data-driven models, particularly Time Series Statistics (TSS) [
12] and Machine Learning (ML) [
13] technologies, demonstrate advantages in processing large datasets and adapting to new information [
14,
15]. Machine learning models such as Decision Trees (DT) [
16], Linear Regression (LR) [
17], Gradient Boosting Regression (GBR) [
18], Support Vector Regression (SVR) [
19], eXtreme Gradient Boosting (XGBoost) [
20] and Long Short-Term Memory networks (LSTM) have proven effective in revealing complex patterns within reclaimed water volume (RWV) data [
21,
22]. However, standalone ML models may struggle with highly non-stationary time series that contain substantial noise components without prior raw data preprocessing [
23,
24,
25]. Appropriate time series preprocessing can simplify the original signals, extracting multi-scale features and enabling ML techniques to effectively analyze the raw signals and uncover hidden information within the dataset [
26,
27]. Common decomposition techniques such as Empirical Mode Decomposition (EMD) [
28], Ensemble Empirical Mode Decomposition (EEMD) [
29], Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) [
30], and Variational Mode Decomposition (VMD) [
31] enhance ML model performance by decomposing non-stationary signals into simpler sub-signals. Recent research has introduced a novel hybrid predictive model that employs a two-stage decomposition approach (including CEEMDAN and VMD) combined with deep learning algorithms [
32]. This study utilizes these techniques to extract major fluctuations and trends from time series, then employs the Long Short-Term Memory (LSTM) deep learning algorithm to explore the dynamic characteristics of each sub-sequence. Consequently, decomposition integration models that combine time series preprocessing with machine learning techniques have been developed to predict both the quantity and quality of wastewater treatment simultaneously [
33,
34].
Furthermore, the application of the Transformer model and its attention mechanism in the field of wastewater prediction offers an enhanced method by prioritizing important parts of the data. The Transformer model, introduced by Vaswani et al. [
35], has revolutionized the field of Natural Language Processing (NLP) and rapidly become a significant milestone in deep learning research. However, its application in time series prediction, particularly in wastewater treatment, has not been extensively explored. Huang et al. [
36] were pioneers in applying the Transformer model to wastewater treatment tasks, achieving a sample prediction accuracy of up to 96.8%. This marked the first application of the Transformer model in the wastewater treatment domain, breaking the constraints of traditional models in terms of predictive accuracy. The study by Peng and Fanchao [
37], focused on fault detection rather than direct water volume prediction, but the models built using the Transformer network provided effective solutions for complex issues in wastewater treatment processes, potentially inspiring other applications, including water volume prediction. Although the Transformer model has demonstrated potential in certain aspects of wastewater treatment, current research primarily concentrates on enhancing the model’s predictive performance. There remains a relative scarcity of in-depth studies addressing specific challenges faced by the model in predicting wastewater volumes, including handling seasonal and non-stationary features, as well as sensitivity to peaks and anomalies. This necessitates a dual focus: not only on enhancing the accuracy of the Transformer model but also on ensuring the model can sensitively and meticulously respond to the non-stationarity and complexity of the data, thereby ensuring robustness and precision in predictions [
38,
39]. In the realm of convolutional neural networks implementing fuzzy time series predictions, utilizing differencing methods can attenuate the non-stationarity of time series, thereby facilitating low-error predictions for non-stationary time series. Although studies indicate that differencing can improve model efficiency, how to effectively integrate differencing sequences and deep learning models, including the aforementioned decomposition techniques, particularly within the Transformer model, and effectively handle the seasonal changes and non-stationarity of wastewater volume data remains a challenge yet to be fully addressed [
40,
41]. The self-attention mechanism in long-sequence time predictions can lead to the loss of temporal information, causing existing models to perform poorly in capturing subtle changes in time series data. When predicting wastewater volumes, the Transformer model may ignore the continuity of temporal information due to the characteristics of the self-attention mechanism, resulting in a failure to capture fine variations in water volume. The dot-product self-attention mechanism in the classic Transformer architecture is insensitive to local context, making it unable to accurately distinguish turning points and anomalies, and thus unable to promptly handle unexpected situations.
To address these challenges and enhance the efficiency of time series prediction with the Transformer, this study introduces the ML-CEEMDAN-TSTF differential decomposition integration model, incorporating the following enhancements to the base Transformer model:
(1) A differential transformation was implemented on time series data to explicitly capture crucial information, significantly enhancing the model’s ability to recognize dynamic changes within the data.
(2) The introduction of a Time-Aware Outlier-Sensitive Loss Function allows the model to place greater emphasis on anomalies and abrupt changes during the loss calculation. Consequently, this design significantly enhances the model’s adaptability to key changes and the precision of its predictions.
(3) The adoption of an adaptive sliding window mechanism enables the model to dynamically adjust the size of the processing window based on the specific features of the time series data. Consequently, this approach effectively reveals long-term dependencies and key patterns, thereby enhancing both the efficiency and accuracy of the model in handling complex time series data.
(4) By utilizing an enhanced Transformer model and its decomposition integration technology, the model efficiently distills multidimensional features while reducing white noise. This significantly augments the model’s capability for comprehensive analysis and processing of time series data.
In summary, the primary goal of this research is to establish a precise and reliable real-time forecasting decomposition integration model, while also identifying the optimal model parameters that are well-suited for complex, nonlinear, non-stationary time series, and to train the model to accurately predict wastewater volumes. Specifically, this study aims to demonstrate the effectiveness and reliability of the proposed ML-CEEMDAN-TSTF model through comparative analysis with single ML models and various hybrid forecasting models. Furthermore, the research focuses on categorizing experiments and discussing solutions to address issues of slow time perception and differencing forgetfulness, ensuring optimal model accuracy.
The remainder of this paper is organized as follows: 
Section 2 provides a detailed description of four single ML models, differencing decomposition algorithms, factor selection methods, and the customized Transformer hybrid forecasting model. 
Section 3 describes a case study, including the basic introduction, data preparation, and model parameter settings. 
Section 4 further analyzes the effects of the model. 
Section 5 emphasizes the advantages and limitations of the methods proposed in this paper. Finally, conclusions are drawn in 
Section 6.
  2. Materials and Methods
  2.1. Research Framework
This study has developed a differential decomposition integration prediction model based on the TS-Transformer, aimed at accurately forecasting time series data. 
Figure 1 illustrates the framework of the entire decomposition integration model. Initially, three machine learning models were selected: LR, SVR, and GBR. These models are capable of capturing key features of the time series from different dimensions. Specifically, they extract the lagged sequences most correlated with the target sequence from historical data, providing a solid foundation for model training. To address the issue of long-term dependencies and to capture non-linear features within the time series, we further integrated LSTM. Subsequently, to enhance the model’s accuracy and stability, we employed the CEEMDAN differencing decomposition algorithm to perform multimodal decomposition of the series. This step eliminates white noise from the differenced series and isolates components strongly related to the prediction target, thereby strengthening the model’s recognition of seasonal and trend features within the time series. The decomposition by CEEMDAN reveals hidden cyclical and structural information in the data, providing a clearer basis for subsequent analysis. Finally, an integration using the TS-Transformer was performed to combine features generated by the previous prediction and decomposition models, further refining the predictive capability. The TS-Transformer, through its self-attention mechanism, enhances the model’s ability to handle long-distance dependencies within the time series and is trained using a custom time-aware outlier-sensitive loss function, which increases sensitivity to the most recent observations and outliers. Consequently, the integrated model constructed in this study is designed to capture the complex dynamics of time series data, providing accurate predictions for various application scenarios.
  2.2. Single Forecast Models
  2.2.1. Regression Models
In this study, three machine learning regression models were utilized for predicting time series data: Linear Regression (LR), Support Vector Regression (SVR), and Gradient Boosting Regression (GBR). These models were selected due to their respective strengths in handling complex environmental data, and their complementarity during the analytical process.
LR is a foundational statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. The simplicity of this model lies in its assumption that there is a straight-line relationship between the variables. Its primary advantage is its simplicity and interpretability, making it a preferred choice for initial exploratory analysis. To implement LR, (1) collect and prepr90ocess the dataset; (2) designate the dependent and independent variables; (3) calculate the coefficients using the least squares method; and (4) use the derived equation to make predictions.
SVR operates by finding a hyperplane that best fits the data, aiming to limit the errors within a certain threshold. Its efficacy arises from its ability to manage non-linear data using kernel functions. It offers robustness against outliers and has the flexibility to model non-linear relationships. To deploy SVR, (1) scale and preprocess the data; (2) choose an appropriate kernel function (linear, polynomial, or radial basis function); (3) train the model by solving the optimization problem; and (4) predict values using the trained model.
GBR is an ensemble learning method that builds multiple decision trees sequentially, where each tree corrects the errors of its predecessor. It works effectively because it optimizes a loss function, reducing errors iteratively. GBR stands out for its ability to handle missing data, its resistance to overfitting with adequate tree depth, and its capacity to model complex non-linear relationships. To utilize GBR, (1) initialize the model with parameters like learning rate and number of trees; (2) train the model by fitting it to the data, allowing each tree to learn from the residuals of the previous one; and (3) use the ensemble of trees for predictions.
The selection of the three aforementioned regression models is justified by their unique data processing characteristics and their complementary analytical dimensions, which are vital for the construction of the final integrated model. Linear regression, based on its simplicity, provides a clear starting point for predictions and reveals the linear relationships between variables. SVR, on the other hand, introduces the capacity to handle complex non-linear data, adapting to a more diverse data structure. In contrast, GBR, through its foundation in decision trees, can capture more subtle features and patterns within the data. Integrating these three models into a more advanced model aims to enhance the final prediction’s accuracy and robustness by synthesizing the strengths of different models, thereby providing a more comprehensive data foundation for the final integrated model.
  2.2.2. Long Short-Term Memory (LSTM) Networks
This study incorporates LSTM networks to address the challenges posed by traditional Recurrent Neural Networks (RNNs) in processing extended temporal sequences, a critical aspect in the realm of predictive analytics [
42,
43]. LSTM networks are specifically deployed to complement the Transformer model, enhancing the overall predictive accuracy of the integrated system.
At the core of LSTM’s effectiveness is its intricate internal architecture, which meticulously orchestrates memory cells and hidden cells. This design allows for the nuanced capture of both gradual and abrupt temporal shifts in data, a key feature for accurate forecasting. The LSTM model is equipped with three essential gates: input, forget, and output. Each of these gates plays a pivotal role in modulating the inflow, retention, and outflow of information within each cell, thereby maintaining the integrity and relevance of data through the sequence. The application of LSTM in hydrological modeling, as explored in-depth by Kratzert et al. [
44], provides a foundation for its deployment in this study. The integration of LSTM networks into this study’s forecasting system not only mitigates the limitations of traditional sequence modeling approaches but also enriches the integrating model’s capacity to handle intricate temporal dynamics. This synergy enhances the overall robustness and precision of the predictive analytics framework developed for time series forecasting.
  2.3. Differencing Decomposition Based on the CEEMDAN Algorithm
When addressing the complex relationships between time series and their influencing factors, relying solely on raw data sequences often fails to reveal the direct connections among these variables. Therefore, an efficient decomposition algorithm is necessary to eliminate white noise from the differenced series and to identify factors strongly correlated with the time series. Traditionally, the EEMD method has been used for this purpose, but it has a flaw: the reconstruction error increases with the number of ensembles, and the added white noise cannot be completely removed by increasing the average number of times, resulting in lengthy computation times. In contrast, the CEEMDAN method adopted in this study introduces adaptive white noise during the decomposition process. This method effectively reduces the reconstruction error to near zero with fewer ensemble means, thereby optimizing the data processing procedure.
The original time series data are transformed into a differenced form:
        where 
 is the differenced series, 
 is the original series, and 
 represents the time step, yielding the sequence at time 
.
Employing the CEEMDAN algorithm, the differenced data series undergoes multimodal decomposition as follows:
Step 1: First, we calculate the first modal component, similarly to the first independent component in EMD:
Step 2: Next, we compute the first unique residual signal:
Step 3: Decompose the remaining signals, repeating the experiments 
 times, until the second modal component is obtained:
In this process,  represents the noise coefficient.
Step 4: For each subsequent step 
, calculate the 
ith residual signal 
 in the same manner as Step 3. Continue the decomposition until no further modal components can be obtained:
Step 5: At each step, determine if the number of extrema in the residual signal is at most two. If this condition is satisfied, the algorithm stops, and the final 
 modal components are obtained. The final residual signal can be expressed as follows:
Thus, the original differenced data 
 can ultimately be decomposed into 
 modal components and the residual signal, as shown in Equation (7):
  2.4. Decision Tree for Factor Importance Measurement
In the evaluation of the importance of influencing factors, decision trees are widely utilized due to their intuitiveness and ease of interpretation across various data analysis scenarios. As depicted in 
Figure 2, a decision tree is visualized as a flowchart, where the root and subsequent internal nodes conduct attribute tests, with branches indicating possible outcomes of these tests. Each terminal leaf node defines a category label or a predicted value, and every path from the root to the leaves constitutes a set of rules for classification or regression decisions. With such a structure, decision trees not only streamline the decision-making process but also offer a transparent perspective on how features influence prediction outcomes.
The construction of a decision tree begins with the analysis of the input time series data, comprising a series of influencing factors. During the data processing stage, the decision tree model progressively reveals, much like assembling a puzzle, which factors are essential to the output result. At each node, the model evaluates and selects an influencing factor to test, dividing the data into two subsets corresponding to the different value ranges of the factor.
These subsets are further tested and divided until no further splitting is possible, culminating in the formation of leaf nodes. Each leaf node contains a set of factors that make a decisive contribution to the prediction outcome. The paths leading to these leaf nodes represent a series of decision rules from input data to the prediction result. Through this recursive partitioning process, the decision tree effectively filters out the time series factors with the greatest impact on the prediction target.
In this process, the model assesses the effectiveness of each influencing factor in differentiating the data, converting it into a score that reflects the importance of each factor. Ultimately, those factors contributing most significantly to the score are considered the most important influencing factors. Thus, we can identify key time series data points for model prediction, bridging historical information to future forecasts. The decision tree makes the complex process of data analysis coherent and manageable, enabling us to easily identify and focus on those key factors that have the most substantial impact on the results.
  2.5. Ensemble Model
  2.5.1. Time-Aware Outlier-Sensitive Transformer (TS-Transformer)
Currently, Transformer models have achieved notable success in the fields of image, text, audio, and time-series data processing, renowned for their capabilities in parallel computing and capturing long-range dependencies. In this study, we have employed a customized Transformer model, the Time-Aware Outlier-Sensitive Transformer. This model adheres to the original Transformer architecture, consisting of two primary modules: the encoder and the decoder. The structure of our Transformer prediction model is presented in 
Figure 3.
The initial layer of the encoder is the input embedding, which maps each feature of the original time series into a high-dimensional vector space. This process is formulated as
          
          where 
 is the input feature matrix, 
 is the weight matrix of the embedding layer, and 
 is the bias term.
Post embedding, each vector undergoes positional encoding to retain the sequence order. The encoding for each position 
 and dimension 
 in a vector of dimension 
 is given by
          
Each encoder layer comprises a self-attention sub-layer followed by a feed-forward sub-layer, each with a subsequent normalization step. The self-attention mechanism in the 
-th layer is calculated as
          
          where 
, 
, 
 are the query, key, and value matrices derived from the input to the 
-th layer. The attention mechanism can be further detailed in the scaled dot-product attention.
The decoder’s initial input aligns with the last sample point of the encoder, ensuring a seamless transition into the forecasting phase.
In addition to self-attention, each decoder layer implements a cross-attention sub-layer that focuses on the encoder’s output, formulated as
          
          where 
 is the output from the final encoder layer, and 
 is the output from the self-attention sub-layer of the 
-th decoder layer.
To capture the varying influence of different time points, a time attention module assigns weights to each timestep, enhancing the model’s sensitivity to temporal dynamics.
To further enhance the model’s performance in accuracy, we have integrated a custom loss function, ‘Time-Aware Outlier-Sensitive Loss’, which dynamically adjusts its focus on recent data points and outliers during the training process. This loss function is formulated as
          
          where 
 and 
 are the actual and predicted values at time 
, 
 is a linearly decreasing time weight, 
 and 
 are the mean and standard deviation of the errors, and 
 and 
 are trainable parameters. This advanced loss function enhances the model’s ability to prioritize recent observations and adaptively respond to outliers.
Additionally, to optimize the training of our Transformer model, we employ the Adam optimizer, a method known for its efficiency and effectiveness in handling sparse gradients and large-scale data operations. The Adam optimizer, an algorithm that combines the advantages of two other popular methods, Adagrad [
45] and RMSprop [
46], is particularly suitable for our application due to its adaptive learning rate capabilities. The optimizer adjusts the learning rates of each parameter through moment estimation, enhancing the convergence speed and performance of the model.
The integration of a Time-Aware Outlier-Sensitive Loss Function and the Adam optimizer into the Transformer model has substantially enhanced its capability to identify and interpret complex temporal patterns in wastewater flow data. By effectively utilizing dynamically tunable parameters for temporal weighting and outlier sensitivity within the model’s loss function, this approach has consequently led to a significant improvement in overall predictive performance for time series forecasting.
  2.5.2. Real-Time Forecast
The time series dataset was divided into two halves: a calibration set and a validation set, each representing 50% of the original dataset’s total length. In the intricate process of analyzing time series, four sophisticated predictive models were employed: LR, SVR, GBR, and LSTM. The LSTM model’s input selection strategy was meticulously crafted, focusing on time-lagged selection. From all related factors, decision tree analysis was employed to evaluate the effectiveness of each factor in distinguishing the data. The 10 sequences with the highest correlation were selected as inputs, with the original sequence volume serving as the target sequence. This approach enables the identification and selection of the most influential factors based on their capability for classification and accurate prediction of the target sequence. The model training process was finely tuned to utilize the minimal NSE value to ensure the model’s optimal performance.
The original sequence was artfully transformed into a differenced sequence, which was then deftly decomposed using the CEEMDAN decomposition algorithm, yielding nine sequences of varying frequencies, labeled as imf1 through imf9. These IMFs, along with the regression sequences and LSTM-generated forecast sequences, were astutely selected as forecasting factors, with the differenced sequence as the target sequence. A fixed time step preceding the forecast point was precisely selected, and the TS-Transformer model was employed to seamlessly integrate these sequences of diverse dimensions and frequencies. The Time-Aware Outlier-Sensitive Loss was adeptly used to dynamically balance the sensitivity to the latest data points and outliers, thereby optimizing the model’s performance in time series forecasting.
In this study, a dynamic training set window was strategically adopted to train the Transformer model, effectively simulating a real-time prediction environment. Initially, the differential data set 
 was divided into three parts: model training, model validation, and the rest assigned to the test set for performance evaluation. The TS-Transformer model, with its refined precision, predicted one period for each differenced subsequence of the inflow. After each prediction, the differenced forecast value was added to the previous day’s actual value, yielding the final predicted volume. Simultaneously, the sliding window was advanced one time step, ensuring the inclusion of the most recent data points. As new observation points emerged during the iterative training process, they were continuously integrated into the training set, thereby updating the model’s knowledge base. This approach led to a gradual increase in the training set’s length and a corresponding decrease in the test set, maintaining a constant size of the overall dataset. This strategy emulated a continuous learning environment, allowing the model to adapt to the latest data trends and ensuring consistency in the evaluation process, as well as real-time verification of the model’s predictive capabilities. The flowchart of the whole real-time forecast experiment is shown in 
Figure 4.
  2.6. Performance Evaluation Criteria
In the interest of ensuring an objective evaluation of predictive performance, the experiments utilized the following metrics to quantitatively assess the accuracy of the models: Root Mean Square Error (
RMSE), Nash–Sutcliffe Efficiency Coefficient (
NSE), Mean Relative Error (
MRE), and the Correlation Coefficient (
Corr). The calculation formulas for each metric are presented below.
        
In these formulas,  represents the number of samples,  is the observed value, and  is the predicted value. The model is considered to perform optimally when the NSE and Corr approach 1, and both the RMSE and MRE approach 0, reflecting high accuracy and reliability of the predictions.
  3. Case Study
  3.1. Study Area and Data
Beijing, as the capital and the pivotal hub of the coordinated development of the Jing-Jin-Ji region, grapples with inherent water scarcity, a situation exacerbated by its categorization as a severely water-deficient area. Water resources constitute the foundation and lifeline of the capital’s development. As of 2022, Beijing boasts a permanent population exceeding 21 million, with an average daily water usage of 163.22 L per capita and a water supply coverage rate reaching 99.81%. Consequently, the use of recycled water plays a vital role in addressing Beijing’s water resource challenges. To maximize the effective use of the city’s recycled water sources, numerous recycling plants have been constructed, significantly mitigating the water shortage in Beijing. In 2022, the city’s total water supply amounted to 14.99 billion cubic meters, marking a marginal decrease of 0.17% compared to previous years, while the utilization and allocation of recycled water exceeded 1.2 billion cubic meters. In this context, investigating the precision and robustness of key technologies in water recycling plants is of paramount importance.
This study focuses on the CH and YF reclaimed water plants in the Haidian District of Beijing to validate the CEEMDAN-TAOST model. Located along the vital Han River tributary of the Yangtze River, the plants’ historical inflow data span 2435 days, from 1 January 2015 to 31 August 2021. This study revolves around the actual historical and meteorological data specific to the CH and YF plants and the Haidian District of Beijing, conducting an in-depth analysis of the plants’ water inflow mechanisms, with rainfall data obtained from the WQ Rainfall Station. A comprehensive historical dataset encompassing various factors was established, from which key predictive elements were extracted. Utilizing these elements, a predictive model focusing on the day-to-day difference decomposition and integrated deep learning has been developed, concentrating on the accurate prediction of water inflow to the CH and YF plants.
  3.2. Data Normalization
Given the nature of time series prediction, which is composed of sequences across multiple dimensions and exhibits significant numerical disparity between decomposed and predicted sequences, it is therefore necessary to normalize the data during the preprocessing stage. In this study, all forecasting factors have been standardized to a common scale of [−1, 1]. The predictors are selected from historical lagged information in measured data and real-time forecasting data, and are categorized into three groups: water quantity data, including the influent flow rate of the wastewater treatment plant; climatic factors, comprising maximum temperature, minimum temperature, and rainfall; and other factors, such as holidays and residential water consumption. For the forecasting model in this study, the first round of forecasting models normalizes using Equation (19), while the Transformer model during decomposition integration normalizes using Equation (20):
        where 
 is the normalized value and 
 is the target sequence, with 
 and 
 representing the maximum and minimum values in the original sequence, respectively. Considering that the differenced series of observations includes both positive and negative amplitudes, Equation (20) is used for normalization in the integrated model. The above equations ensure that all features are scaled equivalently, circumventing gradient issues caused by inconsistencies in feature dimensions, and thereby promoting numerical stability during the model training process.
  3.3. Model Configuration
To assure an impartial appraisal of the predictive performance, four forecasting schemes were established for comparative analysis: the ML model, decomposition integration model, comprehensive model under the Transformer, and comprehensive model under differencing scenarios. These evaluations utilized daily RWV data from the CH and YF reclaimed water plants, as detailed in 
Section 3.1.
All ML model inputs incorporated historical and real-time information, including rainfall, holidays, and maximum and minimum temperatures, along with measured RWV data. A decision tree was used to select influential factors from these inputs, and different combinations were tested in real-time forecasting to enhance model performance. This methodology allowed the models to capture key time series features, thereby improving the accuracy of the forecasts. For the RWV data, sequences lagged from 1 to 10 days were taken as pre-inputs, combining historical factors with influencing factors over these time frames (i.e., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 day lag). From these inputs, the 10 sequences most strongly correlated with the time series were selected to predict future data points. This approach enabled the model to grasp crucial time series features, hence increasing forecast precision.
To accommodate computational resource limitations and identify optimal model parameters, we adjusted the hyperparameters of LSTM through empirical methods, focusing on optimizing the number of hidden layers and neuron units, which were set to 1 and 7, respectively, while the batch size remained at 1000; other LSTM hyperparameters adopted default settings. Experimental analysis indicated that a single hidden layer outperformed multiple hidden layers on both the original series and the decomposed subsequences; consequently, we selected a single-layer architecture. The parameters for LR, SVR, and GBR models retained their default values. The differenced series was decomposed into nine subsequences with distinct features through CEEMDAN, each independently input into the Transformer model for prediction.
For the foundational configuration of the Transformer model, the historical data input stride was set at 30 days, with the future period for predictions fixed at 1 day. The training epochs were determined to be 500 to ensure the model adequately learned the complex patterns in the data. The chosen number of input features was 14, encompassing the differenced series of historical observations, ML models (including LR, SVR, GBR, and LSTM), and the CEEMDAN decomposed differenced series. The differential dataset was divided into three parts: 50% for model training, 10% for model validation, and the remaining 40% for the test set. For the core configuration of the Transformer model, after empirical testing, the embedding dimension was set at 290 to capture a rich representation of input features; the number of neurons in the hidden layer was 319, providing sufficient model complexity; the number of heads was set to 29, allowing the model to parallelly learn different aspects of the data; the number of blocks for both encoder and decoder was fixed at 3 to form a deep network structure. Additionally, a dropout rate of 0.05 was used during model training to prevent overfitting, the learning rate was set at 0.0001 to ensure stable convergence, and the batch size was maintained at 28. Finally, the predicted differenced series was combined with historical observations to obtain the final forecasting results, which were evaluated using the metrics outlined in 
Section 2.6.
  5. Discussion
  5.1. Advantages of ML-CEEMDAN-TSTF
The ML-CEEMDAN-TSTF model presented in this paper boasts several key advantages, including the integration strengths of Transformer, the reinforced training benefits of the custom time-aware outlier-sensitive loss function, the feature extraction capabilities of CEEMDAN differencing decomposition, and the strategic selection of differenced sequences as the target for prediction.
The ensemble advantage of Transformer is manifested in its robust feature extraction capacity, particularly in capturing short-term dependencies and long-term trends in time series. As observed in 
Section 4.2, while the Transformer’s performance as a fundamental forecasting model was unremarkable, 
Section 4.3 demonstrated its formidable parallel capabilities in integrating various models’ features and strengths. This integration is crucial for addressing the challenges faced by conventional deep learning models when dealing with time series data characterized by seasonality and non-stationarity. By reducing the demand for training data volume, the Transformer not only diminishes reliance on computational resources but also significantly improves the efficiency of model training.
The enhanced training benefits of the time-aware outlier-sensitive loss function enable the model to predict sequences with anomalies or significant changes more accurately. The design of this loss function surmounts the limitations of traditional models, which struggle with the impact of differences between predictive and previous time points, effectively boosting the model’s responsiveness to unforeseen events and prediction accuracy.
The feature extraction prowess of differencing decomposition lies in its ability to meticulously dissect time series, identifying and eliminating non-stationarity and white noise, while also uncovering strong correlations with key influencing factors. This approach furnishes Transformer with purer and more representative input features, laying the groundwork for precise predictions.
Lastly, the benefit of choosing differenced sequences as the target for prediction lies in its efficacy in enhancing the model’s capacity to capture seasonal trends and cyclical patterns. By integrating several machine learning models capable of extracting lagged features before Transformer, past time points are effectively recognized and utilized, further bolstering the model’s ability to identify cyclical patterns. A detailed comparative analysis of these advantages will be the focus of 
Section 5.2.
In conclusion, ML-CEEMDAN-TSTF, with its distinctive design and structure, has successfully addressed several challenges faced by conventional time series forecasting models. It not only offers theoretical innovation but has also demonstrated superior performance in experimental settings, particularly when processing complex time series data exhibiting seasonal and non-stationary characteristics.
  5.2. Advantages of Differencing in Decomposition Integration Models
Historically, two misconceptions have often arisen in forecasting. It is well-known that the factors influencing the current period’s data points are essential for forecasting results; however, in daily forecasting, using current-period meteorological information might suggest foreknowledge, while using data from the previous moment as influencing factors often yields suboptimal outcomes. Even if forecast data for the day can be obtained early, the actual operation of models in projects also requires time, as do the actions taken based on the forecasts—warnings, drills, and contingency plans—therefore, daily forecasts often cannot be applied timely to formulate corresponding strategies. This perception is a misconception; factors from the previous moment can significantly affect the actual model operation [
49]. The second misconception is that models often overlook the significant impact of differences between previous days and the forecast day on the results, a problem we refer to as ‘differencing forgetfulness’. Many researchers and practitioners assume that models inherently possess the ability to recognize and utilize such temporal scale differences [
50,
51]. In reality, however, the complexity of time series data often exceeds the processing capabilities of basic model structures, especially when dealing with environmental data characterized by strong seasonality and sudden events. Therefore, this capability often needs to be explicitly introduced through specific data processing and model design. With these two misconceptions in mind, the key issue becomes: how can we enable models to extract the relationship between the differences from the previous day to the forecast day and the inflow volume?
To address this issue, an effective strategy is to introduce the differencing information as an explicit feature into the model. This means that instead of simply inputting sequential data points into the model, we calculate the differences between adjacent time points to provide the model with information about dynamic changes directly. The core of this approach is that it allows the model to focus directly on key change points within the time series, rather than being overwhelmed by stable or minor changes. Through this method, the model can more sensitively respond to key changes in the actual environment, such as sudden weather changes leading to sharp increases or decreases in flow, thereby enhancing the accuracy of predictions. Additionally, this method also helps the model better understand the nonlinear characteristics and non-stationarity of time series, further improving the model’s adaptability to complex environmental data and forecasting performance.
To validate this theoretical approach, we conducted comparative experiments using multiple groups of models, including ML-CEEMDAN-TSTF, ML-CEEMDAN-Transformer, and ML-TSTF, to forecast using both differenced sequences and original water volume sequences, with results illustrated in 
Figure 8. Forecasts using differenced sequences demonstrated improvements in 
NSE of no less than 3.16% across different models, yielding better fitting outcomes. Moreover, using differenced sequences as the forecasting sequence still achieved a 5.81% increase in accuracy in tasks without differencing decomposition, clearly indicating that the concept of differencing to enhance model accuracy proposed in this experiment applies not only to differencing decomposition but also as a training target for forecasting models.
  5.3. Impact of Sliding Window Length on Forecast Accuracy in Real-Time Prediction
To address the demands of real-time forecasting, Transformer incorporates a sliding window mechanism that allows the model to continuously receive and update data for each batch to respond in real-time. Through its self-attention mechanism, Transformer can capture dependencies between different time points within the sliding window. This means that when processing sequence data provided by the sliding window, the model not only considers the positional information of each element within the sequence but also understands and utilizes the dynamic interactions among these elements, thereby achieving accurate predictions of future events.
Consequently, in our study’s ML-CEEMDAN-TSTF model, the selection of the sliding window length has been demonstrated to have a critical impact on improving forecast accuracy. To understand how the sliding window length affects model performance, we designed a series of experiments comparing the impact of different sliding lengths, ranging from 1 to 60 days, on model forecasting accuracy. These lengths were chosen to cover different time scales from short to long term, assessing the model’s ability to capture short-term dependencies and long-term trends in the time series.
Figure 9 shows the results of four evaluation metrics for the ML-CEEMDAN-TSTF model with slider lengths ranging from 1 to 60, highlighting the best result among them. Experimental results indicate that when the sliding window was set to 30 days, the forecast results achieved the lowest values for 
MSE and 
RMSE, and the highest for 
NSE and 
Corr, indicating optimal forecast precision. This finding reveals the model’s effectiveness in capturing monthly cyclical changes in the time series and their predictive impact on future water quantity variations. Compared to shorter lengths, such as 7 days and 10 days, a 30-day window provides sufficient historical information, allowing the model to learn richer temporal dependencies, thus enhancing prediction outcomes. However, increasing the window length further to 60 days, although providing access to longer historical data, introduces more noise and unnecessary information, potentially obscuring critical short-term and medium-term dependencies, leading to a decline in forecast accuracy.
 The experimental results from this section emphasize the need to balance the sufficiency of historical information and the risk of introducing excessive noise in practical applications, to find the sliding window length that best fits specific data characteristics for optimal forecast results.
  5.4. Limitations and Prospects
In order to enhance the reliability of predictive models and reduce the impact of uncertainties in meteorological forecasting, this study employs historical observational values as inputs to the model, explicitly excluding information from the forecast day itself to mitigate overreliance on the accuracy of meteorological forecasts. Despite this, technological advancements and innovations in models have led to considerable objectivity in recent meteorological forecast accuracy, with daily predictions of rainfall and temperature exceeding 90% accuracy. Although the methodology used in this paper ensures the accuracy of input data, it also introduces certain limitations. Primarily, relying on historical data means that the most current meteorological forecast information cannot be fully utilized, which may restrict the model’s timeliness and forward-looking capabilities in practical applications. Additionally, this approach may cause the model to overly adapt to specific patterns in historical data, overlooking the actual impact of changes in meteorological conditions on reclaimed water output, thus affecting the model’s adaptability and forecast accuracy when faced with future changes in weather conditions. Therefore, in practical applications, forecasts often require not only the next time point but also daily predictions for a longer future period. Whether to calibrate the model using meteorological forecast information or the model’s own prediction data remains to be further explored through experimental analysis and consideration in future research.
In this study, our proposed ML-CEEMDAN-TSTF differencing model has made significant progress in enhancing the predictive capability for reclaimed water inflow, achieving forecast accuracy above 98.2%. However, the model faces challenges in algorithm optimization and model simplification. Due to its integration of various complex data processing and forecasting techniques, which have enhanced its ability to handle complex time series, the model also incurs higher operational times and resource consumption. Additionally, calibrating the model’s parameters requires substantial effort, a limitation that becomes particularly evident in scenarios dealing with large datasets or requiring real-time predictions, potentially restricting the model’s widespread application and dissemination. Future research should thus focus on developing more efficient algorithm optimization techniques and model simplification strategies, such as by streamlining model structures and optimizing algorithmic processes to significantly enhance operational efficiency without significantly sacrificing forecast accuracy. Furthermore, exploring the design of lightweight models, such as by reducing the number of model parameters or employing model compression techniques, could not only reduce the demand for computational resources but also simplify the model’s tuning and maintenance processes, enhancing usability and practicality.
This research includes a variety of predictive approaches ranging from traditional machine learning models like support vector regression and gradient boosting regression to deep learning models based on neural networks like Transformer and LSTM. During the experimental process, we found that although neural network-based models excel at learning deep features and complex dimensions of data, capturing more detailed data patterns, they do not match the computational stability of traditional machine learning models. Specifically, we observed that neural network-based models exhibit significant fluctuations in results during continuous training, indicating a lack of robustness. This variability could stem from deep learning models’ high sensitivity to noise and outliers in the data and instability in optimizing within a large parameter space. In contrast, traditional machine learning models, due to their relatively simple and well-defined mathematical frameworks, typically provide more stable and predictable forecast results. Future research should therefore focus on how to improve the robustness of neural network-based models. Developing hybrid predictive models that combine the advantages of deep learning models and traditional machine learning models might be an effective direction, aiming to utilize the high representational capabilities of deep learning and the stability of traditional models to achieve higher forecast accuracy and better stability.
A key model in this study, Transformer, features its self-attention mechanism as one of its core attributes, allowing it to effectively capture dependencies between elements when processing sequence data. The successful application of the self-attention mechanism spans multiple domains, including natural language processing, time series forecasting, and image processing. Although the self-attention mechanism enables the model to capture long-distance dependencies, its computational complexity increases significantly with sequence length, limiting its application in processing very long sequence data. Future research could explore new variants of attention mechanisms, such as sparse self-attention and local self-attention, to reduce computational costs and enhance the model’s capability to handle long sequences. Additionally, multimodal learning has become a recent research hotspot and is highly compatible with the Transformer model. The self-attention mechanism shows tremendous potential in integrating information from different modalities, such as combining text, images, and audio data for comprehensive understanding and analysis. Future work could further explore how to optimize the self-attention mechanism to more effectively blend and process multimodal data, for instance, whether surface water flow, changes in water body areas, or sound signals related to water flow can be incorporated as key data information into the model, further enhancing its performance.
  6. Conclusions
The ML-CEEMDAN-TSTF differencing decomposition integration model has been proposed, which is a time series forecasting framework that incorporates machine learning, CEEMDAN differencing decomposition, and the TS-Transformer to enhance the overall performance, reliability, and efficiency of the decomposition integration model. It particularly addresses the issue of differencing forgetfulness. This model is designed to tackle key challenges in forecasting inflow volumes at reclaimed water plants, including handling seasonal and non-stationary features, as well as improving sensitivity and prediction accuracy for peak and anomalous conditions. To validate the ML-CEEMDAN-TSTF model, extensive comparative analyses were conducted, including non-decomposed ML models, time-aware outlier-sensitive loss functions, TSTF integrated models, TS-Transformer decomposition integrated models, and TS-Transformer differencing decomposition integrated models. The study utilized daily RWV data from the CH and YF reclaimed water plants in Beijing from February 2018 to February 2021. The main findings can be summarized as follows:
(1) The introduction of time-scale differencing information as an explicit feature resolved the differencing forgetfulness issue, enhancing the model’s sensitivity to temporal changes and its dynamic adaptability to predictions;
(2) TS-Transformer demonstrated significant improvements in predictive performance and accuracy over traditional Transformers, particularly in handling outliers and peaks within time series;
(3) Differencing decomposition techniques provided rich decision-making information, significantly enhancing the model’s overall prediction accuracy, especially in analyzing intrinsic features of the time series;
(4) Although Transformer exhibited average performance in standalone prediction tasks, it demonstrated superior feature extraction and data handling capabilities within integrated model applications, effectively enhancing the predictive performance of the integrated models.
In summary, the ML-CEEMDAN-TSTF differencing decomposition integration model has significantly improved the dynamic adaptability and predictive accuracy of RWV in practical scenarios. Due to the ongoing improvement of supporting equipment and the need for coordination among multiple departments for real-time monitoring data, the model and results are expected to be applied in the future once these issues are resolved. Future research will explore more efficient algorithm optimization strategies and attention mechanism techniques to further enhance the model’s computational efficiency and real-time prediction capabilities, expanding its potential applications in multimodal data fusion and analysis.