Comparison of Forecasting Models for Real-Time Monitoring of Water Quality Parameters Based on Hybrid Deep Learning Neural Networks

: Accurate real-time water quality prediction is of great signiﬁcance for local environmental managers to deal with upcoming events and emergencies to develop best management practices. In this study, the performances in real-time water quality forecasting based on different deep learning (DL) models with different input data pre-processing methods were compared. There were three popular DL models concerned, including the convolutional neural network (CNN), long short-term memory neural network (LSTM), and hybrid CNN–LSTM. Two types of input data were applied, including the original one-dimensional time series and the two-dimensional grey image based on the complete ensemble empirical mode decomposition algorithm with adaptive noise (CEEMDAN) decomposition. Each type of input data was used in each DL model to forecast the real-time monitoring water quality parameters of dissolved oxygen (DO) and total nitrogen (TN). The results showed that (1) the performances of CNN–LSTM were superior to the standalone model CNN and LSTM; (2) the models used CEEMDAN-based input data performed much better than the models used the original input data, while the improvements for non-periodic parameter TN were much greater than that for periodic parameter DO; and (3) the model accuracies gradually decreased with the increase of prediction steps, while the original input data decayed faster than the CEEMDAN-based input data and the non-periodic parameter TN decayed faster than the periodic parameter DO. Overall, the input data preprocessed by the CEEMDAN method could effectively improve the forecasting performances of deep learning models, and this improvement was especially signiﬁcant for non-periodic parameters of TN.


Introduction
As the most important water source for human life and industry production, surface water is extremely vulnerable to being polluted. By quantifying different types of parameters, water quality monitoring can help us to develop best management practices to protect water source safety and improve aquatic habitats [1]. With the development of the social economy, many regions of China are facing severe water pressure due to increasing water demand and decreasing available water resources as the result of surface water pollution [2]. Therefore, precisely and timely monitoring and prediction of surface water quality is particularly critical to local environmental managers for designing pollution reduction strategies and responding to environmental emergencies. In recent years, many studies have estimated water quality parameters through different methods based on artificial intelligence tools. For example, Fijani et al. designed a hybrid two-layer decomposition performances; and (3) compare the results of different prediction steps and analyze the impact of prediction steps on different parameters and different models.

Study Area and Data Description
The case study area in this paper is Xin'anjiang River, which is a transboundary river that crosses Anhui Province and Zhejiang Province. The Xin'anjiang River originated in Huangshan City, Anhui Province, and flows into Thousand Island Lake through Chun'an County, Zhejiang Province ( Figure 1). As the most important water source flowing into the Thousand Island Lake, the water quality of Xin'anjiang River is concerned with the health of 115.91 million people in both Anhui and Zhejiang provinces [19]. The Jiekou monitoring station is located at the junction of these two provinces, which is mainly used to monitor the water quality of Xin'anjiang River (Figure 1). Jiekou station automatically analyzes water samples and records data every 4 h to realize real-time monitoring of water quality. We collected water quality data including dissolved oxygen (DO) and total nitrogen (TN) from 0:00 on 6 May 2015 to 20:00 on 20 January 2020. The DO and TN datasets both had 10,326 records, which were used to conduct the predictive models. To perform the model development and evaluation, the full dataset was split into training, validating, and testing subsets. The data division criterion in this study was shown in Figure 2. The training, validating, and testing sub-sets included 6609 records (nearly 64% of the dataset, from 0:00 on 6 May 2015 to 8:00 on 11 May 2018), 1652 records (nearly 16% of the dataset, from 12:00 on 11 May 2018 to 20:00 on 10 February 2019), and 2065 records (nearly 20% of the dataset, from 0:00 on 11 February 2019 to 20:00 on 20 January 2020), respectively. The descriptive statistics of the training set, validating set, and testing set for these two parameters were summarized in Table 1.

Complementary Ensemble Empirical Mode Decomposition with Adaptive Noise
CEEMDAN is an improved adaptive signal time-frequency analysis method [20], which is developed following the empirical mode decomposition (EMD) [21] and ensemble empirical mode decomposition (EEMD) [22]. EMD could decompose complex signal into intrinsic mode functions (IMFs) and a residue, so as to better understand the linear and nonlinear characteristics of the signal. However, since either a single IMF is composed of very different scales, or similar scales reside in different IMF components, the original EMD has the disadvantage of mode mixing [3]. The EEMD method is proposed to solve the problem of mode mixing by adding white noise [22]. However, the added white noise could not be completely eliminated during the decomposition process, so the interaction between the signal and the noise may produce different modes [23]. The CEEMD algorithm completely eliminates the residual white noise by adding positive and negative auxiliary white noise pairs to the original signal, and further improves the performance of the decomposition algorithm [24]. By introducing the adaptive noise component, the CEEMDAN algorithm not only maintains the ability to eliminate mode mixing and residual noise, but also has fewer iterations and higher convergence performance [20]. In this study, CEEMDAN was used to decompose the original DO and TN data series to reduce their complexities.

Convolutional Neural Network
The convolutional neural network (CNN) is a multi-layer feedforward neural network that can be used to process data of multiple dimensions, such as time series data (which can be considered as a 1-D grid sampled at regular intervals) and image data (which can be considered as pixel 2-D grid) [25]. With the application of CNNs, many variants of convolutional network structure have appeared. However, the basic structures of most networks are similar, including input layer, convolution layer, pooling layer, fully connected layer, and output layer [16,26]. The input data needs to enter the convolutional neural network through the input layer. According to the observed data and study objectives, the input layer has a variety of formats, including 1-D, 2-D, and even n-D, and these data usually contain 1 to n channels. In this study, the original time series could be viewed as 1-D grid data to build a 1-D convolutional network. Moreover, the original data preprocessed by CEEMDAN were decomposed into multi-frequency IMFs, and all of them were put into CNN as input data in the form of image. These images had only one channel, and the CNN built based on them could be considered a 2-D model.
Through a series of comparative experiments, the basic structure of the CNN models used in this study was determined. After the input layer, the second layer was a convolutional layer with a number of learnable convolution kernels. The application of these kernels was to obtain multiple feature maps by learning different features of the input data. The third layer was the activation layer that used rectified linear unit (ReLU) as the activation function. The fourth layer was a pooling layer using the average function. The pooling layer reduced the dimensionality and processing time by sub-sampling the feature maps extracted from the convolutional layer. The fully connected layer, which was the last part before the output layer, connected the output of the pooling layer to a one-dimensional row vector. Finally, for the regression problem of this study, the regression output layer was used as the last layer of CNN. In addition to these basic structures, in order to reduce the problem of overfitting, dropout and batch-normalization methods were used in this study [27]. Dropout randomly set the input elements to zero with a given probability in each training case, so that a large number of different networks could be trained in a reasonable time to effectively reduce the problem of overfitting [28]. Batch normalization can mitigate the internal covariate shift and the dependence of the gradient on the parameter or its initial value, so as to achieve the purpose of accelerating training and producing more reliable models [29].

Long Short-Term Memory Neural Network
The long short-term memory (LSTM) network is a special variant of the recurrent neural network (RNN). In traditional feedforward neural networks, information can only flow from the input layer to the hidden layer and finally to the output layer in one direction. The biggest difference between RNN and the feedforward neural network is that RNN has a recurrent hidden unit to implicitly maintain historical states about all past elements in the sequence [30,31]. That is to say, in RNN, the final output is based on the feedback of the input layer and the historical states of all hidden units. However, when the RNN is trained using the gradient descent method, the gradient may show an exponential increase or decay, which will cause the gradient to explode or disappear [31,32]. LSTM introduces the concept of input and output gates to improve RNN. LSTM applies memory cells, which could filter and process historical states and information instead of the recurrent hidden units of the RNN, and the basic structure of three consecutive memory cells is shown in Figure 3. Each memory cell contains an input gate (i t ), a forget gate (f t ), and an output gate (O t ) to control the flow of information. The input gate (i t ) is used to determine how much input data at the current moment (t) needs to be stored in the cell state, and an intermediate value Ct is used to update the state in this process.
Water 2021, 13, 1547 6 of 20 ting, the dropout layer was inserted before the fully connected layer. All of the algorithms were performed by programming codes in MATLAB R2020b, and a function developed by Torres et al. [20] was used to design the CEEMDAN model. The Deep Learning Toolbox including the basic framework of multiple deep neural networks was applied to design the CNN, LSTM, and CNN-LSTM models.  Figure 3. The architecture of long short-term memory (LSTM) neural network.
The forget gate (f t ) determines how much of the cell state from the previous moment (t−1) needs to be retained to the current moment.
By deleting part of the old information and adding the filtered intermediate value, the cell state is updated from C t−1 to C t .
The output gate controls how much of the current cell state needs to be output to the new hidden state.
In the above formula, W * and b * , respectively, represent the relevant weight matrices and bias vectors, while σ(·) and tanh(·) are the sigmoid function and hyperbolic tangent function. The basic structure of LSTM models used in this study included the input layer, LSTM layer, fully connected layer, and regression output layer. To prevent overfitting, the dropout layer was inserted before the fully connected layer.
All of the algorithms were performed by programming codes in MATLAB R2020b, and a function developed by Torres et al. [20] was used to design the CEEMDAN model. The Deep Learning Toolbox including the basic framework of multiple deep neural networks was applied to design the CNN, LSTM, and CNN-LSTM models.

The Hybrid Forecasting Models Development
Before constructing the forecasting models, the Lyapunov exponents were calculated first to determine whether there were nonlinear and dynamic characteristics in the DO and TN data series. When the maximum Lyapunov exponent is positive, it is a powerful indicator of the chaotic features of the data series. The values of the Lyapunov exponent for DO and TN were 0.36 and 0.44, respectively, indicating that the monitoring data of these two parameters were both chaotic time series. For chaotic nonlinear dynamic systems, the data may be corrupted by noise or their internal features may be completely unknown, so deep learning methods are very suitable for prediction and feature extraction of such data. In this study, we investigated the potential uses of single models and hybrid models for multi-step ahead forecasting of DO and TN. The procedures of the water quality prediction models based on CNN and LSTM are described as follows.
Step 1: The explanatory variables and target variable were decided. Although some commonly used analysis methods (such as autocorrelation function and partial autocorrelation function) have been tried to determine the input variables, no significant number of lags has been found. Taking into account the monitoring frequency and effectiveness of the data, twelve records in two days were selected as input variables, and six records in the following day were used as output variables. Taking DO as an example, the input and output variables are shown in Table 2.
Step 2: DO and TN data series were decomposed into several different IMFs and a residual component. These components can provide detailed information from high frequency to low frequency contained in the input data series. Previous studies indicated that the added noise level and the number of realizations in the CEEMDAN modeling process can be adjusted according to the application [33]. According to some empirical algorithms in the previous studies (for example, the recommended white noise amplitude is about 20% of the standard deviation, etc.) and the results of trial and error, a noise level of 0.1, a realization of 100, and a maximum of 2000 sifting iterations were set [3,22,34].
Step 3: The data series were divided into training, validating, and testing sets, respectively. It is worth noting that although the testing subset is completely independent of the training process and is used to compare the forecasting performances of the models, the decomposition process of CEEMDAN needs to be applied to the entire dataset. If the training subset and the testing subset are decomposed separately, the obtained IMFs and residual component are inconsistent and the trained model becomes invalid for the testing data.
Step 4: The training subset was used to train each model for multi-step ahead forecast. This study contained two types of input data: the original one-dimensional time series and the two-dimensional gray-scale image based on CEEMDAN decomposition. The original one-dimensional input was a set of 12 × 1 data sequences, while the CEEMDAN-based input was a set of 12 × n two-dimensional images (the value of n was decided by the results of decomposition process). The one-dimensional and two-dimensional images are shown in Figure 4, respectively. Each type of data was used to train three deep neural networks: CNN, LSTM, and CNN-LSTM.
Step 5: The testing subset was used to obtain multi-step ahead forecasting results and compare the performances of different models. Since the testing data were not involved in the training process at all, the forecasting results of testing data set could well reflect the generalization ability of the models. At the same time, the results of different ahead steps can be used to evaluate the effective forecasting duration of the models.

Evaluation Criteria
To quantitatively evaluate the performances of the aforementioned models, three statistical evaluation criteria were selected for comparison, including the coefficient of efficiency (CE), root mean square error (RMSE), and mean absolute percentage error (MAPE). The CE was also known as the Nash-Sutcliffe coefficient and defined as follows: The RMSE was calculated as: The MAPE was defined as: where n was the number of input samples, O i and P i were observed and predicted value of sample i, respectively. O was the mean value of observed data. Generally, the larger the CE value and the smaller the RMSE and MAPE values, the smaller the difference between the predicted value and the actual value, that is, the higher the prediction accuracy of the model.

Results of DO Data Series
Based on the CEEMDAN preprocessing method, the time series of the DO data set were decomposed into 13 IMFs and one residual term, shown in Figure 5. These IMFs reflected the oscillating modes of DO in different periods, and the residual term indicated the trend of the data. In this study, the input data had two forms, one is the original time series, and the other is the gray image based on the CEEMDAN results ( Figure 4). The original input was a set of 12 × 1 data sequences, while the CEEMDAN-based input was a set of 12 × 14 two-dimensional images. Obviously, compared to a single time series, the CEEMDAN-based input data could reflect more information. Moreover, increasing the Water 2021, 13, 1547 9 of 20 dimensionality of the input data was beneficial to the deep learning neural networks to extract data features, thereby improving the prediction accuracy. The CE, RMSE, and MAPE statistics of single and hybrid deep learning neural networks in the testing period for the DO dataset are shown in Table 3. For targets with different ahead steps, the best error measures were highlighted in red. According to the results, it can be found that the CEEMDAN-CNN-LSTM model had the best prediction accuracy for all targets. Whether for the separate CNN and LSTM model or the CNN-LSTM hybrid model, the prediction accuracies of CEEMDAN-based input data were better than the original input data. As the forecasting step increased, the accuracy difference between these two sets of input data gradually increased. For example, the differences of CE, RMSE, and MAPE between DO t and DO t+5 for CNN-LSTM and CEEMDAN-CNN-LSTM were 0.09 (|0.97-0.88|), 0.30 (|0.35-0.65|), and 2.61 (|3.06-5.67|) and 0.04 (|0.98-0.94|), 0.22 (|0.26-0.48|), and 2.01 (|2.55-4.56|), respectively. In other words, as the forecasting step increased, the decrease of the prediction accuracy of CEEMDAN-based input data was slower than the original input data. The six models, especially the CEEMDAN-CNN-LSTM model, could capture the general trend of the testing data series across different forecasting steps. However, it was necessary to evaluate the model accuracy based on the forecasts of the peak and lowest values. In this study, the MAPE between observations and predictions of the 10% lowest and 10% highest values of the testing data was analyzed. Figure 6a,c showed the MAPE of lowest values from Step 1 to 6 based on original and CEEMDAN-based input. The MAPEs of the six models increased as the forecasting step became longer, and the MAPEs based on the original input data were larger than the CEEMDAN-based input. The MAPEs of the CNN models were the largest, LSTM models were the second, and CNN-LSTM models were the smallest, which indicated that the CNN-LSTM model was superior to the separate models in forecasting the lowest values. For the prediction of peak values (Figure 6b,d), the MAPEs of CEEMDAN-based inputs were smaller than the original inputs, and the MAPEs increased as the forecasting step became longer. Being different from the lowest values prediction, LSTM was the model with the largest MAPEs for the peak values prediction, followed by the CNN model, and CNN-LSTM was the smallest. Therefore, the CEEMDAN-CNN-LSTM was the best forecasting model for both the lowest and highest values.

Results of TN Data Series
The decomposition results of the TN data series using the CEEMDAN method are shown in Figure 7. The TN data set was also decomposed into 13 IMFs and one residual term. The CE, RMSE, and MAPE statistics of single and hybrid deep learning neural networks in the testing period for the TN dataset are shown in Table 4. The results indicated that the CEEMDAN-CNN-LSTM model had the best performances across different forecasting steps. However, for the prediction of TN, the differences between the results obtained from the original input and the CEEMDAN-based input were very significant. Especially with the increase of the time steps, the performance of the original input deteriorated rapidly, while the CEEMDAN-based input showed much better stability for TN forecasting. For CNN, LSTM, and CNN-LSTM models, the CEs of TN n+3 , TN n+4 , and TN n+5 were all less than 0.5, which was not within the range of satisfactory model simulations [35]. In other words, when the time step exceeded 3, no model could reliably predict the TN concentration based on the time series data alone.   The MAPEs of different models for the 10% peak values and 10% lowest values of the TN testing data set are shown in Figure 8. The prediction of the lowest values showed that the CNN-LSTM was better than the CNN, while the LSTM performed the worst among the three models (Figure 8a,c). For the original input and CEEMDAN-based input, the average MAPEs of the three models across different time steps were 39.20, 41.71, and 51.72 and 26.07, 28.12, and 31.30, respectively. Obviously, the prediction of the lowest values based on the original input data had a greater deviation from the observed values. According to the prediction results of the peak values, LSTM was the worst performing model, followed by the CNN model, and the CNN-LSTM model performed best (Figure 8b,d). In addition, as the time step increased, the MAPEs gradually became larger, that is, the deviation between the predicted values and the observed values became larger. The change curves of MAPE were nearly the same; however, the MAPEs obtained from the original input were about twice the MAPEs obtained from the CEEMDAN-based input. In general, CEEMDAN-CNN-LSTM was the best model for forecasting TN extreme values, while LSTM was the least accurate model for extreme values prediction.

Discussion
Numerous studies have applied single and hybrid deep learning models to predict hydrological and water quality data [15,16,36]. Among them, the most commonly used method was to decompose the data series at first, then make predictions for each IMF separately, and finally add the results to get the reconstructed time series. However, data decomposition was a data-dependent method, so it needed to be repeated as the data was updated. Generally, huge calculation cost was accompanied by long calculation time, so the commonly used method was too time-consuming for real-time data prediction. In this study, we tried to input the decomposed data series as two-dimensional images for multi-step water quality prediction. The results indicated that compared with the original one-dimensional time series input data, the two-dimensional input data decomposed by the CEEMDAN method could improve the prediction accuracies. However, for water quality parameters with different fluctuation characteristics, the two-dimensional input data improved the prediction accuracies differently. It can be found from Figure 4 that the DO data had very obvious seasonal changes. The DO concentrations in spring and summer were higher than those in autumn and winter. Due to the strong periodicity of DO data, the standalone model and hybrid model with two types of input data had satisfactory performances. However, the performances of the hybrid models were slightly better than that of the standalone models, so the results were compared and discussed based on the hybrid models.  Figure 7 showed that the TN data, which was mainly affected by human activities, had no periodic fluctuation characteristics. The results based on the CEEMDAN input data were significantly better than the results based on the original input data. For TN t , the CE, RMSE, and MAPE of CNN-LSTM and CEEMDAN-CNN-LSTM were 0.76 vs. 0.92, 0.18 vs. 0.10, and 10.06 vs. 6.63, respectively. The CEEMDAN-CNN-LSTM model performed significantly better than the CNN-LSTM, which indicated that decomposing the input data was more beneficial to improve the prediction accuracy of non-periodic time series. The CEEMDAN method added adaptive white noise in each decomposition to smooth the impulse interference, so that the decomposition of the data was very complete [37]. For non-periodic parameters, the complete decomposition could better separate the noise component and periodic components of the data. Among them, the periodic components could be much better predicted, so the prediction accuracies of the data series were also improved.
Increasing the forecasting steps of each round of prediction may help to reduce computational consumption, while relevant previous studies rarely discussed the impact of the increase in forecasting steps on model performances [14,15]. In this study, all of the models were tested for one-step-ahead (4 h) to six-step-ahead (1 day) forecasting, which were aimed to analyze the impact of the forecasting steps on the performances of the models. As shown in Figures 9 and 10, with the increase of prediction steps, the CE of all models decreased, while the RMSE and MAPE gradually increased. Some previous studies on hydrological data prediction have also obtained similar results on a monthly scale. For example, in a wavelet analysis-ANN-model-based multi-scale monthly groundwater level prediction study, the prediction results of 2 and 3 months ahead were worse than the results of 1 month ahead [38]. In the study using a two-phase hybrid model to predict the runoff in Yingluoxia watershed, the model's accuracy for shorter lead times (1 month) were better than the 3-and 6-month horizon [34]. However, in addition to the conclusion that the prediction accuracy would decrease with the increase of prediction steps, we found that the accuracies' attenuation rates of different water quality parameters were different. Taking the indicator CE as an example, the CE attenuation rates of the DO models were much lower than those of the TN models as the prediction steps increased. Based on the CE at time t (i.e., one-step-ahead), the CE attenuation rates of CNN-LSTM and CEEMDAN-CNN-LSTM models at time t+1, t+2, t+3, t+4, and t+5 were calculated, and the results are shown in Figure 11. For DO, the attenuation rates of CNN-LSTM models and CEEMDAN-CNN-LSTM models were 3.09~9.28 and 0~4.08, respectively. For TN, the attenuation rates of CNN-LSTM models and CEEMDAN-CNN-LSTM models were 13.16~48.68 and 2.17~14.13, respectively. Obviously, the attenuation rates of the periodic parameter DO were lower than those of the non-periodic parameter TN, and the attenuation rates based on two-dimensional input data were lower than those of one-dimensional input data. It was interesting to note that the prediction accuracy decreased as the number of prediction steps increased. With the continuous flow of the river, the waters that have experienced different influences gradually flowed through the monitoring station. As the time interval became longer, the differences of water quality became greater. Therefore, the prediction accuracies based on the same set of input data gradually decreased. Moreover, the TN concentrations were mainly affected by human activities. The TN emission sources were always changing, as the prediction steps increased, the sources may appear or disappear. Therefore, the attenuation rates of TN were faster than DO.

Conclusions
In this study, the performances of the deep neural networks CNN, LSTM, and CNN-LSTM with and without the preprocessing method CEEMDAN were compared for the water quality parameters' forecasting. A real-time water quality monitoring dataset with 10,326 records from 2015 to 2020 was collected as the input data. According to the completeness of the data and the fluctuation characteristics of the parameters, DO and TN were selected as prediction targets. The DO and TN data series were decomposed by the CEEMDAN method at first, and two decomposed datasets with 13 IMFs and one residual term were obtained. Taking into account the monitoring frequency (every 4 h) of the data, twelve records in two days were selected as input variables, and six records in the following day were used as output variables. Then, a set of 12 × 1 data sequences was prepared as the original input data, while the CEEMDAN-based input data were a set of 12 × 14 two-dimensional grey images. The two types of input data were used to train and test CNN, LSTM, and CNN-LSTM models, respectively. The results indicated that the performances of hybrid model CNN-LSTM were superior to the standalone model CNN and LSTM. The models used CEEMDAN-based two-dimensional input data performed much better than the models used the original input data. Moreover, the improvements of CEEMDAN-based data for non-periodic parameter TN were much greater than that for periodic parameter DO. In addition, the impacts of the prediction steps on the forecasting accuracies of the models were analyzed in this study. The results showed that the model forecasting accuracies gradually decreased with the increase of prediction steps. The attenuation rates presented the features that the original input data decayed faster than the CEEMDAN-based input data and the non-periodic parameter TN decayed faster than the periodic parameter DO. In general, the input data preprocessed by the CEEMDAN method could effectively improve the forecasting performances of deep neural networks, and this improvement was especially obvious for non-periodic parameter. Overall, compared with a separate monitoring system, the incorporated monitoring and modeling framework may provide better knowledge, which could help decision-making departments respond to upcoming events and outbreaks.