Daily Runoff Forecasting Model Based on Ann and Data Preprocessing Techniques

There are many models that have been used to simulate the rainfall-runoff relationship. The artificial neural network (ANN) model was selected to investigate an approach of improving daily runoff forecasting accuracy in terms of data preprocessing. Singular spectrum analysis (SSA) as one data preprocessing technique was adopted to deal with the model inputs and the SSA-ANN model was developed. The proposed model was compared with the original ANN model without data preprocessing and a nonlinear perturbation model (NLPM) based on ANN, i.e., the NLPM-ANN model. Eight watersheds were selected for calibrating and testing these models. Comparative study shows that the learning and training ability of ANN models can be improved by SSA and NLPM techniques significantly, and the performance of the SSA-ANN model is much better than the NLPM-ANN model, with high foresting accuracy. The SSA-ANN1 model, which only considers rainfall as model input, was compared with the SSA-ANN2 model, which considers both rainfall and previous runoff as model inputs. It is shown that the Nash-Sutcliffe criterion of the SSA-ANN2 model is much higher than that of the SSA-ANN1 model, which means that the proper selection of previous runoff data as rainfall-runoff model inputs can significantly improve model performance since they usually are highly auto-correlated.


Introduction
Real-time hydrological forecasting plays an important role in flood control and reservoir operation, and higher forecasting precision can increase the utilization efficiency of water resources.Traditionally, hydrological simulation modeling systems are classified into three main groups, namely empirical black box, lumped conceptual, and distributed physically-based models [1].The last two groups focus on understanding hydrological processes and involve various physical phenomena.Owing to the complexity of the rainfall-runoff process, these physical process simulations and model calibrations require large amounts of hydrological data.On the contrary, black-box modeling does not require a deep knowledge of the underlying physics and also can solve the problem of the scarcity of the data.Several yield a "filtered" time series, the model performance could be improved.Sivapragasam et al. [25] proposed a prediction technique based on SSA coupled with support vector machines to predict runoff and rainfall, and showed that the proposed technique yields a significantly higher prediction accuracy than that of the nonlinear prediction method.Wu and Chau [29] also found that SSA can considerably improve the performance of the rainfall-runoff model and it is promising in hydrological forecasting.
In this paper, an approach of improving daily runoff forecasting accuracy in terms of data preprocessing and the selection of predictive factors is discussed.The artificial neural network (ANN) is used for rainfall-runoff simulation.The SSA and LPM techniques are adopted to deal with data preprocessing.Then SSA-ANN models are developed and compared with the NLPM-ANN model based on the daily data from the eight watersheds used by Pang et al. [16].A comparative study is also conducted involving two different types of model inputs, namely considering rainfall as an input and considering both rainfall and runoff as inputs.

NLPM-ANN Model
The structure of the NLPM-ANN model as shown in Figure 1 was proposed by Pang et al. [16] to consider the influence of seasonal changes and the nonlinearity of the rainfall-runoff process.The model input is divided into two parts.The first is the series of the seasonal expectations of the input (pd) that is transformed to the series of the seasonal expectations of the output (qd) through an undefined relation.The second part, which is the input perturbations (Pi-pd), is transformed into the output perturbations (Qi-qd) through ANN.The total output is the sum of the seasonal expectations of the output and the output perturbations.

Singular Spectrum Analysis
Singular spectrum analysis (SSA) is a suitable analysis method for researching the period oscillatory behavior.It is also a statistical technique starting from a dynamic reconstruction of the time series and is associated with empirical orthogonal function (EOF).Generally, SSA can be considered as a special application of EOF decomposition.The main purpose of SSA is converting a one-dimensional time series into a multi-dimensional matrix with a given window length, and then the orthogonal decomposition of this matrix is obtained.If the obvious pairs of eigenvalues are produced and the corresponding EOF is almost periodic or orthogonal, this corresponding EOF can be considered the oscillatory behavior of the time series.
Brief operating procedures of SSA are summarized as follows.Assume that the series is a nonzero series F = {f0, f1, …, fN−1} (fi ≠ 0), the length of series is N (>2).Given a window length L, the one-dimensional time series can be transferred into a sequence of L-dimensional vectors Xi = {fi−1, …, fi+L−2} T , (I = 1, …, K = N−L+1).The K vectors Xi will form the columns of the (L × K) trajectory matrix: Then the singular value decomposition (SVD) of the trajectory matrix X is conducted.Let S = XX T .The eigenvalues and eigenvectors of S can be calculated, and these eigenvalues range in the decreasing order of magnitude.According to the conventional computation of EOF, an expansion of the matrix X is represented as: x a where E is the corresponding eigenvector denoted by T-EOF.The key step of SSA is to reconstruct a new one-dimensional series of length N using each component of the T-PC and T-EOF.The formula is expressed as follows: 3) produces an N-length time series Fk, thus the initial series F is decomposed into the sum of L series: If the number of contributing components is p, then the filtered series is the sum of p series: The sum of the remaining series is noise.As mentioned above, these reconstructed components can be associated with the trend, oscillations, or noise of the original time series with proper choices of L and p.

Artificial Neural Network
ANN can be categorized as single-layer, bilayer, and multilayer according to the number of layers, and as feed-forward, recurrent, and self-organizing according to the direction of information flow and processing [9].Among these different architectures, the multilayer feed-forward networks, which consist of an input layer, several hidden layers, and an output layer, have been widely used.Each layer has different nodes, and the number of hidden layers and the hidden nodes of each hidden layer are usually determined by trial-and-error method.
Assuming the three-layer ANN denoted by m × h × 1, where m stands for the number of input nodes, namely the number of predictive factors, and h is the number of nodes in the hidden layer, the ANN prediction model can be formulated as: where Xt is the input data; T is the length of lead time; φ denotes transfer functions; wji are the weights defining the link between the ith node of the input layer and the jth of the hidden layer; θj are biases associated with the jth node of the hidden layer; out j w are the weights associated with the connection between the jth node of the hidden layer and the node of the output layer; and θ0 is the bias at the output node.The Levenberg-Marquardt algorithm is chosen to adjust the values of w and θ in this study [32].

Proposed SSA-ANN Models
The SSA-ANN models are proposed with the aim of analyzing the effect of data processing.The flowchart of SSA-ANN models is illustrated in Figure 2, where the original series is decomposed into oscillations and noise by SSA, firstly.Then the reconstructed series is selected as the ANN model input.If the input is the rainfall data series only, the SSA-ANN1 model is built to simulate the relationship between rainfall and runoff.If the input contains both the rainfall and runoff data series, the SSA-ANN2 model is built to simulate the relationship between rainfall and previous runoff with forecasting runoff.

Evaluation of Model Performances
Two criteria are selected to evaluate the prediction performance based on Chinese Hydrological Forecasting (or prediction) guidelines (2008), they are: (2) Water balance coefficient (WB) where n is the number of year, t Q and ' t Q are the observed and forecasted inflows, respectively, t Q is the average value of observed flow; if the values of R 2 and WB are closer to one, the better the prediction results that are obtained.

Data
To compare the proposed SSA-ANN models with the NLPM-ANN model, eight watersheds in China used by Pang et al. [16] were selected as case studies in this paper.The data include the daily rainfall and runoff data.Each of data series is divided into three parts, i.e., training set, cross-validation set, and testing set.The training set is used to train the network and the cross-validation set is used to check the progress of the network and implement an early stopping approach in order to avoid the over-fitting of the training set.The testing set serves as model evaluation.Table 1 lists statistical information about all watersheds, including mean (μ), standard deviation (Sx), maximum (Xmax), and minimum (Xmin).As shown in Table 1, the training data does not cover the cross-validation or testing data totally.In order to ensure the extrapolation ability of ANN and avoid numerical difficulties during calculation, all data are scaled to the interval [−0.9, 0.9] by normalization.

Determination of Model Inputs
The suitable predictive factors have an important impact on model performance.If the model input is only rainfall, it can be expressed as: , , ) where x is the rainfall series, y is the runoff series, and n is the number of antecedent rainfall components.In Pang et al.'s paper [16], only rainfall was selected as model input, so the SSA-ANN1 model, which only uses rainfall as model input, was developed.In order to ensure the comparability of model performance, the same n values for the SSA-ANN1 model and the NLPM-ANN model were selected.From Pang et al.'s results of the NLPM-ANN model [16], the values of n are 8, 6, 6, 8, 10, 8, 6, and 10 for Jiahe, Laoguanhe, Baohe, Mumahe, Nianyushan, Gaoguan, Shimen, and Tiantang, respectively.As we know, the autocorrelation of the runoff series is strong and the impact of previous runoff on current runoff cannot be ignored, so the SSA-ANN2 model which uses rainfall and runoff as model inputs was developed in this paper.It can be expressed as: ( , , , , , , ) where m is the number of previous runoff data.The values of n for the SSA-ANN2 model are the same as the SSA-ANN1 model.In view of the convenience of operation and simplicity of computation, the autocorrelation function (ACF) is used to determine m.The smaller the values of correlation, the poorer the relationship is. Figure 3 plots the ACF values of the runoff series at the one-step prediction horizon.Then the number of model inputs can be taken with the values of 5, 5, 5, 3, 2, 3, 2, and 1 for Jiahe, Laoguanhe, Baohe, Mumahe, Nianyushan, Gaoguan, Shimen, and Tiantang, respectively.It can be seen that the number of previous daily runoff is obviously related with the watershed area.

Data Preprocessing
According to the theory of the SSA, the decomposition procedure requires identifying the parameter L. The value of an appropriate L should be able to clearly resolve different oscillations hidden in the original signal.In the current study, a small interval of [2,12] is examined to choose L [28].L is considered as the target only if the singular spectrum can be markedly distinguished [33].Figures 4 and 5 present the relation between singular values and singular numbers for the rainfall and runoff series, respectively, where the singular values associated with the appropriate L are highlighted by the dotted solid line.It can be seen that L is selected as 8, 8, 8, 8, 9, 10, 9, and 7 for the rainfall series, and L is set as 9, 8, 9, 10, 9, 10, 9, and 7 for the runoff series in the Jiahe, Laoguanhe, Baohe, Mumahe, Nianyushan, Gaoguan, Shimen, and Tiantang watersheds, respectively.Once the original series is decomposed into L components, the subsequent task is to identify noise, choose the contributing components and reconstruct a new series as model inputs.This paper applied the cross-correlation function (CCF) to find the number of contributing components p (≤L).From the perspective of linear correlation, the positive or negative CCF value indicates that the component makes a positive or negative contribution to the output of model.Table 2 listed all CCF values between each decomposed component and original series for all watersheds.Take Jiahe rainfall series as an example; the last four components have positive CCF values, which mean that they have positive correlation with the original series.So the number of contributing components p is equal to 4 and the sum of the last four components is reconstructed series.Meanwhile, the reconstructed series of other time series can be obtained by the same way.

Results Analysis
Table 3 summarized the model performances for each watershed during calibration and testing periods.The ANN model is the benchmark in which the input is the original rainfall series without data preprocessing.It is shown that the model performance is improved significantly by data preprocessing techniques.During the testing period, the mean values of R 2 and WB of eight watersheds are 70.16% and 0.879 by ANN, and are increased to 75.86% and 1.155 by NLPM-ANN, and 80.62% and 1.04 by SSA-ANN1, respectively.In the Tiantang watershed, the performance of the NLPM-ANN and SSA-ANN1 models is improved significantly, so the R 2 value increased from 59.79% to 81.96% and 79.54%, respectively, during the testing period.The mean values of R 2 and WB for the SSA-ANN1 model are 82.08% and 80.62%, and 1.0 and 1.04, during calibration and testing periods, respectively, which are much better than that of the NLPM-ANN model.It means that the reconstructed series obtained by SSA has a strong regularity and is easy to simulate.It also demonstrated that the impact of noise in hydrological time series on model performance is bigger than the seasonal hydrological behavior.Therefore, SSA is an effective way to improve runoff forecasting accuracy.The mean values of R 2 for the SSA-ANN2 model are 92.97% and 91.52%, which are much better than those of the SSA-ANN1 model.It is concluded that considering previous runoff as a model input can improve model efficiency greatly.In order to compare the NLPM-ANN model, SSA-ANN1 model, and SSA-ANN2 model clearly and deeply, we selected one year during the testing period of four watersheds as an example, and the observed and simulated runoff hydrographs created by these three models for the Jiahe, Laoguanhe, Baohe, and Shimen watersheds are plotted in Figures 6-9, respectively.These figures show that the runoff hydrograph simulated by the SSA-ANN2 model is much closer to the observational one.The peak and minimum flows simulated by the SSA-ANN2 model are the best among these models.Therefore, the SSA-ANN2 model can predict daily runoff very well in practice.

Figure 3 .
Figure 3. Autocorrelation function (ACF) values of runoff series for all watersheds.

Figure 4 .
Figure 4. Singular values as a function of different window length L for rainfall series.

Figure 5 .
Figure 5. Singular values as a function of different window length L for runoff series.

Figure 6 .
Figure 6.Observed and simulated runoff hydrographs by three models for Jiahe.

Figure 7 .
Figure 7. Observed and simulated runoff hydrographs by three models for Laoguanhe.

Figure 8 .
Figure 8. Observed and simulated runoff hydrographs by three models for Baohe.

Figure 9 .
Figure 9. Observed and simulated runoff hydrographs by three models for Shimen.

Table 1 .
List of the watershed statistical information.

Table 2 .
Cross-correlation function (CCF) values between each decomposed component and original series.

Table 3 .
Summary of model performances during calibration and testing periods.