Streamﬂow Forecasting Using Empirical Wavelet Transform and Artiﬁcial Neural Networks

: Accurate and reliable streamﬂow forecasting plays an important role in various aspects of water resources management such as reservoir scheduling and water supply. This paper shows the development of a novel hybrid model for streamﬂow forecasting and demonstrates its efﬁciency. In the proposed hybrid model for streamﬂow forecasting, the empirical wavelet transform (EWT) is ﬁrstly employed to eliminate the redundant noises from the original streamﬂow series. Secondly, the partial autocorrelation function (PACF) values are explored to identify the inputs for the artiﬁcial neural network (ANN) models. Thirdly, the weights and biases of the ANN architecture are tuned and optimized by the multi-verse optimizer (MVO) algorithm. Finally, the simulated streamﬂow is obtained using the well-trained MVO-ANN model. The proposed hybrid model has been applied to annual streamﬂow observations from four hydrological stations in the upper reaches of the Yangtze River, China. Parallel experiments using non-denoising models, the back propagation neural network (BPNN) and the ANN optimized by the particle swarm optimization algorithm (PSO-ANN) have been designed and conducted to compare with the proposed model. Results obtained from this study indicate that the proposed hybrid model can capture the nonlinear characteristics of the streamﬂow time series and thus provides more accurate forecasting results.


Introduction
Assessing and estimating water available in a basin over long periods (from months to years) is an important precondition for efficient reservoir scheduling and water supply. For this reason, constructing a forecasting model for long-term streamflow time series is essential [1]. At present, methods for long-term streamflow forecasting can be mainly grouped into two categories: physically-based models and data-driven models. Technically, physical-based models usually consider multi-factors including rainfall, runoff, evaporation, soil moisture, infiltration, snow cover and melt, etc., while data-driven models usually take advantage of the implicit information contained in the historical streamflow time series to forecast the future streamflow [2]. Physically-based models can suffer from the drawbacks that physical relationships between different hydrological features are complex, and sometimes the lack of hydrological data makes it difficult to complete physical modeling [3].
In the last few decades, numerous data-driven models for hydrological time series forecasting have been proposed to increase forecasting accuracy. The autoregressive moving average (ARMA) approach, since first being proposed by Box and Jenkins [4] and then popularized by Carlson et al. [5], has been Water 2017, 9,  one of the most widely-used methods for hydrological forecasting [6][7][8]. The ARMA technique assumes the time series to be stationary and to follow the normal distribution [4]. However, streamflow time series is usually characterized by features of both nonlinearity and unstableness; thus, linear-related time series forecasting techniques are not sufficient to capture the characteristics of hydrological time series [9]. In this case, artificial neural networks (ANNs) have been put forward and widely exploited for hydrological forecasting to deal with the nonlinearity and instability of hydrological time series. Hornik et al. [10] have demonstrated that ANN can approximate any measurable function to a certain degree. One of the most obvious advantages of the ANN technique is that one does not need to have a well-defined process for algorithmically converting inputs into outputs [11]. The advantages and disadvantages of ANN in the application of hydrologic research have been discussed in a comprehensive review by Govindaraju [12,13].
Although there have been many achievements in the application of the ANN technique using single ANN, there is still much room to improve. One traditional method is to optimize the weights and biases of ANN. In the training process of ANN, the target is to find the optimal weights and biases to minimize the learning error. Since the behavior of ANN is highly dependent on the initial values of weights and biases, employing different heuristic searching algorithms becomes popular in the training process [14][15][16][17][18]. Mirjalili et al. [14] employed the gravitational search algorithm (GSA) and PSOGSA (particle swarm optimization, gravitational search algorithm) as training methods to train the neural networks. Khan and Sahai [15] demonstrated the superiority of the new metaheuristic bat algorithm (BA) over the genetic algorithm (GA), particle swarm optimization (PSO), the "standard" back propagation (BP) algorithm and the Levenberg-Marquardt algorithm. In Ref. [16], the GA was exploited to search for the optimal weights and biases of the back propagation neural network (BPNN). Cheng et al. [17] proposed an ANN model based on quantum-behaved PSO (ANN-QPSO) for daily runoff forecasting. In Ref. [18], a gravitational search algorithm (GSA)-based radial basis function (RBF) approach (GSA-RBF) was exploited for predicting debris flow velocity.
Another new trend to improve the performance of the single ANN model is to take advantage of data pre-analysis methodologies such as eliminating stochastic volatilities from the original time series [19,20], using decomposed subseries as inputs [21,22] and employing a decompose-predict-ensemble framework [23,24]. Wu et al. [19] combined moving average (MA), singular spectrum analysis (SSA), wavelet multi-resolution analysis (WMRA) and ANNs to improve the forecasting accuracy of daily flows. In Ref. [20], the MA method, principle component analysis (PCA) and SSA were coupled with ANNs to predict rainfall time series. Besides these aforementioned data preprocessing-based techniques, wavelet and empirical mode decomposition (EMD) have been the most commonly-used data pre-analysis techniques in recent years. Kisi et al. [21] developed a wavelet-support vector regression (WSVR) model in which the discrete wavelet transform (DWT) was employed as a data preprocessing tool. The discrete wavelet transform has been proven to have improved the forecasting accuracy. Budu et al. [22] decomposed the observed reservoir inflow time series into subseries using DWT with different mother wavelet functions. Then appropriate subseries were used as inputs for constructing ANN models. Fu et al. (2014) developed a hybrid EMD-RBFNN model for predicting regional groundwater depth time series [23]. Zhao et al. [24] combined chaotic least squares support vector machine (CLSSVM) with EMD for annual runoff forecasting. The EMD-CLSSVM model has been proven to have attained significant improvement compared with the single CLSSVM model in long-term runoff time series forecasting. Nourani et al. [25] provided a comprehensive introduction to the application of the wavelet artificial intelligence (AI) conjunction model in simulating important processes at the hydrologic cycle.
Although wavelet transform (WT) and EMD have been used widely in recent years, the behavior of different WT-based forecasting models depends highly on the selection of mother wavelets, and EMD suffers from the drawbacks of a lack of mathematical theory and sensitivity to both noise and sampling. In this case, Gilles [26] proposed a new adaptive data analysis method, empirical wavelet transform (EWT), to overcome the drawbacks of these decomposition techniques. The efficiency of EWT has been demonstrated by a few benchmarks in Ref. [26]. Additionally, EWT has been applied to several forecasting models successfully since being proposed in 2013. Hu and Wang [27] combined the EWT with Gaussian process regression (GPR) for forecasting short-term wind speed. Wang and Hu [28] proposed a GPR model combined with autoregressive integrated moving average (ARIMA), extreme learning machine (ELM), SVM and LSSVM for short-term wind speed forecasting. In both Ref. [27] and Ref. [28], EWT was employed as a data pre-analysis technique to filter the noises of the original time series.
Multi-verse optimizer (MVO) is a novel nature-inspired optimization algorithm, which was originally introduced by Mirjalili et al. [29]. In Ref. [29], the efficiency of MVO has been demonstrated by 19 benchmark functions and five real engineering optimization problems. Results have proven that the MVO algorithm outperforms the best algorithms proposed in the literature on the majority of the test beds. Additionally, the MVO algorithm has exhibited good performance in optimizing the weights and biases of a feedforward multilayer perceptron neural network [30], as well as the parameters of SVM [31].
In consideration of the good performance of EWT and MVO, a hybrid EWT-MVO-ANN model is proposed for more accurate annual streamflow forecasting results in which the EWT is employed to decompose the streamflow time series into subseries and eliminate the noises; the MVO algorithm is adopted to optimize the parameters of ANN; and ANN trained with MVO is employed for time series forecasting. Before constructing MVO-ANN models, the input variables are selected according to the PACF (partial autocorrelation function) values. Parallel experiments as BPNN, PSO-ANN and MVO-ANN with and without the EWT denoising process have been designed and conducted to apply to annual streamflow data of four hydrological stations in China to highlight the superiority of the proposed EWT-MVO-ANN model. The results of the experiments of this study have demonstrated that EWT is an effective way to filter the noises of annual streamflow series and thus improving prediction accuracy, and also, the MVO algorithm exhibits better performance compared with the Levenberg-Marquart (LM) and PSO algorithms in the training of ANN models for annual streamflow forecasting. It has been demonstrated by the results of this study that the proposed hybrid model can provide an effective modeling approach to capture the nonlinear characteristics of annual streamflow series, thus providing more satisfactory forecasting results.

Empirical Wavelet Transform
The empirical wavelet transform (EWT) was first introduced by Jerome Gilles [26] in 2013. The main idea of this method is to extract the amplitude modulated-frequency modulated (AM-FM) signals of the given signal by designing an appropriate wavelet filter bank adaptively. EWT produces intrinsic mode functions (IMFs) according to the information contained in the spectrum of the signal. The decomposition process of signal x(t) using EWT can be described as follows: Step 1 Compute the Fourier spectrum F(ω) of the original time series x(t) using the fast Fourier transform algorithm (FFT).
Step 2 Extract different modes by proper segmentation of the Fourier spectrum and determine the boundaries.
Step 3 After the boundaries are determined by appropriate segmentation, calculate the approximation and detail coefficients by applying the scaling function and empirical wavelets that correspond to each detected segment.
Equipped with the obtained set of local maxima plus 0 and π, the boundaries ω i of each segment are defined as the center between two consecutive maxima. Let { f 0 , f 1 , f 2 , ..., f N } be the set of frequencies corresponding to the set of local maxima and f 0 = 0, f N = π. The boundaries ω i can be given as: then the Fourier segments will be [0, A transition phase T n of width 2τ n is defined centered around each ω n [26].
After determining the set of boundaries, a bank of N wavelet filters, comprised of one low-pass filter and N − 1 band-pass filters, is defined for each segment based on the detected boundaries. The expressions for empirical wavelets ψ i (ω) and scaling function ϕ i (ω) are given by Gilles [26], as shown in Equations (3) and (4), respectively: where β(x) is an arbitrary function such that: The widely-used function β(x) according to Ref. [32] is: Additionally, τ n = γω n , 0 < γ < 1, γ can be selected using the below expression: The detail coefficients are given by the inner products with the empirical wavelets: while the approximation coefficient by the inner products with the scaling function: where F −1 denotes the inverse Fourier transform. Finally, the reconstruction of the original signal is obtained according to: where * means convolution. The multi-verse optimizer (MVO) algorithm, proposed by Mirjalili et al. [29], is inspired by the theory of the multi-verse in physics. Multi-verse theory believes that there is more than one big bang, and each big bang causes the birth of a universe [29]. It builds a mathematical model of the white/black holes (wormholes) tunnels that connect two universes and exchange the objects of universes using the wheel mechanism. At every iteration, the universes are sorted according to their inflation rates, and the one with the highest inflation rate is chosen by the roulette wheel to have a white hole. MVO is mathematically modeled as follows: Assume that: where d is the number of parameters (variables) and n is the number of universes (candidate solutions): where x j i denotes the j-th parameter of the i-th universe, Ui denotes the i-th universe, N I(Ui) is the normalized inflation rate of the i-th universe and r1 ranges from 0 to 1.
Assuming that there is a wormhole tunnel between one universe and the best universe, the transportation can be described as follows: where X j denotes the j-th parameter of the best universe, WEP and TDR are worm existence probability and travelling distance rate, respectively, lb j and ub j denote the lower and upper bound of the j-th variable, respectively, and r2, r3, r4 are random numbers in [0, 1]. The formulae for WEP and TDR are defined as follows: where min and max are the minimum and maximum, which are defined as 0.2 and 1, respectively, l denotes the current iteration, L denotes the maximum iterations and p is the exploitation accuracy over the iterations and is defined as 6. The number of universes is defined as 30, and the maximum number of iterations is 500 in this study.

Artificial Neural Network
As a typical intelligent learning paradigm, the artificial neural network has been widely used in many areas of science and engineering such as daily water level forecasting [33], rainfall-runoff simulation [34] and wind speed forecasting [35]. As can be seen in Figure 1, a typical three-layer feed-forward ANN usually consists of three layers including input, hidden and output layers [36]. Each layer contains nodes that are connected to nodes of the other layer(s). Additionally, the connection is associated with a weight. Assume a simplified neural network with three layers. The output of the j-th hidden node can be obtained as given in Equation (15): where f (x) = 1/(1 + exp(−x)) and is the transfer function of the hidden layer; n denotes the number of nodes in the input layer; l denotes the number of nodes in the hidden layer; w ij represents the connection weight from the i-th input node to the j-th hidden node; and a j represents the bias of the j-th hidden node.

Artificial Neural Network
As a typical intelligent learning paradigm, the artificial neural network has been widely used in many areas of science and engineering such as daily water level forecasting [33], rainfall-runoff simulation [34] and wind speed forecasting [35]. As can be seen in Figure 1, a typical three-layer feed-forward ANN usually consists of three layers including input, hidden and output layers [36]. Each layer contains nodes that are connected to nodes of the other layer(s). Additionally, the connection is associated with a weight. Assume a simplified neural network with three layers. The output of the j-th hidden node can be obtained as given in Equation (15): and is the transfer function of the hidden layer; n denotes the number of nodes in the input layer; l denotes the number of nodes in the hidden layer; wij represents the connection weight from the i-th input node to the j-th hidden node; and aj represents the bias of the j-th hidden node. After calculating the outputs of the hidden layer, the final output can be given as follows: where m denotes the number of nodes in the output layer, wjk denotes the connection weight from the j-th hidden node to the k-th output node and bk is the bias of the k-th output node.

ANN Trained with MVO
Training of neural networks is a form of supervised learning in which the weights and thresholds of the network are updated for the purpose of reducing the function error whenever the error of the training outputs does not match the expected accuracy. Unlike traditional local optimization training algorithms, such as the gradient descent method and Levenberg-Marquart (LM) algorithm, the global optimization algorithms can overcome the drawbacks of local minima and obtain more efficient solutions. The MVO algorithm is exploited to find an optimal combination of connection weights and biases to minimize the fitness error. In the MVO algorithm, the iteration begins with creating a set of random universes. When training the neural network, the weights and thresholds of the network are represented by the parameters of the MVO algorithm. In each iteration of MVO, the weights and thresholds are optimized. After calculating the outputs of the hidden layer, the final output can be given as follows: where m denotes the number of nodes in the output layer, w jk denotes the connection weight from the j-th hidden node to the k-th output node and b k is the bias of the k-th output node.

ANN Trained with MVO
Training of neural networks is a form of supervised learning in which the weights and thresholds of the network are updated for the purpose of reducing the function error whenever the error of the training outputs does not match the expected accuracy. Unlike traditional local optimization training algorithms, such as the gradient descent method and Levenberg-Marquart (LM) algorithm, the global optimization algorithms can overcome the drawbacks of local minima and obtain more efficient solutions. The MVO algorithm is exploited to find an optimal combination of connection weights and biases to minimize the fitness error. In the MVO algorithm, the iteration begins with creating a set of random universes. When training the neural network, the weights and thresholds of the network are represented by the parameters of the MVO algorithm. In each iteration of MVO, the weights and thresholds are optimized.
In the MVO-ANN model, the ANN model is trained with MVO. In order to evaluate the performance of the MVO-ANN model, root mean square error (RMSE) is adopted as the fitness function. Define the number of training samples as q, then the fitness function can be described as follows: where Y k i and O k i denote the predicted and observed values of the k-th output node in the i-th training sample, respectively.

Partial Autocorrelation Function
Selecting appropriate input variables for ANN models is an essential precondition for ANN modeling. The relationship between inputs and outputs has a significant impact on the forecasting accuracy of ANN models. To address this matter, PACF values of the streamflow time series are adopted to identify the input variables. Assume x t (t = 1, 2, ..., n) to be the target variable. If PACF values at lags greater than k are inside the 95% confidence interval, ., x t−k are employed as input variables. If all PACF values are in the 95% confidence interval, the input variables are chosen according to a trial and error method by systematically increasing the number of sequential data from the latest year to the sixth year. The calculation steps of the PACF values of the target variable x t (t = 1, 2, ..., n) can be briefly described as follows [37]: Firstly, let γ k denote the covariance at lag k (γ 0 represents the variance), γ k can be calculated as follows: Secondly, the autocorrelation coefficient at lag k, denoted by ρ k , is calculated: Finally, the partial autocorrelation coefficient at lag k, denoted as α kk , is obtained as follows: where k = 1, 2, ..., M.

The Hybrid EWT-MVO-ANN Model
The overall structure of the proposed hybrid EWT-MVO-ANN model is shown in Figure 2. As can be seen in Figure 2, the detailed procedure of the proposed hybrid EWT-MVO-ANN model can be described in four steps: Firstly, the EWT is exploited to decompose the original streamflow series into several detailed components and an approximated component. The approximated component corresponds to the general trend of the original series while the detailed components are high-frequency IMFs that belong to different frequency bands of the streamflow series. Secondly, the residual series is thrown away to eliminate the noises and smooth the original streamflow series. The sum of the remaining modes produces the reconstructed series. Thirdly, the PACF values are calculated to determine the best combination of input variables. Fourthly, the MVO algorithm is exploited to find the optimal weights and biases of the ANN architecture to minimize the fitness error. Finally, the forecasting results of the streamflow series are obtained according to the well-trained MVO-ANN model. Firstly, the EWT is exploited to decompose the original streamflow series into several detailed components and an approximated component. The approximated component corresponds to the general trend of the original series while the detailed components are high-frequency IMFs that belong to different frequency bands of the streamflow series. Secondly, the residual series is thrown away to eliminate the noises and smooth the original streamflow series. The sum of the remaining modes produces the reconstructed series. Thirdly, the PACF values are calculated to determine the best combination of input variables. Fourthly, the MVO algorithm is exploited to find the optimal weights and biases of the ANN architecture to minimize the fitness error. Finally, the forecasting results of the streamflow series are obtained according to the well-trained MVO-ANN model.

Model Performance Evaluation
It is important to apply multiple error measures when evaluating the forecasting ability of the developed models. This paper considers four statistical measures including RMSE, mean absolute error (MAE), mean absolute percent error (MAPE) and the coefficient of correlation (R). Among the four statistical measures, RMSE is sensitive to extremely large or small values of a time series and reflects the degree of variation; MAE reflects the actual forecasting error in a more balanced perspective; and MAPE is a measure of accuracy with no units. The RMSE, MAE and MAPE are defined as follows:

Model Performance Evaluation
It is important to apply multiple error measures when evaluating the forecasting ability of the developed models. This paper considers four statistical measures including RMSE, mean absolute error (MAE), mean absolute percent error (MAPE) and the coefficient of correlation (R). Among the four statistical measures, RMSE is sensitive to extremely large or small values of a time series and reflects the degree of variation; MAE reflects the actual forecasting error in a more balanced perspective; and MAPE is a measure of accuracy with no units. The RMSE, MAE and MAPE are defined as follows: where q p (i) and q o (i) denote the predicted and observed annual streamflow series, respectively, and N denotes the length of data. R shows the degree to which two datasets are related to each other and ranges from −1 to 1. The larger the absolute value of R, the more the predicted and observed data are related. The coefficient of correlation (R) is defined as: where q p and q o denote the average values of the predicted and observed runoffs, respectively.

Study Area and Data Collection
Annual streamflow time series of four hydraulic gauging stations located in two river basins in the upper reach of the Yangtze River in China, i.e., Xiangjiaba, Panzhihua, Luoxidu and Sanleiba, are researched in this study.
The first river basin is referred to as the Jinsha River Basin. It flows through the provinces of Qinghai, Sichuan and Yunnan in western China. Originating in the uppermost of the Yangtze River, the Jinsha River has a length of 3486 km, accounting for 77% of the total length of the upper Yangtze River, and has a drainage area of 480,000 km 2 , accounting for 50% of the total area of the upper Yangtze River. The second river basin is referred to as the Jialing River Basin. Jialing River is a major tributary of the Yangtze River in the Sichuan Basin. The Jialing River Basin has a main stream length of 1120 km and a drainage area of 160,000 km 2 , accounting for about 17% of the drainage area of the Yangtze River Basin. Two hydraulic gauging stations, which are called Luoduxi and Sanleiba, are selected in this study. Annual streamflow data from 1954 to 2008 (55 observations) for both Luoduxi and Sanleiba are studied. Data from 1954 to 1996 (43 observations) are for training, whilst those from 1997 to 2008 (12 observations) are for validation. Figure 3 provides a schematic of the Jinsha River Basin and the Jialing River Basin, as well as the four hydrological gauging stations researched in this study. Annual streamflow data of the four stations are presented in Figure 4. Basic statistical information, such as the mean value, standard deviation (SD), minimum (Min.), maximum (Max.), skewness (Skew.) and kurtosis (Kurt.), of these datasets is illustrated in Table 1.

Decomposing and Reconstruction Using EWT
Since EWT borrows the framework of classic WT in the decomposition and reconstruction processes [27], and three-level decompositions have been considered as the most appropriate in various studies, such as wind speed forecasting [27,28], drought index prediction [38] and daily river flow forecasting [39]; the decomposition level is determined as three in this study. Graphical representations of the decomposed subseries using EWT for four study stations are illustrated in Figure 5. As can be seen in Figure 5, the original annual streamflow series is decomposed into three independent modes and a residual for the four stations, respectively.

Decomposing and Reconstruction Using EWT
Since EWT borrows the framework of classic WT in the decomposition and reconstruction processes [27], and three-level decompositions have been considered as the most appropriate in various studies, such as wind speed forecasting [27,28], drought index prediction [38] and daily river flow forecasting [39]; the decomposition level is determined as three in this study. Graphical representations of the decomposed subseries using EWT for four study stations are illustrated in Figure 5. As can be seen in Figure 5, the original annual streamflow series is decomposed into three independent modes and a residual for the four stations, respectively.

Decomposing and Reconstruction Using EWT
Since EWT borrows the framework of classic WT in the decomposition and reconstruction processes [27], and three-level decompositions have been considered as the most appropriate in various studies, such as wind speed forecasting [27,28], drought index prediction [38] and daily river flow forecasting [39]; the decomposition level is determined as three in this study. Graphical representations of the decomposed subseries using EWT for four study stations are illustrated in Figure 5. As can be seen in Figure 5, the original annual streamflow series is decomposed into three independent modes and a residual for the four stations, respectively. Among these modes, Mode 1 is a trend term that describes the general tendency of the original annual streamflow series. The modes present changing frequencies, amplitudes and wavelengths. For all stations, Mode 1 has the lowest frequency, the longest wavelength and contributes the most to the original time series. For the following subseries, the frequency is increased, while the wavelength and contribution are decreased. After decomposing the original streamflow series, the residue is thrown away because it mainly contains noisy information, and the remaining modes are reconstructed in a new series. The graphical representations of the reconstructed series are shown in Figure 6. As can be seen in Figure 6, the reconstructed series tends to be smoother after the denoising processes.  Among these modes, Mode 1 is a trend term that describes the general tendency of the original annual streamflow series. The modes present changing frequencies, amplitudes and wavelengths. For all stations, Mode 1 has the lowest frequency, the longest wavelength and contributes the most to the original time series. For the following subseries, the frequency is increased, while the wavelength and contribution are decreased. After decomposing the original streamflow series, the residue is thrown away because it mainly contains noisy information, and the remaining modes are reconstructed in a new series. The graphical representations of the reconstructed series are shown in Figure 6. As can be seen in Figure 6, the reconstructed series tends to be smoother after the denoising processes. Among these modes, Mode 1 is a trend term that describes the general tendency of the original annual streamflow series. The modes present changing frequencies, amplitudes and wavelengths. For all stations, Mode 1 has the lowest frequency, the longest wavelength and contributes the most to the original time series. For the following subseries, the frequency is increased, while the wavelength and contribution are decreased. After decomposing the original streamflow series, the residue is thrown away because it mainly contains noisy information, and the remaining modes are reconstructed in a new series. The graphical representations of the reconstructed series are shown in Figure 6. As can be seen in Figure 6, the reconstructed series tends to be smoother after the denoising processes.

Input Determination
To obtain the input variables for each model, the PACF values between each streamflow series and their antecedent values at lag k are calculated using the method described in Section 2.3, as well as the PACF values between the reconstructed streamflow series and their antecedent values. Graphical representations of the PACF values are shown in Figure 7 for the four stations.

Input Determination
To obtain the input variables for each model, the PACF values between each streamflow series and their antecedent values at lag k are calculated using the method described in Section 2.3, as well as the PACF values between the reconstructed streamflow series and their antecedent values. Graphical representations of the PACF values are shown in Figure 7 for the four stations. According to the input variable selection method described in Section 2.3, the input variables of the models for the original and the reconstructed series of the four stations are finally determined as described in Table 2.

Parameter Settings
To verify the efficiency of the proposed MVO-ANN model, the back propagation neural network (BPNN) trained with the LM algorithm and ANN trained with the PSO algorithm (PSO-ANN) have been constructed for annual streamflow forecasting.
After determining the optimal combination of input variables using the method described in Section 2.3, the streamflow data are normalized to [0, 1]. In most cases, the sigmoid function is used According to the input variable selection method described in Section 2.3, the input variables of the models for the original and the reconstructed series of the four stations are finally determined as described in Table 2. Table 2. Input determination for the original and reconstructed series.

Stations Inputs Combination
Original Reconstructed

Parameter Settings
To verify the efficiency of the proposed MVO-ANN model, the back propagation neural network (BPNN) trained with the LM algorithm and ANN trained with the PSO algorithm (PSO-ANN) have been constructed for annual streamflow forecasting.
After determining the optimal combination of input variables using the method described in Section 2.3, the streamflow data are normalized to [0, 1]. In most cases, the sigmoid function is used as the transfer function in the hidden layer, while the linear activation function is used in the output layer for ANNs [40]. In this study, all of the ANN models exploit the tan-sigmoid function and purelin function as the activation function for the hidden layer and output layer, respectively. For the single BPNN models, the LM training function (TrainLM) is exploited to speed up the training process. For the MVO-ANN models, the number of universes is defined as 30, and the maximum number of iterations is 500 according to Ref. [29]. A value of 1000 for the maximum iteration number has also been tried for the MVO-ANN model in the experimental process. Results have shown that there is no significant improvement in accuracy, but the experiment time is much longer. The WEP is increased linearly from 0.2 to 1, while TDR is decreased from 0.6 down to zero according to Equations (13) and (14), respectively. For the PSO-ANN models, the population size is set as 30 and the maximum number of iteration is set as 500 to compare with the MVO-ANN model. The optimal numbers of hidden nodes for all ANN models including BPNN, PSO-ANN and MVO-ANN with and without EWT are determined using grid search (GS) algorithm. In the GS algorithm, the searching range is set as Ref. [2,29] and the grid step is set as one [35]. Each experiment is repeated 10 times, and results with the best accuracy are presented.

Results and Analysis
To evaluate the performance of these models,   It is clear from Tables 3-6 that the proposed EWT-MVO-ANN model obtains the smallest RMSE, MAE and MAPE and the biggest R compared with the other five models. Comparison between the models with and without the EWT algorithm reveals that combining EWT with the ANN model is an effective way to increase the forecasting accuracy. Though the proposed EWT-MVO-ANN model exhibits much better performance than the other models, the configuration is more complex on account of data pre-processing and parameter optimization. Thus, the computation time required is much more than the single BPNN model without the denoising and optimization processes. Detailed analyses are submitted to further evaluate the performance of these models. Table 3 Table 7 illustrates the comparisons of the forecasting performance between EWT-PSO-ANN and EWT-MVO-ANN models. It can be concluded from Table 7 that the EWT-MVO-ANN models outperform the EWT-PSO-ANN models in RMSE, MAE, MAPE and R for the four stations in both training and validation periods, which demonstrate the superiority of MVO algorithm in training ANNs. To further describe the effectiveness of EWT algorithm in enhancing the accuracy of ANN models in streamflow forecasting, forecasting results of the MVO-ANN and EWT-MVO-ANN models for the four stations are illustrated in Figure 8, where line charts with markers on the left side represent the forecasted and observed values of streamflow, and scatter plots with a linear fitting line of the predicted and observed values are exhibited on the right side of Figure 8. Subgraphs on the right side of Figure 8 are drawn using the "Fit linear" tool in the OriginPro environment. The intercepts of the linear fitting lines of all subgraphs have been fixed as zero to better display the differences between two different models. From the left side of Figure 8, it is obvious that the prediction lines of the EWT-MVO-ANN models approximate the observed values better than the MVO-ANN models. From the right side of Figure 8, it can be seen that the scatters of the EWT-MVO-ANN models distribute a little more tightly around the linear fitting line compared with the MVO-ANN models. From the description above, the effectiveness of EWT in the forecasting of annual streamflow time series is demonstrated. Water 2017, 9,406 16 of 20  Generally, it can be concluded from the forecasting results of the experiments of this study that the ANN model trained by LM cannot give a satisfactory performance in the forecasting of annual streamflow series. Additionally, the MVO method can achieve better performance in the training of ANN models compared with the LM and PSO algorithms. Compared with the MVO-ANN model, the EWT-MVO-ANN model leads to a much better performance in terms of the four evaluation indices, which demonstrates that EWT is an effective alternative decomposition technique in improving the forecasting accuracy of annual streamflow time series. Generally, it can be concluded from the forecasting results of the experiments of this study that the ANN model trained by LM cannot give a satisfactory performance in the forecasting of annual streamflow series. Additionally, the MVO method can achieve better performance in the training of ANN models compared with the LM and PSO algorithms. Compared with the MVO-ANN model, the EWT-MVO-ANN model leads to a much better performance in terms of the four evaluation indices, which demonstrates that EWT is an effective alternative decomposition technique in improving the forecasting accuracy of annual streamflow time series.

Conclusions
This paper develops a hybrid EWT-MVO-ANN model for streamflow time series forecasting. EWT is exploited to extract useful information and eliminate stochastic volatility from the original streamflow series. The ANN architecture trained with MVO method is employed as a predictor. To fully demonstrate the effectiveness of the proposed model, data from four hydrological stations with different levels of discharge and different volatilities in the upper reaches of the Yangtze River, China, are studied. Among the six models, including BPNN, PSO-ANN, MVO-ANN, EWT-BPNN, EWT-PSO-ANN and EWT-MVO-ANN, the EWT-MVO-ANN model performs the best according to four statistical indices (RMSE, MAE, MAPE and R). The following results have been concluded: 1 Compared with the BPNN model, the ANN models trained by intelligent algorithms obtain better forecasting results. The performance of the MVO algorithm is better than the PSO algorithm. 2 By discarding the residue of the subseries, the forecasting ability of the EWT-MVO-ANN model has been improved, which demonstrated that EWT is suitable for data pre-analysis for streamflow time series. 3 The proposed model is appropriate for annual streamflow series forecasting for streamflow data with different magnitudes and fluctuations in the four hydraulic gauging stations in China.
This study proposes a novel evolutionary algorithm to optimize the weights and biases of a traditional ANN, but the performances of ANNs with different network structures, such as the radial basis function network, the general regression neural network and the ELM, have not been studied. More attention will be paid to the performances of different ANNs for forecasting in the future study.