Data-Driven Approach for Rainfall-Runoff Modelling Using Equilibrium Optimizer Coupled Extreme Learning Machine and Deep Neural Network

: Rainfall-runoff (R-R) modelling is used to study the runoff generation of a catchment. The quantity or rate of change measure of the hydrological variable, called runoff, is important for environmental scientists to accomplish water-related planning and design. This paper proposes (i) an integrated model namely EO-ELM (an integration of equilibrium optimizer (EO) and extreme learning machine (ELM)) and (ii) a deep neural network (DNN) for one day-ahead R-R modelling. The proposed R-R models are validated at two different benchmark stations of the catchments, namely river Teiﬁ at Glanteiﬁ and river Fal at Tregony in the UK. Firstly, a partial autocorrelation function (PACF) is used for optimal number of lag inputs to deploy the proposed models. Six other well-known machine learning models, called ELM, kernel ELM (KELM), and particle swarm optimization-based ELM (PSO-ELM), support vector regression (SVR), artiﬁcial neural network (ANN) and gradient boosting machine (GBM) are utilized to validate the two proposed models in terms of prediction efﬁciency. Furthermore, to increase the performance of the proposed models, paper utilizes a discrete wavelet-based data pre-processing technique is applied in rainfall and runoff data. The performance of wavelet-based EO-ELM and DNN are compared with wavelet-based ELM (WELM), KELM (WKELM), PSO-ELM (WPSO-ELM), SVR (WSVR), ANN (WANN) and GBM (WGBM). An uncertainty analysis and two-tailed t -test are carried out to ensure the trustworthiness and efﬁcacy of the proposed models. The experimental results for two different time series datasets show that the EO-ELM performs better in an optimal number of lags than the others. In the case of wavelet-based daily R-R modelling, proposed models performed better and showed robustness compared to other models used. Therefore, this paper shows the efﬁcient applicability of EO-ELM and DNN in R-R modelling that may be used in the hydrological modelling ﬁeld.


Introduction
The rainfall-runoff (R-R) modelling process is conducted by hydrologists for forecasting hydrological information (discharge data) which may be helpful in water resources engineering [1,2], flood mitigation planning [3], production of hydroelectric power [4],

Particle Swarm Optimization
Particle swarm optimization (PSO) is a stochastic optimization technique based on a population which is proposed by Russell Eberhart and James Kennedy in 1995, inspired by social behavior of the bird flocking or fish schooling [57]. It mimics the navigation and foraging of a flock of birds or school of fishes. PSO uses the number of particles (individuals) moving around the problem space to find the best solution in their path over the course of runs. In other words, PSO has group of particles that move around the problem space. Particles are influenced by their own best past position and the best past position of the whole population (neighbor). These concepts have been mathematically modelled using Equations (1) and (2). where, v j+1 i,n indicates new velocity for the i th particle, c 1 and c 2 specifies weighting coefficients for local best and global best locations respectively, x j+1 i,n indicates a particle's new position. p j i,n is the i th particle's best-known position, and p j g,n indicates the best position known to the whole swarm, r 1 and r 2 are the uniform random number distribution range between 0 and 1. The authors of this paper utilize PSO to find optimal values of the ELM learning parameters in the training phase.

Equilibrium Optimizer (EO)
The underlying nature of EO is based on the law of conservation of mass, where mass changing in time is equivalent to the amount of mass entering into a system plus the generated amount of mass inside the system minus the amount of mass leaving that system. More details about the inspiration of EO are given in [39]. Like other populationbased optimization techniques, EO has a number of particles (population) as candidate solution and particle's position is termed as concentration. Each particle's fitness function is evaluated to determine the equilibrium states of populations for the selection of best solution. The following are the steps of EO: Step 1: Initialization of particle's concentrations The concentrations are initialized randomly for each particle using the upper bound and lower bound of the decision variable which is formulated as: where C init i is i th particle's initial concentrations, C upper_bound and C lower_bound are the of concentrations upper bound and lower bound respectively, rand i is the generated random number between 0 and 1, and N is the population size.
Step 2: Candidate solutions in equilibrium pool → P eq, pool The equilibrium pool is a collection of best candidate solutions which consists of five types of global optimal candidates. In the initial optimization period, the positions of the equilibrium candidates are unknown, which are determined in each run of the main loop to provide the knowledge about the search pattern. Among the five global optimum solutions, the four best equilibrium candidates are selected in each run with another one which is an arithmetic mean of four best solutions. The four best solutions are useful for the exploration task and the average one helps in the exploitation phase of the EO algorithm. The number of candidate solutions in equilibrium pool may depend on the nature of the evaluated function. In this work, the authors choose five equilibrium candidates for the problem space. The following is the equilibrium pool: → P eq, pool = → P eq[1] + → P eq [2] + → P eq [3] + → P eq [4] Particles are randomly selected to update their concentrations with the same probability. During this optimization process all particles receive approximately the same number of updating processes.
Step 3: particle's concentration updating process After random selection of particle, the first term is the main concentration updating formula (Equation (5)) called exponential term (F).
where λ is the turnover rate, i.e., rate of change in concentration within a control volume, which is varying with time and defined as a random vector in between 0 and 1. Vari-able t (time) is a function of iteration and which has decreasing pattern in each iteration (Equation (6)).
where Iteration and Max_iteration are the present value and maximum value of iteration, respectively and a 2 is used to control the exploitation task which is a manually specified constant value. To guarantee convergence and balance in between intensification and diversification, t 0 is considered as: where a 1 is used to control the exploration task which is a manually specified constant value. The higher value of a 1 means good exploration ability and lower exploitation efficiency. Equally, the higher value of a 2 means good exploitation ability and lower exploration efficiency. sign → r − 0.5 component specifies the direction of intensification and diversification of particles and r is defined as a random vector in between 0 and 1.
The final equation of F is defined by substituting Equation (5) with Equation (7) as follows: The next important term of EO is the generation rate (G) and is used to improve the exploitation task. That controls the search patterns towards the accurate solution. The generation rate is considered as a function of time. Equation (9) shows the generation rate: → GCP = 0.5 * r 1 r 2 ≥ GP 0 where r 1 and r 2 are random values between 0 and 1. GCP is called generation rate control parameter and generated by replicating same values. The GCP controls the probable contribution of generation term in updating process. This probability specifies how many particles update their positions with the generation term. The generation probability (GP) is responsible for that act. Equations (10) and (11) are used to determine this process. If GP = 0 then all concentrations of a particle are updated without → G. GP = 0.5 provides a good balance between the exploitation and exploration phase.
The final concentrations updating rule is as follows: where → F is define in Equation (8) and V is assigned as unit. Equation (12) consists of three terms. The first is the concentration of an equilibrium candidate. The second and third define the variations in particle's concentrations. The variations in the second term specifies large variations in concentration (difference between equilibrium candidate and candidate particle) that controls exploration in EO. The third term manages exploitation by fine tuning concentration based on the generation rate (Equation (9)). Based on parameters which are the candidate particle's concentration, equilibrium candidate particles, turnover rate λ), the final two terms may have the opposite and same sign. The same sign indicates global search and opposite sign indicates local search. 3. Initialize the fitness of four equilibrium candidates [fitness( → P eq [1] ), fitness( → P eq [2] ), fitness( → P eq [3] ), fitness( → P eq [4] )] 4. for it = 1 to maximum iteration number do 5. for i = 1 to P do 6. Estimate the fitness of the i th particle 7.
if fitness( Random selection of equilibrium candidate from equilibrium pool (9)) 27. (12) sign(r-0.5) P eq,pool , in starting period it helps particle for global search patterns. In ending period, it helps particle for local search pattern.

Discrete Wavelet Transforms
Wavelet function is used for localization of the given function in the time and space scale [58,59]. It is generally utilized to acquire evolutionary behaviour to describe oscillating Appl. Sci. 2021, 11, 6238 7 of 36 daily discharge data [60]. Wavelet transform have proficiency to simultaneously acquire information on the location, time and frequency of a signal. The following equation is used to define discrete wavelet transform (DWT): where, ¥ is mother wavelet, l and k are integers that generate the wavelet translation and dilation/scale, respectively. x 0 indicates specified fine-scale step and x 0 > 1, y 0 indicates location parameter and y 0 > 0. Generally, the best choice for x 0 and y 0 are 2 and 1, respectively. Dyadic grid arrangement is one of the most efficient and very simple case for practical purposes. Dyadic grid is defined as the power of two logarithmic scaling of the translations and dilations [56]. Substituting x 0 = 2 and y 0 = 1, DWT turns out to be: where, DWT (k, l) indicates wavelet coefficient and it is for discrete wavelet of scale location y = 2 k l and x = 2 k . Q(t) indicates finite discharge/precipitation time series (t = 0, 1, 2, . . . , U − 1). U indicates integer power of 2 (i.e., U = 2 M ). k and l ranges from 1 < k < U and 0 < l < 2 U−k−1 respectively. The decomposed sub-series represents high frequency information (high-pass filter) and approximation shows slowest variation (low-pass filter) of the original series. The sub-series represents the 2 n fluctuations (dyadic translation/periodicity), where n indicates sub-series component [58].

Deep Neural Network (DNN)
DNNs are artificial neural connections with multi-layered architecture [61]. The layers between input and output may automatically learn the non-linear patterns of data with several layers of abstraction. A DNN uses backpropagation (BP) learning algorithms to learn the complicated patterns in datasets. BP algorithms update the learning parameters of DNNs to compute the representation of each layer from the representations of preceding layer. It back propagates the output errors for fine-tuning of the network weights. The DNNs comprise an input layer, number of hidden layers and output layer. The DNNs are sensitive to its architecture and the value of hyperparameters such as wider vs. deeper networks, neuron count in hidden layers, activation function selection at each layer, optimizer, batch size, loss function, and epochs. DNNs are prone to underfitting and overfitting. The underfitting problem may be resolved by increasing the capacity of the network. However, regularization methods such as weight decay and early stopping with dropout and a weight constraint may cope with overfitting problems. Figure 1 shows the two-layered architecture of DNN for runoff prediction.

Extreme Learning Machine (ELM)
ELM is a type of least square based SLFNs which was proved for significant performance in classification and regression task [62,63]. Huang et al. [63] designed ELM by replacing a large number of neurons in the hidden layer with a kernel function. This technique provides good generalization performance with faster learning capability than some of the traditional machine learning algorithms [62]. Before training of the network, the output weights of the ELM are calculated based on the arbitrarily generated input weights and biases with defined number of the hidden layer neurons and activation function.

Extreme Learning Machine (ELM)
ELM is a type of least square based SLFNs which was proved for significan mance in classification and regression task [62,63]. Huang et al. [63] designed replacing a large number of neurons in the hidden layer with a kernel function. T nique provides good generalization performance with faster learning capabil some of the traditional machine learning algorithms [62]. Before training of the n the output weights of the ELM are calculated based on the arbitrarily generate weights and biases with defined number of the hidden layer neurons and activati tion. Figure 2 shows a single hidden layer neural network with 'n' number of inp neurons, 'l' number of hidden layer neurons, 'm' number of output layer neur example, if training dataset is {Xi, Yi} then the input dataset is Xi = [Xi1, Xi2, …, Xin] the output dataset is Yi = [Yi1, Yi2, …, Yim] T ∈ ℝ m and i = 1, 2, …, l. The network can be mathematically modelled as: where wk is the input weights and bk is the bias factor of the k th hidden node whic network learning parameters, zj = [z1j, z2j, …, znj] T , βk = [βk1, βk2, …, βk1] T is the outpu of the k th hidden node to the output nodes, g(wk, bk, zj) is the k th hidden node outp respect to the input zj, tj is the predicted output of the corresponding input zj, j is t ber of training samples which is denoted as q.
. , Y im ] T ∈ R m and i = 1, 2, . . . , l. The network of ELM can be mathematically modelled as: where w k is the input weights and b k is the bias factor of the k th hidden node which are the network learning parameters, z j = [z 1j , z 2j , . . . , z nj ] T , β k = [β k1 , β k2 , . . . , β k1 ] T is the output weight of the k th hidden node to the output nodes, g(w k , b k , z j ) is the k th hidden node output with respect to the input z j , t j is the predicted output of the corresponding input z j , j is the number of training samples which is denoted as q.

Extreme Learning Machine (ELM)
ELM is a type of least square based SLFNs which was proved for significant mance in classification and regression task [62,63]. Huang et al. [63] designed E replacing a large number of neurons in the hidden layer with a kernel function. Th nique provides good generalization performance with faster learning capabili some of the traditional machine learning algorithms [62]. Before training of the n the output weights of the ELM are calculated based on the arbitrarily generate weights and biases with defined number of the hidden layer neurons and activatio tion. Figure 2 shows a single hidden layer neural network with 'n' number of inp neurons, 'l' number of hidden layer neurons, 'm' number of output layer neuro example, if training dataset is {Xi, Yi} then the input dataset is Xi = [Xi1, Xi2, …, Xin] T the output dataset is Yi = [Yi1, Yi2, …, Yim] T ∈ ℝ m and i = 1, 2, …, l. The network can be mathematically modelled as: where wk is the input weights and bk is the bias factor of the k th hidden node which network learning parameters, zj = [z1j, z2j, …, znj] T , βk = [βk1, βk2, …, βk1] T is the output of the k th hidden node to the output nodes, g(wk, bk, zj) is the k th hidden node outp respect to the input zj, tj is the predicted output of the corresponding input zj, j is th ber of training samples which is denoted as q.  ELM produces w k and b k randomly and analytically estimates β by the Equation (16).
The output weights are calculated using a linear equation [64] as: where H is the output matrix of the hidden layer (Equation (17)), H † is the Moore-Penrose generalized inverse of H and Y = [y 1 , y 2 , . . . , y q ] T is the actual target values of the training dataset. : This paper uses radial basis function kernel ELM (RBF-ELM) where the number of hidden units need not to define before and feature mapping of hidden layer (g(w k , b k , z j )) is hidden to the user [64,65].

Support Vector Regression (SVR), Artificial Neural Network (ANN) and Gradient Boosting Machine (GBM)
SVR is a kernel-based non-linear regression method, derived from support vector classification (SVC) by Boser et al. in 1992 [66,67]. It uses the principle of structural risk minimization, in order to minimize the upper bound of the generalization error instead of minimizing the prediction error on the training set. Basically, it tries to find the best regression hyperplane with smallest structural risk in high dimensional feature space. The more details about SVR can be found in [67].
The idea of ANN is derived from the known functionality of the human brain where a very large number of biological neurons are interconnected through the links. The various characteristics of biological neural network which is simulated by the ANN are the ability to handle non-linearity, noise and fault tolerance, massive parallelism, learning, and generalization capability. The ANN consists of an input layer, an output layer, and one or more hidden layers. The neurons in these layers are connected through the links. The weights of these links are adjusted during the learning process. In this paper, a back propagation algorithm is used to tune the parameters of ANN. The ANN may be able to model highly non-linear systems where the relationship among the variables is complex. The single hidden layered neural network is sufficient for hydrological modelling [68].
GBM is a numerical-based optimization algorithm, derived from gradient boosting classification by Friedman (2001). The main objective behind GBM is to minimize the loss function by iteratively adding a new decision tree (weak learner) at each step [69]. The new decision tree that is fitted to the current residual is added to the previous model in order to update the residual. This process continues until it reaches the maximum number of iterations provided by the researcher. At each iterative step the contribution of the added decision tree is shrunk using a parameter known as learning rate. The value of learning rate, lies between 0 and 1. In order to improve the predictive accuracy of the GBM some randomization is added to the fitting process. Randomization may include using randomly selected subsample instead of full training dataset.

Study Area and Dataset Used
The daily rainfall and runoff data (15 years (2000-2015)) from two different benchmark stations of two catchments in the UK were obtained from the UK national river flow archive website (https://nrfa.ceh.ac.uk/ (accessed on 5 August 2019)). These two stations are part of the UK benchmark network (UKBN2) which are most suitable for interpretation and identification of long-term hydrological variability and change. The UK benchmark stations have all the full flow regime called low, medium and high flows. These catchments are sensibly free from disturbances of human like river engineering, urbanization, and water abstractions. Therefore, these are near-natural catchments and have climate-driven alterations in river flows.
The catchment of the river Teifi at Glanteifi from a wetter, cloudy, and windy region south-west Wales where the values of potential evapotranspiration are low relative to rainfall [70]. The winter months are wetter than the warm summer months. The monthly rainfall pattern increases from August, to peak in November-December, to January, then the pattern begins falling from February. The outlet of the basin is 893.6 km 2 and the river gauge at 5.2 m AOD (mean above ordnance datum). The primary river flow data measuring authority is Natural Resources Wales. The catchment has minor flow abstractions due to upland reservoirs and negligible agricultural demands. Therefore, the flow regime is natural. The factor affecting runoff are (i) reservoirs, and (ii) public water supply. The major portions of the catchment land cover are grassland (79%), woodland (12%), horticulture (4.74%), mountain (1.80%) and urban extent (2.83%). Catchment is mainly impermeable Ordovician and Silurian deposits. Over the land, dairy farming predominates in the south and hill farming in the upper catchment. Some forest with peaty soils on hills and is seasonally wet. The most of the lower areas have soils with permeable substrate.
The catchment of the river Fal at Tregony from dry, sunniest, mild winter, and cool summer region Cornwall country of south-west England where the values of potential evapotranspiration are higher relative to rainfall [70]. The monthly rainfall patterns increase from September to January [30]. The outlet of the basin is 87 km 2 and the river gauge at 6.9 m AOD. The primary river flow data measuring authority is the Environment Agency. Due to steep topography, the catchment has fast responses to the water events. The runoff coefficient is at a daily time step. The factor affecting runoff are (i) increased by effluent return, and (ii) reduced by industrial/agriculture use. The major portions of the catchment land cover are grassland (41.43%), horticulture (21.52%) with woodland (16.22%), mountain (0.18%) and urban extents (4.98%). The land use is low-grade agriculture and pasturage with some woodland. The catchment has no major changes. The statistics of catchments data is given in Table 2. The daily precipitation (mm per day) dataset was obtained from the center for ecology and hydrology-gridded estimates of areal rainfall (CEH-GEAR) [71,72]) which was interpolated to 1 km from the rain gauge data using the natural neighbor interpolation approach [71]. The uncertainty degree of the associated rainfall value depends on the mean distance to the nearest rain gauge station and the spatial variability of the rainfall. More details about the rainfall data can be found in [71]. Figure 3 shows daily discharge and rainfall data of the two basins.
A partial autocorrelation function (PACF) was used to select the number of preceding days runoff data for number of input features along with a previous day rainfall value. The proposed models predict the next day runoff value.
Furthermore, the highest correlated preceding runoff series and previous day rainfall series are pre-processed using discrete wavelet transform technique. The transformed time series data were used to predict the next day runoff value for both catchments. Further details of selected runoff and rainfall series with wavelet pre-processing are described in the model development section (Section 4). Furthermore, the highest correlated preceding runoff series and previous day rainfall series are pre-processed using discrete wavelet transform technique. The transformed time series data were used to predict the next day runoff value for both catchments. Further details of selected runoff and rainfall series with wavelet pre-processing are described in the model development section (Section 4).

Proposed EO-ELM and DNN
This study proposes a new empirical R-R model, called EO-ELM, where EO optimizes ELM learning parameters to find an optimal configuration of ELM for better prediction of runoff values. Here, the concentrations of EO are ELM learning parameters. The root mean square error (RMSE) is considered as objective function for EO. The best equilibrium candidate found by EO is considered as optimal configuration of ELM for prediction tasks.

Proposed EO-ELM and DNN
This study proposes a new empirical R-R model, called EO-ELM, where EO optimizes ELM learning parameters to find an optimal configuration of ELM for better prediction of runoff values. Here, the concentrations of EO are ELM learning parameters. The root mean square error (RMSE) is considered as objective function for EO. The best equilibrium candidate found by EO is considered as optimal configuration of ELM for prediction tasks.
In EO-ELM, initially all particles have no knowledge about the solution space. The collaboration of five equilibrium candidates help concentration updating process of particles. At initial periods of iteration, the equilibrium candidates are diverse in nature and exponential term (Equation (8)) produces large random numbers which helps particle to cover the entire solution space. Similarly, during end period of iterations, particles are surrounded equilibrium candidates which are in an optimal position with similar configuration. At these moments, the exponential term (Equation (8)) produces lower value of random number which helps fine tuning of candidate solutions. The algorithm of EO-ELM is shown in Algorithm 2.

Input:
1. Training and testing dataset 2. Set hidden units and biases of ELM using initialization of EO populations 3. Assign of EO parameters value (a 1 = 2, a 2 = 1, GP = 0.5) (optimized parameters based on several trials of proposed model) Output: 4. Optimized hidden units and biases of ELM from best fitness of a EO equilibrium candidate 5. Get output weights of ELM using Moore Penrose Inverse 6. EO optimized ELM testing using test dataset Begin EO-ELM training 7. Initialize the fitness of four equilibrium candidates [fitness( → P eq [1] ), fitness( → P eq [2] ), fitness( → P eq [3] ), fitness( → P eq [4] )] 8. for it = 1 to maximum iteration number do 9. for i = 1 to P do 10. Estimate the fitness of the i th particle 11.
if fitness(
Random selection of equilibrium candidate from equilibrium pool (8)) 28. Evaluate (11)) 29. Evaluate (12)) 32. end for 33. end for 34. Set ELM optimal input weights and hidden biases using → P eq [1] (Best equilibrium candidate) 35. ELM testing The authors propose and develop a DNN model for R-R modeling. Two hidden layers is considered for prediction of runoff values based on hit and trial approach. The gradient descent optimizer, called "adam", is used for learning parameter (input weights and hidden weights and biases) tuning in DNN. The "adam" optimizer calculates each learning parameter of DNN based on the square of gradients. It adapts the learning rate for each weight of DNN by estimating first and second moment of gradient. The activation function used in hidden units is "relu". The process flowchart of DNN is shown in Figure 4.

Evaluate
The authors propose and develop a DNN model for R-R modeling. Two hidden layers is considered for prediction of runoff values based on hit and trial approach. The gradient descent optimizer, called "adam", is used for learning parameter (input weights and hidden weights and biases) tuning in DNN. The "adam" optimizer calculates each learning parameter of DNN based on the square of gradients. It adapts the learning rate for each weight of DNN by estimating first and second moment of gradient. The activation function used in hidden units is "relu". The process flowchart of DNN is shown in Figure 4.

Model Development and Performance Metrics
The consideration of rainfall and runoff lag for model input combination is an important task in order to achieve better prediction performance. The best performance of ML models has been shown for an input combination in Nourani et al. [18] and Roy et al. [30] was two days lag of discharge (Qt−2, Qt−1), current day discharge (Qt), and current day precipitation (Pt) are used to forecast the next-day discharge (Qt+1) on daily R-R modeling. In this paper, auto-correlogram (ACF) and partial autocorrelation function (PACF) are used for selection of the optimal number of lags from the discharge series [72]. The optimal lags of the discharge series and a current day precipitation (Pt) are used to predict oneday ahead discharge value in the initial model development phase. The lag values of precipitation are already reflected in the lag discharge series. Figure 5a,c show ACF plot with 95% confidence bound for discharge series of Fal at Tregony and Teifi at Glanteifi, respectively, where both indicates (Figure 5a,c) that the correlation effect is weakened in subsequent lag values. In order to find a direct relationship (serial correlation) between number of lags with corresponding observation, PACF plots (Figure 5b,d) for Fal at Tregony and Teifi at Glanteifi, respectively) with 95% confidence are used in this study. PACF plots (Figure 5b,d) showed less correlation beyond some specific lag values or some specific lag values are highly correlated with corresponding observation. In this paper, the authors consider positive correlated lag values from PACF plots (Figure 5b,d). For the catchment Fal at Tregony (Figure 5b) 1, 2, 3, 4, 5, and 6 (6 preceding days) lags are considered for model input from the discharge series with current day precipitation value (here, 1-day lag). For the catchment of Teifi at Glanteifi (Figure 5d) 1, 3, 4, 5, and 6 (5 preceding days) lags are considered for model input from the discharge series with current day precipitation value (here, 1-day lag).

Model Development and Performance Metrics
The consideration of rainfall and runoff lag for model input combination is an important task in order to achieve better prediction performance. The best performance of ML models has been shown for an input combination in Nourani et al. [18] and Roy et al. [30] was two days lag of discharge (Q t−2 , Q t−1 ), current day discharge (Q t ), and current day precipitation (P t ) are used to forecast the next-day discharge (Q t+1 ) on daily R-R modeling. In this paper, auto-correlogram (ACF) and partial autocorrelation function (PACF) are used for selection of the optimal number of lags from the discharge series [72]. The optimal lags of the discharge series and a current day precipitation (P t ) are used to predict one-day ahead discharge value in the initial model development phase. The lag values of precipitation are already reflected in the lag discharge series. Figure 5a,c show ACF plot with 95% confidence bound for discharge series of Fal at Tregony and Teifi at Glanteifi, respectively, where both indicates (Figure 5a,c) that the correlation effect is weakened in subsequent lag values. In order to find a direct relationship (serial correlation) between number of lags with corresponding observation, PACF plots (Figure 5b,d) for Fal at Tregony and Teifi at Glanteifi, respectively) with 95% confidence are used in this study. PACF plots (Figure 5b,d) showed less correlation beyond some specific lag values or some specific lag values are highly correlated with corresponding observation. In this paper, the authors consider positive correlated lag values from PACF plots (Figure 5b,d). For the catchment Fal at Tregony (Figure 5b) 1, 2, 3, 4, 5, and 6 (6 preceding days) lags are considered for model input from the discharge series with current day precipitation value (here, 1-day lag). For the catchment of Teifi at Glanteifi (Figure 5d) 1, 3, 4, 5, and 6 (5 preceding days) lags are considered for model input from the discharge series with current day precipitation value (here, 1-day lag).
The original rainfall and runoff data is decomposed by a DWT method. The decomposed rainfall and runoff data are given to model input and next day discharge data are the corresponding output. Therefore, next day value of discharge is predicted from antecedent information of rainfall and runoff (2 n -day mode and a maximum level of approximation mode, where n = 0,1, . . . , 9). The decomposition runoff series (one preceding lag) is selected from PACF plots where the previous day lag of runoff shows (Figure 5b,d) the highest positive correlation (greater than 0.8) with next day runoff value for both the catchments. Due to previous lags of precipitations already being reflected in runoff series, the previous one day behind precipitation is selected as decomposition precipitation series. In this paper, the most commonly used smooth mother wavelet, called Daubechies (Db), in the hydrometeorological study is selected for model deployment. The number of levels for decomposition is determined by Equation (19), which is based on the mother wavelet and sample data points [73].
where D L is the maximum decomposition level, N is the number of sample data points, and K indicates the number of vanishing moments of a Db wavelet. A broader support of wavelet with higher vanishing moments are more appropriate for irregular and non-linear hydrological time series [74]. Therefore, an irregular wavelet Db7 is applied in this study where 7 is the vanishing moment. D L is estimated from two same data series for Fal at Tregony and Teifi at Glanteifi (N = 5838). Where K = 7, then for both time series data the estimated value of D L is found to be 9. It was showed that for K ≥ 7, all approximations are similar [73]. The original rainfall and runoff data is decomposed by a DWT method. The decomposed rainfall and runoff data are given to model input and next day discharge data are the corresponding output. Therefore, next day value of discharge is predicted from antecedent information of rainfall and runoff (2 n -day mode and a maximum level of approximation mode, where n = 0,1, …,9). The decomposition runoff series (one preceding lag) is selected from PACF plots where the previous day lag of runoff shows (Figure 5b,d) the highest positive correlation (greater than 0.8) with next day runoff value for both the catchments. Due to previous lags of precipitations already being reflected in runoff series the previous one day behind precipitation is selected as decomposition precipitation series. In this paper, the most commonly used smooth mother wavelet, called Daubechies (Db), in the hydrometeorological study is selected for model deployment. The number of levels for decomposition is determined by Equation (19), which is based on the mother wavelet and sample data points [73]. The whole dataset of two catchments is divided into two parts: (i) training dataset (80%), and (ii) testing dataset (20%). The training dataset is used in the model calibration period and the calibrated model is used to predict the test dataset. In the training phase, the configuration of the ELM has been undertaken by initializing the input weights (w l ) and hidden neurons biases (b i ) using the population size of EO. After optimizing the decision variables by EO, output weights (β l ) are calculated analytically by the basic ELM procedure. Since the number of hidden neurons (l) influence the ELM performance, a trail was tested between the number of l versus mean square error (RMSE) (Equation (26)) to fix l. The training data set of 20 hidden neurons showed the better model performance for lags selection using PACF plots (Figure 5b,d). The final architecture of ELM with optimal number of discharge lags and P t was set to n number of input neurons (n = 7, 6 for Fal at Tregony and Teifi at Glanteifi, respectively), 20 hidden neurons, and the number of training targets as output neurons. The main purpose of the use of EO and PSO is to optimize the input weight matrix (7 or 6 × 20) and hidden biases (20 × l) repeated within the number of training samples that minimize the error between predicted discharge and observed discharge value. That error value is calculated by Equation (20).
where Y j and T j are the observed and predicted target value of the j th sample, q is the total number of training sample. The total number of concentrations or features for EO and PSO particles is 160 (7 × 20 + 20) (for Fal at Tregony) and 140 (6 × 20 + 20) (for Teifi at Glanteifi) which are tuned during EO-ELM and PSO-ELM development for each run of EO and PSO, respectively. The size of the population is considered as 20 with 100 iterations for the best model performance for both catchments data with optimal discharge lags and current day precipitation. The optimal value of EO parameters are set a 1 = 2, a 2 = 1 and GP = 0.5. The optimal PSO parameters are set to C 1 = 1 and C 2 = 2. In the case of ELM, 20 hidden neurons are selected based on trials for both catchments. For KELM, 10 number of kernel parameter and C = 1 regularization coefficient is used based on trails for both datasets. The two-layered DNN model is considered for runoff prediction. After several trial and error, the number of hidden neurons in each layer are selected for best performance of DNN. Therefore, first and second hidden layers consist 10 and 5 neurons respectively for both catchments. The optimal batch size is 10 and epochs are set to 100 for DNN. For wavelet-based sub-series decomposition of rainfall and discharge, nine sub-series and one approximation components feed into to the proposed model. The high-pass filters and low-pass filter rainfall and runoff are shown in Figure 6a where Y E i symbolized as the i th estimated daily discharge using a model; Y O i is the i th observed daily discharge; Y E i is the average of the predicted daily discharge; Y O i is the average of the observed daily discharge and q is the total number of observations.  Furthermore, to ensure the trustworthiness and efficacy of eight models (ELM, KELM, PSO-ELM, EO-ELM, DNN, SVR, ANN and GBM), a quantitative assessment [76,77] is carried out.

Results and Analysis
The proposed EO-ELM and DNN model applied to R-R modelling and six other well-known models, namely ELM, KELM, PSO-ELM, SVR, ANN and GBM, are considered for their comparison. To evaluate the performance of models, six evaluation indicators MAE, MAPE, NSE, R 2 , RMSE and VAF have been used. Further, the ranking of models is assigned based on the value of each performance evaluation matric i.e., better the value, higher the ranking. In case of same value of matric, corresponding model ranking is decided based on the value priority of R 2 > RMSE > NSE > MAE.

Optimal Lags-Based
Initially the optimal number of runoff lags from PACF plots and current day precipitation is used for model prediction of next day runoff value.
The performance of the models for the smaller catchment Fal at Tregony is shown in Table 3. It is apparent that the proposed EO-ELM exhibits better prediction in the train and test phases compared to other applied models in terms of all measured matrices (Table 3). Furthermore, the prediction accuracy of DNN performed less compared to ELM, KELM, PSO-ELM, ANN, SVR and better compared to GBM (Table 3). It is evident from Table 3 Figure 9). The Taylor plot (Figure 10a,b) confirms the better prediction performance of EO-ELM in terms of statistical comparison of all the models.
The performance of the models for the larger catchment of Teifi at Glanteifi is shown in Table 4. It is apparent that the proposed EO-ELM exhibits better prediction in test phase compared to other applied models in terms of all measured matrices (Table 4). Furthermore, the prediction accuracy of DNN performed less compared ELM, KELM PSO-ELM, ANN and better compared to SVR, GBM (Table 4). It is evident from Table 4         The performance of the models for the larger catchment of Teifi at Glanteifi is s in Table 4. It is apparent that the proposed EO-ELM exhibits better prediction in test compared to other applied models in terms of all measured matrices (Table 4). Fu more, the prediction accuracy of DNN performed less compared ELM, KELM PSO ANN and better compared to SVR, GBM (Table 4). It is evident from Table 4 (Figure 12d) cases, respectively, compared to other models with their linear equations for predicted and measured values. The Taylor plot (Figure 13a,b) confirms the better prediction performance of EO-ELM in terms of statistical comparison of all models. In the initial period (PACF-based selected lags) of model development, models give a distinct predictive performance for the two different catchment sizes, and characteristics result in hydrological responses to precipitation. The different values of statistical parameters of the datasets shown in (Table 2). However, the EO-ELM performed better in the large catchment Teifi at Glanteifi (R 2 = 0.935) which has a sensibly natural flow regime with minor flow abstractions due to the upland reservoirs and negligible agricultural demands compared to the catchment Fal at Tregony (R 2 = 0.91) which has moderate modification to flow due to steep topography, increasing runoff from effluent returns, and runoff reduces by industrial and agricultural abstractions that might be due to smaller watershed where the smaller catchment characteristics is more sensitive than larger watershed.  Figure 11 (refer supp) and Figure 12 (refer supp) exhibit better prediction performance of EO-ELM in train ( Figure 11d) and test (Figure 12d) cases, respectively, compared to other models with their linear equations for predicted and measured values. The Taylor plot (Figure 13a,b) confirms the better prediction performance of EO-ELM in terms of statistical comparison of all models.

Discrete Wavelet Transform (DWT)-Based
In UK catchments, the discharge prediction using rainfall and runoff modelling is a challenging task [60,78,79]. In order to achieve better prediction result of runoff in both catchments, the authors used the DWT-based time series data pre-processing technique in the rainfall and runoff dataset. Both current day rainfall and current day runoff (the highest correlated runoff data which has greater than 0.8 correlation observed from PACF plots for both catchments (Figure 5b,d) series are decomposed by estimated optimal number of sub-series (D1 to D9) with an approximation (A9) using Equation (19). The decomposed series of rainfall and discharge data are fed into models and next day discharge value is the predicted output value.  In the initial period (PACF-based selected lags) of model development, models give a distinct predictive performance for the two different catchment sizes, and characteristics result in hydrological responses to precipitation. The different values of statistical parameters of the datasets shown in (Table 2). However, the EO-ELM performed better in the large catchment Teifi at Glanteifi (R 2 = 0.935) which has a sensibly natural flow regime with minor flow abstractions due to the upland reservoirs and negligible agricultural demands compared to the catchment Fal at Tregony (R 2 = 0.91) which has moderate modification to flow due to steep topography, increasing runoff from effluent returns, and runoff reduces by industrial and agricultural abstractions that might be due to smaller watershed where the smaller catchment characteristics is more sensitive than larger watershed. The performance of the wavelet-based models for the catchment Fal at Tregony is shown in Table 5. It is apparent that the proposed WEO-ELM and WDNN exhibits better prediction in train and test phase compared to others in terms of all measured matrices (Table 5). For this catchment, the prediction accuracy of WDNN is comparable to WEO-ELM and better compared to WELM, WKELM, WPSO-ELM, WSVR, WANN and WGBM (Table 5). It is evident from Table 5           The performance of the wavelet-based models for the catchment Teifi at Glanteifi is shown in Table 6. It is apparent that the proposed WDNN exhibits better prediction in train and test phase compared to others in terms of all measured matrices (Table 6). For this catchment, the prediction accuracy of WEO-ELM follows WDNN and better compared to WKELM, WPSO-ELM, WELM, WANN, WSVR, and WGBM (Table 6). It is evident from Table 6 Figures 17 and 18 exhibit better prediction performance WDNN in train and test case respectively compared to other models with their linear equations for predicted and measured values. Interestingly, the EO algorithm adequately optimized ELM parameters than the PSO algorithm and enhanced the performance of ELM sufficiently compared to PSO. The Taylor plot (Figure 19a,b) confirms the better prediction performance of WDNN followed by WEO-ELM in terms of statistical comparison of all models.  The performance of the wavelet-based models for the catchment Teifi at Glanteifi is shown in Table 6. It is apparent that the proposed WDNN exhibits better prediction in train and test phase compared to others in terms of all measured matrices (Table 6). For this catchment, the prediction accuracy of WEO-ELM follows WDNN and better compared to WKELM, WPSO-ELM, WELM, WANN, WSVR, and WGBM (Table 6). It is evident from Table 6 Figures 17 and 18 exhibit better prediction performance WDNN in train and test case respectively compared to other models with their linear equations for predicted and measured values. Interestingly, the EO algorithm adequately optimized ELM parameters than the PSO algorithm and enhanced the performance of ELM sufficiently compared to PSO. The Taylor plot (Figure 19a,b) confirms the better prediction performance of WDNN followed by WEO-ELM in terms of statistical comparison of all models.

Uncertainty Analysis (UA) of Models
Both training and testing observed samples were considered for logical comparison of predicted discharge values for these models. The quantification of uncertainty and its analysis using appropriate variable is useful for knowing the technical contribution in a decision-making environment. In this study, the knowledge base for analysis was observed and predicted discharge values. Initially the error ( ) between the observed ( ) and predicted ( ) value is formulated using Equation (27).
Then mean ( ), standard deviation (SD) ( ), standard error (SE), upper bound (UB), lower bound (LB) and width of confidence (WCB) bound of error are calculated. The margin of error (ME) is estimated with 95% confidence interval with 0.05 level of significance (1 − ). The smaller value of mean, UB, LB, SE, ME and WCB indicates better accuracy of a prediction model.

Uncertainty Analysis (UA) of Models
Both training and testing observed samples were considered for logical comparison of predicted discharge values for these models. The quantification of uncertainty and its analysis using appropriate variable is useful for knowing the technical contribution in a decision-making environment. In this study, the knowledge base for analysis was observed and predicted discharge values. Initially the error ( ) between the observed ( ) and predicted ( ) value is formulated using Equation (27).
Then mean ( ), standard deviation (SD) ( ), standard error (SE), upper bound (UB), lower bound (LB) and width of confidence (WCB) bound of error are calculated. The margin of error (ME) is estimated with 95% confidence interval with 0.05 level of significance (1 − ). The smaller value of mean, UB, LB, SE, ME and WCB indicates better accuracy of a prediction model.

Uncertainty Analysis (UA) of Models
Both training and testing observed samples were considered for logical comparison of predicted discharge values for these models. The quantification of uncertainty and its analysis using appropriate variable is useful for knowing the technical contribution in a decision-making environment. In this study, the knowledge base for analysis was observed and predicted discharge values. Initially the error (e i ) between the observed (Y O i ) and predicted (Y E i ) value is formulated using Equation (27).
Then mean (e µ ), standard deviation (SD) σ 2 e , standard error (SE), upper bound (UB), lower bound (LB) and width of confidence (WCB) bound of error are calculated. The margin of error (ME) is estimated with 95% confidence interval with 0.05 level of significance (1 − α). The smaller value of mean, UB, LB, SE, ME and WCB indicates better accuracy of a prediction model.

Lag-Based Models UA
The visualization of interrelationships of the uncertain variables between the lag-based models for both catchments is shown in Figure 20 (Fal at Tregony) and Figure 21 (Teifi at Glanteifi). Figures 20a-c and 21a-c show the interrelationship of LB, UV and mean, SE and ME and WCB, respectively, for the models. For both catchments, LB, UV and mean (Figures 20a and 21a) and SE (Figures 20b and 21b) are lower for EO-ELM compared to other models. The ME (Figures 20b and 21b) is almost similar for all models except SVR and GBM (lower compared to other). From UA, it is observed that WCB (Figures 20c and 21c) also lower in EO-ELM. From UA, it is confirmed that EO-ELM has better prediction accuracy in the prediction runoff variable for both catchments. The visualization of interrelationships of the uncertain variables between the lagbased models for both catchments is shown in Figure 20 (Fal at Tregony) and Figure 21 (Teifi at Glanteifi). Figures 20a-c and 21a-c show the interrelationship of LB, UV and mean, SE and ME and WCB, respectively, for the models. For both catchments, LB, UV and mean (Figures 20a and 21a) and SE (Figures 20b and 21b) are lower for EO-ELM compared to other models. The ME (Figures 20b and 21b) is almost similar for all models except SVR and GBM (lower compared to other). From UA, it is observed that WCB (Figures  20c and 21c) also lower in EO-ELM. From UA, it is confirmed that EO-ELM has better prediction accuracy in the prediction runoff variable for both catchments.

Wavelet-Based Models UA
The visualization of interrelationship of the variables between the wavelet-based models for both catchments is shown in Figure 22   The visualization of interrelationships of the uncertain variables between the lagbased models for both catchments is shown in Figure 20 (Fal at Tregony) and Figure 21 (Teifi at Glanteifi). Figures 20a-c and 21a-c show the interrelationship of LB, UV and mean, SE and ME and WCB, respectively, for the models. For both catchments, LB, UV and mean (Figures 20a and 21a) and SE (Figures 20b and 21b) are lower for EO-ELM compared to other models. The ME (Figures 20b and 21b) is almost similar for all models except SVR and GBM (lower compared to other). From UA, it is observed that WCB (Figures  20c and 21c) also lower in EO-ELM. From UA, it is confirmed that EO-ELM has better prediction accuracy in the prediction runoff variable for both catchments.

Wavelet-Based Models UA
The visualization of interrelationship of the variables between the wavelet-based models for both catchments is shown in Figure 22

Wavelet-Based Models UA
The visualization of interrelationship of the variables between the wavelet-based models for both catchments is shown in Figure 22  For catchment Fal at Tregony, LB, UV and mean (Figure 22a) and SE and ME ( Figure  22b) are lower for WEO-ELM compared to other models. It can be observed that WCB (Figure 22c) also lower in WEO-ELM. WDNN is comparable with WEO-ELM in terms of SE and ME (Figure 22b) and WCB Figure 22c for Fal at Tregony. From UA, it is confirmed that WEO-ELM has better prediction accuracy in the prediction runoff variable along with WDNN for Fal at Tregony. For the catchment of Teifi at Glanteifi, LB, UV and mean ( Figure 23a) and SE and ME (Figure 23b) are lower for WDNN compared to other models. It can be observed that WCB (Figure 23c) is also lower in WDNN. The prediction accuracy of WEO-ELM follows WDNN and is better than the other six models for Teifi at Glanteifi. From UA, it is confirmed that WDNN has better prediction accuracy in prediction runoff variable for Teifi at Glanteifi.

Statistical Test: Two-Tailed t-Test
This study further investigates a statistical test, i.e., a two-tailed -test that is used to determine if two sample means are equal [80]. This test is more suitable when the sample drawn from a population is sufficiently large (specifically more than 30), assuming that the data are normally distributed. In a two sample -test, null hypothesis ( ) indicates  For the catchment of Teifi at Glanteifi, LB, UV and mean (Figure 23a) and SE and ME (Figure 23b) are lower for WDNN compared to other models. It can be observed that WCB (Figure 23c) is also lower in WDNN. The prediction accuracy of WEO-ELM follows WDNN and is better than the other six models for Teifi at Glanteifi. From UA, it is confirmed that WDNN has better prediction accuracy in prediction runoff variable for Teifi at Glanteifi.

Statistical Test: Two-Tailed t-Test
This study further investigates a statistical test, i.e., a two-tailed -test that is used to determine if two sample means are equal [80]. This test is more suitable when the sample drawn from a population is sufficiently large (specifically more than 30), assuming that the data are normally distributed. In a two sample -test, null hypothesis ( ) indicates For catchment Fal at Tregony, LB, UV and mean (Figure 22a) and SE and ME (Figure 22b) are lower for WEO-ELM compared to other models. It can be observed that WCB (Figure 22c) also lower in WEO-ELM. WDNN is comparable with WEO-ELM in terms of SE and ME (Figure 22b) and WCB Figure 22c for Fal at Tregony. From UA, it is confirmed that WEO-ELM has better prediction accuracy in the prediction runoff variable along with WDNN for Fal at Tregony.
For the catchment of Teifi at Glanteifi, LB, UV and mean ( Figure 23a) and SE and ME (Figure 23b) are lower for WDNN compared to other models. It can be observed that WCB (Figure 23c) is also lower in WDNN. The prediction accuracy of WEO-ELM follows WDNN and is better than the other six models for Teifi at Glanteifi. From UA, it is confirmed that WDNN has better prediction accuracy in prediction runoff variable for Teifi at Glanteifi.

Statistical Test: Two-Tailed t-Test
This study further investigates a statistical test, i.e., a two-tailed t-test that is used to determine if two sample means are equal [80]. This test is more suitable when the sample drawn from a population is sufficiently large (specifically more than 30), assuming that the data are normally distributed. In a two sample t-test, null hypothesis (H 0 ) indicates that the true mean difference between the paired is zero, i.e., µ 1 − µ 2 = 0, while the alternate hypothesis (H A ) indicates that the true mean difference between the paired samples is not equal to zero, i.e., µ 1 − µ 2 = 0. Therefore, the procedure for implementing the two-tailed t-test is as follows: Null hypothesis : H 0 : µ 1 − µ 2 = 0 Alternate hypothesis : where µ 1 and µ 1 are the mean of two different samples. The test statistics can be calculated using the expression given by: where: where x 1 and x 2 are the two observations under consideration, n 1 and n 2 are the total number of observations; S 1 and S 2 the standard deviation of two different samples, and S P is the pooled standard deviation. Calculated results at a significance level 5% (i.e., α = 0.05) assuming equal variances are presented in Tables 7 and 8 for Fal at Tragony and Teifi at Glanteifi phases, respectively. Note that, the absolute values of t Stat less than 1.96 (for the two-tailed test) indicate that there is no significant difference in mean between the samples. However, the obtained value closer to ideal value (i.e., z 0.025 = 1.96≈|t stat|) indicates a more reliable model. As can be seen, the EO-ELM and WDNN satisfy the conditions in R-R modeling in Teifi at Glanteifi in the testing phase, while EO-ELM and WDNN satisfy the conditions in the Fal at Tragony testing phase. From the above analysis, it may be concluded that the proposed EO-ELM is a better alternative integrated machine learning model with or without wavelet which gives better accuracy and generalization capability for R-R modeling in place of ELM, KELM, PSO-ELM, SVR, ANN, GBM and in UK. However, in the wavelet-based R-R modeling approach, WDNN performed better compared to other models. This was because increasing of the number of features in the dataset provides better efficacy in the proposed WDNN.

Conclusions
This paper developed two data-driven daily R-R modeling approaches, called EO-ELM and DNN, for two different benchmark stations of the catchment in the UK. The ELM, KELM, PSO-ANN, SVR, ANN, GBM were considered to validate the prediction performance of the proposed models. Initially, for an optimal number of input features, a PACF plot was used to feed the models. The main contributions of this paper are: (i) The use of a new metaheuristic algorithm (called EO) to optimize the input weights and hidden neuron biases of ELM are adequate for the better prediction performance by reducing the prediction error. (ii) The DWT is used to decompose the current day rainfall series (previous days rainfall series is already reflected in current day runoff series) and a runoff series (the highest correlated lag from PACF plot) to enhance the prediction performance of the proposed models (EO-ELM and DNN). (iii) Finally, the UA and two-tailed t-test confirm that EO-ELM performs best in optimal lag-based scenario and WDNN best for the wavelet-based scenario. WEO-ELM performs better compared to the other six models For lag-based input, the experimental results showed that the one day-ahead discharge forecast performance of EO-ELM was better than the ELM, KELM, PSO-ELM, DNN, SVR, ANN, GBM. For wavelet-based input, the results showed that WDNN performed better due to enhancement of input features. However, WEO-ELM out-performed compared to WELM, WKELM, WPSO-ELM, WSVR, WANN and WGBM.
In future, the proposed model should be evaluated on other catchments' data, and the performance of the other machine learning models should be compared to check the prediction capability of the proposed models. The proposed models can be applied to other time series hydrological variable predictions.