Estimation of Instantaneous Peak Flow Using Machine-Learning Models and Empirical Formula in Peninsular Spain

The design of hydraulic structures and flood risk management is often based on instantaneous peak flow (IPF). However, available flow time series with high temporal resolution are scarce and of limited length. A correct estimation of the IPF is crucial to reducing the consequences derived from flash floods, especially in Mediterranean countries. In this study, empirical methods to estimate the IPF based on maximum mean daily flow (MMDF), artificial neural networks (ANN), and adaptive neuro-fuzzy inference system (ANFIS) have been compared. These methods have been applied in 14 different streamflow gauge stations covering the diversity of flashiness conditions found in Peninsular Spain. Root-mean-square error (RMSE), and coefficient of determination (R2) have been used as evaluation criteria. The results show that: (1) the Fuller equation and its regionalization is more accurate and has lower error compared with other empirical methods; and (2) ANFIS has demonstrated a superior ability to estimate IPF compared to any empirical formula.


Introduction
Flash floods are one of the most significant natural hazards in Europe, especially in the Mediterranean countries [1].In recent years, flash floods have caused many economic losses and loss of life throughout Peninsular Spain.As can be seen in Barredo (2007) [2], Spain is the country in Europe that has been the most affected by flash floods from 1950 to 2005.Estimation of the frequency and magnitude of the instantaneous peak flow (IPF) is crucial for the design of hydraulic structures and floodplain management [3].As happens in many countries, Spanish basin management agencies record data relating to mean daily flow (MDF) while the availability of IPF time series is less frequent.The application of techniques that reduce uncertainties associated with IPF estimations is needed because of the damage that flash floods cause.
Several methods for estimating IPF based on MDF have been developed.From an empirical point of view, there are two types of approaches to estimating IPF based on MDF.The first type of approach establishes a relationship between IPF and MDF using the physiographic characteristics of the basin and the second type of approach calculates IPF using the sequence of mean daily flow.In the first group, the method by Fuller (1914) [4] is included; he conducted one of the first studies related to obtaining the IPF from the maximum MDF (MMDF) using drainage area.Other studies, such as those by Silva (1997) [5] and Silva and Tucci (1998) [6], also used the physiographic characteristics to estimate the IPF.Taguas et al. (2008) [7] proposed an equation to estimate IPF from MMDF, drainage area and mean annual rainfall in Southeastern Spain.Among the methods that use the second approach, there are two pioneering methods [8,9] described by Linsley et al. (1949) [10] and the Sangal (1983) [11].Many studies adjusted Fuller's formula for use in their regions.Fill and Steiner (2013) [12] summarized many of these regional formulas in their research.In Spain, the Spanish Centre for Public Works Studies and Experimentation CEDEX [13] adjusted the Fuller formula to obtain thirteen regional equations, which cover Peninsular Spain.In this research, four of these empirical methods were evaluated.Two of them, Fuller's (1914) [4] and the regionalized formula obtained by CEDEX, are included in the first approach group and Sangal (1983) [11] and the Fill and Steiner (2013) [12] methodologies are included in the second group.
According to recent studies [14], new methods have increased the accuracy associated with estimating IPF through the application of data-driven techniques, such as adaptive neuro-fuzzy inference systems (ANFIS) and artificial neural networks (ANN).Therefore, in this work, we have also employed the ANN and ANFIS, which are machine-learning methods used widely in order to compare the results obtained by applying the empirical formula.ANNs reproduce the learning process of the human brain [15].An ANN is a powerful and efficient mathematical model for linear and nonlinear approximations and is often known as a universal approximator [16].Mustafa et al. (2012) [17] examined the effectiveness of ANNs in solving different hydrologic problems and concluded that appropriate ANN modelling is advantageous compared with conventional modelling techniques.Generalizability and forecast accuracy are some advantages of ANNs [18].These properties make ANNs suitable for solving problems of estimation and prediction in hydrology [19].ANNs have the capability of obtaining the relationship between the predictor variables (in this case, MMDF) and the estimated variables (here, IPF) of a process [16,19].We have also used the ANFIS model to estimate IPF from MMDF.ANFIS is another powerful technique for modeling a nonlinear system and it integrates fuzzy logic into neural networks.Therefore, ANFIS has the ANN learning ability [20].The ANFIS model is a fusion of ANN and a Fuzzy Inference System (FIS) and possesses the advantages of both systems.The benefit of ANNs is that it learns independently and adapts itself to changing environments and the advantage of FIS is that it systematically generates unknown fuzzy rules from given information (inputs/outputs) [21].Therefore, this combination allows a FIS to learn from the data to create models.This is an efficient model for determining the behaviour of imprecisely defined complex dynamical systems [22].This model has also been accepted as an efficient alternative technique for modeling and prediction in hydrology [23].Some researchers who have applied ANFIS in hydrological modelling are Dastorani et al. (2010) [24] and Seckin (2011) [25].
Although ANN and ANFIS have great advantages, there are also certain disadvantages [26,27], such as: (1) Neural networks are a black box and do not clarify the functional relationship between the input and output values; (2) a neural network has to be trained for each problem to obtain the adequate architecture, and this requires greater computational resources; and (3) ANFIS is more complicated than FIS and is not available for all FIS options.
Many hydrological studies have shown that ANFIS was more efficient than other models as recurrent neural networks or fuzzy logic [28,29].Shabani et al. (2012) [30] and Dastorani et al. (2013) [31] applied ANFIS and ANNs to estimate IPF from MMDF and compared their results with the methods of Fuller [4], Sangal [11], and Fill-Steiner [12].They found that ANFIS increased the accuracy of the estimation of the IPF.The aim of this study is to identify a method to estimate IPF with greater accuracy in fourteen watersheds covering the diversity of flashiness conditions found in Peninsular Spain.In those basins, longer MDF data series exist, but the IPF data series are shorter.Nevertheless, at least 30 years of IPF data are available in the selected basins in order to compare the performance between estimated and measured IPF.

Study Area and Data
Spain shows a wide range of climatic characteristics due to its position between the European temperature zone and the subtropical zone.It also includes some of the rainiest areas in Europe in the northeast as well as the driest ones in the southeast, with a marked long drought period in summer.A set of 14 flow gauging stations distributed over peninsular Spain was selected to serve as a case study.The basins were selected based on several criteria.It was intended to have (1) a wide diversity of various flow regimes representative of the diversity of the conditions across Peninsular Spain; (2) a sufficiently long time series (more than 30 years) from gauging stations located in near-natural basins; (3) basin areas not exceeding 1000 km 2 .As shown in Figure 1, the basins used in this study are well distributed over Peninsular Spain, covering the three main climatic zones distinguished in Peninsular Spain: the Mediterranean climate, which is characterised by dry and warm summers and cool to mild, wet winters; the oceanic climate, which is located in the northern part of the country; and the semiarid climate, which is present in the centre and southeastern parts of the country, where in contrast to the Mediterranean climate, the dry season continues beyond the end of summer.

Study Area and Data
Spain shows a wide range of climatic characteristics due to its position between the European temperature zone and the subtropical zone.It also includes some of the rainiest areas in Europe in the northeast as well as the driest ones in the southeast, with a marked long drought period in summer.A set of 14 flow gauging stations distributed over peninsular Spain was selected to serve as a case study.The basins were selected based on several criteria.It was intended to have (1) a wide diversity of various flow regimes representative of the diversity of the conditions across Peninsular Spain; (2) a sufficiently long time series (more than 30 years) from gauging stations located in nearnatural basins; (3) basin areas not exceeding 1000 km 2 .As shown in Figure 1, the basins used in this study are well distributed over Peninsular Spain, covering the three main climatic zones distinguished in Peninsular Spain: the Mediterranean climate, which is characterised by dry and warm summers and cool to mild, wet winters; the oceanic climate, which is located in the northern part of the country; and the semiarid climate, which is present in the centre and southeastern parts of the country, where in contrast to the Mediterranean climate, the dry season continues beyond the end of summer.Table 1 lists the set of 14 basins, showing basin areas ranging from 29 to 837 km 2 with an average area of 307 km 2 , altitudes vary from 16 to 1278 m and streamflow data covering a period ranging from 38 to 70 years.According to the Koppen climate classification system [32], among the fourteen studied basins, six of them are considered warm-summer Mediterranean climates (Csb), four of them are considered oceanic climates (Cfb), three of them are considered hot-summer Mediterranean climates (Csa), and there is only one basin representing semi-arid climate (Bsk).There is no southwestern basin in this study due to the lack of data in this area, which has been studied for less than ten years in most gauging stations.Besides, according to Senent-Aparicio et al. (2016) [33], Southern Spain is one of the most water-stressed regions of Europe, and this is why it is very difficult to find near-natural basins in this part of Spain.In order to evaluate the flashiness of the basins selected, the Richards-Baker Flashiness Index (R-B Index) [34] has been obtained.This index reflects the velocity and frequency of short term changes in streamflow in response to storm events and can be calculated as shown in Equation (1).Table 1 lists the set of 14 basins, showing basin areas ranging from 29 to 837 km 2 with an average area of 307 km 2 , altitudes vary from 16 to 1278 m and streamflow data covering a period ranging from 38 to 70 years.According to the Koppen climate classification system [32], among the fourteen studied basins, six of them are considered warm-summer Mediterranean climates (Csb), four of them are considered oceanic climates (Cfb), three of them are considered hot-summer Mediterranean climates (Csa), and there is only one basin representing semi-arid climate (Bsk).There is no south-western basin in this study due to the lack of data in this area, which has been studied for less than ten years in most gauging stations.Besides, according to Senent-Aparicio et al. (2016) [33], Southern Spain is one of the most water-stressed regions of Europe, and this is why it is very difficult to find near-natural basins in this part of Spain.In order to evaluate the flashiness of the basins selected, the Richards-Baker Flashiness Index (R-B Index) [34] has been obtained.This index reflects the velocity and frequency of short term changes in streamflow in response to storm events and can be calculated as shown in Equation (1).
where, i is the time step, n is the total number of time step and q is the daily flow.
As shown in Table 1, the R-B Index in the basins selected range from 0.08 to 0.47, covering the diversity of flashiness conditions found in Peninsular Spain.Higher values for this index indicate higher flashiness (flashy streams), whereas lower values indicate stable streams.Climates, topography, geology, percentages of forest cover, catchment area and shape, land use and other catchment attributes influence the streamflow regime and, hence, the flashiness index [34].Depending on the case study, different correlations between flashiness index and catchment attributes can be found.We have found a negative correlation with the mean catchment elevation, and this is similar to results obtained by Holko et al. (2011) [35].The daily flow data and streamflow gauging stations were collected from the CEDEX [36].

Empirical Formulas
The only way to obtain instantaneous flows accurately is to measure them.If these have not been measured, any attempt to obtain the instantaneous flow afterwards will result in an approximate value.Although the relationship between MDF and IPF is logically variable from one flood to another, in most watersheds, this relationship is usually more or less constant or, at least, it fluctuates within a relatively narrow range of values [13].This has led to the application of empirical formulas to calculate the unknown values of IPF from the known values of MDF.The following are the different empirical methods used in this study.

Fuller
Fuller [4] studied flood data of 24 watersheds in The United States with basin areas between 3.06 and 151,592 km 2 and suggested an equation where IPF is calculated from MMDF as a function of the drainage area.Fuller formula (Equation (2)) is the most important and widely accepted due to its simplicity [11].
where IPF is the estimated instantaneous peak flow (m 3 /s), MMDF is the maximum observed mean daily flow (m 3 /s), and A is the drainage area (km 2 ).The coefficients present in Equation (2) are regression coefficients that were obtained in Fuller's study [4].

CEDEX Regionalization of Fuller´s Formula
In 2011, CEDEX [13] published a technical report about methodologies used in maximum streamflow mapping of the different river basin districts of Spain.CEDEX proposed twelve regional formulas to transform MMDF data into the corresponding IPF based on Fuller's method.Each formula corresponds to a river basin district; these are shown in Table 2.In these regional formulas, IPF is the estimated instantaneous peak flow (m 3 /s), MMDF is the maximum observed mean daily flow (m 3 /s), A is the drainage area (km 2 ) and the coefficients have been obtained by regression for each region.

Sangal
In his study, Sangal [11] realised several calculations based on a triangular hydrograph and proposed the following formula (Equation ( 3)) where the variables are the mean daily flow of three consecutive days.
where IPF is the estimated instantaneous peak flow (m 3 /s), MMDF is the maximum observed mean daily flow (m 3 /s), Q1 is the mean daily flow on the preceding day (m 3 /s), and Q3 is the mean daily flow on the following day.Sangal tested his method using data from 387 gauging stations in Ontario (Canada) for basin areas measuring less than 1 km 2 to more than 100,000 km 2 .He obtained good estimations in the majority of basins, although in small basins the peak flow could be underestimated.

Fill and Steiner
Fill and Steiner [12] created a study based on Sangal's formula and obtained values of estimated IPF that were higher than the observed values in basin areas greater than 1000 km 2 .This problem of overestimating instantaneous peak flow led Fill and Steiner to propose an improvement to Sangal's method.They used data from 14 stations of basins with drainage areas between 84 and 687 km 2 in Brazil and developed a simple formula (Equation ( 4)), suitable for drainage areas from 50 to 700 km 2 , similar to Sangal's (Equation ( 3)) to obtain the IPF from the mean daily flow of three consecutive days.
where IPF is the estimated instantaneous peak flow (m 3 /s), MMDF is the maximum observed mean daily flow (m 3 /s), Q1 is the mean daily flow on preceding day (m 3 /s) and Q3 is the mean daily flow on posterior day.Further details about Equation ( 4) are available in Fill and Steiner (2003) [12].

Artificial Neural Network (ANN)
To estimate IPF data, we used the feedforward multilayer perceptron network (MLP), the most popular ANN in hydrology [37].The MLP network includes an input layer, an output layer, and one or more hidden layers.The first layer receives the input data, the hidden layers process data, and the last layer obtains the output data [38].Each layer contains one or more neurons connected with all neurons of the next immediate layer through vertically aligned interconnections.The output y of a neuron j is obtained by computing the following Equation ( 5) [39]: where, f is an activation function, X is a vector of inputs, W j is a vector of connection weights from neurons in the preceding layer to neuron j, and b j is a bias associated with neuron j.
During the training process with a backpropagation algorithm, the output errors are repeatedly fed back into the network to adjust connection weights and biases until optimal values are obtained [40].The number of hidden neurons and the number of hidden layers is often determined by trial and error [41,42].In this study, one or two hidden layers with a number of neurons between two and twenty are considered.The optimal network configuration has been determined using an iterative process, evaluating the performance for different network structures.In this process, the data sets are randomly divided into three subsets: training set (70%), validation set (15%), and test set (15%).The number of maximum training iterations (epochs) was 1000 and the Levenberg-Marquardt backpropagation algorithm [43,44] is applied to adjust the appropriate weights and minimize error.
The ANN structure used in this work is shown in Figure 2, where the input data is the maximum mean daily flow and the output data is the instantaneous peak flow.The tangent sigmoid transfer function in the hidden layers and linear transfer function in the output layer were used.We implemented and built the ANNs using MATLAB ® software (version 8.2.0.701 (R2013b), The Mathworks, MA, USA).

Artificial Neural Network (ANN)
To estimate IPF data, we used the feedforward multilayer perceptron network (MLP), the most popular ANN in hydrology [37].The MLP network includes an input layer, an output layer, and one or more hidden layers.The first layer receives the input data, the hidden layers process data, and the last layer obtains the output data [38].Each layer contains one or more neurons connected with all neurons of the next immediate layer through vertically aligned interconnections.The output y of a neuron j is obtained by computing the following Equation ( 5) [39]: where, f is an activation function, X is a vector of inputs, Wj is a vector of connection weights from neurons in the preceding layer to neuron j, and bj is a bias associated with neuron j.
During the training process with a backpropagation algorithm, the output errors are repeatedly fed back into the network to adjust connection weights and biases until optimal values are obtained [40].The number of hidden neurons and the number of hidden layers is often determined by trial and error [41,42].In this study, one or two hidden layers with a number of neurons between two and twenty are considered.The optimal network configuration has been determined using an iterative process, evaluating the performance for different network structures.In this process, the data sets are randomly divided into three subsets: training set (70%), validation set (15%), and test set (15%).The number of maximum training iterations (epochs) was 1000 and the Levenberg-Marquardt backpropagation algorithm [43,44] is applied to adjust the appropriate weights and minimize error.
The ANN structure used in this work is shown in Figure 2, where the input data is the maximum mean daily flow and the output data is the instantaneous peak flow.The tangent sigmoid transfer function in the hidden layers and linear transfer function in the output layer were used.We implemented and built the ANNs using MATLAB ® software (version 8.2.0.701 (R2013b), The Mathworks, MA, USA).

Adaptive Neuro-Fuzzy Inference System (ANFIS)
ANFIS models nonlinear functions and makes a nonlinear map from input space to output space using fuzzy if-then rules, with each rule describing the local behaviour of the mapping.The parameters of these rules determine the efficiency of the FIS [45] and describe the shape of the Membership Functions (MF).
In this study, the Sugeno-type FIS [46,47] was used.In this learning process, a hybrid learning algorithm-a combination of the least-squares method and the backpropagation gradient descent method-is used to emulate a given training data set and estimate the parameters of the FIS.The ANFIS architecture is based on the work of Jang (1993) [48] and it is composed of five layers.In the first layer, every node is an adaptive node and acts as an MF.Different MFs were used in this study

Adaptive Neuro-Fuzzy Inference System (ANFIS)
ANFIS models nonlinear functions and makes a nonlinear map from input space to output space using fuzzy if-then rules, with each rule describing the local behaviour of the mapping.The parameters of these rules determine the efficiency of the FIS [45] and describe the shape of the Membership Functions (MF).
In this study, the Sugeno-type FIS [46,47] was used.In this learning process, a hybrid learning algorithm-a combination of the least-squares method and the backpropagation gradient descent method-is used to emulate a given training data set and estimate the parameters of the FIS.The ANFIS architecture is based on the work of Jang (1993) [48] and it is composed of five layers.In the first layer, every node is an adaptive node and acts as an MF.Different MFs were used in this study and the models with the generalized bell and sigmoidal membership function obtained the more accurate results in the testing phase.MMDF was used as input data.

Evaluation Criteria
In this work, the performance of the different methods was evaluated using three evaluation tools.Firstly, coefficient of determination (R 2 ) (Equation ( 6)) was used to describe the degree of collinearity between estimated and observed data and the proportion of variance in observed data explained by the model [49].
where O i is the ith observed data, O is the mean of the observed data, E i is the ith estimated data, E is the mean of the estimated data, n is the total number of observations, S O is the variance of the observed data and S E is the variance of the estimated data.Secondly, root mean square error (RMSE) (Equation ( 7)) was used.RMSE is an error index commonly used for quantifying the estimation error and is useful in assessing the errors in the units of the data [49].
Finally, an analysis of observed-estimated plot with the identity line (1:1 line) was used.

Results and Discussion
As shown in Table 3, to compare the results of the empirical formulas, R 2 and RMSE values were calculated for the fourteen gauge stations studied.The best results for each basin are highlighted in black in Table 3.The estimation with a higher R 2 value and a lower RMSE value is the best.According to the results obtained, in most of the stations, Fuller or the regionalization of Fuller's equation (CEDEX) have the higher coefficient of determination and the lowest amount of error.It can also be concluded that the regionalization of the Fuller equation improves the estimation of the IPF, especially in basins with a high flashiness index, as demonstrated in the results obtained in AND, COT, TRE, GAR, and CUE.Besides, the Fill and Steiner formula obtained more accurate results in PER and PIT, basins characterized by a low flashiness index.The Sangal formula exhibits very regular behaviour in general, with more accurate results compared with Fill and Steiner but worse results compared with the Fuller equation or its regionalization.With regard to climate zones, a clear pattern is not observed indicating a preference of one method over another.Moreover, according to the plots in Figure 3, where line slope and correlation coefficient between observed and measured values of IPF obtained with ANN and ANFIS techniques are compared, ANFIS seems to be slightly better in most of the gauge stations studied because those results concentrate closer to the identity line (perfect line) and the correlation coefficient is closer to 1.However, the differences between the techniques increase when the RMSE results are analysed, as shown in Table 4.
Water 2017, 9, 347 9 of 12 Moreover, according to the plots in Figure 3, where line slope and correlation coefficient between observed and measured values of IPF obtained with ANN and ANFIS techniques are compared, ANFIS seems to be slightly better in most of the gauge stations studied because those results concentrate closer to the identity line (perfect line) and the correlation coefficient is closer to 1.However, the differences between the techniques increase when the RMSE results are analysed, as shown in Table 4.

Conclusions
Estimation of instantaneous peak flow is essential for flood management and the design of hydraulic structures, especially in countries like Spain, where flash floods are a common occurrence and can cause significant damage.In this research, empirical formulas found in the literature and some learning-machine algorithms that have been used recently to estimate many different hydrological variables were applied to estimate IPF from MMDF.Our results show that the use of machine-learning models of the fuzzy type, such as ANFIS, are more accurate in general than ANN.These conclusions suggest that ANFIS is the more accurate method for increasing the accuracy of the estimation of IPF when a long-time series of MDF is available but the availability of IPF is shorter.The machine-learning method is superior to empirical formulas due to the data used from the basins in the case study, and future extensive studies with more data would be needed to obtain better estimators.On the other hand, the nonlinear dynamics of the relationship between IPF and MDF justifies the results obtained by the ANFIS method.
The main drawback of these machine-learning methods was the time consumed to model them.Finding the optimal structure of a neural network, the appropriate MF, and the shape of each variable in ANFIS is difficult and is determined through the process of trial and error.So, they require long tests and much greater computational resources than empirical formulas.

Figure 1 .
Figure 1.Location of the selected basins.

Figure 1 .
Figure 1.Location of the selected basins.

Figure 2 .
Figure 2. Structure of feedforward multilayer perceptron network (MLP) network used in this research.

Figure 2 .
Figure 2. Structure of feedforward multilayer perceptron network (MLP) network used in this research.

Figure 3 .
Figure 3. Scatter plots of observed instantaneous peak flow (IPF) (m 3 /s) and the estimated IPF (m 3 /s) obtained with ANN versus ANFIS for basins (test phase).The letters on the top left of each subfigure show the basin codes (seeTable1).
Figure 3. Scatter plots of observed instantaneous peak flow (IPF) (m 3 /s) and the estimated IPF (m 3 /s) obtained with ANN versus ANFIS for basins (test phase).The letters on the top left of each subfigure show the basin codes (seeTable1).

Figure 3 .
Figure 3. Scatter plots of observed instantaneous peak flow (IPF) (m 3 /s) and the estimated IPF (m 3 /s) obtained with ANN versus ANFIS for basins (test phase).The letters on the top left of each subfigure show the basin codes (seeTable1).
Figure 3. Scatter plots of observed instantaneous peak flow (IPF) (m 3 /s) and the estimated IPF (m 3 /s) obtained with ANN versus ANFIS for basins (test phase).The letters on the top left of each subfigure show the basin codes (seeTable1).

Table 1 .
Summary of the main characteristics of the selected basins.

Table 3 .
Evaluation of the results of empirical formulas based on R 2 and root mean square error (RMSE) criteria.
Note: The best results for each basin are bold in black.