Estimation of Daily Stage–Discharge Relationship by Using Data-Driven Techniques of a Perennial River, India

Modeling the stage-discharge relationship in river flow is crucial in controlling floods, planning sustainable development, managing water resources and economic development, and sustaining the ecosystem. In the present study, two data-driven techniques, namely wavelet-based artificial neural networks (WANN) and a support vector machine with linear and radial basis kernel functions (SVM-LF and SVM-RF), were employed for daily discharge (Q) estimation. The hydrological data of daily stage (H) and discharge (Q) from June to October for 10 years (2004–2013) at the Govindpur station, situated in the Burhabalang river basin, Orissa, were considered for analysis. For model construction, an optimum number of inputs (lags) was extracted using the partial autocorrelation function (PACF) at a 5% level of significance. The outcomes of the WANN, SVM-LF, and SVM-RF models were appraised over the observed value of Q based on performance indicators, viz., root mean square error (RMSE), Nash–Sutcliffe efficiency (NSE), Pearson’s correlation coefficient (PCC), and Willmott index (WI), and through visual inspection (time variation, scatter plot, and Taylor diagram). Results of the evaluation showed that the SVM-RF model (RMSE = 104.426 m3/s, NSE = 0.925, PCC = 0.964, WI = 0.979) outperformed the WANN and SVM-LF models with the combination of three inputs, i.e., current stage, one-day antecedent stage, and discharge, during the testing period. In addition, the SVM-RF model was found to be more reliable and robust than the other models and having important implications for water resources management at the study site.


Introduction
River discharge and water level observation is an essential issue in hydrological and hydraulic modeling; in addition, it represents a piece of vital source information for water resources planning and management. For instance, accurate stage-discharge estimation is crucial for estimating design flows for different hydraulic infrastructures, such as bridges, culverts, and canals [1]. In very dynamic or compound rivers, direct measurements of flow discharge are very often difficult or not feasible [2]. Moreover, in some cases, neither discharge nor water level may be available or have the same data series record. Therefore, in such circumstances, flow rating curves (FRCs) are the standard and most common and SVM-RF. To the best of our knowledge, these models have not previously been used for stage-discharge relationship estimation; moreover, they have been rarely applied in other hydrologicor hydraulic-related issues. Therefore, this study attempts to bring to researchers in the water resources community a set of new data-driven models for potential applications in solving different complex problems in the field of hydraulics and hydrology.
The objectives of this study are (i) to indicate the reliability and precision of the applied data-driven models, (ii) to investigate their performance on stage-discharge datasets relationship estimation, and finally (iii) to compare model fits employing some known comparison criteria. The numerical results demonstrate the efficiency of all the proposed models on the seven real datasets considered. The paper is organized as follows: Section 2 presents a brief description of the study site, data acquisition, and the methodological approach, including descriptions of the data-driven models; Section 3 discusses the main results and findings; finally, concluding remarks and recommendations are presented in Section 4.

Study Area and Data Collection
The study area NH-5 road bridge Govindpur is commonly known as Govindpur, located in the Balasore district of Orissa State (India) with latitude 21 • 32 52" N and longitude 86 • 55 14" E. The study site is the mainstream of the Burhabalang river which is an east-flowing river and also a part of the Subarnarekha river basin located in Orissa State. The contributing area of the drainage basin is 4495 km 2 . Figure 1 illustrates the location map of the study area. The basin is strongly dominated by the south-west monsoon that starts in June and descends in mid-October. The average annual rainfall in the basin is about 1800 mm. The maximum temperature in the plains of the basin varies between 42 and 49 • C during May and goes down 8 to 14 • C during December-January. Geologically, the basin belongs mostly to Archean terrains. The rocks in the basin include Gneisses, Schist, Quartzite, and Amphibolite. Igneous rocks are also seen in the riverbed at some places.
The hydrological data including the daily stage (m) and discharge (m 3 /s) of 10 years (1st June 2004-31st October 2013) were obtained from the India-Water Resources Information System (WRIS) portal. The time series plot of the total available datasets of stage and discharge versus time is shown in Figure 2. The whole data were divided into two parts: (i) training dataset consisting of 70% (1st June 2004 to 31st October 2010) of the total data which were used for the development of the model, and (ii) remaining 30% (1st June 2011 to 31st October 2013) of the total data which were used for testing to check the prediction capability of the applied models ( Figure 2). Figure 3 shows the relationship between stage and discharge through the rating curve at the study site. In contrast, Figure 4 illustrates the flowchart of the adopted methodology for discharge estimation at the Govindpur site.

Wavelet Transforms
Wavelet analysis (WA) is a promising time-frequency technique for signal processing with more advantages than Fourier analysis [14]. WA is an enhanced version of Fourier transformation used to detect time features in data [47,48]. Generally, discrete wavelet transformation (DWT) has been used for data decomposition which is advantageous over continuous wavelet transformation (CWT), so that CWT computes wavelet coefficients at every possible scale, which is time-consuming and also produces comprehensive data. DWT is better for analyzing. It reduces the scaling and shifting factors of the fundamental wavelet function to discrete values, maintaining analytical exactness. DWT was notably used in recent years as a computing tool to extract information on non-stationary signals [47,49,50].

Wavelet Transforms
Wavelet analysis (WA) is a promising time-frequency technique for signal processing with more advantages than Fourier analysis [14]. WA is an enhanced version of Fourier transformation used to detect time features in data [47,48]. Generally, discrete wavelet transformation (DWT) has been used for data decomposition which is advantageous over continuous wavelet transformation (CWT), so that CWT computes wavelet coefficients at every possible scale, which is time-consuming and also produces comprehensive data. DWT is better for analyzing. It reduces the scaling and shifting factors of the fundamental wavelet function to discrete values, maintaining analytical exactness. DWT was notably used in recent years as a computing tool to extract information on non-stationary signals [47,49,50].

Wavelet Transforms
Wavelet analysis (WA) is a promising time-frequency technique for signal processing with more advantages than Fourier analysis [14]. WA is an enhanced version of Fourier transformation used to detect time features in data [47,48]. Generally, discrete wavelet transformation (DWT) has been used for data decomposition which is advantageous over continuous wavelet transformation (CWT), so that CWT computes wavelet coefficients at every possible scale, which is time-consuming and also produces comprehensive data. DWT is better for analyzing. It reduces the scaling and shifting factors of the fundamental wavelet function to discrete values, maintaining analytical exactness. DWT was notably used in recent years as a computing tool to extract information on non-stationary signals [47,49,50].
In the present study, the DWT method was employed for daily discharge estimation. The wavelet transform decomposes the original input time series data of stage and discharges into different frequencies. Three levels of the Haar à trous decomposition algorithm were used in this study. The new decomposed frequencies values act as input for the ANN. The hybridization of the decomposed wavelet value with ANN becomes a wavelet artificial neural network (WANN). The detailed information about ANN can be found in [57]. The Levenberg-Marquardt algorithm was utilized for the training of the model, and the hyperbolic tangent sigmoid transfer function was used to calculate a layer's output from its net input.

Support Vector Machine (SVM)
Vapnik [58] developed the idea of a support vector machine (SVM). The SVM technology informs an excess glider from the input field that disintegrates a particular training dataset and permits distance on both sides of the hyperplane from the nearest instances. The data showing the maximum margin are referred to as support vectors during the regression analysis. These are the dataset points where approximate errors are equal to or greater than the available tube size of the SVM. There would be a non-linear separation between the training data. Then, it is necessary to construct a non-linear separable boundary. The mapping of the original space to a higher dimension is needed to create a non-linear boundary, and this is called the feature space. A kernel function defines the mapping of the feature space from a given input space. For optimization of the model, a penalty factor (c) has been introduced for misclassification. The total penalty in mapping is obtained by adding the penalties on each misclassification. Several useful applications of the SVM technique have been found in water resources engineering [59][60][61][62][63][64].
When the SVM algorithm is applied to classification problems, it is called support vector classification (SVC), and when applied to regression problems, it is called support vector regression (SVR) [65,66]. The use of kernel function makes this technique attractive, an excellent generalization, and applicable in the approximation of both linear and non-linear datasets. The lack of an optimal solution is due to the convex nature of the target function and its limitations. The SVM work based on the principle of structural risk minimization was carried out to mitigate the generalization rather than the training error. Consider a training dataset, T, represented using Equation (3): where x X ⊂ R n are the training inputs and y , Y ⊂ R n are the training outputs. Assume a non-linear function f (x) is given by Equation (4): where w is the weight vector, b is the bias, and φ is a linearly mapped space with a high-dimensional function, x. Therefore, Equation (4) is transformed into a constrained complex optimization problem using Equations (5) and (6) as: minimize : subject to : where ξ i and ξ * i are the loose (or slack) parameters, c (>0) is the penalty variable, and ε is the tube size that represents the maximum acceptable deviation. The Lagrangian multipliers are used to solve complex optimization problems [67,68]. The final expansion of SVM is defined using Equation (7) as [69]: where α + i and α − i are the Lagrangian multipliers, and K x i , x j is the kernel function. The kernel function of the SVM technique allows solving non-linear approximations into a linear function. The kernel functions used in this study were [69][70][71]: • Linear kernel function: the simplest type of kernel function and written by using Equation (8) [72]: • Radial basis function (RBF): a mapping of RBF that is similar to Gaussian bell-shaped, and expressed by using Equation (9) [72]: where γ is the width of the Gaussian RBF kernel parameter. The RBF is widely used among all the kernel functions in the SVM technique. The optimization of SVM in the training phase largely depends on c, γ, and ε parameters. This is because of outstanding features that can effectively tackle the linear and non-linear input-output mapping.

Model Development and Performance Indicators
The current day streamflow not only depends on the current day conditions but also on the previous days [73]. In this context, lagged input variables are very epochal in time series modeling. However, it is challenging to determine the optimal number of lagged input variables. PACF analysis gives a promising idea to select the optimal number of lags/inputs variables and regression of the time series against its past lagged value, served to remove any dependence on intermediate elements within lags [41,[74][75][76]. In the present study, time-series data of discharge and stage have been lagged based on PACF analysis, so that the actual pattern of PACF among the data could be understood ( Figure 5). It was observed that the first three days of lags from the present give more influence on discharge and stage at the 5% significance level. Based on this, lag 1, 2, and 3 from H and Q were selected, and the following three scenarios have been developed in Equations (10)- (12): The performance of the scenarios mentioned above was evaluated statistically using root mean square error (RMSE), Nash-Sutcliffe efficiency (NSE), Pearson's correlation coefficient (PCC), and the Willmott index (WI), and through graphical interpretation (time series plot, scatter plot, and Taylor diagram). The advantages and disadvantages of RMSE, NSE, PCC, and WI with definitions are discussed subsequently: The RMSE measures the difference between observed and estimated values (Equation (13)). The RMSE reports in the same units as the model output and illustrates the size of a typical error. For continuous long-term simulation, RMSE performs well. The RMSE inclines to give more weight to high values than low values because errors in high values are generally more in absolute values than the errors in low values. The RMSE ranges from zero to infinite (0 < RMSE < ∞), so the lower the RMSE, the better the model performance [77,78]. NSE was initially proposed by Nash-Sutcliffe [79] and widely used to evaluate the hydrologic models [78,80,81]. It is the ratio of the mean square error to the variance of observed data during the period under examination, subtracted from unity (Equation (14)). The major limitation of NSE is that the differences between observed and estimated values are calculated as squared values. In other words, it cannot help to identify model bias, differences in magnitudes of peak flows, and the shape of recession curves. Similarly, it cannot be used for single-event simulation [78,80,81]. NSE ranges from minus infinity to one (−∞ < NSE < 1), so the closer to 1, the better the fit. An NSE lower than zero (NSE < 0) shows that the observed mean is as good a predictor as the model, while negative values specify that the observed mean is a better predictor than the model [78,80,81].
The PCC also is known as the correlation coefficient or coefficient of correlation used to measure the degree of collinearity between the observed and estimated variables in hydrological studies [78,81]. The PCC is oversensitive to extreme values and insensitive to additive and proportional variances among model predictions and observed data [81,82]. The PCC varies from minus one to plus one (−1 < PCC < 1), so close to one means a perfect fit (Equation (15)).
The WI, also known as the index of agreement, was developed by Willmott [83] to overcome the insensitivity of NSE and the coefficient of determination (R 2 ) to the differences in observed and estimated means and variances [81,82]. It represents the ratio of the mean square error and the potential error [83]. The WI varies between zero and one (0 < WI ≤ 1), so near to 1 means a perfect Scenario 1 has a minimum number of inputs, viz., current-day stage, previous 1-day stage, and discharge (Equation (10)). Scenario 2 comprises the average number of inputs, viz., current-day stage, previous 1-and 2-days stage, and discharge (Equation (11)). Meanwhile, scenario 3 includes the maximum number of inputs, namely., current-day stage, previous 1-, 2-, and 3-days stage, and discharge (Equation (12)). All the models have been formulated to predict current day discharge (Q t ) at the study site.
The performance of the scenarios mentioned above was evaluated statistically using root mean square error (RMSE), Nash-Sutcliffe efficiency (NSE), Pearson's correlation coefficient (PCC), and the Willmott index (WI), and through graphical interpretation (time series plot, scatter plot, and Taylor diagram). The advantages and disadvantages of RMSE, NSE, PCC, and WI with definitions are discussed subsequently: The RMSE measures the difference between observed and estimated values (Equation (13)). The RMSE reports in the same units as the model output and illustrates the size of a typical error. For continuous long-term simulation, RMSE performs well. The RMSE inclines to give more weight to high values than low values because errors in high values are generally more in absolute values than the errors in low values. The RMSE ranges from zero to infinite (0 < RMSE < ∞), so the lower the RMSE, the better the model performance [77,78]. NSE was initially proposed by Nash-Sutcliffe [79] and widely used to evaluate the hydrologic models [78,80,81]. It is the ratio of the mean square error to the variance of observed data during the period under examination, subtracted from unity (Equation (14)). The major limitation of NSE is that the differences between observed and estimated values are calculated as squared values. In other words, it cannot help to identify model bias, differences in magnitudes of peak flows, and the shape of recession curves. Similarly, it cannot be used for single-event simulation [78,80,81]. NSE ranges from minus infinity to one (−∞ < NSE < 1), so the closer to 1, the better the fit. An NSE lower than zero (NSE < 0) shows that the observed mean is as good a predictor as the model, while negative values specify that the observed mean is a better predictor than the model [78,80,81].
The PCC also is known as the correlation coefficient or coefficient of correlation used to measure the degree of collinearity between the observed and estimated variables in hydrological studies [78,81]. The PCC is oversensitive to extreme values and insensitive to additive and proportional variances among model predictions and observed data [81,82]. The PCC varies from minus one to plus one (−1 < PCC < 1), so close to one means a perfect fit (Equation (15)).
The WI, also known as the index of agreement, was developed by Willmott [83] to overcome the insensitivity of NSE and the coefficient of determination (R 2 ) to the differences in observed and estimated means and variances [81,82]. It represents the ratio of the mean square error and the potential error [83]. The WI varies between zero and one (0 < WI ≤ 1), so near to 1 means a perfect agreement/fit, while approaching 0 means complete disagreements between the observed and estimated data (Equation (16)). The main disadvantages of WI are over-sensitivity to extremes values due to the squared differences. The high values of WI were reported even for poor model fits [81,82].
Finally, the RMSE [31,69,77,78], NSE [79], PCC [38,78,81,84], and WI [83] are written as where N is the data points, Q obs and Q est are the observed and estimated discharge values for ith observations, and Q obs and Q est are the means of the observed and estimated discharge values.

Statistical Analysis
The statistical analysis of stage (H) and discharge (Q) datasets for training, testing, and the entire period is given in Table 1, which includes various statistical parameters like mean, median, minimum and maximum value, standard deviation (Std. Dev.), coefficient of variation (CV), and skewness. These statistical parameters show the variability of data over time. When dividing the dataset into training and testing subsets, it is necessary to cross-validate the data to have the same statistical population. Due to the high skewness coefficient, there has been a considerable negative effect on model performance. Therefore, skewness coefficients are low for both calibration (1.3012) and validation (1.3441) sets for the given station. This is appropriate for discharge estimation at the study site. The standard deviation for the datasets shows that the values that are farther from zero mean that the variability in the data is higher. Hence, the variation of data from the mean value is higher.

Evaluation of Results from Various Trails
In the selection process of the best model, several trails have been performed on a single output. The trails of WANN were performed based on the different number of neurons in hidden layers. In contrast, trails of SVM-LF and SVM-RF were performed by taking several values of SVM-g, SVM-c, and SVM-e parameters from scenarios 1 to 3. The best four trails have been listed in Tables 2-4 based on testing results. The results of trail-2, trail-1, and trail-4 of WANN-1, SVM-LF-1, and SVM-RF-1 ( Table 2); trail-3, trail-2, and trail-4 of WANN-2, SVM-LF-2, and SVM-RF-2 (Table 3); and trail-2 of WANN-3, SVM-LF-3, and SVM-RF-3 (Table 4) were found to be more promising than the other trails. Out of these trails, a total of nine have been imposed based on techniques and input selections and further evaluated to find the optimal one for daily discharge estimation at the study site (Table 5).

Quantitative and Qualitative Evaluation of Results
The RMSE, NSE, PCC, and WI values of all of the nine screened models over scenario 1 (S-1), scenario 2 (S-2), and scenario 3 (S-3) are given in Table 5. The model performance was classified as very good (PCC > 0. 95 [85], and Paul and Negahban-Azar, [86]. After considering all the techniques' best trails from three scenarios (S-1 to S-3), it was noted that the SVM-RF model performed better than the WANN and SVM-LF models based on quantitative performance evaluation indicators. It was also observed that the performance of the SVM-RF model was reduced as the input variables were increased.  Table 5 confirmed the superiority of the SVM-RF model with M-1 (inputs H t , H t−1, Q t−1 ) having the lowest value of RMSE = 104.426 m 3 /s, and the highest values of NSE = 0.925, PCC = 0.964, and WI = 0.979, closely followed by the SVM-RF-2 model.
The results of the optimal nine models in three different scenarios were plotted between observed and estimated discharge values in the form of time variation and scatter plots through Figures 6-8. It was noted that from these figures the high discharge values are under-estimated (>180 m 3 /s), whereas low discharge (<180 m 3 /s) values are over-estimated by WANN, SVM-LF, and SVM-RF models during the testing period. The quantity of explained variation out of the total variation (R 2 : coefficient of determination) was obtained as excellent for SVM-RF-1 and SVM-RF-2 models. Based on R 2 values, the 'order of the model performance from very satisfactory to unsatisfactory [87]  Figure 9a-c demonstrates the Taylor diagrams of WANN, SVM-LF, and SVM-RF corresponding to S-1, S-2, and S-3 during the testing period at the study site. The concept of the Taylor diagram was given by Taylor [88] to represent the spatial distribution of estimated values (i.e., test field) concerning the observed (reference field) by compiling the RMSE, standard deviation, and correlation coefficient in the polar system. It can be seen from these figures that the SVM-RF model is close to the observed (reference) field. Moreover, the SVM-RF-1 model has the lowest RMSE, less standard deviation, and a higher correlation in comparison to other models, and is nominated as an optimal model for daily discharge estimation with H t , H t−1, Q t−1 inputs at the study site.
Further, to support the finding of this study, the results were compared with the recent literature [89][90][91][92][93][94]. Adnan et al. [95] applied the group method of data handling-neural network (GMDH-NN), dynamic evolving neural-fuzzy inference system (DENFIS), and multivariate adaptive Tripura et al. [99] forecasted hourly streamflow of Barak riven basin, Assam (India) by employing the standalone co-active neuro-fuzzy inference system (CANFIS) and a hybrid of CANFIS optimized with the genetic algorithm (CANFIS-GA) and firefly algorithm (CANFIS-FA). They found that the CANFIS-FA model provides better results than the other models. The results of these studies support the application of artificial intelligence (AI) techniques in monthly and daily streamflow/discharge prediction. Likewise, the results of the current research are in fair agreement with the utility of the SVM-RF technique for daily discharge prediction at Govindpur station.    Figure 9a-c demonstrates the Taylor diagrams of WANN, SVM-LF, and SVM-RF corresponding to S-1, S-2, and S-3 during the testing period at the study site. The concept of the Taylor diagram was given by Taylor [88] to represent the spatial distribution of estimated values (i.e., test field) concerning the observed (reference field) by compiling the RMSE, standard deviation, and correlation coefficient in the polar system. It can be seen from these figures that the SVM-RF model is close to the observed (reference) field. Moreover, the SVM-RF-1 model has the lowest RMSE, less standard deviation, and a higher correlation in comparison to other models, and is nominated as an optimal model for daily discharge estimation with , , inputs at the study site.  Further, to support the finding of this study, the results were compared with the recent literature [89][90][91][92][93][94]. Adnan et al. [95] applied the group method of data handling-neural network (GMDH-NN), dynamic evolving neural-fuzzy inference system (DENFIS), and multivariate adaptive regression splines (MARS) for monthly streamflow prediction at the Kalam and Chakdara stations of the Swat river basin, Pakistan. They found better performance of the DENFIS at the Kalam site (RMSE = 18.

Conclusions
Prediction of discharge on daily, weekly, and monthly timescales is vital for short-and long-term water resources management, particularly in extreme events like floods and drought. Thus, the present study was projected to predict the daily stage-discharge relationship at Govindpur station located at the Burhabalang river basin, Orissa (India), by employing wavelet-based artificial neural networks (WANN) and a support vector machine (SVM) optimized with linear and radial basis kernel functions. The PACF analysis gives an appropriate idea to select the optimum numbers on input variables in time series-based modeling. Data with more variability have been chosen for training, and remaining data have been utilized to test the model performance. Based on performance indicators and by visual inspection, the results revealed that the SVM-RF model with H t , H t−1, Q t−1 inputs perform superior to the WANN and SVM-LF models for daily discharge estimation during monsoon season at the study site. Also, it was noted that as the input variable increases, the computation process becomes more difficult, time-consuming, and sometimes produces inferiority in the results. The best performance of the SVM-RF technique can help researchers to use highly variable discharge data for such modeling in the future. Researchers are also suggested to take as many trails as possible to avoid any bias and related problems of over-and under-estimation for highly variable data.