An Improved LSSVM Model for Intelligent Prediction of the Daily Water Level

Daily water level forecasting is of significant importance for the comprehensive utilization of water resources. An improved least squares support vector machine (LSSVM) model was introduced by including an extra bias error control term in the objective function. The tuning parameters were determined by the cross-validation scheme. Both conventional and improved LSSVM models were applied in the short term forecasting of the water level in the middle reaches of the Yangtze River, China. Evaluations were made with both models through metrics such as RMSE (Root Mean Squared Error), MAPE (Mean Absolute Percent Error) and index of agreement (d). More accurate forecasts were obtained although the improvement is regarded as moderate. Results indicate the capability and flexibility of LSSVM-type models in resolving time sequence problems. The improved LSSVM model is expected to provide useful water level information for the managements of hydroelectric resources in Rivers.


Introduction
Time series forecasting has been recognized as one of the classical problems in the fields of both energy engineering and science [1], among which daily water level forecasting is closely related to the hydroelectric resource utilization [2]. In order to obtain accurate and reliable water level forecasting, great efforts have been paid and fruitful achievements have been accomplished in various ways.
In view of the complex intrinsic mechanism and multiple influencing factors, artificial intelligence methods, e.g., adaptive network based fuzzy inference system [1,2] and neural network [3,4] have been accepted and extensively applied to resolve time series forecasting problems. The definition of membership function and rule system is important with regard to the model reliability [5] and accuracy [6]. In contrast to the artificial neural network models, grey models use only a small amount of historical data and mathematical relationship between variables is not required [7]. However, long term characteristics of hydrological data, such as seasonality and cyclical variations, need to be considered and more carefully handled. Successful experience has also been obtained by using artificial neural network (ANN) methods. Palani et al. [8] applied an ANN model for water quality estimation. Nourani et al. [9] established an ANN model for groundwater level prediction. Ivan and Gilja [10] showed good performance of ANNs for hydraulic parameter prediction. However, the model accuracy differs with neuron structures and parameter calibrations might be time-consuming.
The support vector machine (SVM) has been used to address short-term forecasting problems since the 90s of the 20th century, on the basis of which LSSVM (least squares support vector machine) is put forward to overcome drawbacks (e.g., computation cost [11], uncertainties in structural parameter determination [12]) of SVM. LSSVM models solve a linear matrix equation with fewer constraint conditions and have been utilized in a variety of applications, e.g., forecasting of groundwater level fluctuations [13], river stage [14], and watershed runoff [15]. In the case of monthly flow forecasting, Noori et al. [16] discussed the influence of parameter selections on the model performance. Hybrid models have also been proved to be effective ways, such as SVM-Wavelet transform [17].
Although the LSSVM models provide favorable solutions in hydrological forecasting problems, issues such as the kernel function and unbalanced features need to be carefully explored. Cheng et al. [18] improved LSSVM by integrating an adaptive time function. Thereby, the dynamic nature of the time series is considered by assigning an appropriate weight in the cash flow prediction for construction projects. To cope with low efficiency, Cong et al. [19] incorporated the fruit fly optimization algorithm (FOA) for appropriate parameter values of LSSVM. Comparison between LS-SVM-FOA and other models indicated the superiority of the improved model. Ghorbani et al. [20] modelled river discharge time series using SVM and ANN. The authors conclude that SVM and ANN have an edge over the results by the conventional RC (Rating Curve) and MLR (Multiple Linear Regression) models. This is more obvious for peak value predictions. The authors also presented a critical view on inter-comparison studies through a model establishment process and uncertainty analysis. Guo et al. [21] proposed an improved SVM model with an adaptive insensitive factor. Meanwhile, the wavelet de-noising method and phase-space reconstruction theory are applied to eliminate noise and determine the structure of the prediction model. The feasibility and performance of the model is evaluated through a case study of monthly streamflow forecasting. LS-SVM combined with self-organizing maps (SOM) has been applied in time series forecasting [22]. The two-stage architecture of LSSVM and SOM provides a promising tool for resolving the time series forecasting problem.
The present study focuses on the short-term forecasting of the daily water level by using an improved LSSVM model. The data source and pre-processing is presented in Section 2, followed by a detailed description of the improved LSSVM methodology in Section 3. Predictive capability of the improved LSSVM is verified and compared with the classic version in Section 4. Concluding remarks are finally drawn in Section 5.

Data Source
The blooming of water transport in the middle reaches of the Yangtze River has resulted in the rapid increase of vessel traffic flow. Meanwhile, human activities (e.g., sand mining, waterway regulation projects) and operations of upstream dams contribute to the complexity of the temporal characteristics of the daily water level [23]. The accurate forecasting of the daily water level is essential for waterway capacity evaluation as well as maritime risk assessment. Historic hydrological data obtained from the Changjiang Waterway Bureau (MOT, China) are applied for model training and testing. The time series of the daily water level span from 2010 to 2016 and the layout of the research region is presented in Figure 1. The flow runs from the Yichang station (upstream) to the Hankou station (downstream).
As aforementioned, the temporal variations of the daily water level in the middle reaches of the Yangtze River are affected by both natural factors and human activities. Seasonal and cyclical characteristics are readily observed in the raw time sequence of daily water level (as shown in Figure 2). It is necessary to eliminate the noise among the sample data before LSSVM model training. To eliminate noises (e.g., high-frequency fluctuations) in the hydrological time series, wavelet decomposition and reconstruction theory [24] were adopted in the data pre-processing for all five stations. The threshold is determined by the unbiased likelihood estimation method during the de-noising process. As aforementioned, the temporal variations of the daily water level in the middle reaches of the Yangtze River are affected by both natural factors and human activities. Seasonal and cyclical characteristics are readily observed in the raw time sequence of daily water level (as shown in Figure  2). It is necessary to eliminate the noise among the sample data before LSSVM model training. To eliminate noises (e.g., high-frequency fluctuations) in the hydrological time series, wavelet decomposition and reconstruction theory [24] were adopted in the data pre-processing for all five stations. The threshold is determined by the unbiased likelihood estimation method during the denoising process. As aforementioned, the temporal variations of the daily water level in the middle reaches of the Yangtze River are affected by both natural factors and human activities. Seasonal and cyclical characteristics are readily observed in the raw time sequence of daily water level (as shown in Figure  2). It is necessary to eliminate the noise among the sample data before LSSVM model training. To eliminate noises (e.g., high-frequency fluctuations) in the hydrological time series, wavelet decomposition and reconstruction theory [24] were adopted in the data pre-processing for all five stations. The threshold is determined by the unbiased likelihood estimation method during the denoising process. Since missing data can occasionally occur due to technical failures, a linear interpolation was applied to ensure integrity of the input data. The de-noised daily water level data are thus proceeded by the model training of LSSVMc (subscript c denotes conventional LSSVM model) and LSSVMi (subscript i denotes improved LSSVM). Since missing data can occasionally occur due to technical failures, a linear interpolation was applied to ensure integrity of the input data. The de-noised daily water level data are thus proceeded by the model training of LSSVM c (subscript c denotes conventional LSSVM model) and LSSVM i (subscript i denotes improved LSSVM).

Conventional LSSVM Model
The conventional support vector machine (SVM) is one of the machine learning methods [25]. The principle of machine learning is to minimize structural risk and achieve data classification or regression by applying kernel function and high-dimensional data simplification schemes (as presented in Equation (1)). On the other hand, least squares support vector machine (LSSVM) utilizes the least squares results as a basic algorithm to pursue structural risk minimization. Therefore, the basic equations of LSSVM c are written as Equation (2).
where J is the risk bound, X i is the slack variable, Y i is binary target, w is the weight matrix, b is the bias, ξ i is the slack variable, and e i is the error variable; γ denotes a regularization constant; ϕ(X i ) is the kernel function.

Improved LSSVM Model
In order to obtain the unbiased estimation for the forecasting model, an extra bias error control term ( 1 2 ab 2 ) is added in the objective function of LSSVM i and the aforementioned Equation (2) is re-organized as follows: where a is a penalty factor for bias b exceeding the allowable range. To solve the optimization problem, the Lagrangian function is obtained as [16]: where α i are Lagrangian multipliers. By taking derivatives of w, b, e, α respectively and setting all derivatives as zero (i.e., Equation (5)), the following equations are thus derived.

∂L ∂w
Energies 2019, 12, 112 A linear system of functions is therefore obtained.
where the kernel function is defined as: The least squares method is introduced to solve the above equation, on the basis of which the least squares regression function is therefore derived.
Both LSSVM c and LSSVM i are trained by using historical daily water level data (Year 2010-2015), on the basis of which short-term forecasting is achieved for the year 2016. The training results for stations Jianli and Chenglingji have been presented in Figure 3. The error rate of model training is further presented and discussed in Section 4.

Model Performace Metrics
To evaluate the forecasting accuracy of both LSSVM c and LSSVM i , three metrics were employed as the root mean square error (RMSE, Equation (13)), the mean absolute percentage error (MAPE, Equation (14)), and the index of agreement (d, Equation (15) by Willmott, [26]). RMSE is a frequently used estimator of the difference between observations and model predictions. Meanwhile, MAPE quantifies the ratio between the deviation and observations, thus being scale independent. The index of agreement (d) was developed as a standardized measure of the model forecasting error and varies between 0 (no agreement at all) and 1 (perfect match). Suppose the water level observation is {X o1 , X o2 , . . . X on } and the corresponding model prediction is X p1 , X p2 , . . . X pn . X o is the mean value of the observed time sequence. All metrics are calculated as follows:

Model Performace Metrics
To evaluate the forecasting accuracy of both LSSVMc and LSSVMi, three metrics were employed as the root mean square error (RMSE, Equation (13)), the mean absolute percentage error (MAPE, Equation (14)), and the index of agreement (d, Equation (15) by Willmott, [26]). RMSE is a frequently used estimator of the difference between observations and model predictions. Meanwhile, MAPE quantifies the ratio between the deviation and observations, thus being scale independent. The index of agreement (d) was developed as a standardized measure of the model forecasting error and varies between 0 (no agreement at all) and 1 (perfect match). Suppose the water level observation is , , … and the corresponding model prediction is , , … . is the mean value of the observed time sequence. All metrics are calculated as follows: (13) ∑ 100 / Date Jan2010 Jan2011 Jan2012 Jan2013 Jan2014 Jan2015 Jan2016 Water level[m]

LSSVM i Forecasting of Daily Water Level
The water level forecasting by the LSSVM i is presented for different stations together with field observations and LSSVM c predictions (Figures 4 and 5). It was found that the model forecasting is overall satisfactory. Some minor deviations were noted in the June and October for the station Shashi, which locates downstream of the Three Gorge Dam and Gezhou Dam. This could be attributed to the joint operations of multi-reservoir system, especially during the summer seasons when the rainfall generally increases. Besides, the influence of the river confluence was evident, e.g., Chenglingji, which is situated downstream of the Yangtze River-Dongting Lake confluence reaches. The majority of the discrepancies between LSSVM i predictions and field observations appear in the summer seasons (e.g., June-August). which locates downstream of the Three Gorge Dam and Gezhou Dam. This could be attributed to the joint operations of multi-reservoir system, especially during the summer seasons when the rainfall generally increases. Besides, the influence of the river confluence was evident, e.g., Chenglingji, which is situated downstream of the Yangtze River-Dongting Lake confluence reaches. The majority of the discrepancies between LSSVMi predictions and field observations appear in the summer seasons (e.g., June-August).     The water level forecasting by the LSSVMi is presented for different stations together with field observations and LSSVMc predictions (Figures 4 and 5). It was found that the model forecasting is overall satisfactory. Some minor deviations were noted in the June and October for the station Shashi, which locates downstream of the Three Gorge Dam and Gezhou Dam. This could be attributed to the joint operations of multi-reservoir system, especially during the summer seasons when the rainfall generally increases. Besides, the influence of the river confluence was evident, e.g., Chenglingji, which is situated downstream of the Yangtze River-Dongting Lake confluence reaches. The majority of the discrepancies between LSSVMi predictions and field observations appear in the summer seasons (e.g., June-August).

Model Performance Evaluation
The tuning parameters of LSSVM i are determined by using the cross-validation method ( Table 1). By adopting the performance metrics introduced in Section 3.3, the model performance was investigated. Generally, three metrics are computed and tabulated (Table 2). It was found that the LSSVM i provides more accurate forecasting of daily water level although the improvement is generally moderate. Moreover, RMSE has been calculated for model training results and the comparison is shown in Figure 6. The model residual is comparable for both training and testing stages, indicating the LSSVM i does not suffer from an over-fitting problem.  8 comparison is shown in Figure 6. The model residual is comparable for both training and testing stages, indicating the LSSVMi does not suffer from an over-fitting problem.   The hydrological processes always show seasonal fluctuation features. was calculated in terms of monthly data and presented in Figure 7. It is of note that the forecasting accuracy is improved by LSSVMi. Similar temporal variation patterns are observed at Chenglingji while it is quite different at Jianli station. The hydrological processes always show seasonal fluctuation features. RMSE was calculated in terms of monthly data and presented in Figure 7. It is of note that the forecasting accuracy is improved by LSSVM i . Similar temporal variation patterns are observed at Chenglingji while it is quite different at Jianli station. The qualified rate which is defined as the proportion of the predicted values with relative error below 20% is widely used in practice [21]. As one-day forecasting of daily water level, the qualified rate is therefore calculated for both LSSVMi and LSSVMc ( Table 3). The results show clear increases of qualified rate for the stations of Yichang, Shashi and Jianli, while full qualified forecasting is obtained for Chenglingji and Hankou.  The qualified rate which is defined as the proportion of the predicted values with relative error below 20% is widely used in practice [21]. As one-day forecasting of daily water level, the qualified rate is therefore calculated for both LSSVM i and LSSVM c ( Table 3). The results show clear increases of qualified rate for the stations of Yichang, Shashi and Jianli, while full qualified forecasting is obtained for Chenglingji and Hankou.

Influence of Forecast Lead Time
The characteristics of the model accuracy are further explored when the forecast lead time increases. Examples with different forecast lead times are presented for Jianli (Figure 8) by using LSSVM i . Computations of RMSE, MAPE and d are also presented and compared in Table 4. The model accuracy is overall acceptable. Although it decreases gradually as the forecast lead time increases, the LSSVM i model results in relatively high accuracy. This also implies that the proposed LSSVM i model should be further improved in order to yield reliable and effective forecast of the daily water level in the Yangtze River, such as alternative types of kernel functions (e.g., RBF: radial basis function) or integrated algorithm (e.g., Wavelet-LSSVM).

Conclusions
The daily water level forecasting is of significant importance for the maritime administration and water transport safety. The temporal and spatial variations of the daily water level have been recognized as non-linear and non-stationary processes while the least square support vector machine (LSSVM) models have proved to be an effect tool. In the present study, an improved LSSVM i model was proposed through a bias error control scheme.
The model performance of LSSVM i in short term forecasting of the daily water level was evaluated and compared with the conventional LSSVM c model. Both models were trained by using historical hydrological data (Year 2010(Year -2015 to provide forecasting results of Year 2016. It was found that the result yielded by the LSSVM i model is generally satisfactory, although the precision is inevitably affected by the seasonality and forecast lead time. Meanwhile, the influence of joint operations of the multi-reservoir system and river confluence was noted at Shashi station and Chenglingji station respectively. Although the forecasting accuracy decreases gradually as the forecast lead time increases, it is improved most of the time by LSSVM i . The present study indicates the capability and flexibility of LSSVM-type models in resolving time series problems. The LSSVM i proves to be a promising alternative in the daily water level forecasting of the Yangtze River (China) while optimization in forecast extrapolation and error control scheme is still required in future research.