A Hybrid Method for the Run-Of-The-River Hydroelectric Power Plant Energy Forecast: HYPE Hydrological Model and Neural Network

: The increasing penetration of non-programmable renewable energy sources (RES) is enforcing the need for accurate power production forecasts. In the category of hydroelectric plants, Run of the River (RoR) plants belong to the class of non-programmable RES. Data-driven models are nowadays the most widely adopted methodologies in hydropower forecast. Among all, the Artiﬁcial Neural Network (ANN) proved to be highly successful in production forecast. Widely adopted and equally important for hydropower generation forecast is the HYdrological Predictions for the Environment (HYPE), a semi-distributed hydrological Rainfall–Runoff model. A novel hybrid method, providing HYPE sub-basins ﬂow computation as input to an ANN, is here introduced and tested both with and without the adoption of a decomposition approach. In the former case, two ANNs are trained to forecast the trend and the residual of the production, respectively, to be then summed up to the previously extracted seasonality component and get the power forecast. These results have been compared to those obtained from the adoption of a ANN with rainfalls in input, again with and without decomposition approach. The methods have been assessed by forecasting the Run-of-the-River hydroelectric power plant energy for the year 2017. Besides, the forecasts of 15 power plants output have been fairly compared in order to identify the most accurate forecasting technique. The here proposed hybrid method (HYPE and ANN) has shown to be the most accurate in all the considered study cases.


Introduction
In a context of increasing penetration of renewable energy plants, accurate and reliable energy forecasts of wind/solar/hydro power and electric loads are required by diverse types of end users (e.g., utilities, TSOs, energy traders, producers), to be implemented for different time horizons, depending on the specific application, to significantly improve their profitability [1]. Hydropower is the largest source of renewable electricity in the world, generating 16.4% of the world's electricity production and providing 71% of renewable energy delivered to the grid, for a total capacity of 1200 GW installed power in 2016 [2].
Hydropower plants, and in particular Run of the River plants, show a seasonality pattern in their production time series [26]. Therefore, a decomposition approach is often investigated to overcome the difficulties in forecasting hydropower production, applying a so called "decomposition and ensemble" approach [24], separating the different components of the time series to forecast them individually and to subsequently recompose the series to get the final prediction. In [27], it is shown how an ensemble empirical mode decomposition, hybridized with an ARIMA approach, allows to improve the ARIMA forecast performance. The authors of [28] analyze the effect of detrending and deseasonalization on neural network performance, highlighting how ANN finds it difficult to manage different components of the data simultaneously and the beneficial effect of time series preprocessing. In [29], alternative seasonality extraction approaches to the classical moving average decomposition to be coupled with simple forecasting techniques are proposed. Further analysis on decomposition approaches are proposed in [24,30].
In the present paper, a new hybrid method for hydropower production forecast is introduced. The proposed approach leverages on the ensemble of a HYPE model, that relates weather forecasts to rivers flow, with an ANN. Additionally, the model is validated considering a decomposition of the time series in three main components: trend, seasonality, and residual. As a benchmark, it is considered the ANN, with the whole time series in input. Intermediate combination are also evaluated (ANN with time series decomposition and hybrid method) to highlight the contributions of the two proposed approaches.
The paper is outlined as follows. Section 2 presents the available dataset and the method developed to analyze it and improve the power generation forecast. Section 3 reports the results obtained from the implementation of the proposed method, in comparison to a traditional ANN forecast, and discusses the results while Section 4 sums up the work performed.

Materials and Methods
The here presented work aims at evaluating the forecasting performance of the proposed hybrid method, composed by a HYPE hydrological model and a ANN method. The model is applied on the power production time series decomposed in trend, residual and seasonality. This approach (Hybrid plus Decomposition) is then compared to three benchmark models: Hybrid, ANN, and ANN plus decomposition.
This section is structured as follows. At first, the available dataset for the proposed method validation is presented and described. In the second part, the error metrics adopted to evaluate the performance of the investigated methods are discussed. Finally, in the last subsection the novel hybrid method for RoR hydroelectric production forecast is detailed.

Available Dataset
The dataset adopted to validate the proposed method is referred to Slovenia and consists of two parts: flow data and measured electric hydropower production in the period from January 2010 to December 2017.
Slovenia is modeled dividing it into 86 different sub-basins. The conditions and geographic features of each sub-basin are exploited by the HYPE model to return the inflow of the drainage basin (an area of land where precipitation collects and drains off into a common outlet, such as river, bay, or other body of water). In our specific case, just precipitations data are provided to the HYPE model, since rain is the main driver in hydropower generation. It has been collected rainfall daily data, measured in [mm], of three Slovenian meteorological stations. Climate data are taken from the "Copernicus Era5" meteorological database and are of public domain. For each sub-basin, geographical data are provided and a basin map based on sequentiality of sub-basins has been drawn: in particular three groups have been identified, and these groups of points represents the river-basin of the three main Slovenian watercourses: River 1, River 2, and River 3, represented in Figure 1. The second part of the dataset is made by the hydroelectric energy produced [MWh] from 2010 to 2017 and Hydro-States, that informs about the status of each plant (working or not). The final purpose of the analysis is to predict the electric production of hydroelectric power plants, and specifically of big power plants (rated power P el > 20 MW). The dataset is cleaned by discarding those plants characterized by zero-production for a period greater than the 10 % of the total number of samples: three plants have been excluded, for a total of 15 RoR remaining plants, distributed on three rivers: 3 plants on River 1, 5 on River 2, and 7 on River 3. The preprocessing performed on the dataset aimed at testing the proposed method on a significant set of data, with few missing values, to evaluate its performance. The null power production observed in a plant could be due to human regulation rather than a lack of minimal flow in the river, uncoupling in this way the production from the Rainfall/CSB flow forecast. These cases are unpredictable by the algorithm, since they do not derive from a physical process and we have therefore decided to exclude them.
Data are converted on daily basis to have the same time-step of flow data. To validate the proposed method, all the available plants have been studied.

Performance Measurement
In the present work a series of indicator has been defined to evaluate the goodness of the prediction, in order to address different issues. Among the existing performance metrics, we selected the normalized Mean Absolute Error (1) (nMAE [31]) to evaluate the overall performance: where N is the total number of data sampl;, C is the maximum power measured over the period under analysis; and W f ore k and W obse k are, respectively, the forecast power and the measured (observed) power at each time point. In addition, two metrics have been introduced. The Nash-Sutcliffe Efficiency Index E f (2) is commonly used to assess the performance of rainfall runoff models [32]; it varies in the range (−∞, 1], where the unity is obtained in case of perfect forecast. The latter metric considered is the MASE (Mean Absolute Scaled error) (3), which is little influenced by the outliers [22]:

HYPE model and hybrid forecast method
The Hydrological Predictions for the Environment (HYPE) model, is a dynamic, semi-distributed, and process-based model leveraging on well-known hydrological and nutrient transport concepts. It can be used for both small and large scale assessments of water resources and water quality developed at the Swedish Meteorological and Hydrological Institute during the period 2005-2007 [33]. In the HYPE model applications, which simulates water flows [34][35][36], the model domain may be divided into sub-basins, that can either be independent or connected by rivers and a regional groundwater flow, as exemplified in Figure 2. The model receives as input climate variables, rainfall, and ambient temperature among the most important, and returns the sub-basins' water flow with a daily time step as output. An overview of the typical input data for the HYPE model is provided in [33], while in [37] a review of those input parameters that majorly influence performances of the model is presented. The HYPE model is particularly appropriate to simulate ungauged catchments, it is made on simple conceptual and empirical equations, and represents one of the best options for large-scale continental or multi-basins simulations [9]. Therefore, HYPE model was selected to compose the here presented hybrid method.
In Figure 3, the adopted method and the process required to train the hybrid network is displayed. The yellow parallelogram highlights the input/output datasets obtained, while the light blue rectangles highlights the processes performed. At first, it is conducted an analysis to identify the most suitable inputs to be fed into the neural network. A correlation analysis is therefore performed, to select the highly Correlated Sub-Basins (CSB) to the considered plant. The rainfall forecast associated to the identified sub-basins is then fed into the HYPE model, that provides the CSB outflow forecast. In parallel to this analysis, a decomposition of the production time series of the considered plant is performed, identifying the Trend, Seasonality*, and Residual components. The second term will be then used in the forecast process ( Figure 4). Trend and residual components, together with the considered plant past production and the CSB flow forecast, are exploited to perform at first the hyperparameter optimization for the Trend and Residual ANN sizing, respectively, and then for the network training itself.
Once the proper network structure is identified, the forecast will be performed, according to the scheme reported in Figure 4. The CSB flow forecast and the real production of the past six days are exploited to get the trend forecast. This forecast is then fed, together with the previous inputs, into the residual ANN, to get the residual forecast. The obtained production components are then summed up to the seasonality profile to get the production forecast. The obtained results are compared to those coming from the adoption of the ANN (without decomposition), which receives as input the same elements of the HYPE model, i.e., the input weather data from national databases. Additional configurations are considered: Hybrid and ANN with decomposition, to highlight the contributions of decomposition and hybrid network, respectively.  Analyzing the process in detail, at first an analysis on correlation factors is conducted, in order to establish the input-output connection and to identify the proper input layer to feed the ANN. The aim is to understand how the energy production is linked with the discharge flow rate of all basins, in order to find groups of sub-basins which are more relevant for the production; when these groups of basins are estimated, their data values will be exploited to feed a neural network, by exploiting just the highly correlated sub-basins.
The correlation coefficient (4), ρ, measures the strength of the linear relationship between two variables x and y. When the value of ρ is near zero, it indicates the absence of a linear relationship. Generally, we consider the correlation between two variables to be strong when 0.8 ≤ ρ ≤ 1, weak when 0 ≤ ρ ≤ 0.5, and moderate otherwise [38]. The analysis allowed to associate at each power plant a list of sub-basins that majorly contribute, from the correlation point of view, to the hydroelectric generation and to identify the most correlated plants. In particular, it emerges that plants on River 3 have a high correlation with very few sub-basins, which largely correspond to those basins contributing to River 3 from sequentiality point of view. Instead, plants on River 1 and River 2 are both highly correlated to many more sub-basins, and especially they have many correlated sub-basins in common, which do not correspond to those identified from the sequentiality logic.
A decomposition procedure is then introduced to separate the different contributions of the series. Decomposition is used in time series analysis to describe the trend and seasonal factors in a time series. One of the main objectives for a decomposition is to estimate seasonal effects that can be used to create and present seasonally adjusted values [39]. One of the strengths of the ANNs is their ability to infer nonlinear relationships between the input and the output [40], as ANN leverage on a nonlinear activation function, therefore a simple correlation analysis and moving average decomposition approach [41] has been adopted, reported in (5).
The trend is detected by applying an asymmetrical moving average, with a moving window referred to the past "N" days, thus a methodology that allows to find the best number of past days to apply the moving average is investigated. A statistical approach exploiting confidence interval is adopted. In particular, an adaptive confidence band is implemented, where the band is time-variant and it is computed on the basis of the "N" past days. The objective is to find what is the number such that the 90% of the real data are contained in the confidence band and which of these numbers guarantee the smaller band. At first, it is extracted the trend, applying the mean of the N past samples, for each day D of the dataset, where Observed is the real production dataset and D is the number of samples (from 1 to 366) (6). Subsequently it is computed the error, defined as the difference between the real production and the extracted trend, evaluated from the preceding N days to the day D (7). Finally, on the basis of the standard deviation σ of the error, proportional to the error committed in the past N days, it is found the confidence band amplitude. C is a multiplicative factor to guarantee the inclusion of 90% of real data in the confidence band (8). For the sake of simplicity the value of "N" preceding days varies from 3 to 8.
The number of samples to be included in the moving average has been set equal to 6 days, according to the sensitivity analysis led ( Figure 5). The difference between the starting (Observed i ) time series and the new trend function (Trend i ) is called Ripple (9) and it contains the sum of the seasonal and residuals components.
The ripple dataset is then divided in many subsets as the amplitude of the window moving average, 7 in our case (6 previous days plus the day to be forecast), forming 7 datasets containing the ripple component of every day of the week along the whole dataset, as exemplified in (10).
To obtain the seasonal component, the arithmetic mean of each Ripple subset is performed (11). The seasonality dataset is made by 7 terms repeated periodically, being the average value of the deviation between the real production and the trend.
It is important to underline that "seasonal" refers to a "statistical seasonality". In Figure 6, the seasonal component of three power plants on the three different rivers is represented. The plant on river 3 is characterized by an amplitude in the seasonality profile more stressed than in the other two, therefore it is reasonable to expect that this plant will benefit more than the other two from a decomposition approach. The remaining component is the residual which contains information about the "irregularities" found in the time-series decomposition. Figure 7 shows graphically the extraction of the three components from the observed production of a selected plant. It is evident how the trend is characterized by a smoother profile, with less noise, while the residual absorbs all the "irregularities". The seasonal component is used to analyze and understand the dynamics of the electric production and the behavior of the river on which the hydroplant lies. To apply the above described decomposition it is exploited the Python function seasonal-decompose, imported from the "statsmodel" library (which is taken from the "stats" package of R language).  On top of the obtained decomposition, the training of the ANN can be set. The novel hybrid approach proposed is characterized by two ANNs having as output layer the trend in one case and the residuals in the other, which will be then summed up to the seasonality component previously extracted to recompose the function, according to (5).
A preliminary analysis is conducted in order to assess the order of magnitude of the main parameters of the network: the size of the training set, the number of hidden layers, and the number of neurons in each hidden layer [42,43]. A hyperparameter optimization is performed varying the training set size and the number of neurons, selecting the combination of parameters that minimizes MAE. The number of neurons is assumed varying in the range [2,40] with step 2.
The three models selected for the comparison undertook themselves a hyper-parameter optimization to properly design their structure.

Results and Discussion
The analysis is carried out on the selected hydro plants and tested in the period from 1 January 2017 to 31 December 2017. The training of the networks is performed on the days of the year 2016. As input to the networks is provided the observed power production of the six preceding days to the day to forecast (moving window approach).
From the analysis conducted on correlation factors, it has been decided to provide as input to the neural network just the sub-basins highly correlated with the chosen plant (ρ > 0.5), in order to delete those basins scarcely correlated to the plant production. The hyperparameter optimization conducted allows to identify the optimal neural network configuration in order to carry out the forecast. In Table 1, the results of the analysis and the parameters settings adopted are summarized: the number of inputs is function of the number of CSB identified. Before starting the training process, the network automatically normalizes the inputs, to speed up the simulation and reduce the computational burden. The activation function implemented in the ANN is the tansigmoid one. It can be noticed that the trend forecast requires a very low number of neurons and a short training set, just 50 days, due to its smoother profile. Residual forecast leverages on a huger TSS, of 100 days, and on a two layers network, each of them made by four neurons. In Figure 8 is represented the Residual ANN structure of the plant 2, river 3: it takes in input the CSB flow (CSB xy), the production of the previous six days (W obs xy) and the trend forecast (P trend ).  To get the overall foreseen production, the trend and residual forecasts are summed to the seasonality extracted in the decomposition procedure. In Figure 9 are reported the three phases of the decomposition approach: trend ( Figure 9a) and residual forecast (Figure 9b) in order to obtain, once these components are summed to the extracted seasonality, the overall production forecast (Figure 9c). As it is possible to observe, the network is able to predict quite accurately the trend behavior, while the residual forecast is less precise and has the aim to detect the sudden increase or decrease of energy production, for instance, around day 120 (positive and negative peaks), or between days 200 and 250 (negative peaks). In the recomposed forecast ( Figure 9c) the majority of the peaks, the most difficult to be predicted, are now well interpolated; negative peaks until day 120 are well represented thanks to the seasonality contribution: these negative peaks occur during Sundays. RoR plants can have a small storage capacity (pondage), which can imply the possibility of a limited amount of power regulation, that allows the producer to generate more electric power when electric prices are higher (mid-week), and generate less when prices are lower (weekend). The objective of the seasonal component is to detect these periodic recurrences and adjust the forecast. To evaluate the performance improvement associated to the hybrid forecast methodology (HYPE + ANN) with decomposition, three alternative configurations are considered: a Hybrid method (without decomposition) and an ANN with the same inputs of the HYPE model (rainfall precipitation), without and with time series components extraction. In Table 2 are reported the optimal parameters settings resulting from the hyper-parameter optimization performed. As input to the ANNs, instead of the HYPE output, is provided the precipitation measured by three climate stations located in the region, in [mm], in addition to the past real production. This type of input layer has been selected after the hyperparameter optimization performed, as it proved to lead to higher performances in this specific configurations.
In Table 3, the obtained results for the proposed hybrid method and for the three models taken as benchmark are displayed. As an example, it is here reported one plant per each considered river; the results associated to the other plants under analysis can be found in the Appendix A. The investigated hybrid methods (Hybrid and Hybrid with decomposition), show the highest performance (green or olive color) in all the considered metrics and plants. Indeed, in all the cases analyzed the traditional neural network (ANN), with rainfalls in input, shows the worst performances (orange or magenta color). Considering the hybrid method without decomposition,it provides better results than the decomposition approach in nine plants out of fifteen. The hybrid approach could therefore be a valuable alternative to the ANN method to improve forecast performance even though it requires the implementation of a hydrological model, the HYPE. Table 3. Performance comparison proposed method and the three benchmarks considered: Hybrid with decomposition, Hybrid, ANN, and ANNs with decomposition for three of the analyzed plants: (a) plant 1 on river 1, (b) plant 1 on river 2 and (c) plant 1 on river 3. Color legend from the best to the worst result: green, olive, orange, and magenta. In Table 4 the aggregated performances are reported, in terms of mean and standard deviation, per river ((4a) to (4f)) and over all the considered plants ((4g), (4h)). In all the analyzed plants, the hybrid method proved to be more performing than the ANN method: considering the nMAE metric, for example, the error committed is more than halved moving from the ANN to the hybrid model. The adoption of the decomposed hybrid approach seems to be beneficial just for the plants located on River 3, while in all the other cases the hybrid approach got the highest forecast accuracy. Recalling Figure 6, plants on river 3 on average present a wider seasonality profile than plants on the other rivers: this issue could motivate the different performance of the proposed method on the analyzed plants. Taking into account the standard deviation of the considered performance metrics, in most of the cases the hybrid method is characterized by a smaller variability than the ANN one.
In Figure 10, the scatter plot related to the two approaches-hybrid with decomposition (orange dots) and ANN model (blue diamonds)-together with the associated regression lines (same colors) and the R 2 indicator, to evaluate the degree of accuracy of the forecast over the whole year are reported. In black it is drown the ideal regression line, in case of perfect forecast. On the x-axes is reported the measured production while on the y-axes is reported the forecast production. R 2 measures the goodness of the prediction: it reaches a maximum value of 1, and the closer it is to 1, the more the prediction can be considered precise. The most distant points are always under the black line, in the right part of the chart, therefore the most relevant errors are made when the electric production is high, as the greater dispersion of the points in the top right highlights. Instead when production is low, as in the left part of the chart, the forecast reaches high levels of accuracy. This is maybe due to the low numerosity of observations of very high production. Indeed, for smaller values of measured power, it is possible to observe a lower dispersion of the points around the black line. In plant 1, River 1 ( Figure 10a) and plant 1, River 3 ( Figure 10c) the regression lines of the ANN and hybrid models almost coincide, while in plant 1, River 2 (Figure 10b) there is a small discrepancy in favor of the hybrid method. Focusing on the coefficient of determination R 2 , the hybrid decomposition values manage to better explain the variability in the measured data with respect to the ANN approach. The improvement is particularly significant for the plant 1, river 3, where the R 2 index doubles in the hybrid approach with respect to the benchmark case.

Conclusions
In this work, different approaches for daily hydroelectric production forecast, applied to fifteen RoR plants, along three different rivers, have been investigated. In particular it has been analyzed the effect, in terms of forecast accuracy, of diverse input data to the artificial neural network: sub-basins flow data derived from a hydrological model (HYPE) and precipitations data from three climate stations have been alternatively provided to a neural network in order to establish which can be more suitable for hydro-production forecast purposes. Furthermore the effect of production decomposition in its three components, trend, seasonality, and residual, has been analyzed. The hybrid approach, where precipitations are processed by the HYPE model to provide sub-basins flow as input to the ANN, outperformed the ANN model, both with and without decomposition. In addition, the hybrid forecast proved to be more performing than decomposed hybrid method in most of the analyzed plants.
Even if the presented case study is inherent to the RoR power plants, the current method could be adopted to other hydroelectric power plants characterized by the same features (seasonality, strong dependence on rainfalls and sub-basins flow, etc.).

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: