Evaluation of Evaporation from Water Reservoirs in Local Conditions at Czech Republic

: Evaporation is an important factor in the overall hydrological balance. It is usually derived as the difference between runoff, precipitation and the change in water storage in a catchment. The magnitude of actual evaporation is determined by the quantity of available water and heavily inﬂuenced by climatic and meteorological factors. Currently, there are statistical methods such as linear regression, random forest regression or machine learning methods to calculate evaporation. However, in order to derive these relationships, it is necessary to have observations of evaporation from evaporation stations. In the present study, the statistical methods of linear regression and random forest regression were used to calculate evaporation, with part of the models being designed manually and the other part using stepwise regression. Observed data from 24 evaporation stations and ERA5-Land climate reanalysis data were used to create the regression models. The proposed regression formulas were tested on 33 water reservoirs. The results show that manual regression is a more appropriate method for calculating evaporation than stepwise regression, with the caveat that it is more time consuming. The difference between linear and random forest regression is the variance of the data; random forest regression is better able to ﬁt the observed data. On the other hand, the interpretation of the result for linear regression is simpler. The study introduced that the use of reanalyzed data, ERA5-Land products using the random forest regression method is suitable for the calculation of evaporation from water reservoirs in the conditions of the Czech Republic.


Introduction
Water management, changes in natural water regime and sustainable landscape became an important topic of social discussions and policies not only in the Czech Republic, but also around the world [1]. It is clear that global and local climatic conditions are changing and will have an impact on the water management sector and therefore they should be given the highest attention. The evaporation in the Czech Republic also changes [2].
However, not only the climatic conditions change, but also the technology and knowledge that can be used in water management and specifically in hydrology. With the rapid development of remote sensing tools through recent decades an onset of easy-to-use high quality products supplied both professionals and public in water resources.
In recent years, there has been a significant development in the supply of information from remote sensing of the Earth utilizable in water management, not only for the professional public [3][4][5]. Another option is, for example, the use of globally available climate reanalyses or other available data sources. Despite the development of data availability and modelling tools, a question arises: How significant is the impact of the ongoing cli-e.g., Least-Squares Spectral Analysis, Least-Squares Wavelet Analysis, Least-Squares Cross Wavelet Analysis [29].
Other methods for evaluation may include parametric and non-parametric trend tests, which are used in machine learning [30,31]. The parametric method (logistic regression, linear discriminant analysis and simple neural network) use a fixed number of parameters to build models, require fewer variables and the result may be affected by outliers. The nonparamtric method (the Mann-Kendall, Spearman's Rho and k-Nearest neighbors) use a flexible number of parameters, both variable and attribute can be used in the models, the result is not affected by outliers.
In this paper, we explore the relationships for the calculation of evaporation from water surface in the Czech Republic using reanalyzed climate data and the constructed linear models (LM) and random forest models (RFM) for the calculation of evaporation. Evaporation estimated from the derived models was compared with observed evaporation from evaporation stations. Finally, the derived relationships were applied to the selected water reservoirs.
Specifically, we aim to answer the following questions: Which statistical method for calculating evaporation achieves better linear regression or random forest regression? How many variables are important for determining the formula for calculating evaporation? How important is the geomorphological information (elevation and location) for calculating evaporation using linear and non-linear models? The main objective of the evaporation estimation from water surface was to derive a universal relationship for the whole territory of the Czech Republic.
This paper is structured as follows: Section 2 introduces the area of interest and input data. The statistical method for evaluation evaporation with respect to goodness-of-fit (GOF) is evaluated in the R environment [32] and described Section 3. The results and discussion are in Section 4 along with a detailed evaluation of the goodness-of-fit (GOF) regression for evaporation stations and subsequently for water reservoirs. The paper is concluded in Section 5.

Study Area and Data
The study area is defined by the state border of the Czech Republic. Within the region (51 • 03 N to 48 • 33 N latitude and 12 • 05 E to E 18 • 51 longitude) the long-term (1981-2010) mean annual precipitation totals at 709.5 mm, mean annual air temperature is 7.9 • C, mean runoff is 205.5 mm [33] and long-term runoff coefficient is thus 0.29 (29% of precipitation totals runs off). Figure 1 Figure 2 shows the selected 24 evaporation stations and 33 water reservoirs. The evaporation stations were assigned to water reservoirs based on the Quitt classification and the elevation [35]. The elevation differences between the evaporation stations and water reservoirs do not exceed 100 m a.s.l. The Quitt classification divides the Czech Republic into three climatic regions (cold, moderately warm and warm regions), with an evaporation station in the same climatic region always assigned to a reservoir. The observed evaporation from the evaporation station was recorded between 1957 and 2019 (most evaporation station was recorded from 2005).  The data from the evaporimeter (EWM) were provided by the Czech Hydrometeorological Institute, Palivový kombinát Ústí, state-owned enterprise. The T. G. Masaryk Water Research Institute, public research institution (TGM WRI, p.r.i.) provided data from the floating evaporator and data from the evaporation station Hlasivo.
Observed data from evaporation stations were aggregated into monthly step which were then used to evaluate evaporation from water reservoir surface, because the measured daily values are affected by random error [36]. The observed evaporation (may-october) is from 459 [mm·year The relationships for calculating evaporation from the water surface were developed using linear and nonlinear regression. Measured evaporation from evaporation stations serves as the dependent variable. ERA5-Land climate reanalysis data were used for the non-dependent variables from 1981 to 2019.

Climate Reanalysis
The purpose of the reanalysis is to provide an estimate of quantities describing atmospheric, climatic and hydropedological processes and behavior of oceans with global coverage and relatively high spatiotemporal resolution.
The reanalyses are outputs of various models, usually including a hydrological, atmospheric and ocean model and a model of the Earth's surface. The advantage is the provision of multidimensional spatially complete and coherent information about the global circulation and hydroclimatic quantities. Climate reanalyses are generated in a similar manner as in numerical weather forecasts, where the prediction models based on the development of the climate system from the initial state are used to predict the future state of the atmosphere. The initial state of the climate is a key input into the forecast determining the future development of the model simulation. Data assimilation is used to estimate the initial state that best matches the available data, while taking into account model errors. The climate reanalysis is performed as the only version of data assimilation that includes the use of the prediction model [37].
The reanalysis uses a combination of modeled data and observed data with emphasis on the laws of physics. The data are stored in the ECMWF archive and copied to the COPERNICUS Climate Data Store archive, from where they are freely downloadable using the CDS catalog or the CDS API application in the GRIB or NetCDF format.
The data was downloaded in NetCDF, which is a common format in drought or flood forecasting [38]. The spatial resolution is 0.1 • × 0.1 • , which represents approximately a grid of 9 km × 9 km.
The  [39], where the input data were dew point and temperature.
In the final dataset preparation, evaporation data from evaporation stations and geomorphological variables (elevation, latitude and longitude) were added to the reanalyzed data.

Methods
Statistical methods of linear and non-linear regression (random forest regression) were used to evaluate evaporation from the water reservoir. In this case, the main objective of the regression is to determine the best fit between the observed values from the evaporation stations and the variables from the ERA5-Land project. The resulting linear and nonlinear models were evaluated based on cross validation and goodness-of-fit (GOF): mean absolute error (MAE), root mean squared error (RMSE), coefficient of determination (R 2 ) and relative error (RERR). This section introduced building linear and non-linear models and their evaluating.

Linear Regression
Linear regression attempts to explain the values of a dependent variable through other quantities. In our case, an attempt was made to explain the dependent variable (evaporation value or evaporation rate from evaporimeter stations and evaporimeters EWM) using other variables (air temperature, surface temperature, wind, surface net solar radiation, dew point, pressure, latitude and altitude, evaporation type distribution) using 18 linear models created by sequential testing manually (8 models) and on the basis of stepwise regression (10 models).
The first set of models (built manually) was evaluated based on the Akaike Information Criterion (AIC) [40] value and the QQ plot was used for visual diagnostics [41]. The value of AIC is the sum of two terms, the first is proportional to the logarithm of the residual sum of squares, the second term is proportional to the complexity of the model (number of its members). When building the LM models, it can often happen that more independent variables reduce the sum of residues (improves the fit of the model with the observed data), however, this can result in an overfitted LM. The part of the AIC that penalizes the complexity of the model should prevent overfitting. When verifying the assumptions of the model (normality of residues), the QQ plot of residues can help. In the QQ plot of residues, two quantiles are plotted against each other-the theoretical quantile from distribution and the quantile with the actual residues of the model.
The second part of the linear models was developed using stepwise regression. Rpackages caret, leaps, MASS [42] were used for this regression. The R-package caret uses the principle of machine learning and the R-package leaps are used to calculate the stepwise regression. The R-package caret has a function train(), which allows the implementation of a sequential selection of predictors, where the linear regression selection is selected: In this work, a method with backward selection was selected. The hyperparameter nvmax corresponds to the maximum number of predictors that are included in the model. In this work, 11 predictors were used. Furthermore, it is also possible to set the parameters of the validation method, in this work it was cross validation with 500 iterations.

Random Forest Regression
Random forest (RF) is a combined learning method for classification and regression that creates multiple decision trees during learning and then outputs the modus (most frequent value) of the classes returned by each tree to form a regression forest. The resulting regression function is defined as a weighted average of the regression functions of multiple trees. Regression forests belong to the so-called committee or ensemble methods, the main idea of which is to combine several separate models into a single ensemble. Thus, it uses the so-called collective decision [26,43]. A random forest consists of a set of trees T 1 ,. . . ,T N whose classification or regression functions can be expressed as follows: where h is a function, X is a predictor and O 1 ,. . . ,O N are independent equally distributed random vectors. For the Random forests method, binary trees of type CART [44] are used. Similar to the creation of individual trees or other calibrations, a split into test and training sets is used. The R-package randomforest [27] was used in this work. Random forest is an approach to build predictive models for both classification and regression tasks. It is a way to combine poorer performing baseline models to obtain better predictive models. Due to their simple nature, low assumptions and high performance, RF models have been widely used in machine learning. The term "forest" refers to a set of decision trees that are themselves "weak" classifiers. A regression forest does not have the same predictive power as a stand-alone regression tree. If a single tree splits into a single criterion, it is very sensitive to changes. RF models classify variables based on their importance to achieve the best RF model [45].

Evaluation of Regression
Cross validation is used to improve the quality of regression models [46]. Depending on the method chosen, cross-validation is divided into k-fold cross validation, k-fold cross validation and leave-one-out. In our experiment, the method selected was leave-one-out validation. The dataset was split into training and test data, with one subset of data removed for the training data. The dataset consisted of the selected stations and in the training data the subset consisted of one sampled station, for a total of 24 stations, resulting in 24 iterations. Goodness-of-fit (GOF) criteria were used for further evaluation.

Evaluation of Regression by Goodness-of-Fit (GOF)
The linear regression and random forest regression set were evaluated based on their GOF (R 2 [47], RMSE [48], MAE [48] and RERR [49]). This means that we would like to identify the best model which is the most suitable for the calculation of evaporation in the Czech Republic.
(i) The R 2 is given by: where RSS is the residual sum of squares and TSS the total sum of squares from predicted evaporation values Ep and of tested data of cross validation Et.
• R 2 indicates a measure of the quality of the regression model and explains the proportion of variability in the dependent variable of the model R 2 , it may attain maximum value of 1, which means perfect prediction of the dependent variable. Conversely, value of 0 means that the model provides no information for understanding the dependent variable and is useless.
(ii) RMSE is given by: where Ep i is predicted evaporation values i-th case, Et i tested data from cross validation and N is the total number of simulated values.
• It was used as the standard statistical metric providing a relatively high weight to large errors.
(iii) MAE is given by: The mean absolute error (MAE) is calculated as the average of the absolute differences between the predicted evaporation values Ep i and tested data from cross validation Et i .
• MAE is used to measure how close the predictions or forecasts are to the final results. 'Absolute' means that negative values are converted to positive values. The error is less sensitive to occasional very large errors because it does not amplify calculation errors.
(iv) RERR is given by: is the ratio of the absolute error between Ep-predicted evaporation values and Ettested data to the true of the value Ep-predicted evaporation values.
• It is a dimensionless quantity and can be given in percentages, it may attain both positive and negative values. Relative error can be used to compare quantities with different dimensions.

Final Evaluation of Regression Models
The last step of the evaluation was to create a scoring matrix and consecutively remove the models from the end (order of removal was from the worst models to the best). In order for the removal to occur, the individual models had to be ranked (from best to worst) or standardized using a GOF. Based on this procedure, the final evaluation was performed.

Results and Discussion
In this section, a detailed evaluation of linear and random forest regression with respect to GOF (R 2 , RMSE, MAE and RERR) is presented. After evaluating all GOFs, RMSE was selected. Then, the best evaporation formulas are selected from the group of linear models (LM) and random forest models (RFM). Selected models were used to calculate evaporation from the water reservoirs.

Evaluation of Regression Models
Regression models LM and RFM were evaluated by cross-validation. The crossvalidation procedure was as follows: (i) In the training data, one station out of 24 stations was selected and validation of the inferred patterns from 23 stations was performed for this station. (ii) Validation was carried out successively for all stations and models. (iii) For validations, the goodness-of-fit R 2 , RMSE, MAE and RERR were calculated. (iv) Based on the RMSE, the function of R [32] rank() was used, which lists the order of individual values corresponding in an ascending order to the sorted vector. After creating a unique identifier, a matrix was created where the models were on the x-axis and on the stations on the y-axis were. Based on this matrix, the best models were selected.
The models were evaluated and compared using GOF (see Figure 3). The results show that RF models can fit the data better than LM models. RF models are more consistent than LM models for all criterion functions. It can also be seen from the graph and results that for some stations the models do not achieve a good fit.
Outliers (the worst 10% GOF values) are present in all LM models, which also happens in RF models, but on a smaller scale. The outliers corresponded to 70% of the maximum value, thus setting the limit value for selected GOF. Table 1 shows evaporative stations that have exceeded the limit values for the selected GOF.  By the method of sorting, using the R function rank(), 3 linear and 3 random forest models (RFM) were selected. The selected regression models with average values of GOF are presented in Table 2. The top 3 linear models according to all criterion functions are LM1, LM7 and LM8 and the top 3 RFM are RFM4, RFM5 and RFM15. The selected models are shown in Figure 4, green line represents linear models and blue line represents random forest models. The average value of RMSE for the selected linear models is 0.57, the minimum value is 0.22. The selected RFM had an average RMSE value of 0.51 and a minimum value of 0.18. The models that were designed based on stepwise regression achieved worse results than the models that were built manually based on data analysis. Models designed using manual regression achieved better results; however, some models designed using stepwise regression achieved good results in some cases, with less demanding inputs. The linear models were further supplemented with LM12, which also showed good results and the derived equation is more useful for practice due to its simplicity. All regression models are presented in the Sect. Appendix A in Table A1.

Model Application to Water Reservoirs
For testing, the best LM models (LM1, LM7, LM8 and LM12), RF models (RFM4, RFM5, RFM15) already described above were applied to selected reservoirs in the Czech Republic for the period May-October. The selection of the May-October period is because the evaporation from the observed data in the winter months is not measured due to freezing. The calculated evaporation values for the water reservoirs are introduced in the Sect. Appendix A in Table A2.
The difference and seasonality in evaporation between the water reservoirs is described in Figure 5 where green lines represent linear models (LM), blue lines random forest models (RFM) and red lines introduce observed data. The average across all data is represented by the bold lines. The mean value from LM models and RFM models over the period (1981-2020) for reservoirs for May-October is 546.54 [mm·year  Top models LM1 and RFM12 are compared with elevation for the whole water reservoir. The following Figure 6 shows the relationship between elevation and evaporation, where the green line represents linear regression model and blue line represents random forest model. The elevation of water reservoirs is 170.54-781.91 m a.s.l. The evaporation decreases with the elevation above sea level. Both models are influenced by local conditions because both models have input geographic coordinates and elevation. The results of the study will be implemented to the hydrological model Bilan [50,51] and for assessing climate change studies in the Czech Republic [52].

Concluding Remarks
The main objective of the estimation of evaporation from the water reservoirs was to derive a universal relationship for the whole territory of the Czech Republic.
The estimation of evaporation from water reservoirs is complicated because a large number of water reservoirs do not have observed evaporation data. In this work, Quitt's climate classification was used to assign a evaporimeter station that is not near a reservoir to a given reservoir based on climate region and elevation. Within the Czech Republic, the evaporation value from water reservoirs is determined on the basis of a handling order, which is established according to a Czech technical standard which is based on old climatic data and does not deal with climate change. For this reason, the determination of the evaporation from water reservoirs is based on estimation using statistical methods rather than exact measurement.
The ERA5-Land climate reanalysis data were used for derivation and were chosen for their comprehensiveness, availability, high spatial resolution, long time series and advantageous management. Relative humidity was included into the results based on the calculated August-Roche-Magnus approximation. The climate reanalysis data were exported for stations and water reservoirs.
The derivation of the relationship for evaporation was based on the multiple linear regression method, where the values of the dependent variable (evaporation) were sought, based on two or more variables (predictors: air temperature, surface temperature, wind speed, surface net solar radiation, dew point, surface pressure, dew point, altitude, latitude, longitude and calculated humidity). The construction of the models was done (i) manually, where the evaluation was done using the AIC parameter and the quantile-quantile (plot-QQ) was used for visual diagnostics, this method was time consuming, (ii) using stepwise regression, where the predictors are entered sequentially and models from one to X-selected variables were generated, this method is not time consuming. Random forest regression was used to account for non-linear relationships. Linear and random forest regression models were cross-validated and evaluated using criterion functions (R 2 , RMSE, MAE and RERR). Finally, 3(+1) LM models and 3 RF models were selected. The models contained a large number of independent variables (6)(7), possibly leading to model overfitting and therefore another model was selected which performed best for the RMSE criterion function and is based only on 4 independent variables and is therefore more user friendly.
It turned out that geomorphological information (elevation, location) appeared more in the manually derived models as opposed to models constructed using the stepwise regression method. When comparing linear models (LM) and random forest models (RFM), LM was found to have much more variability in the outcome compared to the RFM. The advantage of RFM is their adaptability, but the subsequent interpretation of the results can be a problem. This has been shown in the design of LM and RFM as well as when applying the proposed models to water reservoirs.
Evaporation values for the period 1981-2019 were calculated for the selected water reservoirs and selected formulas based on ERA5-Land climate reanalysis data.
For the evaluation of evaporation, models from LM and RFM models were used. Among the best models that were evaluated by linear regression, models LM1 from the manual linear regression group and LM12 from the stepwise regression group were used. Model LM1 was selected as the best model among the six predictors. The LM1 model can be replaced by an alternative model LM12 with which also performed satisfactorily with four predictors.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: