1. Introduction
Water management, changes in natural water regime and sustainable landscape became an important topic of social discussions and policies not only in the Czech Republic, but also around the world [
1]. It is clear that global and local climatic conditions are changing and will have an impact on the water management sector and therefore they should be given the highest attention. The evaporation in the Czech Republic also changes [
2].
However, not only the climatic conditions change, but also the technology and knowledge that can be used in water management and specifically in hydrology. With the rapid development of remote sensing tools through recent decades an onset of easy-to-use high quality products supplied both professionals and public in water resources.
In recent years, there has been a significant development in the supply of information from remote sensing of the Earth utilizable in water management, not only for the professional public [
3,
4,
5]. Another option is, for example, the use of globally available climate reanalyses or other available data sources. Despite the development of data availability and modelling tools, a question arises: How significant is the impact of the ongoing climate change on hydrological balance components and the consequent impact on water management [
6]?
The hydrological balance is tied to rainfall-runoff processes, which are driven by climatic, geographical and geomorphological factors. The climatic factors include meteorological factors affecting the evaporation and evapotranspiration from the catchment, such as: precipitation, humidity, soil moisture, evaporation, air temperature, wind speed and direction and atmospheric pressure [
7].
Recently, a number of studies pointed out that evapotranspiration significantly affects the hydrological balance. The key role of evapotranspiration in hydrological balance was the subject of many recent studies, e.g., [
8,
9,
10,
11]. And it is nowadays widely recognized, that on the most of the Earth’s surface evaporation plays crucial role in the hydrological cycle.
The study [
12] illustrates the impacts of climate change on the water cycle, which may impact from total evaporation, precipitation, atmospheric humidity and horizontal moisture transport at the global scale.
There are many methods to calculate evaporation, which can be calculated from free water, from the soil surface or from vegetation over a period of time. The evaluation of evaporation can be done by direct methods namely measurement or by indirect methods: empirical methods, remote sensing of the Earth on regional or global scales [
13,
14], the use of models that are classified as fully physically-based combination models, semi-physically based models or black-box models [
15].
The total evaporation can be divided into actual, potential or reference evapotranspiration. The potential evaporation can be determined by empirical relationships or by measurement, the empirical relationships may differ in the input data or in the time step [
8,
16]. The calculation of the reference evapotranspiration is defined according to the FAO methodology, with the reference area being devided in [
17].
The studies [
18,
19] evaluated evapotranspiration calculated on the base of empirical equations, which were divided into categories: mass-transfer, radiation based method and temperature-based method. The best equations from each category were then selected and compared based on the FAO and Penman–Monteith equations [
20].
The estimation of reference evapotranspiration was used in the study [
21], where the Penman–Monteith temperature-based equation achieved the best rating for the evaluation of reference evapotranspiration because it preserves the physical philosophy of the Penman–Monteith equation method. The method was applied at a global scale using the Köppen climate classification system with respect to the world dataset under different climate conditions. Calculation of reference evapotranspiration based on indirect methods can provide acceptable results when direct measurements of are not available [
15].
Since most of the empirical formulas are based on geographical location, it is straightforward that the empirical calculation of evapotranspiration is not the same for different regions, due to the different climatic conditions [
17]. National standards, legislation and expertise also takes place resulting that different methods are preferred in different countries, e.g., Netherlands—Makkink’s method [
22], Slovakia—Budyko’s method [
23], Bulgaria—Delibaltov–Hristov–Tsonev method [
24].
The Penman–Monteith method is considered the sole standard for calculating reference evapotranspiration. The inputs to the equation are climatic data, solar radiation, air temperature, humidity and wind speed. It allows the calculation of evapotranspiration at different times of the year and in different regions, yet a precise measurement at a given location can easily replace the simplified Penman–Monteith equation [
17].
Other methods of calculating evapotranspiration include the use of empirical relationships, e.g., the relationship between observed evaporation from evaporation stations and meteorological quantities, these relationships can be calculated either linearly or non-linearly [
25,
26] using machine learning algorithms [
27,
28] linear regression or random forest regression.
The assessment of long-term climate variables can be based on time series. The time series is a sequence of measurements recorded over time, that can be analysed using, e.g., Least-Squares Spectral Analysis, Least-Squares Wavelet Analysis, Least-Squares Cross Wavelet Analysis [
29].
Other methods for evaluation may include parametric and non-parametric trend tests, which are used in machine learning [
30,
31]. The parametric method (logistic regression, linear discriminant analysis and simple neural network) use a fixed number of parameters to build models, require fewer variables and the result may be affected by outliers. The non-paramtric method (the Mann–Kendall, Spearman’s Rho and k-Nearest neighbors) use a flexible number of parameters, both variable and attribute can be used in the models, the result is not affected by outliers.
In this paper, we explore the relationships for the calculation of evaporation from water surface in the Czech Republic using reanalyzed climate data and the constructed linear models (LM) and random forest models (RFM) for the calculation of evaporation. Evaporation estimated from the derived models was compared with observed evaporation from evaporation stations. Finally, the derived relationships were applied to the selected water reservoirs.
Specifically, we aim to answer the following questions: Which statistical method for calculating evaporation achieves better linear regression or random forest regression? How many variables are important for determining the formula for calculating evaporation? How important is the geomorphological information (elevation and location) for calculating evaporation using linear and non-linear models? The main objective of the evaporation estimation from water surface was to derive a universal relationship for the whole territory of the Czech Republic.
This paper is structured as follows:
Section 2 introduces the area of interest and input data. The statistical method for evaluation evaporation with respect to goodness-of-fit (GOF) is evaluated in the R environment [
32] and described
Section 3. The results and discussion are in
Section 4 along with a detailed evaluation of the goodness-of-fit (GOF) regression for evaporation stations and subsequently for water reservoirs. The paper is concluded in
Section 5.
2. Study Area and Data
The study area is defined by the state border of the Czech Republic. Within the region (
N to
N latitude and
E to E
longitude) the long–term (1981–2010) mean annual precipitation totals at 709.5 mm, mean annual air temperature is
°C, mean runoff is 205.5 mm [
33] and long-term runoff coefficient is thus 0.29 (29% of precipitation totals runs off).
Figure 1 describe long-term temperature, evaporation trend at evaporation station Hlasivo. The Hlasivo evaporation measuring station provides a consistent time series of 58 years, the evaporation values are measured by a 20 [m
−2] benchmark evaporator. Other observed variables are: air temperature at 2 m [°C], water surface temperature in the evaporimeter [°C], relative humidity [%], global solar radiation [W·m
−2] and wind speed at 2 m [m·s
−1] [
34].
Figure 2 shows the selected 24 evaporation stations and 33 water reservoirs. The evaporation stations were assigned to water reservoirs based on the Quitt classification and the elevation [
35]. The elevation differences between the evaporation stations and water reservoirs do not exceed 100 m a.s.l. The Quitt classification divides the Czech Republic into three climatic regions (cold, moderately warm and warm regions), with an evaporation station in the same climatic region always assigned to a reservoir. The observed evaporation from the evaporation station was recorded between 1957 and 2019 (most evaporation station was recorded from 2005).
The data from the evaporimeter (EWM) were provided by the Czech Hydrometeorological Institute, Palivový kombinát Ústí, state-owned enterprise. The T. G. Masaryk Water Research Institute, public research institution (TGM WRI, p.r.i.) provided data from the floating evaporator and data from the evaporation station Hlasivo.
Observed data from evaporation stations were aggregated into monthly step which were then used to evaluate evaporation from water reservoir surface, because the measured daily values are affected by random error [
36]. The observed evaporation (may–october) is from 459 [mm·year
−1] (Pec pod Sněžkou) to 760 [mm·year
−1] (Holešov), mean evaporation (from evaporation stations) 627 [mm·year
−1], minimum mean daily rate (1.38 [mm·year
−1]) is in October and maximum mean daily rate is in July (4.53 [mm·year
−1]), with maximum in June 2017 (5 stations exceeded 6.5 [mm·year
−1]).
The relationships for calculating evaporation from the water surface were developed using linear and nonlinear regression. Measured evaporation from evaporation stations serves as the dependent variable. ERA5-Land climate reanalysis data were used for the non-dependent variables from 1981 to 2019.
Climate Reanalysis
The purpose of the reanalysis is to provide an estimate of quantities describing atmospheric, climatic and hydropedological processes and behavior of oceans with global coverage and relatively high spatiotemporal resolution.
The reanalyses are outputs of various models, usually including a hydrological, atmospheric and ocean model and a model of the Earth’s surface. The advantage is the provision of multidimensional spatially complete and coherent information about the global circulation and hydroclimatic quantities. Climate reanalyses are generated in a similar manner as in numerical weather forecasts, where the prediction models based on the development of the climate system from the initial state are used to predict the future state of the atmosphere. The initial state of the climate is a key input into the forecast determining the future development of the model simulation. Data assimilation is used to estimate the initial state that best matches the available data, while taking into account model errors. The climate reanalysis is performed as the only version of data assimilation that includes the use of the prediction model [
37].
The reanalysis uses a combination of modeled data and observed data with emphasis on the laws of physics. The data are stored in the ECMWF archive and copied to the COPERNICUS Climate Data Store archive, from where they are freely downloadable using the CDS catalog or the CDS API application in the GRIB or NetCDF format.
The data was downloaded in NetCDF, which is a common format in drought or flood forecasting [
38]. The spatial resolution is
×
, which represents approximately a grid of 9 km × 9 km.
The data set consisting of 2 m temperature [K], skin temperature [K], 2 m dew-point temperature [K], 10 m v-component of wind [m·s
−1], surface pressure [Pa], surface net solar radiation [J·m
−2] was selected to calculate evaporation from water reservoir. Temperature units [K] were converted to [°C] and energy units from [J·m
−2] were converted to [W·m
−2], values divided by the accumulation time expressed in seconds. Relative humidity [%] was calculated using the August–Roche–Magnus approximation [
39], where the input data were dew point and temperature.
In the final dataset preparation, evaporation data from evaporation stations and geomorphological variables (elevation, latitude and longitude) were added to the reanalyzed data.
3. Methods
Statistical methods of linear and non-linear regression (random forest regression) were used to evaluate evaporation from the water reservoir. In this case, the main objective of the regression is to determine the best fit between the observed values from the evaporation stations and the variables from the ERA5-Land project. The resulting linear and non-linear models were evaluated based on cross validation and goodness-of-fit (GOF): mean absolute error (MAE), root mean squared error (RMSE), coefficient of determination (R2) and relative error (RERR). This section introduced building linear and non-linear models and their evaluating.
3.1. Linear Regression
Linear regression attempts to explain the values of a dependent variable through other quantities. In our case, an attempt was made to explain the dependent variable (evaporation value or evaporation rate from evaporimeter stations and evaporimeters EWM) using other variables (air temperature, surface temperature, wind, surface net solar radiation, dew point, pressure, latitude and altitude, evaporation type distribution) using 18 linear models created by sequential testing manually (8 models) and on the basis of stepwise regression (10 models).
The first set of models (built manually) was evaluated based on the Akaike Information Criterion (AIC) [
40] value and the QQ plot was used for visual diagnostics [
41]. The value of AIC is the sum of two terms, the first is proportional to the logarithm of the residual sum of squares, the second term is proportional to the complexity of the model (number of its members). When building the LM models, it can often happen that more independent variables reduce the sum of residues (improves the fit of the model with the observed data), however, this can result in an overfitted LM. The part of the AIC that penalizes the complexity of the model should prevent overfitting. When verifying the assumptions of the model (normality of residues), the QQ plot of residues can help. In the QQ plot of residues, two quantiles are plotted against each other—the theoretical quantile from distribution and the quantile with the actual residues of the model.
The second part of the linear models was developed using stepwise regression. R-packages caret, leaps, MASS [
42] were used for this regression. The R-package caret uses the principle of machine learning and the R-package leaps are used to calculate the stepwise regression. The R-package caret has a function
, which allows the implementation of a sequential selection of predictors, where the linear regression selection is selected:
leapBackward,
leapForward,
leapSeq.
In this work, a method with backward selection was selected. The hyperparameter nvmax corresponds to the maximum number of predictors that are included in the model. In this work, 11 predictors were used. Furthermore, it is also possible to set the parameters of the validation method, in this work it was cross validation with 500 iterations.
3.2. Random Forest Regression
Random forest (RF) is a combined learning method for classification and regression that creates multiple decision trees during learning and then outputs the modus (most frequent value) of the classes returned by each tree to form a regression forest. The resulting regression function is defined as a weighted average of the regression functions of multiple trees. Regression forests belong to the so-called committee or ensemble methods, the main idea of which is to combine several separate models into a single ensemble. Thus, it uses the so-called collective decision [
26,
43]. A random forest consists of a set of trees
,…,
whose classification or regression functions can be expressed as follows:
where
h is a function,
is a predictor and
,…,
are independent equally distributed random vectors. For the Random forests method, binary trees of type CART [
44] are used. Similar to the creation of individual trees or other calibrations, a split into test and training sets is used. The R-package randomforest [
27] was used in this work.
Random forest is an approach to build predictive models for both classification and regression tasks. It is a way to combine poorer performing baseline models to obtain better predictive models. Due to their simple nature, low assumptions and high performance, RF models have been widely used in machine learning. The term “forest” refers to a set of decision trees that are themselves “weak” classifiers. A regression forest does not have the same predictive power as a stand-alone regression tree. If a single tree splits into a single criterion, it is very sensitive to changes. RF models classify variables based on their importance to achieve the best RF model [
45].
3.3. Evaluation of Regression
Cross validation is used to improve the quality of regression models [
46]. Depending on the method chosen, cross-validation is divided into k-fold cross validation, k-fold cross validation and leave-one-out. In our experiment, the method selected was leave-one-out validation. The dataset was split into training and test data, with one subset of data removed for the training data. The dataset consisted of the selected stations and in the training data the subset consisted of one sampled station, for a total of 24 stations, resulting in 24 iterations. Goodness-of-fit (GOF) criteria were used for further evaluation.
3.4. Evaluation of Regression by Goodness-of-Fit (GOF)
The linear regression and random forest regression set were evaluated based on their GOF (
[
47], RMSE [
48], MAE [
48] and RERR [
49]). This means that we would like to identify the best model which is the most suitable for the calculation of evaporation in the Czech Republic.
- (i)
The
is given by:
where
is the residual sum of squares and
the total sum of squares from predicted evaporation values
and of tested data of cross validation
.
indicates a measure of the quality of the regression model and explains the proportion of variability in the dependent variable of the model , it may attain maximum value of 1, which means perfect prediction of the dependent variable. Conversely, value of 0 means that the model provides no information for understanding the dependent variable and is useless.
- (ii)
RMSE is given by:
where
is predicted evaporation values
i-th case,
tested data from cross validation and
N is the total number of simulated values.
- (iii)
The mean absolute error (MAE) is calculated as the average of the absolute differences between the predicted evaporation values and tested data from cross validation .
- (iv)
RERR is given by:
is the ratio of the absolute error between
-predicted evaporation values and
-tested data to the true of the value
-predicted evaporation values.
3.5. Final Evaluation of Regression Models
The last step of the evaluation was to create a scoring matrix and consecutively remove the models from the end (order of removal was from the worst models to the best). In order for the removal to occur, the individual models had to be ranked (from best to worst) or standardized using a GOF. Based on this procedure, the final evaluation was performed.
5. Concluding Remarks
The main objective of the estimation of evaporation from the water reservoirs was to derive a universal relationship for the whole territory of the Czech Republic.
The estimation of evaporation from water reservoirs is complicated because a large number of water reservoirs do not have observed evaporation data. In this work, Quitt’s climate classification was used to assign a evaporimeter station that is not near a reservoir to a given reservoir based on climate region and elevation. Within the Czech Republic, the evaporation value from water reservoirs is determined on the basis of a handling order, which is established according to a Czech technical standard which is based on old climatic data and does not deal with climate change. For this reason, the determination of the evaporation from water reservoirs is based on estimation using statistical methods rather than exact measurement.
The ERA5-Land climate reanalysis data were used for derivation and were chosen for their comprehensiveness, availability, high spatial resolution, long time series and advantageous management. Relative humidity was included into the results based on the calculated August–Roche–Magnus approximation. The climate reanalysis data were exported for stations and water reservoirs.
The derivation of the relationship for evaporation was based on the multiple linear regression method, where the values of the dependent variable (evaporation) were sought, based on two or more variables (predictors: air temperature, surface temperature, wind speed, surface net solar radiation, dew point, surface pressure, dew point, altitude, latitude, longitude and calculated humidity). The construction of the models was done (i) manually, where the evaluation was done using the AIC parameter and the quantile–quantile (plot-QQ) was used for visual diagnostics, this method was time consuming, (ii) using stepwise regression, where the predictors are entered sequentially and models from one to X-selected variables were generated, this method is not time consuming. Random forest regression was used to account for non-linear relationships. Linear and random forest regression models were cross-validated and evaluated using criterion functions (, RMSE, MAE and RERR). Finally, 3(+1) LM models and 3 RF models were selected. The models contained a large number of independent variables (6–7), possibly leading to model overfitting and therefore another model was selected which performed best for the RMSE criterion function and is based only on 4 independent variables and is therefore more user friendly.
It turned out that geomorphological information (elevation, location) appeared more in the manually derived models as opposed to models constructed using the stepwise regression method. When comparing linear models (LM) and random forest models (RFM), LM was found to have much more variability in the outcome compared to the RFM. The advantage of RFM is their adaptability, but the subsequent interpretation of the results can be a problem. This has been shown in the design of LM and RFM as well as when applying the proposed models to water reservoirs.
Evaporation values for the period 1981–2019 were calculated for the selected water reservoirs and selected formulas based on ERA5-Land climate reanalysis data.
For the evaluation of evaporation, models from LM and RFM models were used. Among the best models that were evaluated by linear regression, models LM1 from the manual linear regression group and LM12 from the stepwise regression group were used. Model LM1 was selected as the best model among the six predictors. The LM1 model can be replaced by an alternative model LM12 with which also performed satisfactorily with four predictors.