Evapotranspiration Response to Climate Change in Semi-Arid Areas: Using Random Forest as Multi-Model Ensemble Method

: Large ensembles of climate models are increasingly available either as ensembles of opportunity or perturbed physics ensembles, providing a wealth of additional data that is potentially useful for improving adaptation strategies to climate change. In this work, we propose a framework to evaluate the predictive capacity of 11 multi-model ensemble methods (MMEs), including random forest (RF), to estimate reference evapotranspiration (ET 0 ) using 10 AR5 models for the scenarios RCP4.5 and RCP8.5. The study was carried out in the Segura Hydrographic Demarcation (SE of Spain), a typical Mediterranean semiarid area. ET 0 was estimated in the historical scenario (1970–2000) using a spatially calibrated Hargreaves model. MMEs obtained better results than any individual model for reproducing daily ET 0 . In validation, RF resulted more accurate than other MMEs (Kling–Gupta efﬁciency (KGE) M = 0.903, SD = 0.034 for KGE and M = 3.17, SD = 2.97 for absolute percent bias). A statistically signiﬁcant positive trend was observed along the 21st century for RCP8.5, but this trend stabilizes in the middle of the century for RCP4.5. The observed spatial pattern shows a larger ET 0 increase in headwaters and a smaller increase in the coast.


Introduction
Concerns about anthropogenic Climate Change (CC) have increased in the last decades. It has been studied by many researchers and institutions, such as the Intergovernmental Panel on Climate Change (IPCC). Despite the uncertainties, the warming of the climate system is unequivocal, and many of the changes observed in climate variables since the 1950s are unprecedented in recent centuries. Data on land and ocean surface temperatures show an average warming of 0.85 • C over the period 1880-2012, with each of the last three decades being successively warmer than any previous decade since 1850 [1]. The Fifth IPCC Report stresses that temperature rising will continue throughout the 21st century at a global scale together with a precipitation decrease in the mid-latitudes [1].
Climate projections depending on different scenarios have been generated to know the foreseeable CC and its environmental impacts, and also to implement adaptation measures. The coupled ocean-atmosphere general circulation models (AOGCM) are the basic tool for making these projections. However, their low spatial resolution (>100 km) prevents them from being used to study CC impact and adaptation at a regional scale. To solve this problem, regionalization or downscaling techniques, which can be statistical or dynamic, have been used. These techniques adapt global projections to regional or local characteristics, which are highly influenced by orography, land-water contrast and land use, among other variables. The CORDEX [2] project, and its regional portals such as EuroCORDEX, and the ENSEMBLES [3] project in Europe are prominent examples. In Spain, within the framework of the Climatic Change National Adaptation Plan (PNACC), various regionalizations have been developed with the aim of obtaining high-resolution series. The project ESCENA produced a monthly dynamic downscaling with a spatial resolution of 25 km, and ESTCENA does statistical downscaling based on weather stations data [4]. The Spanish State Agency of Meteorology (AEMET) has produced daily series [5] based on scenarios defined in the Fifth Assessment Report of the IPCC (IPCC-AR5) [6]. More information and links to access these data are described in AEMET [7].
One of the major limitations of using CC data projections is the uncertainties that affect all steps in their generation process, including those related with the data used and those derived from uncontrolled local factors in the regionalizations [5,8]. In Spain, Amblar Francés et al. [5] found that, when generating regionalized precipitation series, uncertainties associated with the models and the regionalization are more important than those associated with emissions, while it is the opposite with temperatures, which also have a better fit with the historical reference series. Even so, the use of these sources of information is the best way to understand the impact of CC on water resources and to develop adaptation strategies.
As Devineni et al. [9] point out, errors resulting from climate forecasting are primarily of two types: (1) uncertainty in initial and boundary conditions and (2) model error [10]. Multi-model ensemble (MME) methods (also called forecast combinations, multi-model forecasts, or multi-model ensemble projections) may reduce both errors. The advantage of combining several estimations from different models is that they convey different representations of physical processes and models. This way, it is possible to reduce the implicit uncertainty of the models by combining the results of several of them [11].
Traditionally, MMEs projections have weighted equally all regionalized projections, with the Simple Average combination (SA) being the most used for its simplicity. In recent years, various methods have been proposed to weight differently each of the projections in the final series. This weighting is based on the accuracy of the used climate models in instrumental or reference periods. The Reliability Ensemble Averaging (REA) method [12] and its variants, such as those proposed in [13][14][15], or the Bayesian Model Averaging (BMA) [16,17] use this approach.
Machine Learning (ML) has been recently introduced, due to the increasing computational power and the development of powerful algorithms, as a suitable generic framework to tackle a wide range of problems related to CC [18,19]. In the specific case of MMEs, multimodel projections of various variables, such as temperatures or precipitation, are being obtained using different machine learning algorithms such as Random Forest (RF), Support Vector Machines (SVM) and its generalization for regression problems (SVR), neural networks (ANN), K-Nearest Neighbor (KNN), or Relevance Vector Machines (RVM) [20][21][22][23]. In these studies, the forecasts obtained through SA are compared with those obtained through machine learning methods, obtaining the later better results in all cases. However, Wang et al. [20] conclude that the results obtained vary depending on the variable analyzed; better results are obtained with temperatures, while for precipitation, due to its greater variability, the results are worse.
In the specific case of potential evapotranspiration (ET P ) and reference evapotranspiration (ET 0 ), the regionalized projections of CC, together with those of precipitation, are the starting point for studies aimed at the management of water resource systems. This type of studies are essential to evaluate potential changes in water resources in semi-arid areas [24]. Having ET 0 series in semi-arid areas is important as evapotranspiration consumes a high percentage of the scarce precipitation [25].
This study have two main objectives. First, to propose a methodology for obtaining robust ET 0 daily series to complement the Spanish National Climate Change Adaptation Plan (PNACC) data and the regionalized projections generated in its third report [5] for the IPCC-AR5 data [6]. To do this, we propose to estimate and compare the regionalized projections of 16 global models and 11 MMEs whose accuracy usually depends on the analyzed variable. Second, the application of such methodology to the Segura Hydrographic Demarcation (DHS) as a case study, with the aim of evaluating the temporal and spatial impact of CC on the ET 0 in a semi-arid Mediterranean region.
To tackle both objectives, we propose a framework centered on the use of Random Forest (RF) as a machine learning algorithm, both as a daily series ensemble technique, and as part of a Random Forest Regression Kriging (RK) interpolation model (RFRK). Its accuracy will be evaluated and compared with that of the other models and other widely used ensemble methods.
For this purpose, we obtained multi-model projections of ET 0 for the 21st century and two emission scenarios Representative Concentration Pathway 4.5 (RCP4.5) and RCP8.5 [6]. To estimate ET 0 , the maximum and minimum daily temperature variables and the Hargreaves method were used and calibrated according to the methodology proposed in Gomariz-Castillo et al. [26]. Once obtained the series for each regionalized model (16 for each emission scenario) we made ensembles using two machine learning algorithms, two geometric ensembles, four regression based ensembles, and three simple ensemble, evaluating the performance of all of them in validation. After that, the best of them (RF resulted significantly more accurate than the others) was used to interpolate the values of ET 0 using easily obtained auxiliary variables that can act as proxies of those aforementioned variables involved in the evapotranspiration process, but not included in the Hargreaves model.

Reference Evapotranspiration
The FAO Penman-Monteith equation [27] is usually recommended for estimating ET P and ET 0 ; however, it requires data on solar radiation, air temperature, atmospheric humidity, and wind speed. Nowadays, EuroCORDEX has data series for these variables, with a spatial resolution of about 12.5 km. Leaving aside the substantial resolution difference, there are other problems that must be solved: (a) some studies, such as Boe et al. [28], suggest that the current EuroCORDEX ensemble does not capture the upper part of the climate change uncertainty range, as it simulates a much smaller increase in shortwave radiation reaching Earth surface therefore smaller temperature and ET 0 values, and (b) because of the spatial resolution, the effect of local factors may not be taken into account. Consequently, to obtain future projections of ET 0 , regionalization techniques such as the Statistical Downscaling Method (SDSM) [28,29] or the Delta method [30] must be used. For example, Izquierdo-Miñano et al. [31] study the fit of EuroCORDEX in the DHS for relative humidity and the daily maximum and minimum temperature and perform a statistical downscaling based on random forest because high biases were detected in all of them.
The alternative is to use approximations of ET 0 that only require input variables available in the projections. Hargreaves method [32,33] is a temperature-based method that only requires maximum and minimum temperature data. Despite this, it is one of the models that best approximate the real ET 0 [34][35][36]. In this way, it is possible to use downscaling to produce regionalized series, such as those used in this study [7].
However, it underestimates ET 0 in dry or high regions and overestimates it in low areas [37]; moreover, in semi-arid areas, such as the DHS, it generally underestimate ET 0 in the cold months and overestimate it in the warm months [38]. Among the variables whose non-inclusion in the Hargreaves model could reduce its accuracy are radiation [39], the ratio of atmospheric pressure to sea level pressure [40], precipitation [37], wind, and the regional T/∆T ratio. To solve this problem, Hargreaves' method is usually calibrated to approximate the ET 0 estimations obtained with the Penman-Monteith equation FAO. After comparing various calibration procedures, Gomariz-Castillo et al. [26] proposed a simple regression equation for its simplicity and goodness of fit to observed data. Both parameters are then interpolated using Random Forest Regression Kriging (RKRF); in this way, it is possible to extrapolate the calibration parameters to each weather station used in this study. The modified equation for the estimation of ET 0 (mm day −1 ), in a given regionalized model i and weather station, can be defined as where ET 0ti is a ET 0 (mm day −1 ) at day t in a given i regional model. b 0est and b 1est are the interpolated parameters for a given weather station obtained in Gomariz-Castillo et al. [26], and ET 0Hit (mm day −1 ) is the ET 0 estimated from the Hargreaves equation for a weather station and regionalized model data. ET 0Hit can be obtained from the following expression, T MXit and T MNit ( • C day −1 ) are the daily maximum and minimum temperature at day t in a regionalized model i, respectively; C T is an empirical coefficient whose value is 17.8; and Ra t,est is an extraterrestrial radiation (mm day −1 ) considered for a given day t and latitude for weather station est [41].
Although the impact of climate change on ET 0 has been less studied than on other variables, in the last decade some studies have been developed about the spatial-temporal variation of predicted ET 0 along the 21st century in several areas of the planet, using different regionalization techniques [28][29][30]42,43]. The previous works are generally based on the regionalized projections obtained for the model that best fit the reference period (generally the last 30 years). However, Xing et al. [44] use MME projections based on a Bayesian approach. In general, these works obtain positive and significant trends of the ET 0 during the 21st century, with important spatial differences in the magnitude of these trends. Machine learning-based MMEs have not been used to forecast ET 0 series.
The technical report produced by the Spanish Center for Public Works Studies and Experimentation (CEDEX) [45] analyses the changes predicted for the 21st century in precipitation, ET 0 , and total runoff, for three periods distributed throughout the 21st century: 2011-2040, 2041-2070, and 2071-2100, using the period 1961-1990 as a historical reference scenario. Six regionalized projections, obtained from different global models and regionalization techniques, were used. The results refer to two emission scenarios (RCP4.5 and RCP8.5) of the IPCC-AR5 [6]. In addition, Gimenez et al. [25] analyzed the expected evolution of ET 0 in different areas of the Iberian Peninsula for the first part of the 21st century. In this work, MME projections of maximum and minimum temperatures are obtained, based on different regional climate models from the ENSEMBLES [3] project, using the modification of the REA method proposed by Gimenez et al. [15].
Regarding the spatial estimation of ET P and ET 0 , point estimates made from data observed in weather stations can be useful for irrigation management, but are not representative of wide areas in heterogeneous regions [46]. Therefore, some studies have tried to interpolate values of ET 0 from regression models using different geographical variables as predictors [26,47,48].

Study Area
The Segura Hydrographic Demarcation (DHS) (Figure 1) is located in the southeast of the Iberian Peninsula. It has an emerged area of 19,025 km 2 with a large orographic variability, alternating mountain systems that exceed 2000 m.a.s.l. in altitude, with high plateaus, valleys, and coastal depressions [49].
The spatial distribution of temperature is strongly influenced by the orography, with an increase in the average annual temperature from the Northwest mountains (10 • C) to the coast (18 • C). The annual temperature regime presents a winter minimum in the months of December and January, while the maximums correspond to the months of July and August [24].
The average annual rainfall in the period 1980/81-2011/12 was 375 mm [24], making it one of the driest regions in continental Europe. The mountain ranges located in the Northwest average annual rainfall slightly higher than 1100 mm. Precipitation drops drastically in a northwest-southeast direction, with average annual precipitation values around 200 mm in the Southwestern coast. Gomariz-Castillo et al. [26] obtained for the period 2003-2014 an average ET 0 of 1258 mm × year −1 , ranging in most of the DHS between 1200 mm and 1300 mm without a clear spatial pattern.
The climate shows great spatial contrasts and a great interannual variability, with relatively frequent torrential events, interannual droughts, heavy frosts, and maximum temperatures that locally can surpass 40 • C. Its location and spatial heterogeneity makes it an ideal case study to evaluate the impact of CC in strongly anthropized semi-arid areas of the Mediterranean. The combination of factors favorable to agriculture (high solar radiation and mild temperatures) allowed the development throughout the twentieth century of a very productive agriculture comparing with the rest of Spain [50]. However, due to the aridity of the DHS and the high water deficit 400 hm 3 /year agriculture is highly dependent on irrigation, which has required a large degree of regulation [51], the transfer of water from neighboring areas (the most important being the Tagus-Segura transfer [52]) and the overexploitation of aquifers [45,53]. This overexploitation, together with the decrease in snowfall, due to the increase in temperature related to CC, has meant that 46% of the springs existing in 1916 in the Segura basin have now disappeared [54].
The IPCC-AR5 RCP4.5 scenario [7] estimates a stabilization of temperature increases for the second half of the 21st century, around 1.5 • C with respect to the reference period 1961-1990. On the other hand, the RCP8.5 scenario estimates a significant increase in global temperatures of around 3 to 5 • C at the end of the 21st century with respect to the period 1961-1990. As a result, in addition to the increase in the ET 0 , significant reductions in precipitation are expected, together with an increase in adverse phenomena such as floods and the intensification of droughts, which will cause significant alterations in water availability both in the DHS and in inter-basin transfers [55].

Data Sources
Three CC scenarios were used: HISTORICAL, ET 0 for the period 1970-2000, and the IPCC-AR5 [6] RCP4.5 and RCP8.5 scenarios for the period 2020-2099. It is thus possible to compare the CC impact on the variable studied with respect to the first scenario. RCP4.5 and RCP8.5 cover time series of emissions and concentrations of the full range of greenhouse gases (GHGs), aerosols and chemically active gases, as well as land use and land cover [6]. RCP4.5 is an intermediate emissions scenario, which estimates a stabilization of GHG emissions from 2050; RCP8.5 is a high GHG emissions scenario, which considers a significant increase throughout the 21st century.
We tried to maximize reproducibility implementing the process described in the methodology in R-CRAN [56], an Open Source program for data analysis that can handle and analyze large volumes of information, and that has a large number of packages available to different types of the calculations. Therefore, it can be applied to one or more time series (one or more weather stations) and for different datasets. The Open Data series of the AEMET [7], regionalized for the Iberian Peninsula with more than 2500 weather stations, are used as a starting point.
AEMET regionalized projections of 16 models (Table 1) were used to estimate daily ET 0 and to generate the ensembles. Such projections were carried out within the framework of the PNACC, whose objective was, among others, to generate daily series of maximum and minimum temperatures and precipitation, considering local factors in such estimation. To this end, they generated daily series associated with the network of AEMET weather stations based on a regionalization using the statistical method of analogs from the global models participating in the IPCC-AR5 [6] and the project Coordinated Regional Downscaling EXperiment (CORDEX) [2]. The analogs method we used to obtain the series is described in [5]. It is based on the search for historical synoptic situations similar to the days that are regionalized. Its advantage is that, when using time patterns, it keeps the spatial covariance structure of local time, and does not make assumptions about the shape of the probability distribution of the variables. Amblar Francés et al. [5] carry out such regionalization using the following process: (a) to obtain a synoptic classification of atmospheric situations, estimating their similarity using Euclidean distance, and (b) to use as temperature at 2 m, temperatures at three pressure levels (850, 700, and 500 hPa), the zonal and meridian components of the wind at sea level, the pressure at sea level, the temperature at 2 m of the previous day, the specific humidity at 700 hPa and the insolation as predictors.  [57] x x x x ACCESS1_3 [57] x x bcc_csm1_1 [58] x x x x bcc_csm1_1_m [58] x x BNU_ESM [59] x x x x CMCC_CESM [60] x x CMCC_CM [61] x x CMCC_CMS [62] x CNRM_CM5 [63] x x x x inmcm4 [64] x x x x IPSL_CM5A_MR [65] x x MIROC_ESM [66] x x x x MIROC5 [66] x x x x MPI_ESM_LR [67] x x x x MPI_ESM_MR [67] x x x x MRI_CGCM3 [68] x x x x TOTAL 16 12 13 10 To apply the study to the DHS, maximum and minimum temperature data associated to 48 weather stations of the AEMET network have been used (Figure 1). The emission scenarios selected in this work are those defined in the IPCC-AR5 for intermediate emissions (Representative Concentration Pathway 4.5-RCP4.5) and high emissions (RCP8.5) [6] for the period 2020-2099.
As can be seen in Table 1, 16 regionalized models are available, although this number is not the same for RCP4.5 and RCP8.5; therefore, to simplify the process, the 16 regionalized models have been evaluated with respect to the observed data, but the MMEs have been generated for the 10 regionalized models available in both RCP scenarios (10 in total).
A third historical scenario is also included with daily data for the period 1961-2000 (HISTORICAL scenario). This scenario includes the series of the previous regionalized models and a series of observed data for that period, used to evaluate goodness of fit and to generate the MME models. For this purpose we used data from the data set Iberia01 [69] that has daily rainfall and temperature series for the Iberian Peninsula obtained from the interpolation of 3486 weather stations (including the weather stations used in this study). The advantage of this type of data over the observed data is that anomalous values and inhomogeneities are removed and the data are spatially distributed in a grid with a resolution of about 12 km.
The series of ET 0 were estimated from the maximum and minimum daily temperature data associated to each of the 16 models. A total of 16 regionalized models (Table 1)  To train and validate the MMEs, the ET 0 series of the HISTORICAL scenario was divided into a training dataset (1 January 1961 to 31 August 1987) and a test dataset (1 September 1987 to 31 December 2000). As the main objective of this study is the estimation of the future series, the last period was selected for validation. The ET 0 series estimated from the daily maximum and minimum temperature obtained in Spain02v5 was used as reference series.

Multi-Model Ensemble
Eleven MME methods to combine forecasts of individual regionalized models were analyzed: three simple ensembles, four regression-based ensembles, two geometric-based ensembles, and two machine learning (ML) based ensembles. Simple ensembles, regressionbased ensembles, and geometric-based ensembles are implemented in the R package GeomComb [70]. Bayesian Model Averaging (BMA) analysis in the R package BMS [71]. Random Forest (RF) and Support Vector Regression (SVR) in the R packages Random Forest [72] and kernlab [73], respectively. The parameter calibration of the last two methods was carried out using the R package caret [74]. Detailed descriptions of the algorithms are available in the previous references.
The simple ensembles used were Simple Average forecast combination (SA), Median forecast combination (MED), more robust to outliers than SA [75], and Trimmed Average forecast combination (TA). This last method uses a trim factor to eliminate a percentage of the extreme data on both sides of the distribution, being less sensitive to sampling outliers and asymetric distributions than SA [76]. In this study, the trim factor has been estimated by minimizing the RMSE of the predictions with respect to the training data-set.
Regression-based ensembles and geometric-based ensembles are also linear combinations of the members of the ensemble but the weights are optimal in terms of a risk function. Ordinary Least Squares (OLS) regression, first used to build ensembles in Granger and Ramanathan [77], uses ordinary least squares to estimate both the weights and an intercept: where f it is the value of the ensemble member or predictor i at time t; b 0 is the intercept, which acts as a bias correction; and w OLS i is the weight assigned to predictor i. Constrained Least Squares (CLS) forecast combination [77] is conceptually similar to OLS (Equation (3)), but subject to ∑ N i=1 w CLS i = 1 and b 0 = 0. It s main advantage over OLS is that CLS produces better estimations when the predictors are highly correlated but, having no b 0 , it does not correct bias. Least Absolute Deviation (LAD) [78] is another lineal regression-based method (equal to Equation (3)) but w LAD(N) i are estimated by minimizing absolute errors instead of squared errors, as in OLS, for this reason it is more robust if outliers are present.
Bayesian Model Averaging (BMA) [79,80] produces a weighted average of probability density functions (PDFs), combining the models while calibrating the weights from the training data set. In this method, the weights are estimated using the posterior probability of each of the individual members. It is a relatively new approach with two main advantages [80,81]: (a) the estimation of weights based on a statistical a priori analysis may ensure objectivity and (b) the idea of assigning higher weights to forecasts that provide unique information is consistent with the recommendation of combining forecasts that incorporate diverse information [76]. This is because systematic and random errors of individual forecasts are more likely to cancel out in the ensemble if the individual forecasts convey different information. Hinne et al. [82] provides a detailed description of its general framework and practical utility, and Wang et al. [20] applies it to CC scenarios.
Standard eigenvector (EIG1) and bias-correct eigenvector (EGI2) proposed in Hsiao and Wan [83] are the two geometric ensembles included in this study. Both estimate the linear combination weights from the sample mean squared prediction error (MSPE) matrix. The main difference among them is that EIG2 uses the centered MSPE matrix [70]; therefore, EIG2 is more advisable if any of the components of the combination is skewed.
Random Forest (RF) and Support Vector Regression (SVR) are the two machine learning approaches tested in this study. Both algorithms are fully explained in Hastie et al. [84] or Kuhn and Johnson [74] and its application to MMEs with CMIP5 data problem is presented in Wang et al. [20].
RF is a nonparametric algorithm that builds an ensemble of decision trees [85]. Each tree is calibrated using a bootstrapped subsample of cases, and the features to perform each split in the trees are selected from a random subsample of the whole feature set. When all trees have been trained, without pruning, each new case is analyzed by all trees and the final prediction is obtained averaging the results. RF uses two parameters: number of regression trees (n tree ) to grow, and number of features randomly sampled in each split (m try ). Once the model is trained, the prediction can be obtained aŝ where M is the number of trees, T m denotes single decision tree, and f t is a vector of predictors. In this study, n tree = 1000, as higher values do not generally produce any improvement [72,84] and m try has been calibrated using cross validation without repetition and minimizing RMSE. To do this, while saving computing time due to the high number of observations and variables, a cross-validation based on a partition of the training time series by decades in three blocks was carried out. The resulting RMSE values for each block were calculated, and finally the m try value that minimized the average RMSE was selected.
RF importance was estimated using a model-based approach, more related to the actual model performance than the traditional methods, that takes into account the correlation structure between the predictors [74]. The metric used is based on the estimation of MSE on the out-of-bag data for each tree, and then the same computed after permuting a variable. The differences are averaged and normalized by the standard error. Once done, the values obtained have been rescaled to percentages in order to compare the results between the various estimated models. In this way, it is possible to evaluate the contribution of each series to the final ensemble and to determinate which of them are more important in the forecast.
SVR [86] is an extension of Support Vector Machines (SVM) for regression problems. Instead of minimizing the sum of squared errors, it selects points, the so-called support vectors, whose residual absolute values |e| are larger than a threshold . These points will contribute linearly, instead of quadratically, to the objective error function to be minimized. The problem is then solved as a quadratic optimization problem. A regularization parameter C penalizes large residuals. SVR is generalized to nonlinear regression problems by including in the formulation the so-called kernel functions. In this way, the prediction of ET 0 can be obtained from the following general formulation, where K(X n , X) is the kernel function. In this study a radial basis function (RBF) was used as kernel function. SVR with the RBF kernel requires setting two parameters: C, the cost parameter, and γ, a parameter of the RBF kernel function that controls the degree of nonlinearity of the model. To reduce the computational cost, γ has been estimated automatically from the training data, following the methodology in Caputo et al. [87], as advised by authors such as Kuhn and Johnson [74]; after that, C has been calibrated from cross-validation, following the same procedure as for RF. Values of γ close to 0 produce similar results to the linear model (they reduce the degree of nonlinearity). A large C produces a more flexible but also more likely to overfit [74].

Evaluation and Comparison of Individual Regional Models and Multi-Model Ensembles
Three statistics were used to estimate error validation: R-squared (R 2 ), bias percentage (PBIAS), to measure the mean tendency of a model to overestimate or infra-estimate ET 0 , and Kling-Gupta efficiency (KGE) that summarizes several aspects of the model performance in fitting reference data.
R 2 can be interpreted as the proportion of the variance of the reference variable reproduced by the simulation. It is defined as where sim i and re f i are the ET 0 (mm day −1 ) simulated by a specific model and the reference value, respectively, on a given day, and re f is the arithmetic mean of the reference values. Although it is the most widely used fitting statistic, it is very sensitivity to extreme values [88] and insensitive to constant biases (additive or multiplicative).
PBIAS (bias percentage) measures the average tendency of the simulated data to be larger or smaller than their reference values [89]. It is defined as Positive values indicate underestimation bias and negative values indicate overestimation bias [89].
KGE [90] is increasingly being used in calibration and validation studies due to its advantages over other metrics: it normalizes model performance into an interpretable scale and addresses deficiencies of other metrics (as the Nash-Sutcliffe efficiency and R 2 ) due to its formulation [90], summarizes correlation, bias and variability. In this study, we used the modified version proposed in Kling et al. [91] that ensures the no correlation of bias and variability ratios: where r is the lineal correlation between simulated and reference values; CV sim and CV re f are the coefficients of variation in simulated values and reference values (variability ratio in equation), respectively; and µ sim and µ re f are the mean in simulated values and reference values (bias ratio in equation). Its interpretation is similar to R 2 , KGE ranges from -Inf to 1; KGE = 1 indicates perfect agreement between simulations and observations [90]. Two one-way ANOVA tests were performed on the three goodness of fit statistics estimated with the test data-set to identify any statistically significant differences in the individual regionalized models and MMEs. The mean values of each statistic were compared using the observations in the 48 weather stations: • ANOVA 1. Its objective is to evaluate performance between the different series and forecast methods (Series factor). This variable factor consists of k = 27 classes (16 regionalized models, 3 simple ensembles, 2 geometric-based ensembles, 4 regressionbased ensembles, and 2 machine learning ensembles) and n = k · 48 stations = 1296 observations. • ANOVA 2. In addition, to better interpret the results, a second analysis was designed, with the same data but grouping the series according to their type (regionalized models, simple ensembles, regression-based ensembles, geometric ensembles, and machine learning ensembles).
When the existence of an effect is discovered using ANOVA, the effects and significance of differences between the different models are evaluated with multiple pair contrasts based on a Tukey-Kramer contrast. Such contrasts allow to find homogeneous groups of classes. To evaluate the statistical asssumptions of ANOVA models (normality and homoscedasticity), we used the Kolmogorov-Smirnov test to evaluate normality and the Levene test [92] to evaluate homoscedasticity in the residuals. KS was not significant, but the significance of the Levene contrasts revealed heteroscedasticity. In this case, the covariance matrix of the estimated parameters is not robust enough, and the Tukey-Kramer contrast is less reliable. To correct the heteroscedasticity we used a heteroscedasticity-consistent covariance matrix of the parameters (HC3) [93] in ANOVA and the Tukey-Kramer contrast.

Temporal and Spatial Patterns of ET 0
In addition to obtaining daily series of ET 0 , the spatio-temporal characterization of the CC impact on ET 0 has been carried out by an analysis of the yearly ET 0 in the HISTORICAL scenario and in the RF estimations in the RCP4.5 and RCP8.5 scenarios, as RF resulted to be the most accurate method.
The temporal trend of the annual ET 0 and its significance have been calculated for the average series of all 48 weather stations included in the study. This trend has been estimated using a robust regression based on the Theil-Sen estimator [94] and its significance with Mann-Kendall trend test. These are non-parametric methods with great robustness if statistical assumptions, such as time dependence, are not met [95]. The Theil-Sen estimator is a generalized-based median method that fits a regression line by choosing the median of the slopes of all lines through pairs of points as slope estimation (β 1 ); the sign of β 1 represents the direction of change and its value indicates the steepness of change [96]. Mann-Kendall monotonic trend test [97] is based on the correlation between the ranks of a time series and their time order.
In order to analyze the spatial distribution of the expected changes of annual ET 0 , average values of ET 0 at annual scale were obtained for the 48 stations included in the study area for the period 1971-2000, used as reference period, and for the periods 2041-2070 and 2071-2100. These annual values were then interpolated for both periods using Regression Kriging, a technique that combines a global regression model to establish the relation between the dependent variable and different predictors with a local interpolation of its residuals using, in this case, ordinary kriging. Due to the good results obtained in previous works [26] Random Forest was used as the regression model.
Seven geographical variables were initially to be used as predictors for RF: (1) elevation above sea level (m), using a DEM with resolution of 25 m (DTM25) [98]; (2) Euclidean distance to the coast (m) (COASTDIST); (3) mean annual potential irradiation (W/m 2 year 1 ) (POTI), obtained according to the methodology proposed in Hofierka and Suri [99]; the (4) inverse and (5) logarithmic transformation of the elevation; and the (6) inverse and (7) logarithmic transformation of distance to the coast. These variables can act as proxies for some of the meteorological variables relevant in the calculation of ET 0 but not included in Hargreaves' equation. For example, POTI is a proxy of the global radiation; DTM25 is a proxy of the ratio of atmospheric pressure to atmospheric pressure at sea level or of wind magnitude, while COASTDIST acts as a proxy of the ratio T/DeltaT, humidity, or cloudiness [26].
The multicollinearity between predictors was evaluated using the variance inflation factor (VIF), eliminating in an iterative manner the variables with a threshold greater than 10, as proposed in Kutner et al. [100]. After performing the VIF, the logarithmic transformation of elevation and distance to the coast were eliminated due to high collinearity, so the final number of variables used as predictors was 5.
Once the spatial distribution of the annual ET 0 of the three scenarios was obtained, the differences were evaluated for significance using ANOVA and a posteriori contrasts described in Section 3.4.

Summarized Workflow
The conceptual scheme in Figure 2 summarizes the calculation process. A sample code for a weather station is included as Supplementary Materials. It computes (a) the estimation of ET 0 from the daily maximum and minimum temperatures associated with the individual regionalized models, (b) the validation of the individual regionalized models and the multi-ensemble models, and (c) the predictions for the RCP scenarios. The processes and code use are described in the file README.pdf, included in the Supplementary Materials. Figure 2A shows the downloading and preprocessing steps, described in Section 3.2. The maximum and minimum daily temperatures (TMX and TMN) of the weather stations included in the study area are downloaded and processed for the three considered scenarios and the 16 individual regionalized models; in addition, the observed data are downloaded and processed. Then, for each station, scenario and model (including the observed data), the ET 0 is obtained using the Hargreaves calibrated equation (see Section 2).
Once all the series of ET 0 are obtained, as summarized in Figure 2B, we start with the estimation of the 11 MMEs in an iterative way for each station (48) in the HISTORICAL scenario (see Section 3.3). For this purpose, 2/3 of the data series was used to fit the models (train data, from 1961 to 1987), and 1/3 as test data (from 1988 to 2000). Additionally, for the machine learning algorithms, a cross-validation is performed to obtain validation data in each K iteration. Once the MMEs are estimated, they are validated and the validation results compared (see Section 3.4) using one-way ANOVA and Tukey-Kramer post hoc tests using HC3 heterocedasticity correction. After that, the method that best fits the observed values is selected and the predicted series are generated from that method.
Finally, Figure 2C summarizes the characterization of the hypothetical impact of CC on the temporal and spatial ET 0 patterns in the study area. On the one hand, the Theil-Sen and Mann-Kendal regression is used to analyze the ET 0 trend (mm year −1 ) in the three scenarios, obtaining the average annual increase in each of them.

Performance of Individual Models and Multi-Model Ensembles
The ANOVA analyses performed on the goodness-of-fit statistics of MMEs and regionalized models show significant differences in the three analyzed statistics, both when each series is considered separately (F(26, 1269), p < 0.0001 in all three cases) and when the series are aggregated by type (F(4, 1291), p < 0.0001 in all three cases). Figure 3 shows the value distribution of such statistics, and also the groupings obtained with the Tukey-Kramer post hoc test. Tables A1 and A2 show their estimation both in calibration and validation. In general, the regionalized individual models (green boxes in Figure 3) have a good fit to the observed data. Figure 3 shows that R 2 > 0.6 in all weather stations except for one in BNU_ESM, although on average they underestimate ET 0 (negative bias in Figure 3b) except for bcc_csm1_1_m. This may be because the method we used to estimate ET 0 , take into account maximum and minimum daily temperatures, which are accurately estimated in their regionalization. Thus, Amblar Frances et al. [5] indicate, when carrying out the regionalization for all Spain, that the cause of the good fit may be that when generating the series, the uncertainty associated with the emissions predominates over those associated with the models and processes of regionalization that were used. An interesting pattern appears in the correlation coefficient R 2 (Figure 3a): although RF is the best, all MMEs methods are grouped as a (no significant differences are observed between them) including the simple ensembles MMEs; therefore, this statistic does not discriminate adequately the performance of the different methods. Traditionally, this statistic has been used to interpret the goodness of fit, but it may induce errors in the final conclusions (Figure 3b,c). In this case, the problem is that R 2 is insensitive to constant deviations (bias) [88]. Therefore, KGE does a better estimation of the true performance of the series, resulting in the machine learning and regression based ensembles, labeled by the post hoc tests as a to b, as the most accurate estimation method. KGE is a better statistic because it integrates the correlation, the variability and the bias (Equation (8)) [90,91].
In general, MMEs outperformed regionalized individual models, which is in agreement with the rest of the literature on MMEs [20,101]. It is, however, noteworthy that  Figure 3c shows a KGE outlier associated to weather station 7117; even in that case, RF is significantly better than the rest of ensembles: KGE around 0.80 in BMA, OLS, SVR, and LAD, while in RF KGE = 0.847. Although there are few studies where RF and SVM are used, for the purpose of our work, we can highlight that our results are consistent with those obtained in Wang et al. [20] for monthly temperatures with respect to the other two methods it used (BMA and SA). In our study the RF performance is even better, significantly better than SVM, BMA, or SA, even though we analyzed daily data.
Regression-based MMEs (orange color in Figure 3) also show a good performance. In general, they are not significantly different from SVM; it is noteworthy that, although no significant differences are observed in R 2 and PBIAS with respect to other methods such as geometric or simple ensembles, in KGE it is differentiated from the rest as b category. Its good performance may be because ET 0 estimated in this study is based on temperatures, and as pointed out by authors such as Amblar Frances et al. [5], temperature fields are usually smooth, and their statistical behavior is closer to normal than other variables, so regression methods obtain more accurate results. Among the methods used, LAD obtains slightly better performance probably because it is less sensitive to outliers and more stable, when predictors are highly correlated, than regression based on least square estimations, in which low fluctuations in the sample can cause large changes in the coefficients, producing poor predictive performance [102]. The worst performing method is CLS, which in the case of KGE is outside the b grouping of the post hoc tests. This is because it does not include a parameter b 0 acting as a correction of the systematic bias that occurs in the individual regionalized models, as OLS does.
The Eigenvector-based MMEs have a much poorer fit to the observed data than the rest of the MMEs. As pointed out in Hsiao and Wan [83], EIG2 is better than EIG1 due to the high bias in regionalized models. In fact, unlike what happened in Hsiao and Wan [83], EIG2 is significantly different to EIG1 in absolute PBIAS. In addition, this author points out its usefulness over the models based on regression when the performance of the different members are similar, but in our study there are significant differences among them.
Simple methods can be more robust than more complex techniques, due to their simplicity (no need to estimate any parameter or weighting of the members included in the analysis [75,103]) even with a small number of observations or underlying changes in the timeline [104]. That is the reason why they are included in the a post hoc group in Figure 3a. However, PBIAS and KGE values are poorer because they produce large errors when the predictors are biased [103]. Even so, it is interesting to use SA as a baseline with which to compare the outputs of the other methods [105,106]. SA appears in the cd − ce groups of the post hoc tests for KGE; in this case, all MMEs perform better than SA except EIG1. MED and TA slightly better on average in KGE and PBIAS, although without significant differences, probably due to their greater robustness against outliers [75]. Figure 4 shows, as an example, the prediction of ET 0 (mm day −1 ) in weather station 7152, located in the center of the DHS. The figure shows a hypothetical year (a), estimated from the average daily values or the test data, and the results for 1996 (b). Its goodness of fit in validation is good and similar to the average values in Figure 3. It includes the forecasts of the two best MMEs (RF and LAD) and SA as a reference series.  The multi-model ensemble methods RF and LAD follow the observed seasonal pattern. In general, the models underestimate the reference series in the months with the highest temperature (and therefore the highest amount of energy) and tend to overestimate it in the colder months. However, the percentage of bias of the whole RF (PBIAS = −0.55%) and LAD (PBIAS = −0.6%) validation series are lower than the rest of the series: in SA PBIAS = −1.2% and in inmcm4 percent bias is positive (PBIAS = 2.3%). Figure 5 represents the importance of the individual models in the RF ensemble (a summary is included in Table A3); it shows a relatively high dispersion of relative importance, reflecting differences in the behavior of different regionalized individual models depending on the location of weather stations, and therefore possible local effects. The results highlight MIROC5, CNRM_CM5, and MPI_ESM_LR as the most important, with mean relative importance values de 76.67%, 70.57%, y 68.86%, respectively. It is interesting that none of the three stand out for their performance in Figure 3, except MIROC5 and only in R 2 , moreover all three have a high negative bias. This behavior may indicate that there are patterns in them that are not contemplated in the rest of the series, and therefore, although are not highly accurate individually, they are useful in the forecasting ensemble.  Table 2) in both the HISTORICAL scenario (Z MK = 3.773, p = 0.0002) and the two RCP scenarios (Z MK = 9.842 and 11.920, p < 0.0001). For the RCP8.5 scenario, a significant increase (1.70 mm/year) is predicted in the ET 0 over the entire 21st century, according with the significant temperature increase predicted for that scenario. For the RCP4.5 scenario, the projected increase of ET 0 is considerably less than that of RCP8.5 (0.71 mm/year). This increase would be mainly concentrated in the first half of the period, followed by an stabilization by the end of the century, as can be seen in Figure 6. Annual evolution and trend of ET0. Historical and projected series Figure 6. Evolution of the observed ET 0 in the historical series and in the two climate change scenarios considered. The linear trend models have been obtained using the Theil-Sen estimator and their significance was calculated using the Mann-Kendal test. Figure 7 shows the distribution of ET 0 for the historical period and for the two periods contemplated for the RCP4.5 and RCP8.5 scenarios. These distributions are shown at both annual and seasonal scales, and also show the results of the posteriori contrast carried out. The ANOVAs performed for the five situations show significant differences in all cases (Annual: F(4, 235) = 19.687, p < 0.0001; Summer: F(4, 235) = 5.724, p < 0.0002; Winter: F(4, 235) = 12.802, p < 0.0001; Autumn: F(4, 235) = 15.257, p < 0.0001; Spring: F(4, 235) = 30.975, p < 0.0001). According to the values obtained for the F statistics in the different seasons, the highest increments of ET 0 are observed for spring, while the lowest variation has been obtained for summer. Tomas-Burguera et al. [107] analyzed the evolution of the ET 0 in Spain for the period 1961-2014, and they also obtain the highest increments of ET 0 for the study area in spring. The results obtained in both works for that season may have or may be already having repercussions in water resources availability, these repercussions should be analyzed in future studies.
With regard to the ex-post contrast carried out to compare the ET 0 distributions obtained for the different periods and scenarios, in all the cases analyzed, with the exception of some cases of summer, there are significant differences between the historical period and the periods corresponding with the two climate change scenarios. In summer, only the ET 0 values obtained for the period 2071-2100 are significantly different from the HISTORICAL scenario, another indication that summer is the season with lowest ET 0 increases. In autumn, differences are not great, although enough for all the scenarios to be significantly different from HISTORICAL; in this case, RCP8.5 for 2071-2099 is significantly different from the rest of the scenarios. On the other hand, for the RCP4.5 scenario no significant differences have been obtained between both periods in any of the five situations analyzed, which is a consequence of the stabilization of the increase in temperature and therefore of the ET 0 foreseen for that scenario. On the other hand, for the RCP8.5 scenario there are significant differences between both periods for the annual scale as well as for spring and autumn, which agrees with the important and continuous increase of temperature foreseen for the RCP8.5 scenario throughout the 21st century. Furthermore, it should be noted that for the period 2041-2070 none of the five situations analyzed show significant differences between both scenarios. On the other hand, for the period 2071-2100 significant differences have been obtained between both scenarios both on an annual scale and for autumn and spring, which also agrees with the evolution of the estimated temperature for both scenarios. Table A4 shows the average, the standard deviation and the standard error of ET 0 values for each of the five cases analyzed, both on an annual and on a seasonal scale. Figure 8 shows the spatial distribution of the ET 0 estimated variation for the two scenarios analyzed in the periods 2041-2070 and 2071-2100 with respect to the reference period. The same spatial pattern exists for the four scenarios represented, with the largest increases of ET 0 in the basin headwaters (northwestern of the study area) and the smallest increases in coastal areas.

Spatial Distribution of Annual Variation in ET 0
For the RCP4.5 scenario, only in the northwest and north end of the study area, an increase larger than 200 mm/year in both analyzed periods is observed. In the rest of the basin the increase ranges between 50 and 200 mm/year. Moreover, for this scenario, hardly any changes are observed between the two analyzed periods, which confirms what was mentioned in previous sections. For the RCP8.5 scenario, the highest increases of ET 0 are estimated in a large area in the northwest of the DHS, especially for the period 2071-2100, with increases larger than 250 mm/year, while the increase exceeds 150 mm in a large part of the DHS. In this case, changes are observed between both periods. This spatial pattern could mean a decrease in the water resources available in the study area, as most of them are produced in the northwest sector, due to a larger rainfall and permeability of the rocks. Moreover, the outstanding increase of ET 0 observed in the northwestern sector may produce an increase of the evaporation losses in the reservoirs, a reduction of the channel runoff and a reduction of the aquifer recharges. Tomas-Burguera et al. [107] also obtain the largest increases of ET 0 for the period 1961-2014 in the northwest zone of the study area, with a significant increase of 15-30 mm/decade. Table 3 compares the average variation in ET 0 estimated both in this work and in CEDEX [45] for the DHS (as the difference in percentage between the HISTORICAL scenario and the two climate change scenarios and the two periods analyzed in this study). Although the methodology used is not the same in both cases, it shows the similarity between both studies. The variation of estimated ET 0 in this work is greater than the one estimated in CEDEX [45], with the exception of scenario RCP8.5 for the period 2071-2100. The differences between both studies may be due to the different methodologies used to obtain the ET 0 values, to the fact that the data shown for CEDEX [45] are a simple average of the ET 0 variations obtained for six regionalized projections and to the difference in the control periods used, although the influence of the latter would be minimal according to what is observed in Figure 6. In summary, for the RCP4.5 scenario the estimated ET 0 average increase with respect to the period 1971-2000 is 10.7%, for the period 2041-2070, and 11.8% for the period 2071-2100, while for the RCP8.5 scenario this increase is 11.2% for the period 2041-2070 and 15.3% for the period 2041-2070.

Conclusions
Multi-model ensemble methods generally outperform all or almost all individual series. The observed fit in validation, especially considering that the forecasts are have been carried out for a daily time step, justifies the use of this type of method and its usefulness to characterize the climatic variables for different CC scenarios.
We conclude that machine learning MMEs are efficient and highly useful methods, with ability to obtain better predictions, with high precision and better reproduction of historical climatic variables. In our study, RF is salient because of two aspects: (a) its accuracy is significantly better than the rest of individual regionalized series and multimodel ensembles methods, also reducing the bias compared to the rest of the methods, which generally underestimate ET 0 in the warmer months and overestimate it in the colder months, and (b) that RF conveys the different representations of physical processes in individual models, reducing their implicit uncertainty. In addition, RF has several advantages as a predictive model. It is robust to outliers and nonlinear data, produces an estimation of variable importance that allows to weight or select variables, works efficiently with large volumes of data, produces an unbiased estimation of generalization error and is computationally lighter than other machine learning algorithms like SVM. For all these reasons, RF can be useful when analyzing other climate variables, such as precipitation.
However, it is necessary to highlight the importance of using and comparing various methods as model accuracy can differ between variables, climatic zones, or even periods. This paper proposes a framework based on Open Source data analysis software to objectively compare different models and select the most accurate. It can be used with different datasets and study areas, and include other multi-model ensembles, such as REA, inverse Rank, or other machine learning algorithms in futures studies.
Using alternative statistics to evaluate goodness of fit is also useful. Traditionally, R 2 has been used to evaluate correlation, and CC-oriented studies have given special importance to bias. The main advantage of KGE decomposition is that it evaluates, in an integrated way, three fundamental aspects: correlation, variability and bias. In this study, KGE does a better discrimination of the performance of the different ensembles than R 2 .
ET 0 trend reveals a statistically significant increase in the three analyzed scenarios, although higher in RCP4.5, where it tends to stabilize towards the second part of the century. This increase, more accentuated in spring, is also observed when carrying out an intra-annual analysis. This trend can cause serious problems in this semi-arid area because of its high water deficit and strong anthropic pressure. The problem might be even exacerbated by the expected reduction in rainfall. The strong water demand by irrigated crops, a large population density, and several touristic resorts increases precisely in the months with a larger increase in ET 0 . Moreover, the largest ET 0 increases are expected in the headwaters, the area where most of the DHS water resources are produced.
Supplementary Materials: The following are available online at https://www.mdpi.com/2073-4 441/13/2/222/s1, Includes the reproducible code for a weather station: (a) the estimation of ET 0 from the daily maximum and minimum temperatures associated with the individual regionalized models, (b) the validation of the individual regionalized models and the multi-ensemble models, and (c) the forecasts for the RCP scenarios. A complete directory is included with the project implemented in the R language; the document "README.pdf" includes the description and help of the implemented functions. Funding: This research was funded by Spanish Ministry of Economy, Industry and Competitiveness/Agencia Estatal de Investigación/FEDER (Fondo Europeo de Desarrollo Regional) grant number CGL2017-84625-C2-2-R.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The datasets used in this study were downloaded from publicly available sources. Daily series of maximum and minimum temperatures for individual regionalized projections (Historical and RCP scenarios) [5,7] are available on the Spanish State Agency of Meteorology (AEMET) data repository (http://www.aemet.es/es/serviciosclimaticos/cambio_climat/datos_ diarios). Daily series of maximum and minimum temperature historical observations (reference data, Iberia01 [69]) are available on the Spanish National Research Council (CSIC) data repository (https://digital.csic.es/handle/10261/183071).

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. Goodness of fit statistics in calibration for the different series. N = 1296 observations (n k = 48 weather stations by class).