Fusion of Five Satellite-Derived Products Using Extremely Randomized Trees to Estimate Terrestrial Latent Heat Flux Over Europe

: An accurate estimation of spatially and temporally continuous latent heat flux (LE) is essential in the assessment of surface water and energy balance. Various satellite-derived LE products have been generated to enhance the simulation of terrestrial LE, yet each individual LE product shows large discrepancies and uncertainties. Our study used Extremely Randomized Trees (ETR) to fuse five satellite-derived terrestrial LE products to reduce uncertainties from the individual products and improve terrestrial LE estimations over Europe. The validation results demonstrated that the estimation using the ETR fusion method increased the R 2 of five individual LE products (ranging from 0.53 to 0.61) to 0.97 and decreased the RMSE (ranging from 26.37 to 33.17 W/m 2 ) to 5.85 W/m 2 . Compared with three other machine learning fusion models, Gradient Boosting Regression Tree (GBRT), Random Forest (RF), and Gaussian Process Regression (GPR), ETR exhibited the best performance in terms of both training and validation accuracy. We also applied the ETR fusion method to implement the mapping of average annual terrestrial LE over Europe at a resolution of 0.05 ◦ in the period from 2002 to 2005. When compared with global LE products such as the Global Land Surface Satellite (GLASS) and the Moderate Resolution Imaging Spectroradiometer (MODIS), the fusion LE using ETR exhibited a relatively small gap, which confirmed that it is reasonable and reliable for the estimation of the terrestrial LE over Europe.


Introduction
The latent heat flux (LE) governs the associated heat flux of the interaction between land surface and its atmosphere [1], including vegetation transpiration, soil evaporation, and plant canopies interception evaporation [2]. In general, LE returns approximately 60% of rain back to the atmosphere and also helps to cool the land surface by consuming an enormous amount of heat [3]. Europe makes up the western fifth of the Eurasian landmass. Thus, an accurate LE estimation over Europe plays a key role in many climatic, hydrologic, and agricultural applications [4]. As a confederation of regional observation networks, FLUXNET routinely provides long-term eddy covariance (EC) flux measurements of carbon, water vapor, and energy exchange over America, Europe, Asia, Africa, and Australia. About one-third of FLUXNET's EC sites are located in Europe. However, as result of the spatial heterogeneity, point-based measurements of terrestrial LE cannot be applied for continuous monitoring on a large scale [5].
Satellite-based observations can provide us with a more readily available monitoring for land surface and atmospheric properties. Combined with well-known flux equations [6,7], we can effectively derive spatially and temporally continuous LE estimates over large areas. Currently, various satellite-derived LE products at moderate spatial resolution are generated, including the Moderate Resolution Imaging Spectroradiometer (MODIS) LE product (MOD16) with a resolution of 0.5 km / 1 km and 8 days [8,9], the Global Land Evaporation Amsterdam Model (GLEAM) LE product with a resolution of 0.25 • and 1 day [10], the Global Land Surface Satellite (GLASS) LE product with a resolution of 1 km / 5 km and 8 days [11], or the Breathing Earth System Simulator (BESS) LE product with a resolution of 1 km and 8 days [12]. However, when intercomparing and evaluating with long-term ground measurements from in situ flux networks, satellite-derived LE products showed large discrepancies. Previous studies over China [13,14], Brazil [15], Spain [16], and South Africa [17] found that in the arid and semiarid climate, MOD16 underestimated the LE of the irrigated crop in the growing season [18]. Additionally, some reanalysis and data assimilation LE products have a relatively high temporal resolution but a relatively coarse spatial resolution, such as the Global Land Data Assimilation System (GLDAS) datasets with 0.25 degree spatial resolution and 3 hour temporal resolution [19,20]. Validation against tower flux measurements indicates that LE datasets with a relatively coarse spatial resolution tend to contain large uncertainties regarding the heterogeneous terrestrial biosphere [2]. Therefore, deriving terrestrial LE estimates accurately with both a high spatial resolution (e.g., 1 km) and a reasonable temporal resolution (e.g., daily) remains a central challenge.
To meet this demand, many satellite-derived algorithms are designed to implement LE terrestrial estimates, mainly using process-based algorithms and empirical/semi-empirical algorithms, etc. Traditional process-based algorithms, such as the Surface Energy Balance System (SEBS) [21], the Single-Source models [21][22][23][24], the Two-Source models [25][26][27][28][29][30], the Penman-Monteith (PM) equation [8,9], and the Priestley-Taylor (PT) algorithm [7,[31][32][33], calculate terrestrial LE using satellite-derived datasets related to meteorological observations based on the surface energy balance (SEB) equation [34]. However, as a result of the uncertainties that exist in different model structures, the simulation results of each algorithm show large discrepancies [35]. Empirical/semi-empirical algorithms are convenient to apply but the need to determine the site-specific parameters makes them difficult to implement accurately with variable surface conditions [36]. As the most viable empirical algorithms, machine learning methods have demonstrated great success in predicting complex problems and have been widely used to estimate LE. For instance, Wang [37] used artificial neural network (ANN) combined with meteorological parameters, remote sensing variable (NDVI), and ground-measured LE from 85 EC flux tower sites to predict LE over North America. Bodesheim et al. [38] obtained a global half-hourly LE product using FLUXNET observations, remotely sensed and meteorological data with the Random Forests (RF) method. Xu et al. [39] generated daily ET from in situ flux tower sites on the Heihe River Basin scale at a spatial resolution of 1km by using remote sensing variables (LAI, land cover), air temperature (Ta), relative humidity (RH), solar radiation (Rs), and precipitation (P) with the RF method. However, simulation under complex heterogeneity conditions by these data-driven methods remains questionable because of the spatial representation limitation of sparse training data at certain sites.
Far from discouraging the development of LE simulation methods over large scales, the discrepancies and uncertainties lie in the individual LE products, thus providing an opportunity to foster further research. In recognition of above challenges, the approach of fusing multiple LE products would be preferable. Aiming to develop effective fusion methods, these efforts have ranged from the simple model averaging method (SMA) [40] to more complex approaches such as Bayesian model averaging (BMA) [11], empirical orthogonal function (EOF) [41], or integration methodologies considering consistency with water cycle products [42]. Yao et al. generated the GLASS LE product by fusing of five LE products using the BMA method to enhance terrestrial daily LE estimates [11]. Zhu et al. also reported that merging ET products using the BMA method can achieve more reliable estimations than the SMA method across north China [43]. However, a limitation associated with these fusion methods concerning the weight calculating efficiency of the individual products has somewhat affected their wider application [2]. Regarding this, approaches involving multiple LE product ensembles based on machine learning techniques have the potential to provide accurate terrestrial LE estimates. Extremely Randomized Trees (ETR) as a new tree-based machine learning ensemble method has shown better robustness and regression accuracy compared with other traditional models [44]. Experiments show that ETR is generally competitive and even superior to RF in terms of accuracy [45]. Although ETR has been substantially applied to regression approaches, there is a lack of experiments on datasets fusion problems, especially in improving terrestrial LE estimates by merging multiple LE products With the aim of reducing uncertainties in individual satellite-derived LE products, in this study, we used Extremely Randomized Trees (ETR) to enhance terrestrial LE estimations over Europe by fusing five individual LE products. We had three major objectives: (1) to evaluate individual satellitederived terrestrial LE products produced by five classic LE algorithms using ground-measured EC data from the FLUXNET; (2) to evaluate the ETR fusion method by comparing it with three other machine learning methods using the EC observations and two other global LE products; and (3) to apply the ETR method to map the mean annual terrestrial LE with 0.05° spatial resolution in the period from 2002 to 2005 based on MODIS and Modern Era Retrospective Analysis for Research and Applications (MERRA) meteorological data over Europe.

Data
We produced individual satellite-derived terrestrial LE products using five traditional LE algorithms. Table 1 describes the five LE products in detail. The forcing data included MODIS Fraction of Absorbed Photosynthetically Active Radiation (FPAR) data with 500m spatial resolution, MODIS NDVI and GLASS LAI data with a 0.05 degrees spatial resolution, MERRA meteorological data with a 1/2 * 2/3 degrees spatial resolution. The daily surface net radiation (Rn), shortwave radiation (Rs), relative humidity (RH), air temperature (Ta), vapor pressure (e), and wind speed (WS) from MERRA were used in this study. The FPAR and MERRA meteorological data were spatially interpolated into 0.05 degrees by the bilinear method in order to be consistent with the MODIS pixel size. The individual LE products are briefly described in the following. The RS-PM-based LE product was generated by an improved Penman-Monteith equation [9]. The total LE consisted of four components: the canopy transpiration (LEc), interception evaporation (LEi), saturated wet soil evaporation (LEws), and unsaturated soil evaporation (LEds). FPAR was used as an improved vegetation cover fraction. The calculations of aerodynamic, boundary-layer, and canopy resistance were also modified compared to the beta version [11]. A more detailed description of the RS-PM algorithm can be found from Mu et al. (2011) [9]. The input variables include Rn, Ta, Tmin, RH, and e from the MERRA data, FPAR derived from the MOD15C2 product, and LAI derived from the GLASS product. The daily RS-PM-based LE product has a 0.05 degrees resolution and covers Europe from 2000 to 2006.

Shuttleworth-Wallace dual-source (SW)-based LE product
The SW-based LE product was generated by the Shuttleworth-Wallace dual-source (SW) model [46]. On the basis of the energy balance theory, the SW algorithm divided the total LE into two components: the vegetation transpiration (LEv) and the soil evaporation (LEs). Assuming aerodynamic mixing occurring at a mean canopy source height within the canopy [35], the LEv and LEs can be estimated separately using two Penman-Monteith equations. A more detailed description of the SW algorithm can be obtained from Shuttleworth and Wallace (1985) [46]. The input variables include WS, Rn, Ta, RH, and e from the MERRA data and GLASS LAI. The daily SW LE product covers Europe with a 0.05 degrees spatial resolution for the period from 2000 to 2006.

Priestley-Taylor of the Jet Propulsion Laboratory (PT-JPL)-based LE product
The PT-JPL-based LE product was generated by a novel Priestley-Taylor equation proposed by Fisher et al. [31]. The atmospheric constraints (RH, Vapor Pressure Deficit) and ecological constraints (LAI, FPAR) were associated to determine dynamic coefficients to transform potential ET to actual ET [7]. The extended empirical parameters were determined without using any ground measurements. More formal information about the PT-JPL can be obtained from Fisher et al. (2008) [31]. The input variables to generate the PT-JPL-based LE product include Rn, RH, Ta, and e from the MERRA data, NDVI and FPAR derived from the MODIS product, and LAI derived from the GLASS product. The daily PT-JPL-based LE product covers Europe during the period from 2000 to 2006 with a spatial resolution of 0.05 degrees.

Modified satellite-based Priestley-Taylor (MS-PT)-based LE product
The MS-PT-based LE product was generated by a modified satellite-based PT (MS-PT) model proposed by Yao et al. [32]. The MS-PT algorithm separated LE into four components: the canopy transpiration (LEc), vegetation interception evaporation (LEi), unsaturated soil evaporation (LEds), and saturated wet soil evaporation (LEws). The diurnal air temperature range (DT) was used to quantify the apparent thermal inertia (ATI), which parameterized surface soil moisture (SM) constraints [2]. It only requires four parameters as input: the surface net radiation, air temperature, DT, and NDVI. A more formal introduction to the MS-PT algorithm can be obtained from Yao et al. (2013) [32]. The LE product input variables include Rn, Ta, Tmax, and Tmin from the MERRA data and MODIS NDVI data. The MS-PT LE product has the same spatial and temporal resolution as the PT-JPL-based LE product from 2000 to 2006.

Semi-empirical Penman algorithm (SEMI-PM)-based LE product
The SEMI-PM-based LE product was generated by a Semi-empirical Penman algorithm (SEMI-PM) equation proposed by Wang et al. [47]. The obvious difference between this algorithm and the other four is that the SEMI-PM used wind speed to make the terrestrial LE estimation, which may influence the annual or decadal LE variability [11]. The empirical coefficients for this algorithm were calibrated by 64 global flux tower sites, and the average correlation coefficient of this algorithm was up to 0.94. More formal information about SEMI-PM can be acquired from Wang et al. (2010) [47]. The SEMI-PM LE product requires Rs, Ta, WS, and RH from the MERRA data and MODIS NDVI data. This daily LE product with a 0.05 degrees spatial resolution is also available over Europe during the period from 2000 to 2006.

Eddy covariance data
Our study used FLUXNET eddy covariance (EC) data to evaluate and validate the five satellitederived terrestrial LE products and four machine learning fusion methods. The ground measurements were mainly gathered from 76 Europe flux tower sites ( Figure 1) and each site spanned more than one growing season from 2000 to 2009. These sites were covering five major plant function types (PFTs): deciduous broadleaf forest (DBF, 12 sites), evergreen needleleaf forest (ENF, 18 sites), grassland (GRA, 27 sites), mixed forest (MIF, 2 sites), cropland (CRO, 17 sites), and these PTFs represent the major terrestrial biomes in Europe. Evenly distributed in space, half of the sites were uniformly chosen to train the machine learning models, and the rest of the sites were used to validate the model fusion performance. These ground-measured data included half-hourly or hourly Rn, ground heat flux (G), Rs, WS, Ta, RH, e, sensible heat flux (H), and LE. Because the EC observation method suffers a problem related with energy imbalance, the sum of obtained H and LE is generally less than the total available energy [48]. Therefore, the measured LE can be corrected as follows [49]: is the corrected LE, and the and are the uncorrected H and LE, respectively. Figure 1. Distribution of the 76 FLUXNET eddy covariance (EC) sites for different terrestrial biomes over Europe. The "Train" represents the training sites and the "Test" represents the validation sites.

Extremely Randomized Trees
Extremely Randomized Trees (Extra-Trees, ET) proposed by Geurts is a tree-based machine learning ensemble model for supervised learning of classification and regression [50]. Extra-Trees Regression (ETR) constructs an ensemble of regression trees based on a classical top-down procedure. The obvious discrepancy with other ensemble models is that ETR constructs the trees with all the learning sample instead of a bootstrap replica, and it selects a random cut-point for each feature under consideration rather than computing a locally optimal one [50]. ETR has three important parameters: the K, nmin, and M. These three parameters can be adapted to the specific case either manually or automatically, using cross-validation for example.
The parameter K denotes the number of random splits and its available range is 1 to n, where n denotes the number of attributes. The smaller the K is, the stronger the randomization of the trees will be. Experiments have shown that the optimal values of K are K = √n for classification and K = n for regression [50].
The parameter nmin denotes the sample counts to split a node. Lager nmin leads to a higher bias and smaller of both trees and variances. Optimal default values of nmin depend, in principle, on the output dataset noise. Fully grown trees will generate a very noisy output which always leads to an over-fit problem.
The parameter M is the count of trees. The more trees, the better accuracy, in principle [51]. The compromise between computational requirements and accuracy is the deciding factor for the choice of an appropriate value of M.
The score measure in ETR is the relative variance reduction. For a sample and a split , the score is defined as follows: where and denote the two subsets of cases from corresponding to the two outcomes of a split . The | represents the variance of the output in the sample . The ETR method has no need for the optimization of the discretization thresholds, so it is competitive with other ensemble models in terms of the accuracy, computing times, and ease of implementation.

Gradient Boosting Regression Tree
The Gradient Boosted Regression Tree (GBRT) model, also known as the gradient boosting decision tree, is a boosting regression model consisting of an ensemble of decision trees [52]. GBRT builds the model by a stage-wise way, and it generalizes them by optimizing an arbitrary loss function. The approximation function for GBRT can be expressed as where denotes the weight for an individual decision tree ℎ( ; ). Each individual decision tree can be defined as follows: where represents the input variables, and denotes the classifier of each decision tree. The input dataset would be parted into regions by the trees. The represents the predicted constant for the corresponding region.
The GBRT can generate an appropriate nonlinear relationship automatically by the ensemble trees [53] and process skewed variables without transformations. GBRT can overcome the problem of over-fitting by combining hundreds of weak decision trees. The computational robustness and high scalability of GBRT make it a superior model as compared to a single decision tree [54].

Random Forests
Random Forests (RF) [51] as an enhancement of bagging [55] is an ensemble of single trees. The independent tree predictor in the ensemble is generated by randomly selecting input variables using the bootstrap [56] sampling method. The optimal split can be found either from all input variables or a preset-sized random subset during the construction of a single tree. The estimate variance of the trees can be reduced by increasing the randomness. Experiments found that RF methods significantly outperform other decision tree methods because the single trees typically generate high variance and are prone to over-fit [50]. The randomness injected into forests when generating decision trees may decouple prediction errors. Some errors can be cancelled out by taking an average of those predictions. As a result of the low correlation between independent trees, the RF can avoid falling into over-fitting problems in the process of regression [39].

Gaussian Process Regression
The Gaussian Process Regression (GPR) algorithm is a classical stochastic process method in probability theory. The GPR is a combination of random variables obeying the normal distribution [57]. The kernel of GPR is the Gaussian Process (GP) learning framework which is competitive in handling nonlinear relationships [58]. The kernel function of GPR is GP, which is a collection of finite random variables with a Gaussian distribution that can be used to describe the distribution of functions. The output data y and the input vector can be expressed as where is Gaussian noise. The prior distribution of the observation y after considering the noise can be written as where the ( , ) is the auto-covariance matrix of the input vector and the is the variance of noise. The union prior distribution of the observation y and the prediction * can be written as * ~ 0, ( , ) + ( , * ) ( * , ) ( * , * ) GPR can obtain both predictive mean and predictive variance, and it can obtain the optimal estimates through the predictive distribution of sample data.

Evaluation metrics
To evaluate the accuracy of individual satellite-derived LE products and the four machine learning fusion methods, the R 2 , Bias, and RMSE were selected as the evaluation metrics. R 2 is the square of correlation coefficient R and it is a metric to assess the agreement between estimates and observations. Bias is the average value of the differences between the estimates and ground measurements. Bias can be written as where denotes the number of samples, and denote the ith estimates and ground-measured LE, respectively. The RMSE is the root mean square error between estimates and observations; it measures the predictive skill and the closeness with the target. RMSE is calculated as

Experimental setup
Before fusion model construction, we produced five individual LE products using the RS-PM, SW, PT-JPL, MS-PT, and SEMI-PM algorithms based on the remote sensing variables (NDVI, LAI and FPAR) and MERRA meteorological data. EC ground measurements at all 76 flux tower sites were collected to evaluate the accuracy of these LE products.
Our study constructed the ETR, GBRT, and RF model by using the Ensemble modules on the Python platform. The main parameters to fit these three modules included n_estimators ， max_features, max_depth，learning_rate, and subsample. These optimal parameters can be adapted to the specific case either manually or automatically using cross-validation for example. The GPR was implemented using the Gaussian Processes for Machine Learning (GPML) toolbox in MATLAB, developed by Carl Edward Rasmussen and Hannes Nickisch. Different mean functions and covariance functions were evaluated and the one with lowest error was used. For all four machine learning models, ten-fold cross-validation was used to find the optimal parameter.
To compare the fusion methods, we trained and validated ETR and three other machine learning models using the same EC flux site data. Considering the influences of the representation of multiple PFTs on the model construction, for each PFT, our study chose half of the sites for training and the rest sites for inversion. As for DBF/ ENF/GRA/MIF /CRO, there were 6/9/14/1/9 sites for training and 6/9/13/1/8 sites for inversion. Finally, we trained the ETR model using all 76 available EC flux tower sites and enhanced the terrestrial LE estimates over Europe by the fusion of the five satellite-derived LE products.

Evaluation of satellite-derived terrestrial LE products
To evaluate the five satellite-derived LE products, the corresponding estimates extracted from the individual products were directly compared with ground measurements from the 76 EC flux tower sites. Figure 2 shows the scatter plots for the daily LE observations and estimates of different LE products. The results show that the SEMI-PM product demonstrated the best performance with the highest R 2 (0.61) and a small RMSE (26.42 W/m 2 ) among the five LE products. The R 2 of the SW is the smallest at 0.53, and the R 2 of three other LE products vary from 0. 55   The five LE products showed large discrepancies and uncertainties for different PFTs.  The results show that the accuracies of each LE product for different PFTs vary greatly and there are significant uncertainties between individual LE products. Moreover, we found that none of the individual satellite-derived terrestrial LE products can provide the most accurate LE estimates for all PFTs.

Model development using 39 training flux tower sites
To implement the fusion of five satellite-derived LE products using ETR and other three machine learning methods, the ground measurements collected at the 39 training sites were used to train the models. Figure 4 presents the scatter plots for the observations and LE training results using ETR and three other fusion models. ETR yields the best training performance with the highest R 2 of 0.98 and the lowest RMSE of 4.93 W/m 2 among the four fusion methods, followed by RF with an R 2 of 0.97 and RMSE of 5.94 W/m 2 . The descending training performance order of the four algorithms is ETR, RT, GBRT, and GPR.  Figure 5 presents the statistical summary of the evaluation parameters of the four fusion methods at the 39 training sites for each PFT. One can notice that the fusion estimates using ETR for different PFTs have the highest R 2 and lowest RMSE compared to the three other methods. For all the PFTs of the 39 training sites, the performance of the RF method is lower than ETR, but behaves better than the other two models, and GPR yields the lowest simulation accuracy.  Figure 6 presents the scatter plots for the LE observations at the 37 validation sites and the LE inversion results using ETR and the three other fusion models. The results show that ETR yields the best training performance with the highest R 2 (0.76) and the lowest RMSE (16.87 W/m 2 ) among the four fusion models. At the specific site scale, the performance of the ETR fusion method is significantly superior to the best performance any individual LE product (SEMI-PM with an R 2 of 0.61 and RMSE of 26.42 W/m 2 ). The accuracy of the GPR fusion method is lower than ETR but much higher than RF and GBRT with an R 2 of 0.76 and RMSE of 16.88 W/m 2 . The GBRT yields the lowest R 2 (0.72) and the highest RMSE (18.25 W/m 2 ) compared with the three other machine learning fusion methods, but it is still superior to the best performance of any individual LE product (SEMI-PM). The results show that the machine learning fusion estimates are superior to the five individual LE products, and among the four machine learning fusion methods, the ETR exhibits the best fusion performance.  Figure 7 shows the R 2 , Bias, and RMSE statistics of ETR and the three other fusion models for different PFTs at the 37 validation sites. Overall, whether it is R 2 (varying from 0.74 to 0.84), Bias (varying from -1.71 to 7.53 W/m 2 ), or RMSE (varying from 13.56 to 18.27 W/m 2 ), ETR is superior to the other three fusion models at most PFTs. The performance of the GPR method is lower than ETR, but better than RF and GBRT. GBRT yields the lowest validation accuracy with an R 2 ranging from 0.69 to 0.80, and RMSE ranging from 14.90 W/m 2 to 19.87 W/m 2 . For all the PFTs, MIF has the highest R 2 from 0.80 to 0.84 and the lowest RMSE from 13.35 to 14.90 W/m 2 , but yields the highest Bias, ranging from 6.93 to 7.53 W/m 2 . This may be affected by the original individual LE products, which have the same estimate tendency. The evaluation results of satellite-derived terrestrial LE products are shown in Figure 3. The DBF validation sites behave better than the three other PFTs with an R 2 ranging from 0.77 to 0.82 and RMSE ranging from 16.43 to 18.02 W/m 2 . CRO, ENF, and GRA with lower R 2 , varying from 0.69 to 0.75 compared with other PTFs, are also superior to the individual LE products which obtained R 2 varying from 0.53 to 0.64.

Model evaluation against 37 validation flux tower sites
Overall, compared to the other three machine learning fusion methods, ETR exhibits a relatively high validation accuracy and stability. The enhanced accuracy of the ETR fusion method makes it a reasonable option for improving LE terrestrial estimates over Europe.

Implementation of fusing five LE products using Extremely Randomized Trees
To implement the fusion of five satellite-derived terrestrial LE products using ETR over Europe, we retrained the ETR model based on the EC ground measurements at all 76 sites from 2000 to 2006 and the corresponding LE estimates extracted from five individual LE products. Figure 8 shows the statistics (R 2 , RMSE, and Bias) of the comparison between multiple individual LE products and the fusion LE estimates using ETR for different PFTs. In general, whether it is the R 2 (0.97), RMSE (5.85 W/m 2 ), or the Bias (approximately equal to 0 W/m 2 ), the fusion LE estimates are superior to the five satellite-derived terrestrial LE products. For all five PFTs, fusion LE estimates reduced the RMSE by 20.52 to 27.31 W/m 2 , representing 77.80% to 82.35% of the corresponding RMSE of the individual LE products. The Bias of individual LE products vary from 3.08 to 16.55 W/m 2 , and the fusion LE estimates reduced the Bias to about 0 W/m 2 . Therefore, the accuracy of the individual LE products was significantly improved by the fusion of multiple satellite-derived terrestrial LE products using the ETR method. The results showed that the ETR fusion method in this study could be implemented to generate a reasonable and stable regional terrestrial LE product.  Figure 9 shows the maps of annual terrestrial LE averaged from 2002 to 2005 for the RS-PM-, SW-, PT-JPL-, MS-PT-, and SEMI-PM-based LE product over Europe. All of the five products yielded lower LE estimates over the northern portion of Europe. In the high latitudes, the highest annual LE was approximately less than 20 W/m 2 , appearing on the area with latitudes higher than 65°N. As the latitude decreases, LE shows a trend from low to high. The most prominent difference between the five LE products was the RS-PM-and SW-based LE products estimated higher LE values in the central and southern portions of Europe. Compared with SEMI-PM-based LE product, the PT-JPL and MS-PT yielded higher LE estimation in Austria and Bulgaria. These discrepancies maybe caused by the spatial heterogeneity and the uncertainty existing in the structures of different LE product algorithms. We applied the fusion of five individual terrestrial LE products using ETR, GBRT, RF, and GPR to estimate the mean annual LE in the period from 2002 to 2005 with a resolution of 0.05° over Europe, respectively ( Figure 10). Four fusion products showed highly consistent spatial characteristics over Europe. When compared with the fusion terrestrial LE product using ETR, the three other fusion products generate some spatial differences ( Figure 11). The GBRT yields a higher annual LE than ETR over northern Europe. As the latitude increases, the discrepancy reaches up to 6 W/m 2 . The fusion LE product generated by RF shows consistent spatial characteristics with ETR. The difference between the two fusion LE products was within 1 W/m 2 in most parts of Europe. The GPR fusion product yields lower annual terrestrial LE in Spain, Austria, Romania, Belarus, and a higher LE in Norway, Poland, Hungary, and Italy. The difference between the two LE products was approximately within 4 W/m 2 in most areas. This is possibly caused by the uncertainty existing in the structures of different machine learning fusion methods.  Figure 12 shows the seasonal variation both of ground-measured LE and the estimate LE by the four fusion methods for different PFTs. For each PFT, we randomly selected a site which completely contains at least one year of ground measurements, and the corresponding estimate LE data was compared to this. The daily observed data and estimate LE were replaced by the eight-day average value. The five sites were BE-Lon (CRO), FR-Hes (DBF), DE-Har (ENF), DE-Gri (GRA), and BE-Bra (MIF). Figure 12 illustrates that all the four machine learning fusion methods' LE values showed great consistency with the observations for each PFT in 2005. For the DE-Har (ENF) and DE-Gri (GRA) sites, the GBRT, RF, and the GPR estimations were lower than the observations on most days of 2005. However, for the BE-Bra (MIF) and FR-Hes (DBF) sites, the three fusion models overestimated the LE with the same trend. In comparison with the GBRT, RF, and GPR methods, the seasonal LE variations generated by ETR were closest to the observations. Moreover, the RF estimate performed closely to ETR.

The performance of the Extremely Randomized Trees fusion method
Of all the five sites in different PFTs, the BE-Lon (CRO) site showed a poor performance in the growing season. Studies showed that differences in crop types, and irrigation and fertilization practices may lead to noticeable influence on the accurate estimation of LE [59]. For the FR-Hes (DBF) and DE-Har (ENF) sites, a greater overlap of leaves resulting in a lower bare soil exposure ratio [60], NDVI saturates, and clouds cover [61] all contributed to the inaccurate LE estimates in these two PFTs. All four machine learning methods yielded poor performance in the BE-Bra (MIF) site, mainly because there were fewer samples available (only two sites, one for training and one for validation), which lowered the learning performance of the fusion models. In addition, all the four models presented significantly higher estimations with the same trend in BE-Bra (MIF). This may be caused by the original higher estimation of the five individual LE products at the MIF sites, which can be seen in Figure 3.  Figure 13 illustrates the probability density distributions of estimate errors in the five satellitederived LE products and four machine learning fusion LE products, respectively. For the single LE products, the error distributions of the MS-PT are more closely centered on zero, and compared with MS-PT, the other four LE products show the same trend of overestimation. The SEMI-PM showed relatively small errors maybe because the regression coefficients of this product algorithm were calibrated using 64 global flux tower sites [47]. Overall, MS-PT and SEMI-PM yield better performance than the other products, which is also consistent with the global validation result from Yao [11].
For the fusion LE products, all the four fusion predictive errors are closely centered on zero and even the GPR method, which got the largest biases, performs better than the MS-PT and SEMI-PM. The ETR method decreased the substantial positive and negative biases. Consistent with the experiments from Geurts [50], the ETR method showed a competitive and even superior fusion performance to RF. This may be due to the ETR method having the advantage of removing the need for the optimization of the discretization thresholds which leads to its efficient implementation [45]. Therefore, the ETR strategy can accurately capture the LE variance and exhibits the best prediction performance. a) b) Figure 13. Probability density distributions of the predictive errors in a) five satellite-derived LE product algorithms and b) four machine learning fusion methods, respectively. Figure 14 shows the mean annual spatial differences between the GLASS LE product and the fusion LE estimates using ETR from 2002 to 2005. As indicated in Figure 14, relative to the GLASS LE product, the fusion result yielded lower LE (-15 ~ 10 W/m 2 ) in most southern portions of Europe, and higher LE (0 ~ 5 W/m 2 ) in northern Europe. This may be caused by insufficient site data and representation, as well as the characteristics of the ETR algorithm. The discrepancies between the two LE products were within 15 W/m 2 in most portions of Europe. As indicated in Figure 15, when compared with the MODIS LE product (MOD16), the fusion LE using ETR had lower estimates in central and northeast Europe and higher estimates in most parts of Spain, Portugal, United Kingdom, and Iceland. This may be caused by land surface heterogeneity and spatial continuity underrepresentation of MOD16. The difference between the two products was small and no more than 15 W/m 2 in most areas. Given the accuracy of the GLASS and MOD16 LE products, we can conclude that the fusion LE using ETR exhibits a relatively small gap compared with them, and it proved to be reasonable and reliable for the terrestrial LE estimation using the ETR fusion method over Europe.

Uncertainties of the merged LE estimates
Several studies have shown that the uncertainties in the MERRA meteorology data and satellitebased observations (e.g., LAI, NDVI, or FPAR), errors in tower EC observations, and mismatched spatial scales between different data sources, as well as structural differences between machine learning models all lead to the inaccuracy of LE estimates [8,9].
Firstly, studies showed that there are large biases in the MERRA data and satellite-based vegetation parameter products when validated by ground measurements at flux tower sites [62,63]. Therefore, the uncertainty of the satellite-derived LE estimates would be inherited through errors from both the MERRA and satellite-based data inputs [2]. Secondly, Foken found that large eddies cannot be measured with the EC method, which leads to the energy imbalance [48]. Although we corrected the measured LE using the method proposed by Twine [49], the corrections based on the limited understanding of the nature of the energy imbalance still result in large errors in EC measurements [64]. In addition, Hui found that gap filling of EC measurements from hourly and halfhourly data to daily averages also generates uncertainties for LE estimations [65]. Thirdly, the spatial scale mismatch among the different data sources may have introduced errors into the LE estimations. As a result of the limitation of the spatial representation of ground observation technology, no effective means exist of evaluating spatially distributed regional LE products at scales greater than a few kilometers-particularly over nonhomogeneous surfaces [66]. Finally, all machine learning methods may fall into local optimization, which results in a poor generalization performance [35]. Moreover, the fitting performance of the machine learning methods is also closely dependent on the representativeness of the training data.

Conclusions
We applied the Extremely Randomized Trees method to implement the fusion of five satellitederived terrestrial LE products (RS-PM, SW, PT-JPL, MS-PT, and SEMI-PM) based on flux tower observations. By evaluating against ground measurements, we found that there are substantial uncertainty and discrepancies in the individual LE products.
On the basis of five satellite-derived terrestrial LE products, we trained and validated ETR and three other machine learning models at 76 EC flux tower sites to compare the model prediction performance. We found the ETR algorithm achieved the best fusion accuracy with the highest R 2 (0.98 and 0.76) and the lowest Bias (0 W/m 2 and -0.55 W/m 2 ) and RMSE (4.93 W/m 2 and 16.87 W/m 2 ) for training and validation, respectively. Furthermore, the performance of the ETR fusion method is significantly superior to the best performance of an individual LE product at the specific site scale. When compared with moderate spatial resolution satellite-derived global LE products, such as GLASS and MOD16, the fusion LE using ETR exhibits relatively small spatial differences.
Overall, we conclude that the ETR method improved daily LE estimates over Europe by fusing five satellite-derived terrestrial LE products driven by MERRA meteorology and MODIS products. The improved accuracy of the ETR fusion method makes it a reasonable and reliable option for the estimation of terrestrial LE over Europe.