Integrated Validation of Coarse Remotely Sensed Evapotranspiration Products over Heterogeneous Land Surfaces

: Validation of remotely sensed evapotranspiration (RS_ET) products is important because their accuracy is critical for various scientiﬁc applications. In this study, an integrated validation framework was proposed for evaluating RS_ET products with coarse spatial resolution extending from homogenous to heterogeneous land surfaces. This framework was applied at the pixel and river basin scales, using direct and indirect validation methods with multisource validation datasets, which solved the spatial mismatch between ground measurements and remotely sensed products. The accuracy, rationality of spatiotemporal variations, and error sources of RS_ET products and uncertainties during the validation process were the focuses in the framework. The application of this framework is exempliﬁed by validating ﬁve widely used RS_ET products (i.e., GLEAM, DTD, MOD16, ETMonitor, and GLASS) in the Heihe River Basin from 2012 to 2016. Combined with the results from direct (as the priority method) and indirect validation (as the auxiliary method), DTD showed the highest accuracy (1-MAPE) in the vegetation growing season (75%), followed by ETMonitor (71%), GLASS (68%), GLEAM (54%), and MOD16 (44%). Each product reasonably reﬂected the spatiotemporal variations in the validation dataset. ETMonitor exhibited the highest consistency with the ground truth ET at the basin scale (ETMap) (R = 0.69), followed by GLASS (0.65), DTD (0.63), MOD16 (0.62), and GLEAM (0.57). Error sources of these RS_ET products were mainly due to the limitations of the algorithms and the coarse spatial resolution of the input data, while the uncertainties in the validation process amounted to 15–28%. This work is proposed to effectively validate and improve the RS_ET products over heterogeneous land surfaces.


Introduction
Evapotranspiration (ET) is a keystone variable linking the water, energy, and carbon cycles [1]. Remotely sensed ET (RS_ET), which provides spatio-temporal continuous information over the land surface, can be beneficial to understand the water and energy budgets

Study Area and Experiment
The Heihe River Basin (HRB), with an area of approximately 1.43 × 10 5 km 2 and located in the arid and semiarid regions of Northwest China, was selected as the study area. The HRB is a complicated watershed system with the coexistence of cold and arid regions and diverse landscapes [39] (shown in Figure 1). Over this basin, the mean annual air temperatures are 0.4, 7.3 and 8.2 • C, and annual precipitation is 322, 130 and 30 mm in the upstream, midstream and downstream regions, respectively [17]. In the upstream region, glaciers, permafrost, alpine meadows and forests are the dominant surface types. The midstream region is characterized by an artificial oasis (including farmland and shelter forests) and desert landscape. In the downstream region, the landscape is a natural oasis surrounding a large desert area, and riparian ecosystems are distributed along the river [40]. The HRB has diverse landscapes, which make it an ideal study area for the validation of RS_ET products from homogeneous to heterogeneous land surfaces.
The Heihe integrated observatory network was established in the HRB from 2007, during the WATER (Watershed Allied Telemetry Experimental Research) experiment (2007-2011 [41]), and was completed in 2013 during the HiWATER (Heihe Watershed Allied Telemetry Experimental Research) experiment (2012-2015 [39,42]). This network includes a maximum of 23 observation stations and it currently has 11 operating stations (three superstations and eight ordinary stations). Additionally, one other flux tower site (Linze site [43]) was used in the study. Figure 1a shows the flux tower site locations used in this study (site [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. From 3 May to 21 September 2012, the Multi-Scale Observation Experiment on Evapotranspiration over Heterogeneous Land Surfaces (MUSOEXE) was successfully conducted in the midstream of the HRB [23,44]. Two nested flux observation matrices were involved in HiWATER-MUSOEXE, including a large matrix (30 km × 30 km) and a kernel matrix (5.5 km × 5.5 km). The kernel matrix was initially divided into 17 sample plots (shown in Figure 1b) in the oasis. There was one EC system and one automatic weather station (AWS) in each plot to observe surface heat fluxes and meteorological elements [23]. There were four groups of large-aperture scintillometers (LAS) installed in the kernel matrix (the footprints of the LAS systems can cover three pairs of 3 × 1 MODIS pixels, named LAS1, LAS2 and LAS3 from west to east, respectively; one group covers one pair 2 × 1 MODIS pixels, named LAS4). Moreover, in the Ejina Oasis over the downstream region of the HRB, a flux observation matrix (3 km × 2 km) comprising five observation sites operated from 2013 to 2015. Two groups of LAS systems were installed in one pair of 2 × 2 MODIS pixels (LAS 5 and LAS6, shown in Figure 1c), which operated from 2013 to 2014, but the group of LAS located in one pair of 2 ×1 MODIS pixels (LAS7, shown in Figure 1c) left after 2015. In the upstream, there is one group of LAS and its footprint can cover one pair of 2 × 1 MODIS pixels (LAS 8, shown in Figure 1(a1)). Details of the observation instruments can be found in [23,39].  (9). Shenshawo, (10) Linze, (11) Sidaoqiao, (12) Populus euphratica, (13) mixed forest, (14) barren land, (15) desert.

Remotely Sensed Evapotranspiration Products
In this study, five widely used coarse RS_ET products were evaluated according to the proposed ET validation framework. These ET products were grouped based on their intrinsic mechanisms: the first group included GLEAM [6] and DTD [17], which were derived from the surface energy balance; the second group included MOD16 [5] and ETMonitor [12], which were produced based on the eco-physiological processes of vegetation; the third group included GLASS [9], which was produced based on the integration of multiple algorithms. Details of the coarse RS_ET products are listed in Table  1. The DTD, ETMonitor and GLEAM ET products were aggregated to 8-day temporal resolution, while the GLEAM was resampled to 1-km spatial resolution.  (9). Shenshawo, (10) Linze, (11) Sidaoqiao, (12) Populus euphratica, (13) mixed forest, (14) barren land, (15) desert.

Remotely Sensed Evapotranspiration Products
In this study, five widely used coarse RS_ET products were evaluated according to the proposed ET validation framework. These ET products were grouped based on their intrinsic mechanisms: the first group included GLEAM [6] and DTD [17], which were derived from the surface energy balance; the second group included MOD16 [5] and ET-Monitor [12], which were produced based on the eco-physiological processes of vegetation; the third group included GLASS [9], which was produced based on the integration of multiple algorithms. Details of the coarse RS_ET products are listed in Table 1. The DTD, ETMonitor and GLEAM ET products were aggregated to 8-day temporal resolution, while the GLEAM was resampled to 1-km spatial resolution. The ground truth ET at the satellite pixel scale will help to solve the spatial mismatch between ground measurements and RS_ET products. Depending on the heterogeneity of the land surface, the most appropriate methods were selected to acquire the ground truth ET at the satellite pixel scale using multiple flux tower measurements from June to September in the flux observation matrix of the midstream (LAS1-LAS4) during 2012 and the flux observation matrix of the downstream (LAS5-LAS7) during 2014-2015 in the HRB [26]. The dataset (daily, 1 km) can be downloaded from the National Tibetan Plateau Data Center (http://data.tpdc.ac.cn/ (accessed on 1 June 2018)).
Additionally, the satellite pixel-scale ground truth ET of the flux tower-located satellite pixels was also derived using the single flux tower measurements. The flux towers included three superstations (AR (Figure 1 (Figure 1(a15)) over typical underlying surfaces in the HRB [27]. The dataset (daily, 1 km) can be downloaded from the National Tibetan Plateau Data Center (http://data.tpdc.ac.cn/ (accessed on 10 December 2018)). The details of the ground truth ET at pixel scale are summarized in Table 2.
Regional ET at the watershed scale can be estimated by the water balance method, expressed as where P is precipitation, R is runoff and ∆S is the change in terrestrial water storage. In this study, R was the difference in the water inflow and outflow, and ∆S was derived from the groundwater and reservoir storages. The water balance-based ET over the HRB during 2012-2016 was calculated according to these data with Equation (1) over the upstream, midstream and downstream, respectively. The results can be used as ground truth ET at the basin scale to validate RS_ET products. All the hydrologic data used in the water balance calculation were collected from the National Tibetan Plateau Data Center (http://data.tpdc.ac.cn/ (accessed on 10 December 2018)), which records data from hydrological observations over the HRB. Detailed information about the water balance calculation can be found in [45]. The ETMap was used as ground truth ET at the river basin scale to evaluate the RS_ET products under various land surfaces. The ETMap was obtained by applying the random forest method to train the daily ET and its explanatory variables over 36 flux tower sites (65 site years), and then we extended the results to the whole HRB. The variables related to ET, including leaf area index, solar radiation, precipitation, air temperature and relative humidity, were considered. The ETMap has spatial and temporal resolutions of 1 km and daily, respectively, from 2012 to 2016 [35], which can be downloaded from the National Tibetan Plateau Data Center (http://data.tpdc.ac.cn/ (accessed on 10 December 2018)).

ET Influence Factor Data and Auxiliary Dataset
The land cover map had a resolution of 30 m and high accuracy of over 90% [46]. Precipitation and air temperature datasets were used as influencing factors of ET to analyze the spatiotemporal trends of the RS_ET products. The 5 km/1 h atmospheric forcing dataset generated by Weather Research and Forecasting (WRF) [47] was used as the input of the hydrological-scale model in the HRB. The dataset of groundwater level was acquired by a downstream automatic water gauge in 2015. These datasets can be downloaded from the National Tibetan Plateau Data Center (http://data.tpdc.ac.cn/ (accessed on 10 December 2018)) [39].

Validation Framework
The proposed validation framework has three progressive types of content, including the analysis of the accuracy and spatiotemporal variation trends of the RS_ET products, discussion of the error sources of the RS_ET products and assessment of the uncertainty in the validation process ( Figure 2).

Accuracy Evaluation Method
The Taylor diagram [49] was used to quantify the performance of the RS_ET products at the pixel scale. The Taylor diagram provides a suitable overview of the performance of different products in a single diagram by considering three statistics: the correlation coefficient (R), centered root mean square error (RMSE) and standard deviation (SD). Moreover, the mean absolute percentage error (MAPE) was utilized to quantify the accuracy of the RS_ET products. When validated by ETMap at the basin scale, the MAPE,  The first aspect of the framework is implemented to derive the accuracy of the RS_ET products using a direct validation method and indirect validation method. Specifically, in Remote Sens. 2022, 14, 3467 9 of 26 the direct validation method, the ground truth ET at the pixel and basin scales can be used to assess the accuracy and spatiotemporal trends of the RS_ET products. Meanwhile, in the indirect validation method, the three-cornered hat (TCH) approach is applied to obtain the relative accuracy among the RS_ET products at the basin scale, and the rationality of the spatiotemporal variations of the RS_ET products is explained by the ET impact factors. The second aspect involves tracking the error sources of the RS_ET products, which may be associated with the algorithms and the input datasets. Finally, the third aspect is to assess the uncertainty in the validation process using the generalized polynomial chaos (gPC) method [48].

Accuracy Evaluation Method
The Taylor diagram [49] was used to quantify the performance of the RS_ET products at the pixel scale. The Taylor diagram provides a suitable overview of the performance of different products in a single diagram by considering three statistics: the correlation coefficient (R), centered root mean square error (RMSE) and standard deviation (SD). Moreover, the mean absolute percentage error (MAPE) was utilized to quantify the accuracy of the RS_ET products. When validated by ETMap at the basin scale, the MAPE, BIAS and R were utilized to synthetically analyze the accuracy of the RS_ET products pixel by pixel in the HRB. The calculation methods for these evaluation indexes are listed in Appendix A.
For indirect validation, the TCH method was employed to evaluate the relative accuracy among these RS_ET products. This method is effective for characterizing the relative accuracy of more than three products. The TCH method relies on the removal of common signals from the RS_ET products and subsequently provides the relative errors [37,50]. The calculation result (variance) of TCH was squared and then divided by the average value of each product to obtain the relative error [27]. This is important to confirm the relative accuracy of the RS_ET products without ground truth ET. In this study, this method was employed to estimate the relative accuracy of five RS_ET products without any prior knowledge. The details of the TCH method are explained in Appendix C. Moreover, based on the ET impact factor data, such as the land cover type, air temperature and precipitation, the latitudinal profiles between RS_ET products and these factors were used to analyze the rationality of the spatiotemporal variation trend of the RS_ET products.

Uncertainty Evaluation Method
Quantitative evaluation of the uncertainty is an important part of the RS_ ET products' validation. The EC measurements can represent the ground truth ET value at the pixel scale when the land surface is relatively homogeneous. Here, the uncertainty of ground truth ET is mainly determined by the ET observation error. In the method proposed by Beyrich [51], the maximum difference between LAS observations and EC observations, or the error of the EC measurements, is used to quantitatively evaluate the uncertainty of the ground truth ET at the pixel scale on a homogeneous underlying surface. When the land surface is moderately or highly heterogeneous, the uncertainty of the upscaled ground truth ET is mainly introduced from the errors of ET observation, auxiliary data and the upscaling method. In this study, the gPC method [48] was used to obtain the uncertainty introduced to the upscaled ground truth ET at the pixel scale [26,27]. The gPC method involves representing the inputs and outputs of a system under consideration through series approximations using standard random variables, thereby resulting in a computationally efficient means of uncertainty propagation through complex numerical models [52]. The gPC calculation result (variance) was squared and then was divided by the average value of ground truth ET at the pixel scale to obtain the uncertainty [27]. In this study, the input data, parameters and output data generated in the upscaling method are used to construct the polynomial chaos expansion model when calculating the uncertainty of the ground truth ET. The stochastic collocation method (SCM) is selected and treats the original model as a black box, and the coefficients of the polynomial chaos expansion terms are finally obtained by iteratively solving for selected collocation points in the input variables.
Finally, the mean and variance output from the gPC method are used to calculate the uncertainty of the ground truth ET [27]. More details about the uncertainty quantification methods are given in Appendix D. For the ground truth ET at the basin scale (ETMap), the difference between ETMap values and LAS observations is calculated and then divided by the LAS observation [35], which can be used to determine the uncertainty of ground truth ET at the regional scale. The performance of the RS_ET products at the pixel scale was analyzed over six typical underlying land cover types, namely grassland, Qinghai spruce, cropland, wetland, riparian forest and desert, in the HRB. Grassland includes subalpine (represented by the AR site) and marsh alpine meadows (represented by the DSL site). Qinghai spruce (represented by the GT site) is an evergreen needle leaf forest and is the dominant land cover of forests in the HRB. The land cover of the cropland includes the DM and flux observation matrix in the midstream, YK and LZ sites. Wetland is represented by the WD site. In this study, barren land or sparsely vegetated areas belong to deserts (represented by the HZZ, BJT, SSW, DS and BL sites). Populus euphratica and Tamarix (belonging to deciduous broadleaf forest and shrub, respectively) in the downstream region are collectively called riparian forests (represented by the downstream flux observation matrix, SDQ, MF and PE sites). Figures 3 and 4 show the Taylor diagrams and time series of the RS_ET products against the ground truth ET at the pixel scale, respectively. Table 3 presents the accuracy of the RS_ET products at the pixel scale, assessed by the MAPE during the vegetation growing season (from June to September) and the entire year.

Validation Results of Coarse RS_ET Products
Remote Sens. 2022, 14, x FOR PEER REVIEW 10 of 26 might be that the grasslands are relatively uniform, while in the riparian forest, the land surface is more heterogeneous, and the RS_ET models would yield large errors.   truth ET. For instance, in July and August, the values of these RS_ET products reached their peak of approximately 30 mm/mon, while those of the ground truth ET reached 50 mm/mon. GLEAM and MOD16 showed significant underestimation; they are unable to accurately reflect the seasonal and inter-annual variation of ET over riparian forest. The value of GLEAM and MOD16 was approximately 10 mm/mon during 2014-2016, while the ground truth ET could reach a peak value of 35 mm/mon.

Validation at the Basin Scale
In the previous sections, differences in the magnitude of the RS_ET products at the pixel scale were observed. However, over heterogeneous surfaces, the pixel-scale validation results at limited sites may not represent the accuracy and spatiotemporal distribution pattern of the RS_ET products over the whole HRB. Therefore, it is necessary to conduct validation at the basin scale. The validation at the basin scale can be divided into two parts. First, the RS_ET products were validated using the watershed ET calculated by the water balance equation in the upstream, midstream, and downstream regions over the whole HRB, respectively. Second, the ground truth ET at the basin scale (ETMap) was used for the pixel-wise validation of the RS_ET products over the whole HRB.
Based on the ET calculated by the water balance equation in the upstream, midstream, and downstream regions, the validation results of GLEAM, GLASS, MOD16 (upstream), and ETMonitor are shown in Figure 5. All of the four RS_ET products performed best in the upstream region, with MAPE values less than 22%. In the midstream region, the MAPE values among these products are quite different, ranging from 24% for ETMonitor to 70% for GLEAM. In the downstream, the RS_ET products showed poor performance. ETMonitor, with a MAPE value of 28%, performed better than the other RS_ET products, and the MAPE values of GLASS and GLEAM were higher than . Temporal variability between ground truth ET at the pixel scale (the spatial resolution is 1 km) and six RS_ET products (MOD16 data were unavailable in desert areas). At the pixel scale, the five RS_ET products exhibit differences in performance, especially during the growing season ( Figure 3). The MAPE values of the whole year and the growing season were consistent (Table 3). Among these RE_ET products, ETMonitor and DTD performed better for most land cover types in the HRB, while GLEAM, MOD16 and GLASS performed well on Qinghai spruce, croplands or grasslands, respectively. According to the validation results, the MAPE value of the RS_ET products is relatively low on grassland, approximately ranging from 16.35 to 21.86% (in the growing season). In contrast, the RS_ET products had high MAPE values in the riparian forest, which ranged from 27.01 to 70.07% (in the growing season). The potential reason for this result might be that the grasslands are relatively uniform, while in the riparian forest, the land surface is more heterogeneous, and the RS_ET models would yield large errors. Figure 4 shows the seasonal and inter-annual performance of the RS_ET products over typical land cover types. The five RS_ET products generally show similar seasonal and inter-annual variations and agree well with the ground truth ET, but the consistency is variable and associated with the land surface type. DTD and ETMonitor remained consistently better with ground truth ET in variation trends and magnitude over most land cover types. In general, the best agreement with ground truth ET at the pixel scale was found over the grassland, Qinghai spruce, and desert. Over the cropland, wetland, and riparian forest, their differences from the ground truth ET were obvious. Within a year, ET reaches its maximum value in July and August, and decreases to the minimum in winter. The variation trend of the DTD product has higher ET values in grasslands and it reached its peak in 2015 (50 mm/mon), which was significantly higher than the ground truth ET. The MOD16 product overestimated the ET values in the non-growing season of 2010-2011 over Qinghai spruce, and the ET values in December and January were overestimated by approximately 8 mm/mon. The seasonal variation amplitudes of GLEAM, MOD16, and GLASS over wetlands were significantly lower than the ground truth ET. For instance, in July and August, the values of these RS_ET products reached their peak of approximately 30 mm/mon, while those of the ground truth ET reached 50 mm/mon. GLEAM and MOD16 showed significant underestimation; they are unable to accurately reflect the seasonal and inter-annual variation of ET over riparian forest. The value of GLEAM and MOD16 was approximately 10 mm/mon during 2014-2016, while the ground truth ET could reach a peak value of 35 mm/mon.

Validation at the Basin Scale
In the previous sections, differences in the magnitude of the RS_ET products at the pixel scale were observed. However, over heterogeneous surfaces, the pixel-scale validation results at limited sites may not represent the accuracy and spatiotemporal distribution pattern of the RS_ET products over the whole HRB. Therefore, it is necessary to conduct validation at the basin scale. The validation at the basin scale can be divided into two parts. First, the RS_ET products were validated using the watershed ET calculated by the water balance equation in the upstream, midstream, and downstream regions over the whole HRB, respectively. Second, the ground truth ET at the basin scale (ETMap) was used for the pixel-wise validation of the RS_ET products over the whole HRB.
Based on the ET calculated by the water balance equation in the upstream, midstream, and downstream regions, the validation results of GLEAM, GLASS, MOD16 (upstream), and ETMonitor are shown in Figure 5. All of the four RS_ET products performed best in the upstream region, with MAPE values less than 22%. In the midstream region, the MAPE values among these products are quite different, ranging from 24% for ETMonitor to 70% for GLEAM. In the downstream, the RS_ET products showed poor performance. ETMonitor, with a MAPE value of 28%, performed better than the other RS_ET products, and the MAPE values of GLASS and GLEAM were higher than 40%. Overall, ETMonitor exhibited better performance than other RS_ET products across the basin.
To analyze the pixel-wise product performance, further investigation was conducted by directly comparing each product with ETMap pixel by pixel over the whole basin. The spatial distributions of MAPE, BIAS (RS_ET-ETMap), and R of the RS_ET products are shown in Figure 6. We found that some conclusions were consistent with the above validation results. Over grassland areas, some pixels from DTD showed values of MAPE and BIAS higher than 30% and 20 mm/mon, respectively. DTD showed relatively low values of MAPE and BIAS in most cropland and riparian forest pixels. In contrast, the MAPE and BIAS of GLEAM showed relatively low values in most grassland and Qinghai spruce pixels, while these values increased over cropland and riparian forest. For MOD16, GLASS, and ETMonitor, values of MAPE and BIAS increased from upstream to downstream in most regions. Overall, each RS_ET product maintained good consistency with ETMap according to the high correlation coefficient values. The DTD had the lowest value of MAPE in the vegetation growing season (25%), followed by ETMonitor (29%), GLASS (32%), GLEAM (46%), and MOD16 (56%). ETMonintor tended to show certain underestimation (BIAS = −20 mm/mon), leading to a relatively high MAPE (MAPE > 30%) over the grassland. GLASS overestimated the ET in midstream desert regions (BIAS > 23 mm/mon). On the whole, DTD and ETMonitor exhibited better performance than other RS_ET products, with relatively low MAPE and BIAS values and high correlation coefficients in most pixels over the whole HRB.  To analyze the pixel-wise product performance, further investigation was conducted by directly comparing each product with ETMap pixel by pixel over the whole basin. The spatial distributions of MAPE, BIAS (RS_ET-ETMap), and R of the RS_ET products are shown in Figure 6. We found that some conclusions were consistent with the above validation results. Over grassland areas, some pixels from DTD showed values of MAPE and BIAS higher than 30% and 20 mm/mon, respectively. DTD showed relatively low values of MAPE and BIAS in most cropland and riparian forest pixels. In contrast, the MAPE and BIAS of GLEAM showed relatively low values in most grassland and Qinghai spruce pixels, while these values increased over cropland and riparian forest. For MOD16, GLASS, and ETMonitor, values of MAPE and BIAS increased from upstream to downstream in most regions. Overall, each RS_ET product maintained good consistency with ETMap according to the high correlation coefficient values. The DTD had the lowest value of MAPE in the vegetation growing season (25%), followed by ETMonitor (29%), GLASS (32%), GLEAM (46%), and MOD16 (56%). ETMonintor tended to show certain underestimation (BIAS = -20 mm/mon), leading to a relatively high MAPE (MAPE > 30%) over the grassland. GLASS overestimated the ET in midstream desert regions (BIAS > 23 mm/mon). On the whole, DTD and ETMonitor exhibited better performance than other RS_ET products, with relatively low MAPE and BIAS values and high correlation coefficients in most pixels over the whole HRB. The seasonal variation of ET values in typical land cover classes was compared between the RS_ET products and ETMap over the HRB, as shown in Figure 7. These comparisons show that all RS_ET products have captured the seasonal and inter-annual trends of ET relatively well against ETMap during most of the study period (2012-2016). DTD and ETMonitor have better consistency with ETMap in terms of magnitude and variation trend over most of the land cover types.
The largest inconsistencies were obtained over high-heterogeneity land surfaces such as riparian forest, especially for GLEAM, MOD16, and GLASS, and yielded negative bias. They produced peak ET values that were underestimated by nearly two times compared to the values of ETMap during the growing season. In 2012 and 2014, the peak value of DTD products was higher than that of ETMap over grassland, which was consistent with the results of the validation at the pixel scale. The ET values of GLASS had a certain overestimation in the desert during the non-growing season, while the GLEAM and MOD16 products underestimated the peak values over the cropland. The seasonal variation of ET values in typical land cover classes was compared between the RS_ET products and ETMap over the HRB, as shown in Figure 7. These comparisons show that all RS_ET products have captured the seasonal and inter-annual trends of ET relatively well against ETMap during most of the study period (2012-2016). DTD and ETMonitor have better consistency with ETMap in terms of magnitude and variation trend over most of the land cover types.
The largest inconsistencies were obtained over high-heterogeneity land surfaces such as riparian forest, especially for GLEAM, MOD16, and GLASS, and yielded negative bias. They produced peak ET values that were underestimated by nearly two times compared to the values of ETMap during the growing season. In 2012 and 2014, the peak value of DTD products was higher than that of ETMap over grassland, which was consistent the results of the validation at the pixel scale. The ET values of GLASS had a ce overestimation in the desert during the non-growing season, while the GLEAM MOD16 products underestimated the peak values over the cropland.

Cross-Validation
The cross-validation is implemented by using the TCH method to obtain the rel error among the five RS_ET products. The distribution of the relative error value fo RS_ET products is shown in Figure 8. The overall relative error variations can be gene listed as follows: grassland < Qinghai spruce < desert < cropland< wetland < rip forest. The validation results of TCH were generally consistent with the results o direct validation. Among the five RS_ET products, DTD exhibited better perform achieving a low relative error value (lower than 22%) over the HRB, except for s grassland and Qinghai spruce pixels. Then, it was followed by ETMonitor (26%), w had relatively low relative error values in grasslands, Qinghai spruce, and croplands high relative error values in deserts and riparian forests. GLASS presented a mod level of relative error in most pixels (approximately 30%) and had a high relative err desert and riparian forests. GLEAM had a low relative error value in grassland Qinghai spruce pixels (approximately 15%), while it had a high relative error valu cropland, desert regions, and riparian areas. MOD16 yielded a higher relative error i HRB and it increased from upstream to downstream (greater than 35%).

Cross-Validation
The cross-validation is implemented by using the TCH method to obtain the relative error among the five RS_ET products. The distribution of the relative error value for the RS_ET products is shown in Figure 8. The overall relative error variations can be generally listed as follows: grassland < Qinghai spruce < desert < cropland< wetland < riparian forest. The validation results of TCH were generally consistent with the results of the direct validation. Among the five RS_ET products, DTD exhibited better performance, achieving a low relative error value (lower than 22%) over the HRB, except for some grassland and Qinghai spruce pixels. Then, it was followed by ETMonitor (26%), which had relatively low relative error values in grasslands, Qinghai spruce, and croplands, but high relative error values in deserts and riparian forests. GLASS presented a moderate level of relative error in most pixels (approximately 30%) and had a high relative error in desert and riparian forests. GLEAM had a low relative error value in grassland and Qinghai spruce pixels (approximately 15%), while it had a high relative error value in cropland, desert regions, and riparian areas. MOD16 yielded a higher relative error in the HRB and it increased from upstream to downstream (greater than 35%).

Spatiotemporal Variation Analysis
In this study, latitudinal profiles were adopted to analyze the rationality of the spatial-temporal changes among these products. The latitudinal profiles (Figure 9b) were

Spatiotemporal Variation Analysis
In this study, latitudinal profiles were adopted to analyze the rationality of the spatialtemporal changes among these products. The latitudinal profiles (Figure 9b) were drawn for the five RS_ET products along with the air temperature and precipitation. From upstream to downstream, with increasing latitude, the surface types changed from snow and ice, grassland/Qinghai spruce, cropland, and wetland/desert to riparian forest and cropland/desert. The air temperature values rose gradually and the precipitation and ET values decreased gradually, but high ET values were observed in the midstream and downstream oasis areas. The upstream region generates the runoff of the HRB; there is abundant precipitation over this area that results in greater precipitation than the ET. The midstream region consumed most of the water resources to irrigate the crops and yielded a higher ET than precipitation. There was rarely precipitation and low vegetation cover in the downstream region; thus, there were very low ET values over the broad desert but a slightly higher ET in the nature oases. All RS_ET products reasonably reflected the spatial changes in the ET values from upstream to downstream regions. The results also showed that MOD16 underestimated while DTD overestimated the ET values in the upstream region, respectively. GLEAM underestimated the ET in the downstream region, and it yielded an unreasonable result in which the rainfall was greater than the ET, even over the frequently irrigated agricultural areas.

Error Sources of the RS_ET Products
The error sources of the RS_ET products are introduced in terms of two aspects: the limitations of the algorithm and errors of input data. For RS_ET products yielded based on the surface energy balance model, the largest limitation is the accuracy of the satellitebased land surface temperature (LST) data. Based on the validation results of DTD in the HRB, it exhibited obvious overestimation in the upstream region. A potential reason is that the suspected LST data were used in this model under cloudy and rainy weather conditions in the upstream region.
The validation of GLEAM showed an underestimation over the cropland and

Error Sources of the RS_ET Products
The error sources of the RS_ET products are introduced in terms of two aspects: the limitations of the algorithm and errors of input data. For RS_ET products yielded based on the surface energy balance model, the largest limitation is the accuracy of the satellite-based land surface temperature (LST) data. Based on the validation results of DTD in the HRB, it exhibited obvious overestimation in the upstream region. A potential reason is that the suspected LST data were used in this model under cloudy and rainy weather conditions in the upstream region.
The validation of GLEAM showed an underestimation over the cropland and riparian forest (P. euphratica and Tamarix) in the downstream region, and it had a low correlation with ETMap over the cropland. In the GLEAM algorithm, a constant value of α = 0.8 was used to parameterize the tall canopy fraction, and a value of α = 1.26 was applied in both the short vegetation and bare soil fractions. However, previous studies have highlighted that α varies greatly over the growing season and with crop species, soil moisture availability, and climate conditions [53]. Here, we re-calculated the P-T coefficient values using the ground measurements at cropland (DM sites) and riparian forest (SDQ in the downstream) sites during the growing season ( Figure 10). The results showed that the value of α varied with the land surface characteristics of different sites and was influenced by the seasonal variation. The value of α ranged from 1.2 to 1.8 at the DM site and from 0.2 to 0.6 at the SDQ site. The unreliable setting of the α value may potentially cause the discrepancy in GLEAM products over cropland and riparian forest pixels. The eco-physiological process models often have difficulties in the parameterization of soil-plant-atmosphere interactions and other bio-physiological constraints [19]. MOD16 shows underestimation in most regions of the HRB. The parameterization of the MOD16 model is vital in controlling the ET; in particular, the surface resistance parameterization is essential to the model [4,[54][55][56]. The application of VPD or relative humidity (RH) to represent the surface soil moisture status may be questionable. The VPD (or RH) is highly influenced by large-scale atmospheric conditions, while the soil moisture conditions over a large region are variable due to differences in precipitation, irrigation, underground water level, soil texture types, vegetation cover, and topography. Thus, the spatial variability of soil moisture can be stronger than that of VPD (or RH) [57].
ETMonitor is based on the physiological and ecological processes of vegetation, and it performed well over most land cover types in the HRB. However, it showed relatively low accuracy in the downstream region. This may due to the model having no corresponding mechanism to reflect the absorption of deep groundwater by vegetation, thus causing the underestimation of ETMonitor over riparian forests (such as P. euphratica and Tamarix). Figure 11 displays the groundwater table level and ET of the PE, SDQ, and MF sites in 2015. The groundwater level changes significantly within the range of 2-3 m and decreases immediately at the start of the growing season (June), due to the water The eco-physiological process models often have difficulties in the parameterization of soil-plant-atmosphere interactions and other bio-physiological constraints [19]. MOD16 shows underestimation in most regions of the HRB. The parameterization of the MOD16 model is vital in controlling the ET; in particular, the surface resistance parameterization is essential to the model [4,[54][55][56]. The application of VPD or relative humidity (RH) to represent the surface soil moisture status may be questionable. The VPD (or RH) is highly influenced by large-scale atmospheric conditions, while the soil moisture conditions over a large region are variable due to differences in precipitation, irrigation, underground water level, soil texture types, vegetation cover, and topography. Thus, the spatial variability of soil moisture can be stronger than that of VPD (or RH) [57].
ETMonitor is based on the physiological and ecological processes of vegetation, and it performed well over most land cover types in the HRB. However, it showed relatively low accuracy in the downstream region. This may due to the model having no corresponding mechanism to reflect the absorption of deep groundwater by vegetation, thus causing the underestimation of ETMonitor over riparian forests (such as P. euphratica and Tamarix). Figure 11 displays the groundwater table level and ET of the PE, SDQ, and MF sites in 2015. The groundwater level changes significantly within the range of 2-3 m and decreases immediately at the start of the growing season (June), due to the water consumed by the plants, along with increased ET values. This indicates that groundwater provides water for the ET over riparian forests (such as P. euphratica and Tamarix). Relatively low accuracy was found in some desert and riparian forest pixels (P. euphratica and Tamarix) in the downstream. The ET accuracy of individual algorithms has a significant impact on the accuracy of the BMA method used in the production of GLASS. In the downstream region with sparse vegetation, each algorithm used in GLASS may not guarantee relatively high accuracy, thus affecting the overall integration accuracy. Moreover, using normal densities to calculate weights for the ET algorithms for different land cover types, without considering the weight differences for different growing seasons and climates, can also result in low accuracy [9].

Uncertainties in the Validation Process
The RS_ET products have been validated using ground measurements, which are derived from the eddy covariance observations under a homogenous surface. However, this ignores the issue of spatial mismatch between the spatial representatives of the eddy covariance system and the image grid. Here, the ETmonitor with a spatial resolution of 1 km was assessed using the ground observation and ground truth ET, respectively, over a land surface ranging from homogenous to heterogeneous. This showed that they yielded very similar results under a homogenous surface, but their difference increased by as much as 10% when the land surface tended to be heterogeneous ( Figure 12). This means that the ground truth ET urgently requires coarse RS_ET validation, especially for GLEAM. It has a spatial resolution of 0.25° but there are no ground measurements that possess a spatial representative to align with this coarse resolution. Hence, the GLEAM ET was resampled to 1 km and then validated using ground measurements and other RS_ET, which could have yielded unpredicted uncertainties during the validation. Relatively low accuracy was found in some desert and riparian forest pixels (P. euphratica and Tamarix) in the downstream. The ET accuracy of individual algorithms has a significant impact on the accuracy of the BMA method used in the production of GLASS. In the downstream region with sparse vegetation, each algorithm used in GLASS may not guarantee relatively high accuracy, thus affecting the overall integration accuracy. Moreover, using normal densities to calculate weights for the ET algorithms for different land cover types, without considering the weight differences for different growing seasons and climates, can also result in low accuracy [9].

Uncertainties in the Validation Process
The RS_ET products have been validated using ground measurements, which are derived from the eddy covariance observations under a homogenous surface. However, this ignores the issue of spatial mismatch between the spatial representatives of the eddy covariance system and the image grid. Here, the ETmonitor with a spatial resolution of 1 km was assessed using the ground observation and ground truth ET, respectively, over a land surface ranging from homogenous to heterogeneous. This showed that they yielded very similar results under a homogenous surface, but their difference increased by as much as 10% when the land surface tended to be heterogeneous ( Figure 12). This means that the ground truth ET urgently requires coarse RS_ET validation, especially for GLEAM. It has a spatial resolution of 0.25 • but there are no ground measurements that possess a spatial representative to align with this coarse resolution. Hence, the GLEAM ET was resampled to 1 km and then validated using ground measurements and other RS_ET, which could have yielded unpredicted uncertainties during the validation. Uncertainties associated with the validation process always exist but they have rarely been reported. The accuracy of the ground truth ET is primarily affected by the quantity, quality, and representativeness of the ET observation datasets [58]. EC measurements are important training data for ground truth ET [26,35]. It has been reported that the ET observation error from EC measurements ranges from 10 to 30 W/m 2 , and different EC data processing software show differences of 5 and 10% during flux data processing [59]. There are also errors yielded from ground observation-based ET upscaling, which is mainly related to the upscaling methods and their inputs. For example, in research by Li et al. [26] and Xu et al. [35], upscaling methods' inputs, such as LST, LAI, and net radiation, were closely related to the accuracy of the upscaling results.
In this study, the method proposed by Beyrich et al. [51] was used to quantitatively evaluate the uncertainty of ground truth ET at the pixel scale under a homogeneous surface, while the gPC method was used for heterogeneous surfaces [26,27]. The uncertainty for ground truth ET at the pixel and regional scales is shown in Figure 13. We found that the uncertainty could be associated with the spatial heterogeneity of the land surface, which is consistent with recent studies [23,27]. EC measurements and ground observation-based ET upscaling methods tend to have relatively high accuracy in relatively homogeneous land cover types, such as grasslands and deserts, which had lower uncertainties in our study (15-18%). Over heterogeneous land cover types, such as wetlands and riparian forests, the error of EC measurements and upscaling methods tends to increase, reaching values as high as 22% to 28%. Moreover, the uncertainty (MAPE) of ETMap throughout the basin is approximately 20.69% when compared with LAS observations, which can represent the ground truth ET at satellite pixels [35].  Uncertainties associated with the validation process always exist but they have rarely been reported. The accuracy of the ground truth ET is primarily affected by the quantity, quality, and representativeness of the ET observation datasets [58]. EC measurements are important training data for ground truth ET [26,35]. It has been reported that the ET observation error from EC measurements ranges from 10 to 30 W/m 2 , and different EC data processing software show differences of 5 and 10% during flux data processing [59]. There are also errors yielded from ground observation-based ET upscaling, which is mainly related to the upscaling methods and their inputs. For example, in research by Li et al. [26] and Xu et al. [35], upscaling methods' inputs, such as LST, LAI, and net radiation, were closely related to the accuracy of the upscaling results.
In this study, the method proposed by Beyrich et al. [51] was used to quantitatively evaluate the uncertainty of ground truth ET at the pixel scale under a homogeneous surface, while the gPC method was used for heterogeneous surfaces [26,27]. The uncertainty for ground truth ET at the pixel and regional scales is shown in Figure 13. We found that the uncertainty could be associated with the spatial heterogeneity of the land surface, which is consistent with recent studies [23,27]. EC measurements and ground observation-based ET upscaling methods tend to have relatively high accuracy in relatively homogeneous land cover types, such as grasslands and deserts, which had lower uncertainties in our study (15-18%). Over heterogeneous land cover types, such as wetlands and riparian forests, the error of EC measurements and upscaling methods tends to increase, reaching values as high as 22% to 28%. Moreover, the uncertainty (MAPE) of ETMap throughout the basin is approximately 20.69% when compared with LAS observations, which can represent the ground truth ET at satellite pixels [35].
Remote Sens. 2022, 14, x FOR PEER REVIEW 1 Figure 12. The divergence for the remotely sensed ET validation results using the ground tr vs. ground observation ET from homogeneous to heterogeneous surfaces at the spatial res of 1 km.
Uncertainties associated with the validation process always exist but they have been reported. The accuracy of the ground truth ET is primarily affected by the qu quality, and representativeness of the ET observation datasets [58]. EC measureme important training data for ground truth ET [26,35]. It has been reported that t observation error from EC measurements ranges from 10 to 30 W/m 2 , and differe data processing software show differences of 5 and 10% during flux data processin There are also errors yielded from ground observation-based ET upscaling, wh mainly related to the upscaling methods and their inputs. For example, in research et al. [26] and Xu et al. [35], upscaling methods' inputs, such as LST, LAI, and net rad were closely related to the accuracy of the upscaling results.
In this study, the method proposed by Beyrich et al. [51] was used to quantit evaluate the uncertainty of ground truth ET at the pixel scale under a homoge surface, while the gPC method was used for heterogeneous surfaces [26,27 uncertainty for ground truth ET at the pixel and regional scales is shown in Figure  found that the uncertainty could be associated with the spatial heterogeneity of th surface, which is consistent with recent studies [23,27]. EC measurements and g observation-based ET upscaling methods tend to have relatively high accur relatively homogeneous land cover types, such as grasslands and deserts, whic lower uncertainties in our study (15-18%). Over heterogeneous land cover types, s wetlands and riparian forests, the error of EC measurements and upscaling methods to increase, reaching values as high as 22% to 28%. Moreover, the uncertainty (MA ETMap throughout the basin is approximately 20.69% when compared with observations, which can represent the ground truth ET at satellite pixels [35].

Conclusions
In this study, a validation framework was established to evaluate RS_ET products extending over homogenous to heterogeneous land surfaces. The validation framework was applied to assess the GLEAM, DTD, MOD16, ETMonitor, and GLASS products from 2012 to 2016 in the HRB. The main conclusions are summarized as follows.
First, a validation framework was proposed for evaluating coarse RS_ET products over heterogeneous land surfaces with multisource validation datasets (ground truth ET at the pixel and basin scales, other RS_ET products, and ET impact factors), multiple validation methods (direct and indirect validation methods), and multiple scales (pixel and basin scales). In this validation framework, we report the accuracy and the rationality of the spatiotemporal variations of the RS_ET products, the error sources of the products, and the uncertainty of the validation process.
Second, the validation results were consistent among the direct and indirect methods at different scales. The DTD had the highest accuracy (1-MAPE) in the vegetation growing season (75%), followed by ETMonitor (71%), GLASS (68%), GLEAM (54%), and MOD16 (43.59%). All RS_ET products had the capability to maintain the consistency of the spatiotemporal trends of ET and its impact factors. ETMonitor and DTD performed relatively well (with R values higher than 0.66), followed by MOD16 (0.62) and GLASS (0.65), and, finally, GLEAM (0.57). These validation results could also be affected by the accuracy of the ground truth ET dataset. The results showed that the uncertainty of the ground truth ET at the pixel scale was approximately 15% over relatively homogeneous land surfaces, while it can be up to 28% over highly heterogeneous land surfaces.
Validating coarse RS_ET products over heterogeneous surfaces is still a critical research issue. The validation framework proposed in this study could contribute to understanding the accuracy and rationality of spatiotemporal variations in the RS_ET products over heterogeneous surfaces, promoting their application. In the future, we could apply this validation framework and ground truth ET acquisition method at the pixel/basin scale over more climates, surface types, and regions.
where P i represents the RS_ET product, O i is the ground truth ET at the pixel or regional scale, P is the mean RS_ET value, O is the mean ground truth ET, and n is the number of samples.

Appendix B
The input data of the coarse RS_ET products used in this study are presented in Table A1. Note: Ta refers to the air temperature, Tmax refers to the maximum temperature, Tmin refers to the minimum temperature, q refers to the atmosphere-specific humidity, WS refers to the wind speed, LST refers to the land surface temperature, LAI refers to the leaf area index, NDVI refers to the normalized vegetation index, FPAR refers to the fraction of absorbed photosynthetically active radiation, and VOD refers to the vegetation optical depth.

Appendix C
In the three-cornered hat (TCH) method, five RS_ET products were stored in the array {X i } i=1,2,...,N , where N (N = 5) is the number of RS_ET products, and i is the ith RS_ET products. Any RS_ET product can be expressed as where X t is the true value and ε i is the error of the ith time series. The symbol ∀ is called the universal quantifier. As no true estimate of X t is available, any single time series is arbitrarily chosen as the reference. The series of differences matrix can be obtained by calculating the difference between each time series and the reference. The corresponding covariance matrix S of the series of difference matrices is thus obtained. Notably, the valuation of uncertainties is not related to the selection of the reference series. The unknown N × N covariance matrix of the individual noise R is introduced, and R is related to S as follows: where J is described as follows: reasonable method to obtain a unique solution. Galindo and Palacio (1999) [60] proposed a constraint function to meet |R| > 0 as follows: where r 1N,..., r NN are the elements of the corresponding R and K = N−1 |S| is introduced to better obtain a numerical solution. This constraint function constrains the free parameters within the solution domain but is insufficient to determine a unique solution for the free parameters. The following objective function is also used to provide the optimal selected criterion to obtain the unique parameter solution as follows: R is calculated using the objective function under the constraint condition by combining it with Equation (A6). The uncertainty of the time series {X i } i=1,2,...,N , denoted {σ i } i=1,2,...,N , can be obtained by calculating the square root of the diagonal values in R (i.e., {r ii } i=1,2,...,N ). The relative uncertainty is the ratio of σ i to the mean X i .

Appendix D
According to the method of Beyrich et al. [23], the uncertainty of the ground truth ET over a homogeneous underlying surface is calculated via ∆ET = max(σ r , abs(ET_EC − ET LAS )) (A10) where ∆ET represents the uncertainty of the ground truth ET, ET_EC represents the ET observed by EC, ET LAS represents the LAS observed value, and σ r represents the error of the EC observation, which can be obtained from the EC data processing software (EddyPro). However, uncertainty introduced from the eddy covariance instrument measurements must also be considered as the uncertainty of ground truth ET due to the dearth of LAS observations. In this study, we adopted the generalized polynomial chaos (gPC) method to quantitatively evaluate the uncertainty of ground truth ET over moderately and highly heterogeneous underlying surfaces. The gPC method involves representing the inputs and outputs of a system under consideration through series approximations using standard random variables, thereby resulting in a computationally efficient means of uncertainty propagation through complex numerical models [52].
In this approach, the same set of random variables used to represent the input stochasticity is used for representing the output(s). For uniformly distributed random inputs, an equivalent reduced model for the output can be expressed in the form of a series expansion comprising multidimensional Legendre polynomials of uniform random variables as where y refers to an output metric, ξ i 1 , ξ i 2 , . . . are i.i.d. uniform random variables, Γ q ξ i 1 , ξ i 2 , . . . , ξ i q is the Legendre polynomial of degree q, and α i 1 , α i 1 i 2 , . . . are the corresponding coefficients. For notational simplicity, the series can be written as where the series is truncated to a finite number of terms and there exists a correspondence between Γ q ξ i 1 , ξ i 2 , . . . , ξ i q , Φ(ξ), and their corresponding coefficients. The unknown coefficients can be determined by projecting each state variable onto the polynomial chaos basis (i.e., the Galerkin projection method). Once the reduced-order model is formulated (using orthonormal basis functions), the mean and variance can be directly obtained as The uncertainty obtained by the gPC method is the variance of the polynomial chaotic coefficients. In order to ensure the unity of units, the standard deviation is derived to express the final uncertainty. In addition, in order to evaluate the uncertainty of the ground truth ET, we can adopt the following formula: where U is the uncertainty of the ground truth ET obtained by the uncertainty analysis method, gt represents the abbreviation of the name of the ground truth ET, and ET gt is the mean value of the ground truth ET.