Assessing Satellite, Land Surface Model and Reanalysis Evapotranspiration Products in the Absence of In-Situ in Central Asia

Shortfalls in regular evapotranspiration (ET) monitoring and evaluation pose a huge challenge to agricultural water resource distribution in arid Central Asia (CA). In this study, a first detailed regional assessment of GLEAM, ERA5, MERRA2, CLSM and NOAH ET products in CA was performed by systematically implementing the triple collocation (TC) method, in which about 36,936 grid cells for each ET data (within a six-triplet design) were collocated, at 0.25° and with monthly resolutions during 2003–2020. The reliability of the strategy adopted was confirmed in four arid biomes using standard evaluation metrics (R, RMSE and BIAS), and by spatiotemporal cross-validation of the six ET triplets across CA. Results show that the systematic TC method produced more robust ET product assessment metrics with reduced RMSEs compared to the initial ET product validation using in-situ, which showed weak-positive correlation and high negative bias-range (i.e., −21.02 ≤ BIAS < 16 mm) in the four arid biomes of CA. The spatial cross-validation by TC showed that the magnitude of ET random errors significantly varies, and confirms the systematic biases with site-scale measurements. The highest ET uncertainties by CLSM (27.43%), NOAH (29.16%), MERRA2 (38.28%), ERA5 (36.75), and GLEAM (41%) were more evident in the shrubland, cropland, grassland, cropland again, and desert biomes, respectively. Moreover, error magnitudes in high altitudes (Tianshan Mountain range) are generally lower than in plain-desert areas. All ET products spatially captured ET dynamics over CA, but none simultaneously outperformed the other. These findings are invaluable in the utilization of the assessed ET products in supporting regional water resource management, particularly in CA.


Introduction
Terrestrial evapotranspiration (ET) controls land surface-atmosphere interactions and is critical to the physical processes of regional hydrology [1], and agricultural water productivity [2]. ET is a key climate change indicator [3,4], however, it is difficult to calculate, Geographically, Central Asia (CA) is located in the heart of the Eurasian continent covering an expanse of 566 × 104 km 2 , between longitude 46 • 30 to 96 • 30 E and latitude 30 • 20 to 50 • 30 N ( Figure 1). CA comprises Kazakhstan, Kyrgyzstan, Tajikistan, Turkmenistan, Uzbekistan and Xinjiang in northwest China [39]. The region is characterized by large portions of deserts (~74%, e.g., the Taklamakan and Karakum deserts which are habitats to peculiar water stressed shrubs and herbs), and a small fraction of humid regions (~26%) containing croplands, broadleaf and needle leaf vegetation types [40,41]. The general vegetation type is grass. The snowcapped Mt. system in CA extends from the Altai Mt. in the north to Pamir Mt. in the south, with the Tianshan Mt. located in-between them ( Figure 1). This Mt. system interacts with the dominant wind system (i.e., the westerlies) to regulate air mass circulation, control rainfall distribution and atmospheric pressure from low to mid-latitude of CA and beyond [40]. Due to water scarcity in CA, intermittent runoff rivers provide water resources at all levels to support the irrigation activities which also helps to boost the food security status of the region [6]. Largely, rainfall is low in CA (mean annuals < 250 mm), and combines with increased temperature anomalies to impact the region's local climate which further induces high vapor flux transfer from different land cover types [2]. Therefore, water imbalances remain a huge challenge in the region.

State-of-the-Art ET Products
Five global ET data sets were comprehensively evaluated in this study (Table 1). They include Global Land Evaporation Amsterdam Model (GLEAM) satellite ET [11], two Global Land Data Assimilation System (GLDAS)-Land Surface Model (LSM) ET products, i.e., Catchment LSM [42] and NOAH LSM [10], and two reanalysis ET data sets, i.e., European Centre for Medium-Range Weather Forecasts Reanalysis-5 (ERA-5) [14] and the Modern-Era Retrospective Analysis for Research and Application Land version 2 reanalysis product (MERRA2) [15]. We assessed the GLEAM actual land evaporation dataset at a monthly time scale and at the 0.25° × 0.25° resolution from 2003 to 2020 July [11]. It was produced using a set of algorithms, such as the Priestly-Taylor formulation with explicit soil moisture stress and the Gash interception model method and forced by different satellite data [43]. The assessed CLSM's parametrization scheme maintains a temporal consistency between the ECMWF-forced (2003 to present) and the Princeton-forced  simulations, with the available data starting from February 2003 till date, at 0.25° spatial resolution [42]. The meteorological forcings used for simulating NOAH include air temperature, humidity, precipitation, surface pressure, downwelling and upwelling shortwave and longwave radiation, and wind speed [10]. We focused on the newer Level 4 (L4) NOAH v2.1 (0.25° × 0.25°) at the monthly temporal resolution and was evaluated during 2003-2020. We assessed the updated ERA5 with an advanced radiative transfer model based on a 10-member ensemble 4DVar [14], at 0.25° × 0.25° spatial scale from 2003 to 2020. ERA5 was updated from the ERA-Interim reanalysis [44]. Lastly, the recent MERRA2 [15] derived from energy balance at the land surface and was reinforced with the CLSM [45] was also evaluated. To align CLSM and NOAH with GLEAM, ERA5 and MERRA2, the measurements were converted from W m −2 to mm/m. Both ERA5 and

State-of-the-Art ET Products
Five global ET data sets were comprehensively evaluated in this study (Table 1). They include Global Land Evaporation Amsterdam Model (GLEAM) satellite ET [11], two Global Land Data Assimilation System (GLDAS)-Land Surface Model (LSM) ET products, i.e., Catchment LSM [42] and NOAH LSM [10], and two reanalysis ET data sets, i.e., European Centre for Medium-Range Weather Forecasts Reanalysis-5 (ERA-5) [14] and the Modern-Era Retrospective Analysis for Research and Application Land version 2 reanalysis product (MERRA2) [15]. We assessed the GLEAM actual land evaporation dataset at a monthly time scale and at the 0.25 • × 0.25 • resolution from 2003 to 2020 July [11]. It was produced using a set of algorithms, such as the Priestly-Taylor formulation with explicit soil moisture stress and the Gash interception model method and forced by different satellite data [43]. The assessed CLSM's parametrization scheme maintains a temporal consistency between the ECMWF-forced (2003 to present) and the Princetonforced (1948-2012) simulations, with the available data starting from February 2003 till date, at 0.25 • spatial resolution [42]. The meteorological forcings used for simulating NOAH include air temperature, humidity, precipitation, surface pressure, downwelling and upwelling shortwave and longwave radiation, and wind speed [10]. We focused on the newer Level 4 (L4) NOAH v2.1 (0.25 • × 0.25 • ) at the monthly temporal resolution and was evaluated during 2003-2020. We assessed the updated ERA5 with an advanced radiative transfer model based on a 10-member ensemble 4DVar [14], at 0.25 • × 0.25 • spatial scale from 2003 to 2020. ERA5 was updated from the ERA-Interim reanalysis [44]. Lastly, the recent MERRA2 [15] derived from energy balance at the land surface and was reinforced with the CLSM [45] was also evaluated. To align CLSM and NOAH with GLEAM, ERA5 and MERRA2, the measurements were converted from W m −2 to mm/m. Both ERA5 and MERRA2 were resampled to 0.25 • × 0.25 • to match with LSM and satellite ET data sets. A summary of these ET products is listed in Table 1.  [15] 2003-2020 Priestly-Taylor algorithm with canopy interception Martens, Gonzalez Miralles [11] 2003-July 2020

In-Situ Measurements
As at the time of this study, CA still does not have a single flux tower instrument in the global Flux Network (FLUXNET) system [38], which is the comprehensive platform for verifying satellite and alternative ET products [37]. Therefore, this study made use of in situ observations from four independent EC flux towers located in the crop, desert, grass and shrubland biomes in CA (red dots in Figure 1). Detailed preprocessing of the flux data sets has already been elaborated and documented [41,46]. The flux data have been extensively utilized by previous researchers, of which some are coauthors in this study [2,41,46]. The flux towers are currently managed by the Institute of Ecology and Geography, Chinese Academy of Sciences, China. Details of these flux towers are listed in Table 2.

Evaluation Approach
First, five state-of-the-art ET products obtained from different sources (Table 1) from February 2003 to July 2020 were bilinearly interpolated to align them to the same spatial grid of 0.25 • resolution (Section 2.2). Then, the Triple Collocation (TC) method [31,47] was applied to derive the representativeness errors in three parallel analyses of six different ET triplets (ET1, . . . ET6), including the independent uncertainties of each triplet component (i.e., CLSM, ERA5, GLEAM, MERRA2 and NOAH), in four arid biomes in CA (Section 2.5). Figure 2 shows the structural design of the triplets, while Figure 3 is the summary of the study's methodological framework. Due to insufficient in situ data in CA, we considered the reanalysis products (i.e., ERA5 and/or MERRA2) as a reasonable proxy in the triplet combinations; based on their practical hydrological applications [48,49] and performances in previous ET-based studies [50][51][52]. Figure 4 shows that the triplet design is both temporally and spatially reliable.
was performed to confirm their consistency, (ii) spatial cross-validation of the triplets containing any one of the reference products (i.e., ET1 or ET2, and ET3) against a related reference-based triplet (i.e., ET4 or ET5), and then against the validation triplet (i.e., ET6). Note that GLEAM ET provides actual evaporation estimates [11,43], therefore, we combined GLEAM and the two reference data sets to give us the validation triplet-ET6. So, the spatial cross-validation was constructed as follows: ET1 vs. ET5, ET2 vs. ET4, and ET3 vs. ET6, respectively.

Figure 2.
The Triplet design for the TC implementation. Left part of the figure shows the three ET measurement systems, including land surface model ET products (GLDAS NOAH and CLSM), satellite ET model (GLEAM), and reanalysis ET products (ERA5 and MERRA2). The triplet combinations comprise either two or three measurement systems as shown on the first column of the right panel of the figure. The blue background area shows the triplets comprising three different ET measurement systems (which is the ideal combination in TC analysis), the yellow background area consists of the triplets designed with two measurement systems (LSM and Reanalysis), and another triplet combination also comprised of two measurement systems (satellite and reanalysis) with pink background. Overall, the six triplets (ET1, …ET6) comprised of one or two reanalyses (ERA5 or MERRA2) which formed the reference component in the triplets for the TC analysis. Then the last triplet was used for validation purposes, however, all six triplets were independently evaluated to achieve a comprehensive assessment of the individual ET product in the study area.
We also assessed the accuracy of the ET products before applying the TC method (Section 3.3). Regardless of the ET products' positions in the triplet design ( Figure 2), their comprehensive spatial and temporal random error variances in the diverse arid ecosystem biomes were realized for each triplet (ET1, …ET6), as well as for each individual ET product (Sections 3.1, 3.2, 3.4 and 3.5). Similarly, the TC error structures for each product were assessed independently in the biomes using standard evaluation metrics (Section 3.6), and the relative uncertainties of each ET product were calculated using the TC-based standard error variances. Hence, allowing for a comprehensive assessment of the ET products in arid CA (Section 3.7). We defined the ET product uncertainty in this work as the RMSE expressing the variance of residual errors (Θ ) of each ET product obtained from Equation (1) [32]. The study methodological framework is shown in Figure 3, while details of the procedures are presented in Section 2.5. The blue background area shows the triplets comprising three different ET measurement systems (which is the ideal combination in TC analysis), the yellow background area consists of the triplets designed with two measurement systems (LSM and Reanalysis), and another triplet combination also comprised of two measurement systems (satellite and reanalysis) with pink background. Overall, the six triplets (ET1, . . . ET6) comprised of one or two reanalyses (ERA5 or MERRA2) which formed the reference component in the triplets for the TC analysis. Then the last triplet was used for validation purposes, however, all six triplets were independently evaluated to achieve a comprehensive assessment of the individual ET product in the study area.
Next, to ensure that a single product error from any one of the six triplets is stable and reliable [53], a two-step validation procedure was separately performed on the reference data sets; (i) direct linear comparison of the two references (i.e., ERA5 and MERRA2) was performed to confirm their consistency, (ii) spatial cross-validation of the triplets containing any one of the reference products (i.e., ET1 or ET2, and ET3) against a related referencebased triplet (i.e., ET4 or ET5), and then against the validation triplet (i.e., ET6). Note that GLEAM ET provides actual evaporation estimates [11,43], therefore, we combined GLEAM and the two reference data sets to give us the validation triplet-ET6. So, the spatial cross-validation was constructed as follows: ET1 vs. ET5, ET2 vs. ET4, and ET3 vs. ET6, respectively.
We also assessed the accuracy of the ET products before applying the TC method (Section 3.3). Regardless of the ET products' positions in the triplet design ( Figure 2), their comprehensive spatial and temporal random error variances in the diverse arid ecosystem biomes were realized for each triplet (ET1, . . . ET6), as well as for each individual ET product (Sections 3.1, 3.2, 3.4 and 3.5). Similarly, the TC error structures for each product were assessed independently in the biomes using standard evaluation metrics (Section 3.6), and the relative uncertainties of each ET product were calculated using the TC-based standard error variances. Hence, allowing for a comprehensive assessment of the ET products in arid CA (Section 3.7). We defined the ET product uncertainty in this work as the RMSE expressing the variance of residual errors Θ 2 of each ET product obtained from Equation (1) [32]. The study methodological framework is shown in Figure 3

Characterizing Error Variances of the ET Products without In-Situ
We used the triple collocation (TC) method [47] which has the ability to estimate random error variances from point-scale observation and from coarse-resolution satellite data to characterize the ET products error variances [31]. Here, we mainly describe the direct application of TC to our ET data sets, as TC has been effectively used in numerous studies with detailed mathematical derivations already extensively documented [30,31,43,49]. Thus, we applied the TC technique which adopts a linear relationship between the ET measurement systems (Θ → ∈ [ , , ], i. e., the satellite, LSM and/or reanalysis ET products of the same grid cells: 25 × 25 km 2 ) and their supposed true state (Θ , i. e. , hypothetical true value) to estimate the error variances (Θ ) of the ET products, as in Equation (1): where, and are the systematic additives and multiplicative biases, and is the random error/additive zero-mean random noise [31,43]. The TC technique (Equation (1)) does not require high-quality reference data set [54]. However, the TC application in this study was guided by the following caveats: structural errors from the collocations (ET triplets) were uncorrelated, orthogonal and independent with respect to the truth [31,55]. So, to estimate the TC error statistics, a total of six combinations (i.e., ET1, …ET6) were realized from the five ET products based on a systematic tree selection (Figure 2), which produced several hundreds of collocated samples of the data sets, sufficient enough for the required assessment. A minimum of a hundred samples has been recommended [30,31].  The collocated data sets have the same spatial and temporal scales, and it is assumed that their error structure is Gaussian. However, where they solve for dissimilar scales (which we try to minimize during the interpolation analysis), the common variance to the lesser spatial and/or temporal scales forms part of the variances of the residual errors. These variances are representativeness errors [33,34]. Therefore, the spatiotemporally collocated triplets of the ET products were averaged by their cross-multiplied differences (Equation (2)), which is a representation of the diagonal elements of the error covariance matrix of the supposed reference data i. e. , = σ , as follows: . . ET6) which shows that triplet combination with two reference ET products (ERA5 and MERRA2) are consistent in the triplet design (i.e., ET4 and ET5).

Characterizing Error Variances of the ET Products without In-Situ
We used the triple collocation (TC) method [47] which has the ability to estimate random error variances from point-scale observation and from coarse-resolution satellite data to characterize the ET products error variances [31]. Here, we mainly describe the direct Remote Sens. 2021, 13, 5148 8 of 24 application of TC to our ET data sets, as TC has been effectively used in numerous studies with detailed mathematical derivations already extensively documented [30,31,43,49]. Thus, we applied the TC technique which adopts a linear relationship between the ET measurement systems (Θ m → i ∈ [x, y, z], i.e., the satellite, LSM and/or reanalysis ET products of the same grid cells: 25 × 25 km 2 ) and their supposed true state (Θ t , i.e., hypothetical true value) to estimate the error variances Θ 2 of the ET products, as in Equation (1): where, α and β are the systematic additives and multiplicative biases, and is the random error/additive zero-mean random noise [31,43]. The TC technique (Equation (1)) does not require high-quality reference data set [54]. However, the TC application in this study was guided by the following caveats: structural errors from the collocations (ET triplets) were uncorrelated, orthogonal and independent with respect to the truth [31,55]. So, to estimate the TC error statistics, a total of six combinations (i.e., ET1, . . . ET6) were realized from the five ET products based on a systematic tree selection ( Figure 2), which produced several hundreds of collocated samples of the data sets, sufficient enough for the required assessment. A minimum of a hundred samples has been recommended [30,31].
The collocated data sets have the same spatial and temporal scales, and it is assumed that their error structure is Gaussian. However, where they solve for dissimilar scales (which we try to minimize during the interpolation analysis), the common variance to the lesser spatial and/or temporal scales forms part of the variances of the residual errors. These variances are representativeness errors [33,34]. Therefore, the spatiotemporally collocated triplets of the ET products were averaged by their cross-multiplied differences (Equation (2)), which is a representation of the diagonal elements of the error covariance matrix of the supposed reference data i.e., R i = σ 2 i , as follows: where the ET triplet data variances are represented by σ 2 X , σ 2 Y and σ 2 Z , respectively. While the random error variances of the i ∈ [x, y, z] are represented by σ 2 x , σ 2 y and σ 2 z , respectively, also known as the covariance of the measurements [31]. The aim of the TC method (Equation (1)) is to estimate the RMSE (hereafter, RMSE TC ) by measuring the variance of residual errors for each ET product [32]. After estimating the RMSE TC according to Equation (1), we calculated for the correlation coefficient-R (hereafter, R TC ) between each of the ET products to determine their relationship and the supposed truth, according to Equation (3).
where x, y, z are the independent ET data, while σ 2 X and σ y,z are dataset variances of the dataset x and covariance between dataset x and dataset y, respectively. Derivation of R TC , as adapted from the classical TC method [56] is shown by McColl, Vogelzang [22] to extend the practical application of the TC in all terrains. Next, we approximated the relative uncertainty of each ET product by expressing the variance of their residual errors (i.e., RMSE TC ) derived from Equation (1) as the ratio of standard random error Θ 2 from the ET product's mean values at each of the validation site (biome), as in Equation (4) [21,32].
Remote Sens. 2021, 13, 5148 9 of 24 Due to the limited in situ observation which may require rescaling of the ET data to sufficient reference data space [31], we did not present the absolute uncertainties to avoid misinterpretation. However, as shown above, it was possible to independently derive the absolute error structures (RMSE TC and R TC ) following Equations (1) and (3), and the standard deviation std[−] of random error for each of the representative triplets ( ET ) (Equation (4)), representing the product's characteristic inconsistencies (uncertainties) with their corresponding reference [21]. Finally, the relative uncertainties were statistically analyzed according to biome type (Section 3.7).

Validation
The averages of the TC-based absolute error structures were validated spatially (continental) by cross-validation of the triplets (Section 3.2), and by using the statistical metrics of the standard evaluation method (SM) at the biomes (Section 3.6). Specifically, the point scale TC-based metrics at each of the validation sites (biomes) were compared with the standard metrics derived from the same biomes. In the grid point-by-point spatial analysis, a total of 36,936 grid cells for each ET product data were collocated for the period 2003-2020 for each time of the calculation using ERA5 or MERRA2 as the reference in the TC analysis. Then the TC-based grid cells of the ET product representativeness errors by the triplets were spatially compared against those of the validation triplet ET grid cells (i.e., ET1 vs. ET5 and ET2 vs. ET3, against ET3 vs. ET6), which revealed almost similar spatial error patterns, indicating the reliability of the triplet design used for the ET assessment in situ CA (Section 3.1).

Evaluation Metrics
Standard model performance metrics, such as the root mean square error (RMSE) and Pearson correlation coefficient (R) was used to evaluate the consistency and error biases of the ET product estimates against the in-situ measurements in the four arid ecosystem biomes (Equations (5) and (6)).
where n is the total number of observations, x is the ET product value, y is the in-situ observation value from either EC, while x = 1 n ∑ n i=1 x i . . . n and y = 1 n ∑ n i=1 y i . . . n. Additional statistical outputs, such as the product bias and systematic errors (∞) of the ET products, obtained by calculating the standard deviation of the product from the actual mean ET in each biome are presented in Appendix A. More importantly, the TC-based method was validated by detailed statistical intercomparison with those of the standard metrics (SM) in CA (Section 3.6). Figure 5 shows the intercomparison of the triplets used in assessing ET in CA. ET4 and ET5 are identical (Figure 4b). So, Figure 5 only displays the distribution of the R TC performance of the triplets comprising of three separate measurement systems of ET and is regarded as the optimal combination style when using the TC method (e.g., LSM, satellite and reanalysis) [31]. These triplets are ET1, ET2 and ET3 ( Figure 5). Triplet ET6 which contains only two measurement systems (i.e., satellite and reanalysis) was specifically designed for validation and cross-validation purposes. However, ET6 was also independently evaluated alongside ET4 and ET5 to achieve a comprehensive assessment of the ET products using both the three measurement systems and two measurement systems of ET ( Figure 2). The detailed statistics of all the triplets are shown in Table 3. The RC TC values of the main triplets ET1, ET2 and ET3 were 0.77, 0.69 and 0.75, respectively ( Figure 5). Compared to other triplets previously illustrated in Figure 4, ET6 produced a very high R TC value (~0.80), which most closely agree with its evaluation components (i.e., ERA5 and MERRA2). This qualifies ET6 to serve as a reliable validation triplet to ET1, ET2 and ET3, though they were all independently evaluated ( Figure 5).

Evaluation of the TC Method in Assessing ET Errors in CA
Remote Sens. 2021, 13, x 10 of 25 contains only two measurement systems (i.e., satellite and reanalysis) was specifically designed for validation and cross-validation purposes. However, ET6 was also independently evaluated alongside ET4 and ET5 to achieve a comprehensive assessment of the ET products using both the three measurement systems and two measurement systems of ET ( Figure 2). The detailed statistics of all the triplets are shown in Table 3. The RCTC values of the main triplets ET1, ET2 and ET3 were 0.77, 0.69 and 0.75, respectively ( Figure 5). Compared to other triplets previously illustrated in Figure 4, ET6 produced a very high RTC value (~0.80), which most closely agree with its evaluation components (i.e., ERA5 and MERRA2). This qualifies ET6 to serve as a reliable validation triplet to ET1, ET2 and ET3, though they were all independently evaluated ( Figure 5). It is seen that triplets ET4 and ET5 outperformed others given their high median values ( Figure 4b); however, ET6 showed less deviation from the mean ET compared to ET4 and ET5 (Table 3) 21], contrary to ET2 [-] based on its RTC value of 0.10 (p < 0.001). Based on the triplets with the ideal combination measurement systems ( Figure 5), it is expected that assessing the ET product characteristics in CA using ET2 might reveal a different performance than other triplets (ET1 and ET3). A cross-validation examination of the triplets was conducted to further determine the probability of the addictive impact (Section 3.2).  It is seen that triplets ET4 and ET5 outperformed others given their high median values ( Figure 4b); however, ET6 showed less deviation from the mean ET compared to ET4 and ET5 (  Figure 5), it is expected that assessing the ET product characteristics in CA using ET2 might reveal a different performance than other triplets (ET1 and ET3). A cross-validation examination of the triplets was conducted to further determine the probability of the addictive impact (Section 3.2). Figure 6 shows the spatial cross-validation of all the triplets. Based on the 36,936 grid sample cells generated for each triplet in the RMSE TC computation, triplets containing a corresponding reference component (i.e., ERA5 or MERRA2) were spatially cross-validated. In other words, the statistical matching is as follows: ET1 vs. ET5, ET2 vs. ET4, and ET3 vs. ET6 ( Figure 6). In terms of RMSE TC , the cross-validation illustrates a consistent distribution pattern of random errors, indicating that the TC approach is effective in measuring ET errors in the absence of in-situ observation in CA. Figure 6 shows the spatial cross-validation of all the triplets. Based on the 36,936 grid sample cells generated for each triplet in the RMSETC computation, triplets containing a corresponding reference component (i.e., ERA5 or MERRA2) were spatially cross-validated. In other words, the statistical matching is as follows: ET1 vs. ET5, ET2 vs. ET4, and ET3 vs. ET6 ( Figure 6). In terms of RMSETC, the cross-validation illustrates a consistent distribution pattern of random errors, indicating that the TC approach is effective in measuring ET errors in the absence of in-situ observation in CA.

Evaluation of the TC Method by Triplet Cross-Validation Performance in CA
Comparatively, the plain areas of CA, which extend from the Aral Sea to the southwestern parts have the highest error variabilities compared to other sections of the regions, though, at different rates ( Figure 6). Figure 6c shows that the high error variabilities also extend to the northeast of CA, which includes parts of north-west Kazakhstan and north Xinjiang in China (Figure 6c). According to Figure 6a, b, this area has a low or moderate RMSETC. North Kazakhstan and the Tianshan Mt. ranges show the least RMSETC (Figure 6a, b), whereas the rest of the CA has moderate or low RMSETC (Figure 6a, b). The cross-validation results all together revealed a consistent pattern of ET error distribution in CA.   Comparatively, the plain areas of CA, which extend from the Aral Sea to the southwestern parts have the highest error variabilities compared to other sections of the regions, though, at different rates ( Figure 6). Figure 6c shows that the high error variabilities also extend to the northeast of CA, which includes parts of north-west Kazakhstan and north Xinjiang in China (Figure 6c). According to Figure 6a, b, this area has a low or moderate RMSE TC . North Kazakhstan and the Tianshan Mt. ranges show the least RMSE TC (Figure 6a, b), whereas the rest of the CA has moderate or low RMSE TC (Figure 6a, b). The cross-validation results all together revealed a consistent pattern of ET error distribution in CA. Figure 7 present the validation scatterplots for the comparison of CLSM, ERA5, GLEAM, MERRA2, and NOAH using EC observation. The analysis was performed for the periods with available ET measurements in the desert, grassland, shrubland and cropland sites. The best performing ET models have high R values and low RMSE values, while the worst performing products have low R values and high RMSE values at the sites. Accordingly, the best performing ET models with the least RMSEs are NOAH (0.67 mm and 1.02 mm in grassland and shrubland biomes, respectively), MERRA2 (1.02 mm in cropland biome) and GLEAM (2.24 mm in the desert biome). CLSM and ERA5 showed their best performances in the desert (4.13 mm) and shrubland biomes (5.73 mm), respectively (Figure 7). The corresponding R values for these performances are 0.91 and 0.79 for NOAH in the grassland and shrubland sites, respectively; MERRA2 has an R value of 0.86 in the cropland site, while GLEAM has 0.79 in the desert site. CLSM and ERA5 have 0.80 and 0.78 in the shrubland biome, respectively. The performances varied across arid ecosystem biomes.

Assessment of the ET Products Using EC Flux Tower Measurements
their best performances in the desert (4.13 mm) and shrubland biomes (5.73 mm), respectively (Figure 7). The corresponding R values for these performances are 0.91 and 0.79 for NOAH in the grassland and shrubland sites, respectively; MERRA2 has an R value of 0.86 in the cropland site, while GLEAM has 0.79 in the desert site. CLSM and ERA5 have 0.80 and 0.78 in the shrubland biome, respectively. The performances varied across arid ecosystem biomes. Note that all the ET products show a relatively stable ET trend during the period of study, albeit with varying amplitudes in different seasons from 2003 to 2020. These might account for the differences in the ET product performance in the various biomes. However, the scatterplots show that the ET products have a weak-positive correlation with the in-situ measurements. CLSM, ERA, MERRA2 and NOAH, for example, overestimated the ET measurements in the shrubland site (Figure 7). In the cropland, GLEAM clearly underestimated ET, as did ERA5 in the desert biome (Figure 7). It is worth noting that each of the validation sites had EC flux measurements for at least one or more growing season (s). As such, validation using the in-situ observations in the corresponding months with those Note that all the ET products show a relatively stable ET trend during the period of study, albeit with varying amplitudes in different seasons from 2003 to 2020. These might account for the differences in the ET product performance in the various biomes. However, the scatterplots show that the ET products have a weak-positive correlation with the in-situ measurements. CLSM, ERA, MERRA2 and NOAH, for example, overestimated the ET measurements in the shrubland site (Figure 7). In the cropland, GLEAM clearly underestimated ET, as did ERA5 in the desert biome (Figure 7). It is worth noting that each of the validation sites had EC flux measurements for at least one or more growing season (s). As such, validation using the in-situ observations in the corresponding months with those of the ET products provided the chance to compare the results with those derived from the TC analysis (Section 3.7).
Additional statistical analysis was performed to better understand the relationship between the ET model performance at the site scale in each biome. The clarification here is that there exist some obvious systematic errors (uncertainties) with the EC measurements, as revealed by the high ET model deviation from the actual mean ET values, resulting in higher negative biases in the biomes. The additional statistics are provided in the Appendix A ( Table 1). The triplet SD provided in Table 3  with the TC-based metrics clearly confirms that the EC measurements have some system biases, as will be seen later in Section 3.7. In fact, the TC method exhibited less variance from the average ET values, as is evident in the SD (Table 3). Figure 8 shows the calculated TC metrics (RMSE TC and R TC ) stratified by triplet comparison in the four validation sites in CA. Each Triplet component error can be deduced independently from the bar ranking, on the basis of higher/lower R TC value and lower/higher RMSE TC value indicating a reasonable/poor performance of ET product in a particular biome (Figure 8). ET1 shows that CLSM and GLEAM have low RMSE TC in the desert (0.12 and 0.14 mm, respectively), and CLSM in the cropland with RMSE TC of 0. 21 mm. Conversely, ET2 showed a better performance (reduced RMSE TC ) of GLEAM ET than ET1 in the desert (0.12 mm), grassland (0.15 mm) and shrubland (0.10mm) biomes, but with high RMSE TC in cropland. Still based on the low RMSE TC values, ET2 shows that NOAH and MERRA2 ET products have good performance (0.23 mm and 0.14 mm, respectively) in the shrubland. In ET3, ERA5 showed the best performance with low RMSE TC in grassland (0.40 mm), shrubland (0.41 mm) and cropland (0.78 mm) with correspondingly high R TC values in these biomes, respectively. NOAH only performed well in the grassland (0.70 mm) biome with the highest R TC value (0.85). In general, this result clearly shows the strength and limitation of each of the triplets in determining the ET product error variance and the product's ability to retrieve/capture ET dynamics at the site-level in CA.  Figure 9 shows the ET products' spatial distribution based on TC error structures (RMSETC and RTC), while Figure 10 shows their component temporal statistics, illustrated with boxplots. They provide useful insights into the consistencies and discrepancies between the triplets in reproducing ET spatial and temporal patterns by each of the ET prod-  Figure 9 shows the ET products' spatial distribution based on TC error structures (RMSE TC and R TC ), while Figure 10 shows their component temporal statistics, illustrated with boxplots. They provide useful insights into the consistencies and discrepancies between the triplets in reproducing ET spatial and temporal patterns by each of the ET products from 2003 to 2020. The spatial trends are not entirely consistent, since all the products perform well temporally over a variety of locations (Figure 9). However, they generally conform with the spatial distribution of the validation product-ET6 (Figure 7). Compared to the remote sensing ET product (GLEAM), the CLSM and NOAH products showed considerably high RMSETC values over the western (Kazakhstan) part of the CA and Southwest Turkmenistan, respectively (Figure 9a,b); where their RTC values had almost identical scales as those of MERRA2 and ERA5, respectively (Figure 9j,k). It suggests that using ERA5 or MERRA2 as a reference in the triplets (Figure 2) produces similar distribution patterns of RMSETC and RTC. GLEAM exhibited low RMSETC values in the west and central parts (Northwest Kazakhstan, Uzbekistan, Turkmenistan and parts of Tajikistan), which corresponds to high RTC values in these regions (Figure 9c,i). In the eastern areas, GLEAM is reasonably distributed and indicates high RMSTC that corresponds to low RTC values (Figure 9c,i). These analyses suggest that the three measurement systems suffer different types of degradation in ET retrieval in CA. In the plain areas (western parts occupied by grasslands), the reference combination with LSM (CLSM and NOAH) produced high error variabilities (Figure 9a,b), while the TC reference combination with the remote sensing product produced low error variabilities, which was the opposite in the snowcapped mountains and dry desert lands in southern Xinjiang (Figure 9c,i). Figure 10 presents a detailed comparison of the calculated RMSETC temporal error patterns with their corresponding RTC, derived from the TC point-to-point grid analysis, Figure 9. TC-based spatial distribution outputs for the ET products in CA during 2003-2020. Columns 1 and 2 (i.e., (a-f)) are the RMSE TC distribution pattern, and while columns 3 and 4 (i.e., (g-l)) are the corresponding R TC distribution pattern. ET1, ET3, and ET5 used ERA5 as the triplet component, while ET2, ET4, and ET6 used MERRA2 as triplet component.

Spatiotemporal Assessment of the Triplets and Their Individual Components at Continental Scale
Compared to the remote sensing ET product (GLEAM), the CLSM and NOAH products showed considerably high RMSE TC values over the western (Kazakhstan) part of the CA and Southwest Turkmenistan, respectively (Figure 9a,b); where their R TC values had almost identical scales as those of MERRA2 and ERA5, respectively (Figure 9j,k). It suggests that using ERA5 or MERRA2 as a reference in the triplets (Figure 2) produces similar distribution patterns of RMSE TC and R TC . GLEAM exhibited low RMSE TC values in the west and central parts (Northwest Kazakhstan, Uzbekistan, Turkmenistan and parts of Tajikistan), which corresponds to high R TC values in these regions (Figure 9c,i). In the eastern areas, GLEAM is reasonably distributed and indicates high RMS TC that corresponds to low R TC values (Figure 9c,i). These analyses suggest that the three measurement systems suffer different types of degradation in ET retrieval in CA. In the plain areas (western parts occupied by grasslands), the reference combination with LSM (CLSM and NOAH) produced high error variabilities (Figure 9a,b), while the TC reference combination with the remote sensing product produced low error variabilities, which was the opposite in the snowcapped mountains and dry desert lands in southern Xinjiang (Figure 9c,i).
Remote Sens. 2021, 13, x 16 of 25 reanalysis ET products, ERA5 had a better performance than MERRA2 as in ET3. These interpretations are focused on triplets with three different ET measurement systems, with each having its independent errors. However, there exist some temporal differences in the error statistics among all the five ET products. Therefore, the type of reference data utilized in the triplet combination can have a significant impact on the TC analytical performance of the ET products in CA.  Figure 11 present the intercomparison of the two evaluation methods used in this study. Figure 11a shows the RMSE values outputted by the two evaluation methods, while Figure 11b is their corresponding R values. As demonstrated in Figure 11b, the R values of both techniques are generally high. The RMSE values, on the other hand, reveal the systematic differences between the two approaches ( Figure 11a). To examine the efficacy of both approaches in measuring ET accuracy in CA, an error (RMSE) threshold value of 2.50 mm was arbitrarily selected (red dotted line in Figure 11a).

Intercomparison and Evaluation of the TC Analysis against Standard Metric
In general, the TC method shows that ERA5 has the least RMSE value (0.61 mm) with the highest R value (0.93) in the shrubland, while MERRA2 has the highest RMSE value (5.75 mm) and lowest R value (0.52) in the desert. In the grassland site, the best performing ET model by the SM method is NOAH, which has the lowest RMSE (0.67 mm) and the highest R value (0.91). Except for GLEAM, all the ET models in the cropland, have RMSE values < 2.2 mm according to the TC method. Whereas CLSM, ERA5 and GLEAM, all have high RMSE values of 8.51 mm, 4.18 mm and 16.42 mm, respectively, according to the SM method. Figure 10. RMSE TC and R TC temporal distribution for the ET products in CA during 2003-2020. y-axis of RMSE TC end at different values to improve visualization. The boxplots show the continental-scale media and interquartile range metrics for each ET product during in CA. Interpret these results by comparing the key boxplot metrics (the median, the 25th and 75th percentiles, and the interquartile range bounded by the top and bottom whiskers). Typically, ET products with low RMSE TC and high R TC values indicate a good performance. Figure 10 presents a detailed comparison of the calculated RMSE TC temporal error patterns with their corresponding R TC , derived from the TC point-to-point grid analysis, using ERA5 or MERRA2 as a reference during 2003-2020. CLSM outperformed NOAH (high R TC and low RMSE TC ) among the LSM parent products, as in ET1. The equivalent performance of NOAH and MERRA2 in east Xinjiang, CA, as illustrated by Figure 9b,d, may be attributable to a substantial dependency on the triplet component. This could be the reason why CLSM outperformed NOAH as shown in Figure 10. The remote sensing product (GLEAM) performed best (lowest RMSE TC and highest R TC ), as in ET2. For the reanalysis ET products, ERA5 had a better performance than MERRA2 as in ET3. These interpretations are focused on triplets with three different ET measurement systems, with each having its independent errors. However, there exist some temporal differences in the error statistics among all the five ET products. Therefore, the type of reference data utilized in the triplet combination can have a significant impact on the TC analytical performance of the ET products in CA. Figure 11 present the intercomparison of the two evaluation methods used in this study. Figure 11a shows the RMSE values outputted by the two evaluation methods, while Figure 11b is their corresponding R values. As demonstrated in Figure 11b, the R values of both techniques are generally high. The RMSE values, on the other hand, reveal the systematic differences between the two approaches ( Figure 11a). To examine the efficacy of both approaches in measuring ET accuracy in CA, an error (RMSE) threshold value of 2.50 mm was arbitrarily selected (red dotted line in Figure 11a). In the shrubland site, only MERRA2 exhibited a high TC RMSE (2.80 mm) value of above 2.50 mm, whereas the SM RMSE values of CLSM, ERA5, GLEAM and MERRA2 were above 2.50 mm (Figure 11a). Additionally, SM exhibited similar behavior for these four ET models in the grassland. In the desert site, GLEAM showed an RMSE value of 2.24 mm according to the SM (Figure 11a). The red dotted error margin in Figure 11a clearly shows that the TC technique produced low RMSE values in all the locations and outperformed the traditional evaluation method (SM). In Figure 12, we present a more comprehensive summary of the evaluation illustrating the degree of significance in the analyses by TC and SM methods in CA. The boldly highlighted values, with large magnitudes as indicated by the blue and red color scales, are the ET evaluation metrics with significantly better performance in the different arid ecosystem biomes ( Figure 12). In general, the TC method shows that ERA5 has the least RMSE value (0.61 mm) with the highest R value (0.93) in the shrubland, while MERRA2 has the highest RMSE value (5.75 mm) and lowest R value (0.52) in the desert. In the grassland site, the best performing ET model by the SM method is NOAH, which has the lowest RMSE (0.67 mm) and the highest R value (0.91). Except for GLEAM, all the ET models in the cropland, have RMSE values < 2.2 mm according to the TC method. Whereas CLSM, ERA5 and GLEAM, all have high RMSE values of 8.51 mm, 4.18 mm and 16.42 mm, respectively, according to the SM method.

Intercomparison and Evaluation of the TC Analysis against Standard Metric
In the shrubland site, only MERRA2 exhibited a high TC RMSE (2.80 mm) value of above 2.50 mm, whereas the SM RMSE values of CLSM, ERA5, GLEAM and MERRA2 were above 2.50 mm (Figure 11a). Additionally, SM exhibited similar behavior for these four ET models in the grassland. In the desert site, GLEAM showed an RMSE value of 2.24 mm according to the SM (Figure 11a). The red dotted error margin in Figure 11a clearly shows that the TC technique produced low RMSE values in all the locations and outperformed the traditional evaluation method (SM). In Figure 12, we present a more comprehensive summary of the evaluation illustrating the degree of significance in the analyses by TC and SM methods in CA. The boldly highlighted values, with large magnitudes as indicated by the blue and red color scales, are the ET evaluation metrics with significantly better performance in the different arid ecosystem biomes ( Figure 12). Dark red > red > light red > lighter red represents the color codes from the lowest R and corresponding higher RMSE values, and vice-versa. Therefore, the fading blue and red scales represent degradation in the performance of the ET product in the biomes in CA. This clearly shows that the TC method can reliably assess ET with reduced uncertainty in regions without available in situ, such as CA. Figure 13 present the relative uncertainty of each ET model in cropland, desert, grassland and shrubland biomes. Here, we concentrate on the ET product uncertainty based on the TC standard error variances, which are expressed as the ratio of the RMSE to the mean of the ET in each biome. The highest uncertainty was associated with GLEAM in the cropland biome at 41.30%. However, it has a low uncertainty ratio in the shrubland biome, at 15%.

Uncertainty Assessment of the ET Products Based on TC Technique
In descending order, the next products with high uncertainties are MERRA2 in the desert biome (38.28%), ERA5 in both grassland and desert biomes (36.79% and 31.90%, respectively), and MERRA2 again in the grassland biome (31.24%). NOAH depicted the highest ET uncertainties among the LSM ET products in the cropland and desert biomes, at 29.16% and 28.50%, respectively. The uncertainty ratio by CLSM ET ranges from 21.06% in the desert biome to 25.82% in the cropland biome. In general, ET products with the least uncertainty are ERA5 in the shrubland (11.78%), MERRA2 in the cropland (11.88%) and GLEAM in the shrubland (15.00%). Dark red > red > light red > lighter red represents the color codes from the lowest R and corresponding higher RMSE values, and vice-versa. Therefore, the fading blue and red scales represent degradation in the performance of the ET product in the biomes in CA. This clearly shows that the TC method can reliably assess ET with reduced uncertainty in regions without available in situ, such as CA. Figure 13 present the relative uncertainty of each ET model in cropland, desert, grassland and shrubland biomes. Here, we concentrate on the ET product uncertainty based on the TC standard error variances, which are expressed as the ratio of the RMSE to the mean of the ET in each biome. The highest uncertainty was associated with GLEAM in the cropland biome at 41.30%. However, it has a low uncertainty ratio in the shrubland biome, at 15%.

Discussion
A point-by-point grid assessment of ET dynamics using six triplets of collocated raster maps without in situ reference components was achieved in this study. Previous studies implementing TC methods in regions with sufficient in situ observations tend to include in situ component in the triplet analysis [21,49], however, LSM or reanalysis satellite reference can also be independently used and/or alongside with in situ in the TC analysis In descending order, the next products with high uncertainties are MERRA2 in the desert biome (38.28%), ERA5 in both grassland and desert biomes (36.79% and 31.90%, respectively), and MERRA2 again in the grassland biome (31.24%). NOAH depicted the highest ET uncertainties among the LSM ET products in the cropland and desert biomes, at 29.16% and 28.50%, respectively. The uncertainty ratio by CLSM ET ranges from 21.06% in the desert biome to 25.82% in the cropland biome. In general, ET products with the least uncertainty are ERA5 in the shrubland (11.78%), MERRA2 in the cropland (11.88%) and GLEAM in the shrubland (15.00%).

Discussion
A point-by-point grid assessment of ET dynamics using six triplets of collocated raster maps without in situ reference components was achieved in this study. Previous studies implementing TC methods in regions with sufficient in situ observations tend to include in situ component in the triplet analysis [21,49], however, LSM or reanalysis satellite reference can also be independently used and/or alongside with in situ in the TC analysis to achieve a robust assessment [34]. The implementation of six ET triplets in this study was based on a systematic triplet design, which combined three ET measurement systems (i.e., LSM, satellite, and reanalysis, hence ET1, ET2 and ET3) and Two ET measurement systems (i.e., satellite and reanalysis, hence ET4, ET5 and ET6) from 2003 to 2020 ( Figure 2). Figure 4 shows the effectiveness of the triplet design conceived for the study area. The consistency of the triplet reference components (i.e., ERA5 and MERRA2) were ascertained by their high correlation coefficients of 0.96, 0.93, 0.97 and 0.95 in the cropland, desert, grassland and shrubland biomes, respectively ( Figure 4a); then, by their high matching median values of 0.93 and 0.92 as revealed by the parent triplets, as in ET4 and ET5, respectively. Figure 4b shows the performance of all the triplets, while Figure 5 shows the good performance of the triplets with the ideal combination (i.e., ET1, ET2 and ET3), and the triplet used for validation of other triplets (ET6). All the triplets were implemented independently to assess the individual ET characteristics and random error variances in the absence of a harmonized actual observation in CA. The spatial cross-validation of the triplets produced a consistent pattern of random error distribution over the study area ( Figure 6).
Moreover, evaluation of the individual ET products using EC flux tower observation produced both high/low correlation coefficients and low/high RMSEs, depending on the biome type where a product was validated. Figure 7 shows, however, that the linear relationship between the ET models and in situ is not strong enough, but weakly correlated. Some products underestimated/overestimated ET in the various arid ecosystem biomes. For example, CLSM, ERA5, MERRA2 and NOAH overestimated ET in the shrubland site with an RMSE value range between 1.14-5.73 mm/m and R-value range between 0.70-0.90; whereas GLEAM underestimated ET in the shrubland and cropland biomes with R and RMSE values of 0.70 and 4.24 mm/m and 0.46 and 16.42 mm/m, respectively. Despite the underestimation of GLEAM in cropland and shrubland sites, we note that it performed reasonably well in both desert (0.79 R value and 2.24 mm RMSE) and grassland (0.73 R value and 3.52 mm RMSE) biomes.
The grassland result of this study is consistent with the report by Yang, Yong [18] that GLEAM ET was closer to the EC observation in the three grassland sites in China than in the cropland site. The result of this study exclusively pertains to the ecosystem biomes in CA, however, it suggests that GLEAM can also perform well under less water availability in drylands depending on the biome characteristics or plant functional type [57]. Previous studies found that the covariational differences between ERA5, GLEAM and GLDAS-NOAH with PET, Precipitation (P), and P plus groundwater suggest a significant amount of energy exchange during low P conditions, which is a major reason for inconsistencies and errors in these ET products [23,58]. Another explanation is that the saturation of satellite signals of the forcing drivers or model calibration strategy can greatly affect ET product performance, especially over heterogeneous areas [24,59]. To a large extent, these issues can be resolved through the choice of forcing data used in the ET products' parametrization.
Unlike in tropical ecosystems, for instance, in East Asia, Khan, Liaqat [21] found that GLEAM and GLDAS-NOAH tend to show better correlation with in situ in cropland and tropical forest biomes than in the arid ecosystems of CA. Similarly, in Australia, Baik, Liaqat [19] reported that GLEAM and GLDAS showed better performance (R~0.8) over crop and forest areas, respectively. In this study, the R-values for GLEAM and GLDAS-NOAH in situ in the cropland are 0.46 and 0.90, respectively. One major reason for ET products' general underestimation in arid biomes is the limited water resources, which is a major factor in CA, and tend to change the desert plant functional types while also constraining their effective transpiration even in the grown season [57]. The implication is that hydrological processes, for example, sublimation and snowmelt, which are common physical processes that impact water cycling and energy transfer in other sections of CA (e.g., forest and Tianshan mountain areas) can be weakened during the peak of evaporation in growing seasons [60,61].
TC analysis in this study provides in-depth intercomparisons of the ET products statistics ( Figure 8). First, we noticed that all triplet components (i.e., the five ET products) exhibited different magnitudes of variability in all the biomes, which is expected. Note that the bias and gain may not be site (biome) dependent, because the capability to reproduce error variability by the ET measurement systems should ideally be performed with respect to a sufficient amount of actual observation [43], as has been demonstrated with the satellite soil moisture data sets [48,49]. This is not the case in this study (Figure 4), as we used triplet collocations without in situ but reference data ( Figure 2).
In However, the point-wise TC analysis predicts much better results of land surface fields (e.g., soil moisture), because it allows the gains to change from point to point, and with the ERA-Land reanalysis, better performance is guaranteed [49].
As shown in Figure 6, the spatial distribution of the cross-validation outputs shows a consistent pattern of ET error patterns over CA, but at varying magnitudes ( Figure 6). It explains the spatial discrepancies by the ET models in different parts of CA (Figure 9). These impacts might have resulted from topographic and climatic constraints prevalent across CA [40]. CA has two distinct landscape types: the plain areas dominated by desert grasslands and the elevated mountain ranges with patches of vegetated zones. Regions occupied by these two physical environments are typified by different climatology that produces varying hydrological processes and energy exchanging patterns [60], which can affect the performance of the different ET products in the arid environment as evident in Figure 10.
In summary, none of the ET products performed well in all the biomes at the same time, indicating their biome-specificity ( Figure 13). Different forms and causes of uncertainties [8,29] can be attributed to the performance of these ET products in CA. Besides that, CA is poorly observed, the physical investigation of different land surface fields (notably ET) is usually compounded by the region's remoteness and complicated geography. All the ET products, particularly the LSM products (i.e., CLSM and NOAH) exhibited the best stable monthly temporal pattern with each other, but not exactly with the observation in the shrubland.
However, because the TC method does not calculate biases [43], so the additional statistical analysis revealed both the biases and the systematic errors (mean ± SD) associated with the EC flux tower measurements ( Table 1). The site-scale bias range, starting from the worst performing are −10 ≤ bias ≤ 16, −21 ≤ bias ≤ 4.43, −14.92 ≤ bias 4.19, and −17.38 bias 2.27 in cropland, desert, grassland and shrubland biomes, respectively. As a result of the systematic errors connected with EC flux tower measurements (Table 1), describing the absolute uncertainties associated with the five ET products in the various biomes without some type of expert preconception is challenging. Therefore, we can only attribute the inherent inconsistencies displayed by the ET products in the four arid ecosystem biomes as only illustrative of their corresponding ET product biases, as demonstrated by the TC analysis [21].

Conclusions
This study spatially and temporarily evaluated the random error variances in five stateof-the-art ET products, including CLSM, ERA5, GLEAM, MERRA2 and NOAH, from 2003 to 2020 in CA. Both TC and EC methods were thoroughly tested under different conditions of data availability. The first major caveat was to ensure the applicability of the TC method to assess ET data sets without in situ observations in CA. The spatiotemporal collocations in the TC analysis, which was based on a systematic triplet design encompassing three and two ET measurement systems was constructed and implemented, yielding six independent ET triplets (i.e., ET1 . . . ET6, Figure 2).
The correlation of the reanalysis reference (ERA5 or MERRA2) used in the triplets against in situ was very high in the cropland, desert, grassland and shrubland biomes (i.e., 0.96, 0.93, 0.97 and 0.95, respectively) ( Figure 4a). The validation of the triplet types which were based on spatial and temporal intercomparison of the ET measurement systems revealed that triplets with the ideal combinations (i.e., Satellite, LSM and Reanalysis) do not necessarily outperform triplets with only two ET measurement systems (i.e., Reanalysis, LSM, and Reanalysis, or Satellite, Reanalysis and Reanalysis) when assessing ET dynamics in the absence of in situ; this is especially true for the biomes in CA. Overall, all the ET products have the capacity to capture ET dynamics at different levels of accuracy, however, none of the products simultaneously performed well in all the four arid biomes in CA.
This study provides the first evaluation metrics on the newer/updated versions of the five ET products in CA without using in situ, even though in situ evaluation was performed but with less coverage. Therefore, the study reiterates the need to extend energy flux observation towers to the remote parts of the Eurasian continent to better understand ET dynamics in the region, including its ecological divergences with other regions of the world. Besides supporting regional water resource management, the effort will greatly enhance global change and climate change research projects; especially when these uncharted biomes located in unique climate regions of the world are specifically captured in future studies.  Data Availability Statement: The source of data used in this study is provided in the manuscript. thank the Xinjiang Institute of Ecology and Geography, CAS for providing the eddy covariance flux tower data. We thank the editors and anonymous reviewers for their constructive comments and suggestions which helped to improve the paper.

Conflicts of Interest:
The authors declare no conflict of interest or state.
Appendix A Table 1. Additional statistics of ET product evaluation using in situ measurements in the four arid biomes in Central Asia.