Synthesizing a Regional Territorial Evapotranspiration Dataset for Northern China

: As a vital role in the processes of the energy balance and hydrological cycles, actual evapotranspiration (ET) is relevant to many agricultural, ecological and water resource management studies. The available global or regional ET products provide ET estimations with various temporal ranges, spatial resolutions and calculation methods (algorithms, inputs and parameterization, etc.), leading to varying degrees of introduced uncertainty. Northern China is the main agriculturally productive region supporting the whole country; thus, understanding the spatial and temporal changes in ET is essential to ensure water resource and food security. We developed a synthesis ET dataset for Northern China at a 1000 m spatial resolution, with a monthly temporal resolution covering a period ranging from 1982 to 2017, using an in-depth assessment of several ET products. Speciﬁcally, assessments were performed using in situ measured ET from eddy covariance (EC) observation towers at the site-pixel scale over interannual months under the conditions of different land cover types, climatic zones and elevation levels to select the most optimally performing ET products to be used in the synthesized ET dataset. Eight indicators under 21 conditions were involved in the assessment sheet, while the statistics of the different ET product occurrences and corresponding ratios were analyzed to select the best-performing ET products to build the synthesis ET dataset using the weighted mean method. The weights were determined by the Taylor skill score (TSS), calculated with ET products and EC ET observation data. Based on the assessment results, the Penman–Monteith–Leuning (PML_v2), ETWatch and Operational Simpliﬁed Surface Energy Balance (SSEBop) datasets were selected for implementation in the synthesis ET dataset from 2003 to 2017, while Global Land Evaporation Amsterdam Model (GLEAM) v3.3a, complementary relationship (CR) ET, and Numerical Terradynamic Simulation Group (NTSG) datasets were chosen for the synthesis ET dataset from 1982 to 2002. The weighted mean synthesized results from 2003 to 2017 performed well when compared to the in situ measured EC ET values produced under all of the above conditions, while the synthesized results from 1982 to 2002 performed well through the water balance method in Heihe River Basin. These results can provide more stable ET estimations for Northern China, which can contribute to relevant agricultural, ecological and hydrological studies.


Introduction
Evapotranspiration (ET) is a process in which liquid or solid water is converted into water vapor after precipitation reaches the ground during the hydrological cycle and returns to the atmosphere [1-3]; it mainly involves surface water, soil water evaporation and vegetation canopy transpiration [4,5]. From a global perspective, two-thirds of precipitation is returned to the atmosphere by evapotranspiration [6]. Therefore, ET is a critical Yang et al. [63] applied the BMA method and merged eight satellite-based ET datasets and attributed the highest accuracy to an ensemble with four models (Reg2, PT-JPL, RRS-PM and MODIS ET). In the study by Niyogi et al. [67], different statistical metrics, including STS, have been applied to compare satellite-based and land surface-based ET datasets, from which the combination of MODIS ET and the GLEAM dataset could generate the most accurate ET estimates in the United States. Other studies (e.g., Yao et al. [65], Vinukollu et al. [61], Mueller et al. [68], Badgley et al. [69], and Jiang et al. [70]) have applied multidata set synthesis and generated regional and global ET datasets with different degrees of bias. From our perspective, this research aims to provide an assessment-based synthesis scheme with a Taylor skill score (TSS)-based weight and establish a synthesis ET dataset within Northern China. Validations at the site-pixel scale and intercomparison among different global and local ET products are carried out with in situ EC measurements over different land cover types, elevation scales and climatic zones to select the ET datasets with the best performance over these varied conditions for further synthesis, which can obtain a better understanding of the validity and characteristics of the different ET datasets in Northern China. Conversely, the above selected ET products represent the optimally performing products within Northern China and can be synthesized using TSS-based weight for monthly ensemble ET datasets.

Study Area
Eighteen Chinese provinces were taken into consideration as the defined study area and comprised northern, northeastern, and some parts in the northwestern, eastern and central regions of China. The provinces of Heilongjiang, Liaoning, Jilin, Xinjiang, Inner Mongolia, Hebei, Shanxi, Shandong, Ningxia, Beijing and Tianjin are fully encompassed, while the Qinghai, Gansu, Shaanxi, Henan, Hubei, Anhui and Jiangsu Provinces are partially covered. The spatial extent covers a latitude ranging from 28.02 • N to 54.21 • N and a longitude ranging from 71.37 • E to 136.72 • E. The land area covers 5.53 million km 2 , accounting for 57.6% of the country's total area. Most of the study area is located in the temperate zone, and the climatic zone is humid, semihumid, semiarid and arid from east to west. Generally, the land cover types in Northern China are unique and complex, with mountains, plateaus, basins, hills and plains and many different types of vegetation ( Figure 1). In addition, Northern China is agriculturally vital and is where a majority of the wheat and corn resources are produced [32].

Evapotranspiration
Eleven global and two local actual ET datasets were collected for this research. The ET estimations from FLDAS, GLDAS_V20 and GLDAS_V21 are all LSM-based, and the Penman-Monteith (PM) or Priestley-Taylor (PT) equations are involved in the processing of the other 10 ET products. Among these 10 ET products, MOD16A2 (both Collection 6 (C6) and Collection 5 (C5)), PML_V2 and NTSG are mainly based on the specific canopy conductance model, while SEBS, SSEBop and ETWatch are mainly based on the surface energy balance. In particular, GLEAM (both 3.3a and 3.3b) utilizes the soil stress factor in the conversion from potential ET to actual ET, and CR_ET mainly utilizes the complementary relationship (CR) between actual and potential ET. ET products collected had different spatiotemporal resolutions and temporal range, and the specifications of the abovementioned ET products are summarized in Table 1.  ET tower observations from 18 EC flux sites were processed monthly as reference data for validating and assessing the ET products described above. Among them, five sites are from AsiaFlux (https://www.asiaflux.net/ access: 10 March 2021), five sites are from ChinaFlux (https://www.chinaflux.org/ access: 10 March 2021), four sites are from FluxNET (https://fluxnet.fluxdata.org/ access: 10 March 2021) and four sites are from the Aerospace Information Research Institute, Chinese Academy of Sciences (AIRCAS). The periods of flux EC data range from 1 year (12 months) to 8 years (96 months), while the temporal scale of the processed monthly ET values of all the sites ranges from 2002 to 2017. In total, there are 782 site months. The EC flux sites are distributed across different underlying surfaces, marked by various land cover types using the International Geosphere-Biosphere Programme (IGBP) classification system. The 18 EC flux sites mainly cover five vegetation types: deciduous needle leaf forest (DNF, one site), mixed forest (MF, two sites), cropland (CRO, six sites) and grassland (GRA, nine sites). The information for each flux site is summarized in the following Table 2. The collected EC flux data are averaged every half hour in a text format, and the gap-filling method is applied throughout the process [71]. Then, the gap-filled half-hourly averaged latent heat flux is aggregated to obtain the monthly ET. The conversion between latent heat flux and ET is calculated using Equation (1) [72]: where LE (W·m −2 s −1 ) is the latent heat flux, and λ is the latent heat of evaporation. Air temperature is the main factor influencing λ, but variability in air temperature only causes minor changes in λ [73]. In view of the limited influence of air temperature on the estimated ET values with LE [74,75], a constant value of 2.45 MJ·kg −1 is used in the Equation (1) [73].

Auxiliary Data
The main auxiliary data involved in this study are digital elevation model (DEM), aridity index (AI) and gridded precipitation datasets. DEM data are extracted from the 1 arc-second void-filled Shuttle Radar Topography Mission (SRTM) v3 product provided by NASA JPL, which covers the land surface area between 56 • S and 60 • N in altitude, accounting for approximately 80% of the total land area of Earth [76]. The AI dataset is defined as the mean annual precipitation divided by the mean annual evapotranspiration, the former of which is extracted from the WorldClim global climate dataset, and the latter is simulated by the Hargreaves equation [77], the spatial resolution of which is 30 arc-seconds (https://cgiarcsi.community/data/global-aridity-and-pet-database access: 10 March 2021). Precipitation data is extracted from the gridded precipitation dataset produced by the China Meteorological Data Service Center (CMDC) [78], which is generated by spatial interpolation method of the Thin Plate Spline (TPS) using the latest precipitation data of 2472 stations with the spatial resolution of 0.5 • , spanning from 1961 to the latest. DEM and AI data were mainly for the stratifications of flux sites, while precipitation data were used for water balance assessment.

Assessment Method
Due to the complexity of land-atmosphere interactions and their various simulation mechanisms, ET is extremely variable over time and space between the different product datasets, and their performances can be very diverse under different underlying surface conditions, so it is necessary to use ground observation data for comprehensive evaluation [62,76,79,80]. A set of indicators was applied in the comprehensive assessment of the different ET products. Mean error (ME), mean absolute error (MAE) and root mean square error (RMSE) are the most commonly used error measure indicators. ME and MAE, used as bias indicators, represent the average error and absolute error between the ET product values and the tower observed values, respectively, while RMSE represents the sample standard deviation of the difference between the ET product values and the observed values, reflecting the accuracy of ET products. The ME results can indicate whether overestimation or underestimation occurs, whereas the MAE results can avoid the mutual cancelation of errors, which can accurately reflect the actual forecast error. As mentioned in Rim's article [81], RMSE is more sensitive to outliers than MAE, because a single error measured with RMSE increases quadratically, and MAE is a more direct measurement excluding exponential operations; however, RMSE is generally more adaptable to in-depth statistical analyses of error than MAE. Despite these differences, ME, MAE and RMSE all measure the average difference, and it is suitable to utilize each index. As a result of the variability of ET measurements, it is difficult to evaluate the accuracy of ET products using only direct error measures such as ME, MAE and RMSE, and, therefore, their relative values, including the relative mean error (RME), relative mean absolute error (RRMAE) and relative root mean square error (RRMSE), are simultaneously reported [80,82,83]. Moreover, the Pearson correlation coefficient (R) is used as a statistical indicator to measure the strength of the relationship between the different ET products and observed ET values [58], and Willmott's index of agreement (d) is used to describe how well the model-calculated results simulate the observed data [84]. The indicators are calculated using Equations (2)-(9): where n is the record number; i is the i-th record; X is the mean value of the observed ET dataset and Y is the mean ET of the different ET products. The aridity index, elevation and ET values from the different product datasets are extracted as data records for the positions of each flux tower. To carry out a comprehensive assessment, the accuracy of the different ET products is evaluated using the eight indicators mentioned above from the perspectives of climatic zones, land cover types and elevation levels with the following stratifications ( Table 3): The aridity index (AI) is classified into three classes, including dry subhumid, humid and semiarid groups, using the United Nations Environment Programme definitions [85] to describe the different climatic zones. The IGBP land cover type of every flux tower position is aggregated into three types: forest, grassland and cropland. The elevation is aggregated into three levels: low, medium and high. The performances of ET products determined using different indicators under different conditions will be displayed in the assessment sheet, which is described in Section 4.2 in detail.

Synthesis and Validation Method
Through a series of comprehensive assessments, the highly ranked ET products can be selected according to the assessment results. To harmonize the spatial temporal resolution of selected ET datasets, the ET dataset that had finer temporal resolution than a month were all aggregated to monthly time step. Furthermore, ET datasets with spatial resolution finer than 1 km were resampled to 1 km with pixel average, while coarser ET datasets were resampled to 1 km with nearest neighborhood. For the synthesis of the selected datasets, the weighted mean strategy was applied. The weight of the different selected ET datasets is determined by the calculated Taylor skill scores (TSS) [86]. The weights of every individual ET product, which are proportional to the TSS values, are added to 1 [65]. The calculation of the TSS and weights for the different ET datasets and the TSS-based synthesis method are expressed as Equations (10)- (12): where TSS i is the Taylor skill score for an ET product i; n is the number of ET products; R i is the Pearson correlation coefficient between an ET product i and the in situ measured EC ET; R max represents the maximum correlation coefficient and is set to 1 in this research; δ i is the ratio of the standard deviation of ET product i to the in situ measured EC ET; w i is the TSS-based weight for ET product I; ET i is the i-th ET product; and ET syn is the synthesis result with the weighted mean of the ET products. The TSS values range from 0 to 1, where 0 and 1 indicate the least skillful and most skillful datasets, respectively. As for validation of the synthesized results, EC ET was used with the assessment sheet proposed in Section 3.1, while the gridded precipitation was involved in the water balance assessment of the synthesized results (in Section 4.4).
The overall methodology of this study is concluded as the following flow chart ( Figure 2).  Figure 3 shows the multiyear monthly average of each ET product under different conditions as well as the observed flux EC ET. As seen from the EC ET, obvious seasonal changes were captured. For all sites' averages ( Figure 3A), the ET values from April to September are relatively high (higher than 40 mm), and the peak value is 91.36 mm in July. Regarding the different ET products, nearly all of them captured seasonal changes. FLDAS, GLDAS_v20, GLDAS_v21, MOD16A2_C6 and NTSG reported maximum values in August ranging from 66.10 mm (MOD16A2_C6) to 108.89 mm (FLDAS), while the other ET products had maximum values in July ranging from 63.95 mm (SEBS) to 105.68 mm (ETWatch). Winter wheat is widely cultivated throughout the plain area of Northern China and is harvested in June. Then, corn is sowed following winter wheat. Therefore, the ET in June in these areas is typically smaller than that in the adjacent two months, which was sufficiently captured by Figure 3D,H. From Figure 3B-J, it is apparent that MOD16A2_C6 generally reported the smallest values from June to July, excluding sites under the conditions of grassland, dry subhumid regions and elevations greater than 1500 m. In contrast, the SSEBop dataset significantly overestimated ET in cropland areas from May to August, while NTSG also had large deviations from EC ET in the dry subhumid regions and regions with elevations greater than 1500 m from June to September.  Figure 5 plots the monthly assessment indicators of the monthly ET products against the monthly EC ET, considering all records from all sites. The patterns of monthly performances of every ET product evaluated with R and d are similar, from which, in general, all products performed relatively better from April to May and from September to October. Interestingly, SSEBop showed a better performance from November to May in terms of R and d, the pattern of which is quite distinct from other ET products. From ME and RME, it can be seen that different ET products are all overestimated or underestimated in different months to varying degrees. For instance, MOD16A2_C5 had a large negative ME from May to September but had the largest positive RME in January and December, meaning that MOD16A2_C5 had the largest positive deviation from EC ET in January and December, in comparison with the other ET products. Evidently, the patterns of monthly performances with MAE (RMAE) and RMSE (RRMSE) also look similar. From the perspectives of MAE and RMSE, MOD16A2_C6, MOD16A2_C5, FLDAS_ET, GLDAS_v20_ET, GLDAS_v21_ET, SEBS and SSEBop did not perform well, compared to other ET products from May to August. The largest MAE and RMSE values were from MOD16A2_C6 and SSEBop. RMAE and RRMSE also shared nearly the same pattern but were quite different from that of MAE and RMSE. The largest RMAE and RRMSE values were reported by MOD16A2_C5 in January and December, which is contributed to by the great overestimation in these two months, as confirmed by ME and RME.    Figure 6 shows the assessment result by landcover type. According to ME and RME, only MOD16A2_C6, CR_ET and ETWatch were underestimated over forest areas; all ET products were underestimated except for NTSG over grassland areas; and, for cropland areas, only SSEBop was overestimated. The other ET products reported the opposite trends for the three land cover types. Over forest areas, NTSG and SSEBop had the most accurate estimates of ET, with R values greater than 0.93, and the smallest MAE (RMAE) and RMSE (RRMSE) values. PML_v2 reported the highest d value over grassland of 0.94, while PML_v2, GLEAM 3.3a and GLEAM 3.3b reported the highest R value. In addition, the aforementioned three ET products also had the best MAE (RMAE) and RMSE (RRMSE) values. For cropland areas, the most expected R (d) values were from the ETWatch dataset at 0.90 (0.94). Fine MAE and RMSE values were also reported by ETWatch, NTSG, GLEAM 3.3b and PML_v2, in which all MAE values were lower than 20 mm/month and all RMSE values were lower than 25 mm/month.

Overall Assessment Result
In general, the different ET datasets are evaluated based on a constructed two-level assessment sheet, where comparisons among each ET products over different conditions were conducted. There are eight assessment indicators in a row (ME, RME, MAE, RMAE, RMSE, RRMSE, R and d) and 21 assessment conditions in a column, including the 9 aforementioned conditions (land cover types: l01-l03; climatic zones: c01-c03; elevation levels: e01-e03) and individual assessment conditions of interannual months (m01-m12). There are, in total, 168 cells in the assessment sheet. Every cell refers to a process of comparison among every other certain ET product under the column-specified condition, according to the row-specified assessment indicator, and is labeled with the name of the ET product, which has the best performance. To determine the performance of each ET product, lower MAE, RMAE, RMSE and RRMSE values and higher R and d values are expected. Level-1 assessment refers to the selection of the ET product with the highest R and d values and lowest ME, RME, MAE, RMAE, RMSE and RRMSE values, while level-2 assessment aims to select the ET product with the second highest R and d values and the second lowest ME, RME, MAE, RMAE, RMSE and RRMSE values. It is noteworthy that the absolute values of ME and RME are used during the assessment. Next, statistics on the occurrences of the different ET products under each condition (m01-m12, l01-l03, c01-c03 and e01-e03) in the assessment sheet are performed. The number of occurrences and corresponding proportions of all ET products are listed, according to which all ET products were arranged in descending order. Figure 9 presents the two-level assessment sheet covering all ET products, while Figure 10 presents the comprehensive assessment results. It can be concluded from Figure 10 that PML_V2, ETWatch, GLEAM 3.3a, CR_ET and SSEBop are the five most highly ranked ET products within the level-1 assessment, while PML_V2, GLEAM 3.3a, ETWatch, GLEAM 3.3b and NTSG are ranked in the top five in the level-2 assessment. All of the top five products in the two-level assessment occupy more than 10% of the cells in the assessment sheet. Considering the two-level assessment together, PML_V2, ETWatch, and GLEAM 3.3a, CR_ET and SSEBop are the five best-performing ET products. Together with Figure 10

Synthesis Result
According to the overall assessment result, PML_V2, ETWatch and SSEBop are selected from the top-five ET products for the synthesis dataset from 2003 to 2017, because they had finer spatial resolution from 500 m to 1 km than GLEAM 3.3a and CR_ET. PML_V2 was resampled from 500 m to 1 km with pixel average, while ETWatch and PML_V2 were aggregated to monthly time step from daily and 8-day time steps, respectively. For the synthesis dataset from 1982 to 2002, GLEAM 3.3a, CR_ET and NTSG were selected, because they ranked top-three among the ET products with longer temporal range. The selected three ET products were resampled to 1 km using the nearest neighborhood technique, while GLEAM 3.3a was aggregated to monthly time step. The Taylor skill score of the three selected ET products was calculated with 50% of the in situ measured site month data records (randomly selected for calibration), which were utilized for the determination of their weights for synthesis ( Figure 11).   Figure 12 that the monthly average synthesized ET product trend under all conditions is basically consistent with that of EC ET. Generally, the synthesized ET and maximum ET values appear in July, excluding that over forests and humid regions, while the harvesting of winter wheat and sowing of summer maize are well captured from May to July ( Figure 12D,H). From the perspective of the average ET from 1982 to 2017 ( Figure 13E), the spatial distribution of the synthesized ET of Northern China has obvious regional characteristics and is similar to the "northwest-southeast" belt-like distribution of dry and wet zones divided by multiyear precipitation, where the low-value area (<100 mm) is mainly concentrated in the arid area of the northwest regions, and the high-value area (>500 mm) is mainly concentrated in the eastern monsoon climate regions. Based on the changes in the decadal average ET values ( Figure 13A-D), the spatial distribution of the synthesized ET of Northern China is quite consistent, where the low-value areas (<100 mm) only exhibit very minor changes, and the high-value areas (>500 mm) increase to some degree, especially in the plain area of Northern China and Northeast China.

Assessment of the Synthesized ET Dataset
The assessment was carried out using the other 50% of the in situ measured EC ET monthly data records as the validation dataset. Figure 14 shows that the synthesized ET agreed well with the observed EC ET data over the interannual months and under the conditions of different land cover types, climatic zones and elevation levels. R was greater than 0.90, while d was greater than 0.95, excluding that of grassland regions, dry subhumid regions and regions with elevation levels >1500 m. Based on ME (RME), the synthesized ET underestimated the flux EC ET under all conditions to varying degrees with no exceptions. The synthesized ET performed best over forested regions, humid regions and regions with elevation levels of <500 m among the different land cover types, climatic zones and elevation levels, respectively, according to the MAE (RMAE) and RMSE (RRMSE) values. The assessment sheet method was applied here for comparison among the component ET products (PML_V2_ET, ETWatch and SSEBop) and the synthesized results from 2003 to 2017. Figure 15 presents a summary of the assessment sheet. It is obvious that the synthesis result occupied most of the cells in the two-level assessment sheet, showing comprehensive advantages over all other component ET products.
Heihe River Basin, a closed inland river basin without outflows in the arid region of northwestern China, is chosen here to conduct water balance assessment for the comparison among component ET products (GLEAM v3.3a, NTSG and CR_ET) and the synthesized results from 1982 to 2002. As mentioned above, the annual mean ET can be approximately regarded as equal to the mean annual precipitation, because there's no runoff exchange between the closed Heihe River Basin and adjacent regions. Though there are some limitations without considering factors like soil moisture and irrigation, it is still utilized in some closed river basins [49]. Table 4 presents the summary of the results of the water balance assessment. The annual precipitation ranges from 113.2 to 172.9 mm, covering the period from 1982 to 2002, and the mean annual precipitation is 142.7 mm. The mean annual synthesis ET from 1982 to 2002 is 136.3 mm, with −4.0% of the bias to mean annual precipitation. As for component ET products, NTSG, CR_ET and GLEAM v3.3a have a bias of 6.5%, 8.3% and −25.3%, respectively. It's obvious that synthesis ET outperformed the three component ET products.

Discussion
As described in the introduction, acquiring the spatial and temporal characteristics of land surface ET is of vital significance to addressing water resource security and food security issues in Northern China. There have been many global and regional ET products covering Northern China developed using different mechanisms with remote sensingbased models, land surface models (LSMs) and hydrological models with differing model inputs, parameterizations and algorithms and varied spatiotemporal resolutions and temporal ranges. To date, no single ET product can provide relatively accurate longterm ET estimations at a fine spatial resolution. Therefore, this study utilized a two-level assessment scheme to select the best-performing ET products with relatively high spatial resolutions; we presented a synthesized ET dataset with a 1 km spatial resolution covering the period ranging from 1982 to 2017 with the weighted mean of the selected ET products, which can be used directly in relevant studies for Northern China.
When referring to the assessment, the in situ measured ET from EC tower observations was used to carry out the assessment at a site-pixel scale for every selected ET product and aimed to select the optimal ET products with which to build a long-term synthesized ET dataset. Then, a two-level assessment sheet with eight indicators under 21 conditions was built, and statistics describing the different ET product occurrences and corresponding ratios were calculated to select high-resolution ET products with optimal performances. Regarding the overall assessment result, only NTSG overestimated EC ET, while other ET products all underestimated EC ET. For NTSG, the generally overestimated net radiation was the main causes of uncertainties in the estimation of ET [7]. Several ET products had relatively larger deviation to EC ET. The severe underestimation of MOD16 ET products was found across all types of the underlying surface except cropland [87], the reason of which might be attributed to structural characteristics of underlying surfaces and surface conductance parameterizations. The LSM based ET products also showed relative larger deviation to EC ET mainly due to the reanalysis driving data, which was reported to have lower accuracy than satellite driven products [88]. In addition, the uncertainties of SEBS was caused by the situation of the heat transfer simulation within the roughness sublayer (RSL), which often occurs over heterogeneous underlying surfaces like forest [89].
Observations from heterogeneously distributed flux EC sites are a preferable reference for obtaining accurate ET estimations regionally. Even though are limited flux sites in Northern China, this sort of exercise can still contribute to the knowledge of different ET product performances. Though there is still a certain deficiency of in situ EC ET measurements, such as the limited spatial representativeness and the issue of energy balance closure [90][91][92], it is still the most common way to describe the flux exchange occurring between the surface and atmosphere [93] and is still implemented in various relevant studies for the assessment of ET estimations at a site-pixel scale [10,62,68].
After the selection of ET products, the Taylor skill score (TSS) of each ET product was calculated to determine their relative weights. Then, the weighted mean was treated as the synthesis result. Recently, Elnashar et al. [94,95] generated a global synthesis ET product using a simple mean method with an assessment scheme that considered six indicators under 26 conditions. The simple mean of the selected ET products in this research and the aforementioned global synthesis ET product, together with the TSS-based weighted mean results, were assessed with the assessment sheet method using the validation dataset (from 2003 to 2017) and water balance method in Heihe River Basin (from 1982 to 2002), as mentioned in Section 4.4, and the results are as follows ( Figure 16 and Table 5). Figure 16. Count of cells and corresponding percentages of the simple mean of the selected three ET products and the global synthesis ET product generated by Elnashar et al. [94,95]. The weighted mean results in this research with level-1 and level-2 assessment and the total count of cells and their corresponding percentages are shown here. From Figure 16, it can be inferred that the simple mean and weighted mean of the three selected ET products generally outperformed the global synthesis ET product from 2003 to 2017. The weighted mean result had the best performance in the total count (percentage) of cells in the assessment sheet, but the simple mean had a better performance as determined by the level-2 assessment, indicating that the simple mean method had some advantages. The weighted mean method utilized TSS-based weights to ensure more contributions from the better-performing ET products to the synthesis result, which makes it a better choice than the simple mean method. From Table 5, the simple mean slightly outperformed the weighted mean by 0.5% of the bias and substantially outperformed the global synthesis ET product from 1982 to 2017. The performance of the simple mean and weighted mean are very close, because the TSS-based weights of the three selected ET products are very close to each other. In reference to the global synthesis ET product, the component ET products of the synthesis result were selected by global assessment with worldwide flux observation datasets, while the assessment process in this research was implemented locally, which reflected the local performance of each ET product in Northern China.
The most notable contribution of this study is that it was able to prospectively produce a relatively long-term synthesized ET dataset for Northern China with a fine spatial resolution and relatively low uncertainties; additionally, the synthesized ET dataset performed well against the flux EC ET measurements and using water balance method for assessment. This synthesized ET dataset for Northern China provided ET estimations for tested underlying surfaces. Therefore, this synthesized ET dataset has the potential to support regional studies in Northern China over longer temporal periods.
Since different ET products have varying data availabilities, the synthesis ET dataset was built with six individual ET products. Furthermore, the raw spatial resolution of the component ET datasets before 2003 is originally coarse and resampled to 1 km with the nearest neighborhood technique to ensure the consistency with the synthesis dataset from 2003 to 2017. Considering the abovementioned, further study needs more focus on harmonizing the synthesized results of different periods for the improvements of the coherence among pixels. Furthermore, there are no flux observation datasets accessible before 2003; the method proposed in this research would be inapplicable if the synthesis products were required to cover a longer period. Summarizing the abovementioned points, further study on the synthesis method needs to first consider ET products with coarse and fine spatial resolutions. Then, assessment methods of ET products without in situ EC ET measurements should be considered, such as the triple-colocation (TC) method [96] or extended triple-colocation (ETC) method [97]. The uncertainties of certain datasets can be calculated through such methods, while the weights of the datasets for synthesis are inversely proportional to the uncertainty value. The possibilities of applying TC or ETC methods to synthesize ET products with higher spatial resolutions should be further explored.

Conclusions
This study provided an effective method for assessing ET products with in situ measured flux EC ET across different months and various underlying surface types to develop a long-term synthesis ET dataset for Northern China. The comprehensive assessment was carried out under different conditions, namely, over interannual months and across different land cover types, climatic zones and elevation levels, and the latter three conditions were stratified into three classes. With the implementation of eight assessment indicators (R, d, ME, RME, MAE, RMAE, RMSE, RRMSE), the ET products that performed best were selected to generate the synthesis ET dataset over different time ranges. It was demonstrated that the PML_V2, ETWatch, GLEAM 3.3a, CR_ET and SSEBop ET products showed the best performance throughout the assessments together, with the consideration of their spatial resolutions and temporal ranges. No single ET product is likely to perform best under all assessment indicators in all conditions, so this study built the synthesis ET dataset from 2003 to 2017 with the weighted mean of the selected high-resolution PML_V2, ETWatch and SSEBop dataset, and the synthesis ET dataset from 1982 to 2002 with the weighted mean of the selected GLEAM 3.3a, CR_ET and NTSG, which had a higher agreement with and a lower deviation from the in situ measured flux EC ET under all conditions. The weights were determined from the Taylor skill score calculated with the ET product and flux EC ET. Moreover, the ensemble ET estimations from 2003 to 2017 over all types of underlying surfaces performed well when compared with the in situ measured flux EC ET, while the ensemble ET estimations from 1982 to 2002 performed well using the water balance method in Heihe River Basin.
The assessment results over interannual months and across every land cover type, climatic zone and elevation level produced information regarding the performance of ET products under different conditions present in Northern China. Overall, the ET synthesis product proposed in this study improved the accuracy of ET estimations to some extent in Northern China, which can be conducive to relevant hydrological, ecological and agricultural research areas.