Comparative Analysis of Two Machine Learning Algorithms in Predicting Site-Level Net Ecosystem Exchange in Major Biomes

The net ecosystem CO2 exchange (NEE) is a critical parameter for quantifying terrestrial ecosystems and their contributions to the ongoing climate change. The accumulation of ecological data is calling for more advanced quantitative approaches for assisting NEE prediction. In this study, we applied two widely used machine learning algorithms, Random Forest (RF) and Extreme Gradient Boosting (XGBoost), to build models for simulating NEE in major biomes based on the FLUXNET dataset. Both models accurately predicted NEE in all biomes, while XGBoost had higher computational efficiency (6~62 times faster than RF). Among environmental variables, net solar radiation, soil water content, and soil temperature are the most important variables, while precipitation and wind speed are less important variables in simulating temporal variations of site-level NEE as shown by both models. Both models perform consistently well for extreme climate conditions. Extreme heat and dryness led to much worse model performance in grassland (extreme heat: R2 = 0.66~0.71, normal: R2 = 0.78~0.81; extreme dryness: R2 = 0.14~0.30, normal: R2 = 0.54~0.55), but the impact on forest is less (extreme heat: R2 = 0.50~0.78, normal: R2 = 0.59~0.87; extreme dryness: R2 = 0.86~0.90, normal: R2 = 0.81~0.85). Extreme wet condition did not change model performance in forest ecosystems (with R2 changing −0.03~0.03 compared with normal) but led to substantial reduction in model performance in cropland (with R2 decreasing 0.20~0.27 compared with normal). Extreme cold condition did not lead to much changes in model performance in forest and woody savannas (with R2 decreasing 0.01~0.08 and 0.09 compared with normal, respectively). Our study showed that both models need training samples at daily timesteps of >2.5 years to reach a good model performance and >5.4 years of daily samples to reach an optimal model performance. In summary, both RF and XGBoost are applicable machine learning algorithms for predicting ecosystem NEE, and XGBoost algorithm is more feasible than RF in terms of accuracy and efficiency.


Introduction
The biosphere acts as an important regulator for the global climate system through land-atmosphere exchange of greenhouse gases [1][2][3], of which the net ecosystem CO 2 exchange (NEE) is arguably one of the most critical components [4]. The NEE has been

Data Sources
The data used in this study are collected from the FLUXNET dataset [6], which was derived from FLUXNET2015 database (https://fluxnet.org/ accessed on 1 September 2020). The FLUXNET dataset includes data from multiple flux networks, including ICOS, Ameri-Flux, NEON, AsiaFlux, ChinaFLUX, and TERN-OzFlux. All variables within the dataset had gone through quality control with a code package called ONEFlux (Open Network-Enabled Flux processing pipeline, available at https://github.com/AmeriFlux/ONEFlux/ accessed on 5 February 2021). This study selected the SUBSET data product from the FLUXNET2015 that contains micrometeorological, energy, and NEE data. The SUBSET dataset contains data products in hourly, daily, and yearly time steps. Ten micrometeorological variables were used for the NEE prediction (details in Table 1). The daily dataset in all FLUXNET2015 (Tier 1) sites were screened and if any of the 11 variables (10 micrometeorological features and NEE) required for modeling had a missing value, the datapoint was removed. Finally, we obtained 69 sites for this analysis, including 5 sites for evergreen broadleaf forest (EBF), 12 sites for deciduous broadleaf forest (DBF), 17 sites for evergreen needleleaf forest (ENF), 4 sites for mixed forest (MF), 15 sites for grassland (GRA), 5 for croplands (CRO), 2 sites for open shrublands (OSH), 1 site for closed shrublands (CSH), 4 sites for savannas (SAV), and 4 sites for woody savannas (WSA) ( Table 2). Random Forest is an ensemble approach for classification and regression that consists of multiple decision trees as the base estimator [44]. The decision trees are generated by randomly selecting samples and features using the bagging ensemble method. Since the training datasets for different decision trees are obtained by the bootstrap sampling method, training sets and features are selected randomly, which reduces the variance in model prediction. At the same time, the RF again weakens the correlation between decision trees by randomly selecting features to make it less over-fitting. Given the ensemble method, RF has a higher accuracy compared to individual decision trees [45]. However, a large number of decision trees in RF would lead to longer training time and lower efficiency. In this study, the RF model was built with the Scikit-learn (version 0.23.2) package, which is available from https://github.com/scikit-learn/scikit-learn accessed on 10 September 2020.

XGBoost
XGBoost [43] is an optimized version of the Gradient Boosting algorithm [46]. The boosting algorithm is a type of ensemble approach of learning algorithm that accomplishes the learning task by building and combining multiple base estimators. The base estimator of XGBoost is the regression tree (CART) [47]. XGBoost uses the boosting algorithm to continuously correct the fitting effect, and each tree is grown on the residuals of the previous tree, and the prediction is obtained by weighting the ensemble output of all regression trees. Like the RF algorithm, XGBoost also supports random row sampling and column sampling of the training set to avoid overfitting. However, XGBoost features a few improvements, such as it (1) supports linear classifiers to reduce generalization errors and improve better prediction accuracy by using Newton's method and second-order Taylor series; (2) reduces the possibility of overfitting by application of regularization; and (3) combines multithreading and data compression to make the algorithm as efficient as possible. Presently, XGBoost is widely used in data science competitions and is considered to be an advanced evaluator with ultra-high performance in both classification and regression. In this study, the XGBoost model was implemented using the XGBoost package (https://github.com/ dmlc/xgboost accessed on 12 October 2020) with Scikit-learn interface.

Biome-level Simulation
The RF and XGBoost models were implemented in this study and compared to estimate daily NEE in 10 different biomes. To achieve high simulation accuracy, the data need to be pre-processed before further analysis. We merged the subset daily data of EC sites belonging to the same biome to generate hybrid data for each biome. Because daily data were generated by aggregating half-hourly data, further sample screening was performed based on the quality mark of each feature. The quality mark ranges between 0~1, indicating the percentage of observed and good quality gap-filled half-hourly samples per day. To control the data quality of the training features, we excluded the sample when any of the feature's quality mark was less than 0.8. After the process of data cleaning, both of the RF and XGBoost models were generated in Python (version 3.7.5) with packages including Numpy (version 1.17. For each hybrid data of different biomes, we used the method of "train_test_split" to split it into training samples (70%) and test samples (30%) with a "random_state" value of 420. The training sample was modeled using the "randomizedSearchCV" method with 5-fold cross-validation as well as random search to obtain hyper-parameters for the model. The hyper-parameters used for the two models are listed in Table S1. Taking into account the computational efficiency, the "n_iter" parameter in the "randomizedSearchCV" method was set to 100 and 200 for RF and XGBoost, respectively. It controls the number of parameter settings that are tried in the model training. After the hyper-parameters are obtained, the best model can be established. The summary of the data processing was outlined in Figure 1.

Simulations under Extreme Climate Conditions
To analyze the applicability of the models under extreme climate conditions, we specifically evaluated the model performance in simulating NEE under extremely hot, cold, wet, or dry years. The procedures were as below. First, we selected each of the EC sites with more than 10 years of observational data. For each selected site, we calculated the respective standard deviations (SD) for the mean annual temperature (MAT) and annual total precipitation (ATP). The year is considered to be extremely cold if its MAT is less than the multi-year mean temperature by two SDs, and extremely hot if its MAT is greater than the multi-year mean temperature by two SDs. We can find the extremely dry and extremely wet years in the same way. Finally, 5 sites with extreme heat years, 7 sites with extreme cold years, 4 sites with extreme wetness years, and 3 sites with extreme dryness years were selected, details in Table S2. For each extreme situation, we selected the sites with different biomes for evaluation. If there were multiple sites with the same biome in the same extreme conditions, we randomly selected one of them. In order to maintain a balance between the number of extreme and normal samples, we selected a non-extreme year closest to the extreme year to represent the normal year. The samples of extreme year and the normal year were split into training samples (70%, merged the extreme and normal training samples for training) and validation samples (30%, representing extreme and normal samples of testing) with "random_state" value of 420, respectively. A comparison of prediction NEE for extreme and normal samples with the same modeling approach as that of biome above is shown in Section 2.3.1. Note that, in order to maintain the integrity of the data, we did not use the quality mark of each feature for sample exclusion as in Section 2.3.1.

Sample Size Sufficiency for Model Estimation
To quantify the effect of the number of training samples on the NEE prediction of site-level, four EC sites (US-MMS, US-Var, DE-Geb, NL-Loo) with different biomes and more than 12 years observations were selected. Two approaches were used to assess the robustness of the model prediction results with different training sample sizes. The first approach is to use the last 2 years dataset of the site as the test samples and the number of training samples is increased by the annual sample size (BAS) to build the model. The details of the annual sample size of each evaluated EC site are shown in Table S3.
Another method is to increase the number of total samples gradually by a fixed step of 100 (BFS). First, we merged the annual samples for each evaluated EC site. The total samples for each evaluation were randomly selected from the merged data, then split into 70% for training and 30% for testing. Determination (R 2 ), root mean square error (RMSE), and mean absolute error (MAE) were used to evaluate the prediction results. The definitions of three metrics are as follows: where y i is the ith observed validation sample,ŷ i the predicted value of the ith validation sample, y i denotes the mean value of observed value y, and n is the number of validation samples. When R 2 is larger and RMSE and MAE are smaller, it means that the predicted value is closer to the observed value. Evaluation of the models using these three indices can provide criteria for hyper-parameter correction and model performance comparisons.

Bernaola-Galvan Segmentation Algorithm
The Bernaola-Galvan Segmentation Algorithm (BGSA) was adopted to detect nonlinear, nonstationary sequence mutations [48]. Its principle is to divide a non-stationary time series by making the time series consist of many segments with different means, such that the difference in means between neighboring segments is maximized. Compared to traditional mutation detection methods, the BGSA showed better performance in estimating climate evolution [49,50]. In this study, we used the BGSA algorithm to detect the mutations of R 2 series derived from the two machine-learning algorithms (Section 2.3.3).

Feature Analysis
Feature importance is defined as the contribution of each variable to the model, with important variables showing a greater impact on the model evaluation results. In this study, we use both impurity-based and permutation-based methods to calculate the feature importance for RF and XGBoost algorithms, respectively. RF and XGBoost are all ensemble models based on the decision tree as the base estimator, the difference between them is that each decision tree is generated in a different way. In a single decision tree, the model uses the amount of each feature split point improvement performance measure to calculate feature importance, called the impurity-based importance measure. It is used to describe how useful each feature is in constructing the decision tree in the model. However, when the model is overfitting, some features that have less predictive effect may get high importance. The permutation importance is the decrease in a model score when a single feature value is not available, which is complementary to the impurity-based importance. In this study, the impurity-based predictive importance scores of features were obtained from the attribute of "feature_importances_" of the best trained RF or XGBoost models generated during the modeling development. The feature importance of the two models and different biomes will be output by the above processes. The permutation-based predictive importance scores used the method of "permutation_importance" to evaluate the training samples with a "n_repeats" value of 10.

Biome-Level Model Performance
The RF and XGBoost algorithms did reasonably well in simulating NEE at the biomelevel ( Figure 2). Among eight biomes (DBF, EBF, ENF, GRA, OSH, CSH, SAV, and WSA), XGBoost predicted slightly better than RF, with larger R 2 and smaller RMSE; for the other two biomes (MF and CRO), the predictions of the two models have the same R 2 and RMSE. For the XGBoost, ENF (R 2 = 0.81 and RMSE = 0.77 g C m −2 d −1 ) has the highest R 2 and OSH shows the smallest R 2 (R 2 = 0.35 and RMSE = 0.43 g C m −2 d −1 ). In summary, the forest ecosystems (DBF, EBF, ENF, and MF) have the best predictions (R 2 between 0.59 to 0.81), followed by savanna (contain SAV and WSA; R 2 between 0.57 to 0.61), grassland (R 2 = 0.55), and cropland (R 2 = 0.43). The predictions of different types of shrublands (OSH and CSH) showed a large inconsistency, with the R 2 of CSH (R 2 = 0.75) being much higher than that of OSH (R 2 = 0.35). From the fitted lines, the predicted values of both models were smaller than the observed values, in all biomes. Although the prediction of XGBoost is as similar as that of RF, XGBoost is slightly better in modeling efficiencies. The training durations for RF and XGBoost with different biomes of training samples are shown in Table 3. For different biomes, the training durations of different models increase reasonably with the number of training samples. The computational efficiency of XGBoost is 6~62 times higher than that of RF, and their difference increases as the number of samples increases.

Environmental Conditions
The feature rankings obtained by impurity-based and permutation-based were generally consistent across biomes (Figure 3 and Figure S1). For the impurity-based result (Figure 3   Similar results appear in the permutation-based method ( Figure S1). Among all the biomes, the number of occurrences for variables in the top three variable importance was: NETRAD (eight occurrences), PA (seven occurrences), SWC (four occurrences) for RF model; NETRAD (eight occurrences), PA (five occurrences), and SWC (four occurrences) for XGBoost model. The variables with the highest number of occurrences in the last three of variable importance were P (ten occurrences), WS (five occurrences), LW_IN (four occurrences) for RF model; and P (ten occurrences), WS (six occurrences), LW_IN (five occurrences) for XGBoost model. Note that because of the different calculation principles, the values reported in Figure 3 are not directly comparable with that in Figure S1.
The two feature importance methods disagree on the performance of LW_IN. For both RF and XGBoost algorithms, permutation-based method showed that LW_IN was the less important variable. However, the impurity-based method showed LW_IN appears at two extremes under different biomes. For EBF, CSH, and WSA, it is an important variable (top three of variable importance), while for DBF, ENF, and OSH, it is an insignificant variable (last three of variable importance).

Model Performance in Simulating NEE under Extreme Climate Conditions
Both RF and XGBoost models captured the difference of the NEE predictions under normal and extreme climate conditions, with similar R 2 and RMSE. However, dominant variables varied under different extreme conditions and in different biomes. In the following, we will use the prediction results of XGBoost as an example. Extreme cold conditions had little impacts on forest biomes (DBF, R 2 = 0.93 for normal and R 2 = 0.86 for extreme; ENF, R 2 = 0.90 for normal and R 2 = 0.87 for extreme; EBF, R 2 = 0.21 for normal and R 2 = 0.19 for extreme), and had a slightly greater effect on SAV (R 2 = 0.53 for normal and R 2 = 0.44 for extreme). Extreme heat conditions reduced the evaluation results of the four biomes, MF (R 2 = 0.81 for normal and R 2 = 0.76 for extreme), DBF (R 2 = 0.87 for normal and R 2 = 0.78 for extreme), ENF (R 2 = 0.59 for normal and R 2 = 0.51 for extreme), and GRA (R 2 = 0.78 for normal and R 2 = 0.66 for extreme) with different degrees. The negative effect of extreme wetness on CRO (R 2 = 0.84 for normal and R 2 = 0.64 for extreme) was significant, but showed litter effects or even positive effects on forest biomes, MF (R 2 = 0.83 for normal and R 2 = 0.82 for extreme), ENF (R 2 = 0.90 for normal and R 2 = 0.92 for extreme), and DBF (R 2 = 0.86 for normal and R 2 = 0.83 for extreme). In the extreme dryness conditions, DBF (R 2 = 0.85for normal and R 2 = 0.90 for extreme) was not affected, but SAV (R 2 = 0.73 for normal and R 2 = 0.43 for extreme) and GRA were evaluated with significant decreases compared to the normal condition, especially the effect for GRA was considerable (R 2 = 0.54 for normal and R 2 = 0.14 for extreme) ( Table 4).

Model Performance in Simulating NEE under Extreme Climate Conditions
Machine learning models appear to be highly dependent on the training samples. Increasing model training samples benefits model prediction for the same quality of data [37]. Both RF and XGBoost algorithms are sensitive to the number of training samples (NTS) when simulating the NEE of different biomes at site-level. The prediction results of two models were similar according to BAS (Figure 4a-d) or BFS (Figure 4e-h) method. With the BAS method, the result of XGBoost showed that the R 2 of the prediction for the four sites increased with the NTS and reached a stable state when the NTS reached about 8 years of daily samples (Figure 4a-d). With the BFS method, when the assessment curve reached final stability, the R 2 assessed by each site (US-MMS, R 2 = 0.83; US-Var, R 2 = 0.64; DE-Geb, R 2 = 0.48; NL-Loo, R 2 = 0.79) was generally consistent with the evaluation result of the respective biome to which they belong in Figure 2. In Figure 4e-h, although the prediction of each site eventually reached a steady state with the maximum R 2 as the NTS increased, the R 2 curves varied among biomes. For the forest biome (Figure 4e,h), R 2 reached a steady-state faster with the increase of samples and low fluctuation before the steady state. For GRA and CRO (Figure 4f,g), in contrast to the forest biome, R 2 reached a steady-state more slowly and large fluctuation before the steady state.

Quantitative Analysis of Sample Size in Reaching Feasible Model Performance
To quantitatively determine the effects of the sample number on the model prediction, the BGSA algorithm was used to test the R 2 curves of XGBoost in Figure 4e-h for mutation. In Figure 5, the mutation points ranged from two to four for different sites (US-MMS, DBF, with 4; NL-Loo, ENF, with 3; US-Var, GRA, with 2; and DE-Geb, CRO, with 3). Except for the NL-Loo site, the first mutation position of the other three sites is less than 500. At the left of the first mutation, each curve of R 2 fluctuates dramatically and a distinct trough appears. This is likely because when the total sample size is less than 500, then the NTS is less than 350 (approximately one complete observation year), so the NTS is not representative enough. Since both US-MMS (DBF) and NL-Loo (ENF) sites (Figure 5a,b) belong to the forest biome, their mutations are basically the same, at 1300, 2200-2400, and 4100-4500, respectively. The R 2 values increase after each mutation. The x-axis in Figure 5 refers to the total number of samples (70% of which are training samples and 30% are test samples), so the NTS corresponding to each mutation is 910, 1540-1680 (the median is about 1600), and 2870-3150 (the median is about 3000). This means for DBF and ENF sites, the training sample number of 910 (approximately 2.5 years daily samples, named YDS hereafter) can stabilize the model, and 3000 (approximately 8.2 YDS) can lead the model to reach the best performance. Compared with the above two sites, US-Var (GRA) and DE-Geb (CRO) sites (Figure 5c,d) reached the stability mutation at delayed positions between 1900 to 2400 (the median is equal to 2150) and the last mutation at an earlier position between 2400 to 3200 (the median is equal to 2800); correspondingly, the NTS are 1505 (approximately 4.1 YDS) and 1960 (approximately 5.4 YDS), respectively. The same trend was observed in the RF analysis for both four sites, detailed in Figure S3.

Quantitative Analysis of Sample Size in Reaching Feasible Model Performance
The studies compared two algorithms in simulating NEE ( Table 5). Predictions of XGBoost for forest biome in this study were generally better than SVR [33]; for XGBoost, the evaluation metrics (R 2 ) of DBF, EBF, ENF, and MF were 0.78, 0.58, 0.79, and 0.67 respectively, and for SVR they were 0.78, 0.59, 0.29, and 0.37 respectively; GRA also had a similar result, the R 2 of XGBoost and SVR were 0.55 and 0.37, respectively; but SVR had a better prediction for CRO, with a higher R 2 of 0.60 compared to 0.43 of XGBoost. For the forest biome (DBF, ENF, MF), XGBoost (with R 2 between 0.59-0.81) prediction was similar to the results of the best prediction in adaptive neuro-fuzzy inference system (ANFIS), extreme learning machine (ELM), artificial neural network (ANN), and support vector machine (SVM) model, with R 2 between 0.59 and 0.80 [39]. Note that the XGBoost training samples are the hybrid data from multiple EC sites, while the predictions of Dou et al. [39] are for a single EC site. Similar prediction results were obtained for the bamboo forest sample sites using the RF method (R 2 = 0.68) [50]. Jung et al. [36] used the MTE model to obtain predictions of NEE with R 2 = 0.49 from multiple EC sites (no biome distinction) around the world.
Some predictions using the gradient boosting regression (GBR) method are better than using XGBoost to predict an EC site of ENF (with R 2 = 0.90 for GBR and R 2 = 0.81 for XGBoost) [37]. This likely arises for two reasons. First, the GBR method used a single site data for training. Due to the differences in climate and ecological environments of different EC sites, the prediction of a single site will be better than the hybrid data of multiple sites. We evaluated the same EC site and time period as Cai et al. [37] did with XGBoost method. The evaluation results showed a slight improvement over biome-level simulation of ENF, which the site belongs to (with R 2 = 0.81 and RMSE = 1.049 g C m −2 d −1 for ENF biome and R 2 = 0.824 and RMSE = 0.802 g C m −2 d −1 for site-level). Second, the impact of extreme components (maximums and minimums) of the variables on the model was considered when training the GBR model. Therefore, we believe that under the same conditions the prediction of NEE by XGBoost generally meets expectations and is better than most other popular ML methods. It should be noted that all previous studies use multi-source data for model training, such as remote sensing data of vegetation, meteorology data etc., while we only use the meteorology data, which is the least number of data sources and environmental variables. It shows that single source and fewer environmental variables still have a good prediction effect on NEE. Similar results were reported for the prediction of NEE using meteorological data alone comparing the combination of meteorological and remote sensing data for model training [38].

Environmental Controls as Estimated by Two ML Algorithms
It has been shown that radiation, air and soil temperature, relative humidity, vapor pressure deficit, and wind speed were the main factors affecting NEE [51]. In this study, a combination of both impurity-based and permutation-based methods indicated that NETRAD, PA, SWC, and TS were the most important variables, but P and WS were the least important variables when considering all biomes together. We found that other machine learning methods also have the same results [37,52]. This is because traditional statistical methods are based on the classical hypothetico-deductive approach, while ML methods build relationships directly from the data and fit highly non-linear relations between input and output data. For example, although precipitation is an important resource to ensure plant physiological activities, the effect of precipitation on carbon fluxes is generally reflected on an interannual scale [53]. Since the training data we used are based on daily values, and the daily precipitation is not directly responsible for the daily NEE change. This may be the main reason that XGBoost shows inconsistency with traditional statistical methods. In summary, the feature importance from the ML models serves two purposes: one is to provide further insight into the underlying process of NEE generation, and the other is to help in the pre-modeling phase for feature engineering optimization, with the latter being probably more useful. The two methods corroborated each other and showed better consistency, this increased our confidence concerning the feature importance result. It also indicated that the modeling did not have obvious overfitting in this study. We also took mixed forest samples as an example to build the model for different seasons and output each of the feature importance. For the result as described in Figure S2, the results of both RF and XGBoost had similar distributions across seasons and the year. Seasonal variation of the feature importance does exist. The value of shortwave radiation and NETRAD was significantly higher in summer and autumn than in other seasons; while TS and SWC had relatively low values in autumn. This suggests that a more proper split of the samples, such as by season, may obtain more detailed information on the feature importance and improve the performance of the model.

Effects of Extreme Climatic Conditions on NEE Prediction at Biome Level
Extreme climatic conditions can affect the growth and development of vegetation by altering its physiological processes [54]. This is also reflected in the data-based ML method as the input variables under extreme weather can affect the model driving. However, this influence needs to be related with different extreme conditions as well as different biomes. This corresponds to the environmental conditions required for different biomes. The forest and SAV sites used in the study are mostly located at middle and high latitudes, and they are adaptable to low temperatures, so extreme cold has little effect on their prediction results. Both extreme heat and extreme dryness have effects on vegetation growth in different biomes, but the effects of extreme dryness are more pronounced. Extreme wetness has little or no effect for forest, but does not apply to CRO.

Effects of Extreme Climatic Conditions on NEE Prediction at Biome Level
Both RF and XGBoost methods had good prediction accuracy for NEE, which shows that the tree-based ML methods have an advantage in CO 2 flux prediction. In general, the XGBoost model has better prediction results than RF in all biomes with higher R 2 and lower RMSE; XGBoost offers more efficient computation than RF. For XGBoost, the number of training samples and training duration show a highly linear relationship (y = 0.0004x − 0.226, R 2 = 0.998, p < 0.0001; x refers to the number of training samples, and y refers to the training duration which unit is minute), while RF shows a significant exponential relationship (y = 10 −6 x1.979, R 2 = 0.992, p < 0.0001) ( Figure S4). That means, XGBoost can get better performance than RF with less computational cost. Moreover, the XGBoost algorithm is complex with many hyper-parameters, and the prediction accuracy of XGBoost may still be improved after an optimized hyper-parameter. However, due to the inability of the tree-based ML methods in extrapolation, the model predicted that NEE is smaller than the observational results. In conclusion, the XGBoost method shows better prediction accuracy and computational efficiency in predicting NEE.

Conclusions
In this study, we applied two machine learning algorithms, RF and XGBoost, to simulate NEE in major biomes across the globe. We found that the XGBoost has a better model performance than the RF with a 6-62 times higher computational efficiency. The robustness of the two methods was also tested for different extreme climate conditions and the different numbers of training samples. Both XGBoost and RF reflect the difference of prediction between extreme condition samples and normal samples well. The training sample size impact on model performance varied by biome. Biomes with better model performance (DBF and ENF) were easier to achieve the first stability with a smaller NTS and continued to optimize with increasing NTS, while the opposite applied to GRA and CRO biomes. In general, a minimum of 8 years of daily training dataset will lead to feasible model prediction for various biomes.
The variables used for the model training include 10 meteorological variables, which is slightly less than other studies. Both RF and XGBoost algorithms produced feasible results, which indicates that the variables other than those used in this study have slight contribution to the NEE prediction. Given the strengths of XGBoost, it could have good implications on upscaling NEE prediction from the site to a regional area in the future, e.g., based on grouping of samples under normal and extreme climate conditions. It also provides the potential application for estimating NEE from satellite-derived data, which should be very helpful for regional and global scale.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/rs13122242/s1. Table S1 Hyper-parameters used for the models and the initial settings with randomized SearchCV method, Table S2 List of sites with extreme conditions. Acronyms: MAT, mean annual temperature; MMT, multi-year mean temperature; AP, annual precipitation; MMP, multi-year mean precipitation; SD, standard deviations of temperature or precipitation for the listed sites; DBF, deciduous broadleaf forest; EBF, evergreen broadleaf forest; ENF, was evergreen needleleaf forest; MF, mixed forest; GRA, grassland; CRO, cropland; SAV, savannas, Table S3 Number of training and testing samples used for quantitative analysis of model performance, Figure