An Evaluation of Eight Machine Learning Regression Algorithms for Forest Aboveground Biomass Estimation from Multiple Satellite Data Products

: This study provided a comprehensive evaluation of eight machine learning regression algorithms for forest aboveground biomass (AGB) estimation from satellite data based on leaf area index, canopy height, net primary production, and tree cover data, as well as climatic and topographical data. Some of these algorithms have not been commonly used for forest AGB estimation such as the extremely randomized trees, stochastic gradient boosting, and categorical boosting (CatBoost) regression. For each algorithm, its hyperparameters were optimized using grid search with cross-validation, and the optimal AGB model was developed using the training dataset (80%) and AGB was predicted on the test dataset (20%). Performance metrics, feature importance as well as overestimation and underestimation were considered as indicators for evaluating the performance of an algorithm. To reduce the impacts of the random training-test data split and sampling method on the performance, the above procedures were repeated 50 times for each algorithm under the random sampling, the stratiﬁed sampling, and separate modeling scenarios. The results showed that ﬁve tree-based ensemble algorithms performed better than the three nonensemble algorithms (multivariate adaptive regression splines, support vector regression, and multilayer perceptron), and the CatBoost algorithm outperformed the other algorithms for AGB estimation. Compared with the random sampling scenario, the stratiﬁed sampling scenario and separate modeling did not signiﬁcantly improve the AGB estimates, but modeling AGB for each forest type separately provided stable results in terms of the contributions of the predictor variables to the AGB estimates. All the algorithms showed forest AGB were underestimated when the AGB values were larger than 210 Mg / ha and overestimated when the AGB values were less than 120 Mg / ha. This study highlighted the capability of ensemble algorithms to improve AGB estimates and the necessity of improving AGB estimates for high and low AGB levels in future studies.


Introduction
Forest biomass is an essential climate variable that measures the net carbon dioxide exchange between the land surface and the atmosphere [1]. Accurate estimation of forest biomass and its changes can enhance our understanding of the global carbon cycle as well as reduce the uncertainties

Overview of Machine Learning Regression Algorithms
This section presents a brief overview of the machine learning techniques used for the evaluation. Since the objective of this study is to examine the performances of machine learning algorithms for estimating forest AGB, this overview is not a detailed explanation of each algorithm but rather a summary. Both the approaches established and some algorithms yet to be tested for AGB estimation are included.

Linear-Based Learners
Linear regression and its extensions are simple algorithms used for the estimation of land surface variables from remote sensing data. Due to the multicollinearity between predictor variables obtained from remote sensing data, linear regression cannot be directly used to estimate forest biomass [24]. Therefore, some studies have performed stepwise regression to select the optimal independent variables for AGB estimation or used partial least-squares regression, which first reduces the predictors to a small set of uncorrelated components and then performs least-squares regression on these components to estimate forest AGB [21]. However, because of the nonlinear relationships between the predictors and forest biomass, the estimated forest AGB is usually inaccurate [34]. Compared with these linear models, the MARS method can capture nonlinear relationships by creating piecewise-linear models and thus provides higher accuracy for AGB estimates [35,36]. For this reason, the MARS algorithm was selected in this study.

Kernel-Based Learners
Kernel-based learners are a family of algorithms that use kernel functions to project low-dimensional data to a higher dimension to make them linearly separable. They deal with nonlinear problems while still operating with linear algebra [37]. SVR is a kernel-based algorithm commonly used for retrieving biophysical parameters from remote sensing data [38], and it was proven to have good performance in estimating forest AGB [21,39]. Gaussian processes regression (GPR) is another kernel-based algorithm that establishes the relationships between predictor variables and forest biomass in the same way as SVR [37,40]. GPR is a nonparametric Bayesian approach, and it can provide not only the predicted mean value for the estimation but also the variance [41]. Previous studies have suggested that both SVR and GPR are suitable for estimating forest structure parameters but that GPR is slightly more accurate than SVR [37]. In addition to SVR and GPR, kernel ridge regression (KRR) is also a kernel-based algorithm that combines ridge regression and the kernel trick. KRR learns a model quickly for medium or large datasets. Previous studies used KRR for estimating biophysical parameters, such as leaf chlorophyll content, leaf area index, and fractional vegetation cover, and compared its performance with those of SVR and GPR [41], but KRR has not yet been applied in forest AGB estimation.

Tree-Based Models
Tree-based learners have been widely used in parameter estimation due to their efficiency and reliability [23,42]. To date, two types of tree-based learners have been used for estimation: individual tree-based models (e.g., CART and C4.5) and their extensions using bagging or boosting [20,43,44]. For an individual tree-based model, small changes in the learning sample can cause dramatic changes in the built tree, and so the estimated results in previous studies were unstable and inaccurate [45,46]. To improve this situation, most studies adopted bagging and boosting ensemble algorithms. The bagging process in tree-based models applies bootstrap samples generated from the original dataset to train tree models and then aggregates the ensembles to obtain final predictions [47]. Through aggregation, bagging algorithms improve the predictions by decreasing variance [48]. RF is a modified version of the bagging algorithm that builds trees using not only subsamples but also a random subset of predictors to determine each split, and this technique can reduce the correlations between different prediction ensembles [49]. Due to its robustness to overfitting and noise in the training dataset, RF can be very effective for estimating parameters such as forest AGB [21,22]. Approximately 30% of training samples are not included in the modeling of the RF algorithm because of bootstrapping and resamples and can be applied to compute the out-of-bag prediction error.
The ERT algorithm is another tree-based ensemble algorithm that improves the accuracy parameter estimation by reducing variances. Different from other tree-based ensemble methods, the ERT algorithm splits nodes by choosing cut points completely at random and uses the whole learning sample rather than a bootstrap replica to grow the trees [48,50]. Since it does not require the determination of optimal discretization thresholds, the ERT algorithm has an advantage in terms of computing time. To date, the ERT algorithm was utilized to predict terrestrial latent heat fluxes [31], air quality [51], and streamflow models [52], but its performance in estimating forest AGB has not yet been explored.
Boosting is a family of sequential ensemble algorithms that converts weak learners to strong learners, thereby eventually improving the prediction accuracy by decreasing bias [53]. Boosting algorithms pay the most attention to the samples with the highest prediction errors and increase their weights in the next iteration. The final prediction results are a weighted combination of predictions across the sequence of learners. The gradient boosting machine (GBM) is an extremely popular boosting algorithm and one of the leading algorithms for winning Kaggle competitions. For a GBM, the base learners are decision trees. Despite being popular in the machine learning community, the GBM and GBRT approaches are rarely applied in forest AGB estimation studies [23,30]. SGB is another implementation of the tree-based gradient boosting approach that builds a base learner from a random subsample drawn from the entire training dataset without replacement at each iteration, and it can thus reduce the risk of overfitting [54,55]. Moisen et al. [56] illustrated the use of SGB in mapping the presence and basal area of 13 species. Filippi et al. [57] adopted SGB to estimate floodplain-forest AGB from hyperspectral Hyperion images and ancillary data. CatBoost is a new version of the GBRT [58]. As the name suggests, CatBoost can address categorical variables. The implementation of ordered boosting is another critical algorithmic advance in CatBoost [59]. Huang et al. [60] evaluated the CatBoost method for the prediction of reference evapotranspiration in humid regions and found that CatBoost could perform better than the RF and SVM models.
As addressed above, a series of bagging and boosting ensemble algorithms were proposed to increase the accuracy of estimated parameters, but their applications in AGB prediction are very rare. Therefore, ensemble algorithms, including the RF, ERT, GBRT, SGB, and CatBoost were chosen for comparison in this study.

Neural Network-Based Learners
Neural networks have attracted great attention in recent years. Inspired by biological neural networks, ANNs have been used to construct nonlinear models between independent variables and dependent variables, emulating the learning of the biological neuron system [61]. The basic model of an ANN consists of one input layer, one or more hidden layers, and one output layer. The feedforward multilayer perceptron (MLP) model has been the most widely used ANN model in various fields since the introduction of the back propagation algorithm and is also an alternative approach for reliable forest biomass prediction [62][63][64][65]. The extreme learning machine (ELM) is a novel type of ANN with a single hidden layer, where the hidden nodes are randomly initiated and then fixed without iteratively tuning them [66,67]. Empirical studies suggested that the efficiency and generalization abilities of the ELM are comparable to or even better than those of the SVR algorithm [67,68]. Variations in the ELM were developed to make it more suitable for specific applications [69], but it has not yet been extended for forest AGB estimation.

Reference AGB Data
The reference AGB data are indispensable for forest AGB estimation from satellite datasets. They have often been used to calibrate remote sensing datasets and validate the estimated results. In this study, the reference AGB data were compiled from field AGB and high-resolution AGB datasets that were originally derived from field data and lidar data [70]. The field AGB data were obtained by collecting plot-level AGB that were measured on or after the year 2000 from published literature and online databases. To increase the number of training samples, high-resolution AGB datasets with spatial resolutions finer than 100 m that could match well with field AGB at the spatial scale were also considered. Both field AGB and high-resolution AGB datasets were aggregated to 0.01 • . More details about the sources and preprocessing of plot-level AGB and high-resolution AGB datasets to generate the reference AGB data can be found in one of our previous papers [71].
Since the relationships between forest AGB and predictor variables may vary with forest types, we separated the reference AGB data with the MODIS (Moderate Resolution Imaging Spectroradiometer) Land Cover Type Product (MCD12Q1, version 6) for 2005 and the International Geosphere-Biosphere Programme (IGBP) legend [72]. A total of 12,376 reference AGB samples at 0.01 • resolution were finally selected, including 249 samples located in evergreen needleleaf forests (ENFs), 1600 samples in evergreen broadleaf forests (EBFs), 3689 samples in deciduous broadleaf forests (DBFs), 585 samples in mixed forests (MFs), 4993 samples in woody savannas (WSAs), and 1260 samples in savannas (SAVs) (Figure 1). The AGB samples corresponding to non-forest types according to the MCD12Q1 for 2005 were excluded. The statistics of the reference AGB samples sorted by forest type are shown in Table 1.

Input Data Collection
It was demonstrated that the multisensor synergy of optical, radar, and lidar data could improve the accuracy of AGB estimation [4,[73][74][75][76]. Previous studies generally incorporated high-level satellite data products that were originally derived from optical, radar, and lidar data, or a combination of them, into the process of estimating AGB based on satellite observations. In this study, the satellite data products included were the Global Land Surface Satellites (GLASS) leaf area index (LAI), canopy height derived from the Geoscience Laser Altimeter System (GLAS) data, MODIS net primary production (NPP), as well as tree cover data derived from Landsat images, which were closely correlated with forest AGB under some conditions [42,77,78]. In addition, climatic and topographical factors could significantly or marginally improve the accuracy of forest AGB estimates and thus served as predictor variables of forest AGB in this study [79][80][81][82][83].

GLASS LAI Data
As part of the GLASS products suite [84], the GLASS LAI product at 8-day and 1-km resolutions was used because it is more accurate and temporally continuous than other LAI products such as MODIS LAI and Geoland2 LAI [85][86][87]. The 8-day GLASS LAIs from 2001 to 2010 were first reprojected to the WGS 84 geographic coordinate system and then averaged at the monthly scale. The maximum monthly LAI value within a year was taken as the annual maximum LAI. Here, we used the annual maximum LAI for the year 2005 and the interannual variation in LAI, which was represented by the coefficient of variation (CV) of the annual maximum LAIs from 2001 to 2010.

Global Canopy Height Map
The global canopy height map at 1-km resolution was derived from the GLAS data compiled in 2005 by Simard et al. [88]. Compared with the global height map produced by Lefsky [89], the Simard et al. [88] product was in better agreement with airborne lidar-derived heights [90]. Therefore, the Simard et al. [88] global canopy height map was used to extract the canopy height, which was a predictor of forest AGB. For consistency with other datasets, we resampled the canopy height map to 0.01 • using the nearest neighbor method.

MODIS NPP Data
The MODIS annual NPP data are one of the most highly used data sources for studies on the global carbon cycle [91]. The annual MOD17A3HGF (version 6) data at 500-m resolution, provided in a sinusoidal projection, were downloaded from the Land Processes Distributed Active Archive Center (https://lpdaac.usgs.gov/products/mod17a3hgfv006) [92]. Annual NPP data from 2001 to 2010 were converted to the WGS84 geographic coordinate system with the MODIS Reprojection Tool and then aggregated to 0.01 • . Like the LAI data, annual NPP data for 2005 and the interannual variation in NPP from 2001 to 2010 were included.

Global Tree Cover Product
The global forest cover map at 30-m spatial resolution for 2000 generated by Hansen et al. [93] was used. To be consistent with other datasets, the 30-m data were aggregated to 0.01 • resolution. The standard deviation of all 30 pixels within each 0.01 • grid cell was also extracted to describe the spatial variability of tree cover [94].

Global Multi-Resolution Terrain Elevation Data 2010 (GMTED2010)
The DEM and slope data were extracted from the GMTED2010 product, which covered global land areas from 84 • N to 56 • S [95]. The GMTED2010 product was mainly constructed from the SRTM data. For regions outside the coverage area of the SRTM, 10 other elevation datasets, such as the GTOPO30 dataset, Canadian digital elevation data, and the National Elevation Dataset for the continental United States and Alaska, were used to fill the gaps. The GMTED2010 suite contains raster elevation products at 30, 15, and 7.5 arc-second spatial resolutions. Here, we used the GMTED2010 data at 30 arc-second spatial resolution, which had a lower RMSE range between 25 and 42 m with respect to a global set of control points provided by the National Geospatial-Intelligence Agency than the 66 m RMSE of GTOPO30 relative to the same control points.

Climate Data
The WorldClim2 dataset at 1-km spatial resolution provided monthly temperature (minimum, maximum, and average) and precipitation information for global land areas during the period of 1970-2000 and could be downloaded from http://worldclim.org/version2 [96]. The monthly average temperature and precipitation data were aggregated to the annual scale and used in this study.
The Climatic Research Unit (CRU) gridded dataset contains monthly time series of temperature and precipitation data from 1901 to the present day [97]. Monthly temperature and precipitation data from 2001 to 2010 were aggregated to the annual scale, and changes in annual temperature and precipitation were derived by standardizing the data during the period of 2001-2010 to the baseline period , and then they served as predictor variables of forest AGB.

AGB Modeling Algorithms
Based on the reference AGB data and their corresponding predictor variables that were extracted from multiple satellite data products and the ancillary climatic and topographical datasets, forest AGB was estimated with eight machine learning algorithms. Thirteen variables were selected as predictor variables, including the LAI, interannual variation in LAI (LAI-CV), canopy height (CH), NPP, interannual variation in NPP (NPP-CV), tree cover (TC-Mean), standard deviation of tree cover within each 0.01 • grid (TC-Std), DEM, slope, temperature (Temp), changes in temperature (Temp-Anomaly), precipitation (Prec), and changes in precipitation (Prec-Anomaly). The eight nonparametric regression algorithms included the MARS algorithm, a kernel-based model (SVR), a neural network-based model (MLP), and five tree-based ensemble models including the RF, ERT, GBRT, SGB, and CatBoost approaches. The modeling of forest AGB with the SVR, RF, ERT, GBRT, SGB, and MLP approaches was implemented with the scikit-learn machine learning library for the Python programming language [98]. The MARS and CatBoost algorithms were implemented with the py-earth and catboost libraries, respectively.

Tuning the Hyperparameters for the Machine Learning Algorithms
The hyperparameters of an algorithm affect its performance when estimating forest AGB. It is essential to optimize the hyperparameters of each algorithm before conducting any further evaluation or comparison of these algorithms. The key tuning hyperparameters and the configurations of the tuning parameters for each algorithm are shown in Table 2.
For the MARS algorithm, the hyperparameters are the maximum number of terms to retain in the final model (max_terms) and the maximum degree of the interactions between piecewise linear functions (max_degrees). The radial basis function was selected as the kernel for SVR because it could capture the nonlinear relationships between the predictors and forest AGB. The regularization parameter C and kernel coefficient gamma were tuned for SVR-RBF. The parameter C controls the trade-off between the slack variable penalty (misclassifications) and the width of the margin, while gamma determines how much influence a single training sample exerts. For the five tree-based AGB modeling algorithms, the number of trees to build (n_estimators) must be tuned. The RF algorithm also requires the tuning of the number of features to consider when splitting a node (max_features). For the ERT method, two hyperparameters were optimized for the splitting procedure: the number of features randomly selected at each node (max_features) and the minimum sample size required to split a node (min_samples_split). For the three boosting algorithms, GBRT, SGB, and CatBoost, the learning rate determined the contribution of each regression tree and was therefore tuned. In addition to the number of trees to build and the learning rate, the tuning parameters in the GBRT model included the minimum sample size required to split an internal node (min_samples_split) and the maximum depth of a tree (max_depth), which controls the complexity of each individual tree. Compared with the GBRT algorithm, the fraction of samples (subsample) that were randomly selected to build each tree was added to the set of hyperparameters of SGB. The hyperparameters of the CatBoost algorithm include the learning rate, the depth of the tree (depth), and the maximum number of trees (iterations) that can be built [59]. For the MLP model, the size of the hidden layer (hidden_layer_sizes), which includes both the number of hidden layers and the number of perceptrons contained in each hidden layer, must be tuned.  Based on the hyperparameter configurations shown in Table 2, the optimal combination of hyperparameters for each AGB model was determined using grid search with cross-validation. The procedure of tuning the hyperparameters for a given algorithm is shown in Figure 2. The initial data were divided into a training dataset (80%) and a test dataset (20%). Five-fold cross-validation was employed to optimize the hyperparameters; the training data were divided into five folds, with four folds chosen for training the AGB model and the remaining fold held out for validation. Each of the five folds acted once as a validation set and four times as a training set. The root mean square error (RMSE) was calculated based on the validation set. The optimal combination of hyperparameters had the minimum average RMSE for the five validation datasets. Once the optimal hyperparameter combination was determined for an AGB modeling algorithm, the corresponding hyperparameters were set for the model. All the training data were then used to train the AGB model, and 20% of the test data were used to evaluate its performance for AGB estimation. To ensure that the comparison results of different AGB modeling algorithms were stable and not subject to the splitting of the training and test datasets, the procedures for tuning the hyperparameters of an AGB model and evaluating model performance were repeated 50 times, randomly splitting the training set and test set compositions each time (Figure 2). Ten to a hundred times were usually adopted. Considering the running time, the random splitting was repeated 50 times in this study. The performance metrics calculated on the 50 test sets were all stored, and the average of the 50 performance metrics was also calculated to evaluate the performance of each AGB modeling algorithm.

AGB Modeling under Different Scenarios
To fully investigate the performances of eight machine learning regression algorithms for AGB estimation from multiple satellite datasets and ancillary information, the following scenarios were designed.
(1) AGB models were developed based on the random sampling of initial data without stratification of forest types, resulting in a total of 8 models for examination; (2) AGB models were built based on the random sampling of initial data with stratification of forest types, producing 8 models under stratified sampling scenarios; (3) AGB models were developed for each forest type separately based on the random sampling of initial data for the corresponding forest type, which led to a total of 48 models for evaluation (8 models for each forest type).
The second scenario was designed to address whether, compared with the random sampling scenario, the stratification of forest types could improve the accuracy of AGB estimation achieved by eight machine learning algorithms. The training dataset under the stratified sampling scenario was the same as the sum of the training datasets of the six AGB models developed separately for each forest type. The comparison results under the second and third scenarios illustrated whether the separate modeling of AGB improved the forest AGB estimates with these machine learning algorithms.
As addressed in Section 3.3.1, each of the 64 AGB models under the above three scenarios were first optimized based on the training data using grid search with cross validation, then they were developed with the whole training dataset, which accounted for 80% of the whole data, and finally they were evaluated with the remaining 20% of the test data. The procedure was repeated 50 times for each AGB model.

Performance Evaluation of the AGB Models
The performance of each AGB model was evaluated using four commonly used indicators, including the coefficient of determination (R 2 ), root mean square error (RMSE), relative RMSE (RMSE%), and bias. These metrics can be calculated as follows: where y is the reference AGB, y is the average of the reference AGB values, andŷ is the estimated AGB. N is the number of test samples used for the evaluation. High R 2 values were preferred, while lower values of the RMSE, relative RMSE, and bias indicated better performance by an AGB model.

Comparison Analysis of the Algorithms for Modeling AGB
The overall performances of the AGB models under the three scenarios were evaluated in terms of their R 2 , RMSE, bias, and relative RMSE values. For each AGB-modeling algorithm, the performance metrics for 50 runs were calculated. In addition to the performance metrics calculated based on the whole test data for each run, the performances for each forest type were computed using the test data with their corresponding forest types. Only the stratified sampling scenario and separate modeling scenario were considered when we explored how the eight algorithms performed for AGB estimation of each forest type, since these two scenarios used the same training data and test data for AGB modeling and performance evaluation.
Among the eight algorithms, the MARS, RF, ERT, GBRT, SGB, and CatBoost methods provided the coefficients measuring the contributions of the prediction variables in the various AGB models. The base model of the RF, ERT, GBRT, SGB, and CatBoost algorithms was the CART model, and the mean decrease in impurity importance was used to rank the importance of features [99]. In the MARS algorithm, the contribution of each predictor was determined using the generalized cross-validation statistic [100,101]. The feature importance values for the six models were calculated for 50 runs, and the distribution of the feature importance values for each model was represented by the mean, minimum and maximum values.
In addition to the overall performance metrics and feature importance values for 50 runs, we compared the performances of the eight AGB modeling algorithms in terms of AGB prediction under three scenarios. Some of the initial data acted as the test set multiple times for 50 runs. We combined the predictions for 50 runs of each algorithm using two approaches. One approach was to simply combine all the predictions (20% of the initial data for each run, with 50 runs) without integrating the predicted values corresponding to the same test samples; this approach generated 123,800 predictions for each algorithm. The other approach was to integrate the predicted values corresponding to the same samples; this approach produced 12,376 predictions for each algorithm. Based on the generated AGB estimates and the corresponding reference AGB, we compared the performances of all eight AGB modeling algorithms. Following this, we calculated the prediction errors (predicted value minus the reference AGB) of each algorithm and explored the relationship between the prediction errors and the reference AGB. Moreover, the AGB predictions were separated into eight bins according to the reference AGB: 0 < reference AGB ≤ 60 Mg/ha, 60 < reference AGB ≤ 90 Mg/ha, 90 < reference AGB ≤ 120 Mg/ha, 120 < reference AGB ≤ 150 Mg/ha, 150 < reference AGB ≤ 180 Mg/ha, 180 < reference AGB ≤ 210 Mg/ha, 210 < reference AGB ≤ 240 Mg/ha, and reference AGB > 240 Mg/ha. The prediction errors within each bin were calculated, and the results from different algorithms were compared to comprehensively explore the underestimation and overestimation of AGB by eight machine learning regression algorithms. Figure 3 shows the distribution of the validated R-squared, RMSE, bias, and relative RMSE results for 50 runs of each regression algorithm under three scenarios. The five tree-based ensemble algorithms, RF, ERT, GBRT, CatBoost, and SGB, outperformed the SVR, MLP, and MARS algorithms in terms of the R-squared, RMSE, and relative RMSE values. SVR produced the AGB estimates with the largest bias and significantly underestimated forest AGB (Figure 3c and Table 3). Averaging the performance metrics of 50 runs for each regression algorithm, we found that the CatBoost algorithm had the overall best performance in estimating forest AGB from multiple satellite data products, with a mean R-squared of 0.71, RMSE of 46.67, and relative RMSE of 26% (Table 3). In contrast, the MARS model achieved the lowest mean R-squared at 0.56, the highest RMSE at 56.69 Mg/ha, and a relative RMSE of 31.58%. Random sampling corresponds to AGB modeling without stratification of forest types, stratified sampling corresponds to AGB modeling with stratification of forest types, and forest type corresponds to separate AGB models for each forest type scenario. The comparison results of the performances of the eight machine learning regression algorithms did not vary with the training datasets used. In other words, the random splitting of the training and test data did not greatly affect the performance metrics obtained by the AGB modeling algorithms, particularly the tree-based ensemble algorithms. Additionally, the modeling scenarios did not change the comparison results. Both stratified random sampling and modeling AGB for each forest type separately did not improve the overall accuracy of the AGB estimates obtained by the eight algorithms ( Figure 3).

Performance Metrics for the Optimized AGB Models
Compared with the MLP and MARS algorithms, the performances of the RF, ERT, GBRT, CatBoost, and SGB approaches were more stable and less sensitive to the training datasets and modeling scenarios. The mean biases for 50 runs each of the RF, ERT, GBRT, CatBoost, and SGB approaches were −0.03 Mg/ha, −0.21 Mg/ha, −0.11 Mg/ha, −0.07 Mg/ha, and 0.10 Mg/ha, respectively, and the median biases were −0.15 Mg/ha for the RF, −0.36 Mg/ha for the ERT model, −0.01 Mg/ha for the GBRT, −0.09 Mg/ha for SGB and 0.03 Mg/ha for the CatBoost algorithm under the random sampling scenario (Figure 3 and Table 3). The results suggested that the RF, ERT, GBRT, and SGB models generally overestimated forest AGB, while the CatBoost algorithm slightly underestimated forest AGB.
Comparing the performances of the eight algorithms for estimating the AGB of different forest types under the stratified sampling scenario and separate modeling scenario, we found that modeling AGB for each forest type separately did not improve the accuracy of the AGB estimates but instead resulted in large variations in the bias and relative RMSE values due to the random splitting of the training data and test data for each of the 50 runs ( Figure 4). One exception was that the MARS algorithm developed for the mixed forests had a good performance with a corresponding median bias close to zero, while the MARS model built under the stratified sampling scenario showed an overestimation of AGB in mixed forests. Large discrepancies in the bias and relative RMSE values were observed for the different algorithms. However, all the AGB modeling algorithms achieved good performances in estimating the AGB of DBFs, with relative RMSEs ranging from approximately 15 to 25% and biases ranging from −5.0 to 5.0 Mg/ha, but they achieved poor performances in estimating the AGB of ENFs with relative RMSEs larger than 60%. A possible reason for the poor accuracy of the predicted AGBs of ENFs was that many ENF samples had high AGB values, which could not be captured by the satellite data products due to saturation problems (Table 1). For EBFs, all the regression algorithms except SVR underestimated AGB under the stratified sampling scenario while overestimating AGB under the separate modeling scenario. In contrast, the regression algorithms with stratified sampling overestimated the AGB of MFs and WSAs, while the separate AGB models provided underestimated AGB values (Figure 4).
Under either scenario, the MLP, MARS, and SVR models were not optimal for all forest types (Figure 4). Although the five tree-based ensemble algorithms, including the RF, ERT, GBRT, SGB, and CatBoost algorithms, had better performances than the MLP, MARS, and SVR models, no single tree-based algorithm was optimal for all forest types ( Figure 5). The RF, ERT, GBRT, and SGB models had similar performances in terms of AGB estimation when modeling AGB separately for the EBF, DBF, MF, WSA, and SAV types. For ENFs, the ERT algorithm provided relatively better results than those of the other algorithms, but only under the separate modeling scenario.  Figure 6 shows that the relative importance values of 13 predictor variables in different AGB modeling algorithms were different. For the RF, GBRT, and SGB models, the CH contributed the most to AGB estimation, with different values of feature importance. TC-Mean was the most important variable for estimating AGB using the MARS algorithm. However, its contributions were greatly affected by the training data used and varied from 10.35 to 59.65%. For the ERT and CatBoost algorithms, TC-Mean also contributed more to the AGB predictions than the other predictor variables, with corresponding importance values ranging from 11.58 to 16.74%.

Relative Importance of the Predictor Variables in AGB Models
Compared with the MARS, RF, GBRT, and SGB models, the ERT and CatBoost algorithms provided more stable results in terms of the importance of the predictor variables to AGB estimation. This was established under both the random sampling and stratified sampling scenarios ( Figure 6 and Figure S1). The changes in feature importance values caused by the random splitting of the training sets and test sets for each of the 50 runs were no more than three percent under the random sampling scenario and no more than five percent under the stratified sampling scenario. Moreover, the orders of the variables when sorted by feature importance were the same under both sampling scenarios for the ERT and CatBoost algorithms, which further proved their stable performances for forest AGB estimation. Although the RF and SGB models also provided consistent variable orders when sorted by feature importance under the random and stratified sampling scenarios, they were more sensitive to the training dataset than the ERT and CatBoost models, and the changes in feature importance values caused by the training data reached 12% for the RF model and 18% for the SGB model. Despite the large differences in feature importance values among the six AGB modeling algorithms, they all revealed the limited contribution of the LAI to the AGB estimates. Moreover, when the AGB modeling algorithms were developed for each forest type separately, they obtained consistent results indicating that the canopy height contributed the most to the forest AGB in the EBF model, and tree cover was the dominant variable for the AGB estimates of the DBFs, MFs, WSAs, and SAVs (Supplementary Material, Figure S2).

Figure 6.
Feature importance values for the predictor variables in eight machine learning regression algorithms under the random sampling scenario. The lines connect the corresponding minimum and maximum feature importance values for 50 runs of each predictor variable, and the points represent the average of the 50 feature importance values for each predictor variable. CH is the canopy height; Prec is the precipitation; Prec-Anomaly is the changes in precipitation; LAI is the leaf area index; LAI-CV is the interannual variation in LAI; NPP is the net primary production; NPP-CV is the interannual variation in NPP; TC-Mean is the mean tree cover within each 0.01 • grid cell; TC-Std is the standard deviation of tree cover within each 0.01 • grid cell; DEM is the digital elevation model; Temp is the temperature; and Temp-Anomaly is the changes in temperature.

Predicted Forest AGB with Eight Regression Algorithms
By simply combining the AGB predictions of the 50 runs of each AGB modeling algorithm, we found that the MARS algorithm provided unreasonable estimates, and the results of the MLP and SVR models were also inferior to those of the RF, ERT, GBRT, SGB, and CatBoost models in terms of forest biomass estimation ( Figure S3). The performance of the MARS algorithm was improved when the predictions of the 50 runs were aggregated by averaging all the AGB predictions corresponding to the same test sample (Figure 7). For the other AGB modeling algorithms, aggregation of the estimated AGB values only led to slight improvement in the accuracy of the corresponding AGB estimates. The comparison results obtained by the different AGB modeling algorithms were consistent under the three designed scenarios. For simplicity purposes, the following results were based on the random sampling scenario.
Among the five tree-based ensemble algorithms, the CatBoost algorithm provided the most accurate estimates of forest AGB, with an R-squared of 0.71, an RMSE of 46.73 Mg/ha, a bias of 0.10 Mg/ha, and a relative RMSE of 26% ( Figure S3). When the predicted AGB values corresponding to the same test samples were aggregated for each algorithm, the CatBoost model remained the best estimator and had R 2 , RMSE, bias, and RMSE% values of 0.72, 45.63 Mg/ha, 0.06 Mg/ha, and 25%, respectively ( Figure 7). SGB had a similar accuracy to that of the CatBoost algorithm.
Despite the good performances of several algorithms for AGB estimation in terms of R 2 , RMSE, bias, and relative RMSE, the prediction errors induced by all the modeling algorithms were negatively correlated with the reference AGB values, indicating the overestimation of forest AGB at low values and underestimation of AGB at high values ( Figure 8). The aggregation of the predictions slightly increased the correlation coefficients between the prediction error and reference AGB by approximately 0.04 for the MARS algorithm. For the other AGB modeling algorithms, the correlation coefficients remained unchanged after the aggregation.
The problems of overestimation and underestimation were more evident for the RF, ERT, SGB, and CatBoost algorithms, with correlation coefficients of 0.84 without aggregation ( Figure S4). Due to the aggregation, the boosting algorithms, SGB, and CatBoost, had correlation coefficients of 0.85 ( Figure 8).
The predicted errors varied with the AGB values ( Figure 9). All the models tended to overestimate forest AGB, particularly when the biomass was lower than 120 Mg/ha, and they tended to underestimate forest AGB when the biomass was higher than 210 Mg/ha. Among the eight machine learning regression algorithms, the RF, ERT, MARS, SVR, and MLP algorithms produced the most serious underestimations and overestimations of forest AGB, while the three boosting algorithms, CatBoost, GBRT, and SGB, provided less biased AGB estimates. This phenomenon could be explained by the fact that the boosting algorithms were originally designed to reduce the biases of estimators and thus provided relatively better results than those of the other algorithms in terms of bias.  In summary, the CatBoost algorithm was the most suitable for AGB estimation from multiple satellite data products in terms of performance metrics, feature importance values, and overestimation and underestimation problems. SGB also provided a similar accuracy to that of the CatBoost algorithm. The RF and ERT models were slightly inferior to CatBoost and SGB models for forest AGB estimation. However, it should be noted that fewer hyperparameters need to be adjusted for the RF and ERT models than for other models, which makes them more efficient than the CatBoost and SGB algorithms.

Discussion
This study compared the performances of eight machine learning regression algorithms for estimating forest AGB from multiple satellite data products and ancillary information. The results showed that the five tree-base ensemble algorithms, including the RF, ERT, GBRT, SGB, and CatBoost algorithms, had similar accuracies in terms of their estimated AGB and performed better than the remaining three algorithms. The results indicated that the ensemble algorithms could improve the accuracy of estimation, which is consistent with some published studies [102,103]. In this study, the base learner of the ensemble algorithms was the regression tree model, and thus, the ensemble was homogeneous. Since the diversity of the base learners has a great effect on the performance of an ensemble algorithm, heterogeneous ensemble algorithms (e.g., stacking) that integrate different base learners could further improve the accuracy of estimated AGB, and this technique will be considered in future studies [104][105][106].
Among the eight regression algorithms, the RF and SVR models have often been used for forest AGB estimation, and their good performances have been demonstrated in published studies [7,24,[107][108][109][110]. In this study, some recently developed algorithms and even some algorithms yet to be established for AGB prediction (e.g., the CatBoost and ERT algorithms), were evaluated. Comparison results showed that the CatBoost algorithm outperformed the other algorithms, including the RF and SVR models, and that the SGB also achieved good performances for forest AGB prediction. In terms of feature importance, the ERT and CatBoost algorithms provided stable results. The comparison results suggest that the recently developed algorithms performed well, and their use should be encouraged when estimating land surface parameters from remote sensing data in future studies [19]. However, it should be noted that the processes of optimizing the hyperparameters of each algorithm and comparing different algorithms to choose the best one was time consuming. Since model ensembles are becoming widespread and are improving the accuracies of estimates in many fields, integrating an AGB model with different hyperparameter configurations or integrating multiple AGB models instead of using a model selection procedure should be fully considered in future studies [111,112].
The CatBoost algorithm had good performances for AGB estimation, and could be further employed for generating the global forest biomass map since the reference AGB was collected globally in this study. Validation results showed that estimated forest AGB using the CatBoost algorithm had an accuracy with the R 2 of 0.72, RMSE of 45.63Mg/ha, and relative RMSE of 25%. In another study, a global forest biomass map was generated by integrating gridded biomass datasets with the error removal and simple averaging algorithm. Cross-validation using reference AGB data showed that the accuracy of the fused global forest AGB map had an R 2 of 0.61, a RMSE of 53.68 Mg/ha, and a relative RMSE of 30.28% [71]. Hu et al. [8] used RF to estimate forest biomass from GLAS data, optical imagery, climate data, and topographic data on a global scale. The R 2 and RMSE between their predicted results and the validation plots were 0.56 and 87.53 Mg/ha, respectively. Yang et al. [23] used the GBRT algorithm to generate a global forest AGB map for 2005, which had the accuracy with an R 2 value of 0.90 and a RMSE of 35.87 Mg/ha. The validation result was somehow overoptimistic because some of the reference data were compiled from satellite-derived biomass datasets with spatial resolutions of 500 m and 1 km. Previous studies suggested that a global forest biomass map provided by Yang et al. [23] was close to that from Hu et al. [8], when validated with field reference AGB or evaluated in terms of spatial distribution [11,71]. Therefore, compared with previous studies, this study provided a feasible way to produce accurate forest AGB maps on a global scale.
Additionally, our study highlighted that the prediction errors induced by the eight regression algorithms had strong negative correlations with forest AGB, suggesting the overestimation of AGB at low AGB values and the underestimation of forest AGB at high AGB values. Gao et al. [113] also noticed these overestimation and underestimation problems when they used Landsat TM (Thematic Mapper) and PALSAR (Phased Array type L-band Synthetic Aperture Radar) data to estimate AGB in subtropical forests. They found that the RF algorithm was not suitable for AGB prediction when the AGB values were less than 40 Mg/ha or greater than 160 Mg/ha. In this study, the eight algorithms provided their best estimates when AGB values ranged from 150 to 180 Mg/ha, but they significantly underestimated AGB when the values were larger than 210 Mg/ha and overestimated AGB when the AGB values were less than 120 Mg/ha. Zhao et al. [114] examined the data saturation problem in Landsat imagery for different vegetation types and found that the most accurate results were obtained when the AGB was 80-120 Mg/ha. In relatively low AGB (less than 40 Mg/ha) regions and high AGB (greater than 140 Mg/ha) regions, the problems of overestimation and underestimation were evident. Compared with these published studies, we observed higher saturation values of 210 Mg/ha, which could be partly attributed to the datasets used in this study. The canopy height map was originally derived from the GLAS data, which did not have the low saturation values typically found in optical and radar data for the retrieval of land surface parameters; thus, highly saturated AGB values were found. To improve the accuracy of AGB estimation, more attention should be paid to the AGB estimates at high and low values, and using data from different sources as well multistage modeling might improve the AGB estimation or mapping in the future [115][116][117].
A total of 13 predictor variables, including LAI, interannual variation in LAI, canopy height, NPP, interannual variation in NPP, tree cover, standard deviation of tree cover within each 0.01 • grid, DEM, slope, temperature, changes in temperature, precipitation, and changes in precipitation, were used in all the models. Although feature importance results suggested that some variables (e.g., LAI and slope) were not quite relevant to forest AGB in the majority of models, they were not excluded in this study, since published studies have indicated the underlying mechanisms of using these variables for estimating forest AGB from remotely sensed data, and their contribution to forest AGB estimation, particularly at lower AGB values [11].
Previous studies suggested that the stratification of forest types could marginally or significantly improve the accuracy of AGB estimates [118]. However, the stratified sampling scenario did not provide more accurate estimates of forest AGB than the random sampling scenario without stratification of forest types in this study. One possible reason for this was that the choice of the stratification variable was valuable for improving AGB estimates, while the forest type was not the proper variable [119]. Additionally, modeling AGB for each forest type separately did not improve the accuracy of the estimates either, indicating that the consideration of the effects of different forest types was not straightforward for accurate AGB estimation. However, it should be noted that although modeling AGB for each forest type separately did not significantly improve the AGB estimates, it provided stable results in terms of the contributions of the predictor variables to the AGB estimates; this could help us understand the relative importance of variables in forest AGB estimation and the mechanisms of AGB dynamics in the future [120].

Conclusions
In this study, we evaluated eight machine learning regression algorithms for estimating forest AGB from multiple satellite data products and ancillary information. The results showed that forest AGB estimated with the tree-based ensemble algorithms, including the RF, ERT, GBRT, SGB, and CatBoost algorithms, had the mean R 2 for 50 runs ranging from 0.69 to 0.71, RMSE ranging from 46.67 to 47.95 Mg/ha, bias ranging from −0.21 to 0.10 Mg/ha, and relative RMSE ranging from 26.00 to 26.72%, and were more accurate than those estimated with the MARS, SVR, and MLP algorithms with the mean R 2 ranging from 0.56 to 0.66, RMSE ranging from 50.34 M to 56.69 Mg/ha, bias ranging from −0.02 to 1.55 Mg/ha, and relative RMSE ranging from 28.05 to 31.58%. Both the stratification of forest types and the modeling of AGB for each forest type separately for the eight machine learning algorithms did not result in more accurate AGB estimates than those obtained under the random sampling scenario. All eight AGB modeling algorithms underestimated forest AGB when the AGB values were larger than 210 Mg/ha and overestimated forest AGB when the values were less than 120 Mg/ha. Further efforts should be made to improve the AGB estimates at low and high AGB values.
Among the eight algorithms, the CatBoost model exhibited the best performance with an R 2 of 0.72, an RMSE of 45.63 Mg/ha, a bias of 0.06 Mg/ha, and a relative RMSE of 25% when predicted AGB values for 50 runs were aggregated, and it also provided stable results in terms of the feature importance of the predictor variables. Furthermore, it resulted in less serious underestimation and overestimation of AGB than the other models under the three scenarios. Therefore, the CatBoost algorithm is recommended for AGB estimation from multi-satellite data.
However, it should be noted that the processes of hyperparameter optimization for each algorithm and model comparison for selecting the best one was time consuming. In future studies, ensemble methods that integrate versions of an algorithm with different hyperparameter configurations or include diverse algorithms instead of using a model selection procedure will be considered for further improving the AGB estimates.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-4292/12/24/4015/s1, Figure S1: feature importance values for the predictor variables in the eight machine learning regression algorithms under the stratified sampling scenario, Figure S2: feature importance values for the predictor variables in the eight machine learning regression algorithms under the separate modeling scenario, Figure S3: density scatter plot of the predicted AGB values generated by simply combining the results of the MARS, SVR, RF, ERT, GBRT, SGB, CatBoost, and MLP algorithms for 50 runs under the random sampling scenario, Figure S4

Conflicts of Interest:
The authors declare no conflict of interest.