Combination of Feature Selection and CatBoost for Prediction: The First Application to the Estimation of Aboveground Biomass

: Increasing numbers of explanatory variables tend to result in information redundancy and “dimensional disaster” in the quantitative remote sensing of forest aboveground biomass (AGB). Feature selection of model factors is an effective method for improving the accuracy of AGB estimates. Machine learning algorithms are also widely used in AGB estimation, although little research has addressed the use of the categorical boosting algorithm (CatBoost) for AGB estimation. Both feature selection and regression for AGB estimation models are typically performed with the same machine learning algorithm, but there is no evidence to suggest that this is the best method. Therefore, the present study focuses on evaluating the performance of the CatBoost algorithm for AGB estimation and comparing the performance of different combinations of feature selection methods and machine learning algorithms. AGB estimation models of four forest types were developed based on Landsat OLI data using three feature selection methods (recursive feature elimination (RFE), variable selection using random forests (VSURF), and least absolute shrinkage and selection operator (LASSO)) and three machine learning algorithms (random forest regression (RFR), extreme gradient boosting (XGBoost), and categorical boosting (CatBoost)). Feature selection had a signiﬁcant inﬂuence on AGB estimation. RFE preserved the most informative features for AGB estimation and was superior to VSURF and LASSO. In addition, CatBoost improved the accuracy of the AGB estimation models compared with RFR and XGBoost. AGB estimation models using RFE for feature selection and CatBoost as the regression algorithm achieved the highest accuracy, with root mean square errors (RMSEs) of 26.54 Mg/ha for coniferous forest, 24.67 Mg/ha for broad-leaved forest, 22.62 Mg/ha for mixed forests, and 25.77 Mg/ha for all forests. The combination of RFE and CatBoost had better performance than the VSURF–RFR combination in which random forests were used for both feature selection and regression, indicating that feature selection and regression performed by a single machine learning algorithm may not always ensure optimal AGB estimation. It is promising to extending the application of new machine learning algorithms and feature selection methods to improve the accuracy of AGB estimates


Introduction
Forest aboveground biomass (AGB) is the material basis with which forest ecosystems perform their ecological functions and is an important indicator of forest carbon sequestration capacity [1]. The accurate estimation of AGB could provide a significant insight in the study of global carbon cycling and climate changes [2].
Traditional methods for AGB measurement, primarily field surveys, have a number of disadvantages: high labor intensity, high costs of manpower and material resources, long survey periods, and disturbance of the ecological environment [3]. As a result of its extensive coverage and superior repeatability, remote sensing technology is increasingly used for AGB estimation at regional scales. Various methods have been investigated to improve the accuracy of biomass estimation by remote sensing, including (1) larger samples [4], (2) combinations of multi-source remote sensing data [5], (3) feature selection [6,7], and (4) modeling algorithms. However, the high cost of field investigation limits the acquisition of larger sample datasets [8,9]. Remote sensing data such as hyperspectral and airborne lidar data are difficult to obtain due to high costs and regional limitations [10,11]. Previous studies have shown that a saturation phenomenon occurs in AGB estimation when using band reflectance or vegetation indices derived from multispectral data [12][13][14][15]. However, multispectral data, such as Landsat OLI data, are still widely used in AGB estimation at the regional scale and have the advantages of low cost and accessibility [14,16,17].
Feature selection methods are often used for predictive models that are based on high dimensional data [18]. A variety of potential variables can be obtained from remote sensing data, including multispectral reflectance, vegetation index derived from spectral data, texture measures, and others [19][20][21]. However, high dimensionality tends to result in information redundancy and "dimensional disaster" [7]. To reduce the risk of overfitting, feature selection can be performed before constructing a model in order to extract the most important and useful information from the original dataset [22]. Li et al. evaluated stepwise regression and the variable importance-based method for the selection of an optimal subset of variables for AGB estimation using multiple stepwise regression and machine learning methods [23]. They demonstrated the potential utility of feature selection for AGB estimation, especially for machine learning algorithms. Yu et al. focused on variable selection and the model stability of a subtropical forest biomass estimation model [24]. They discussed key aspects of eight variable selection methods and pointed out that most methods differed markedly in their performance on individual indices. Therefore, further investigation of feature selection for AGB estimation is required to gather additional evidence and reach reasonable conclusions.
The accuracy of AGB predictions is also affected by the regression algorithm [25]. The linear regression model is one of the most popular methods, including conventional multiple linear stepwise regression and principal component linear regression, which has a significant advantage in the prediction of forest biomass [26]. However, linear regression cannot fully capture the complex relationships between explanatory variables and AGB, accounting for its low prediction accuracy. Machine learning algorithms such as back propagation neural network, K-nearest neighbor, support vector machine (SVM), and random forests (RF) have been used to improve AGB estimation [27][28][29]. Zhao et al. investigated four machine learning algorithms, including classification regression tree (CART), artificial neural network (ANN), SVM, and RF, to estimate the parameters of a Robinia pseudoacacia L. plantation on the Loess Plateau [12]. They found that RF had the highest accuracy and the smallest error among all forest parameter estimates. Sikdar et al. evaluated two machine learning methods (RF and SVM) for biomass estimation of Sporobolus virginicus (L.) Kunth and found that the RF model (R 2 = 0.72, RMSE = 0.166 kg/m 2 ) outperformed the SVM model (R 2 = 0.66, RMSE = 0.200 kg/m 2 ) [7].
Ensemble learning is a branch of machine learning in which learning tasks are completed by building and combining multiple learners [30,31]. It can integrate decision trees and neural networks or use decision trees or neural networks alone, and it is mainly divided into Bagging and Boosting algorithms. Bagging renders the model a high generalization by reducing the variance. RF is a popular technique in Bagging [32]. Boosting uses a set of machine learning algorithms to convert weak learners to strong learners to increase the accuracy of the model. Boosting includes adaptive boosting (AdaBoost), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost). GBDT and AdaBoost are commonly used algorithms in Boosting. XGBoost and CatBoost, which are improvements of GBDT, have shown potential in fields such as biology and medicine [33][34][35][36]. Although some studies have attempted to use these methods for modeling forest properties from remote sensing data, the CatBoost method has been introduced very recently [13,37]. However, Pham et al. found that extreme gradient boosting regression-genetic algorithm (XGBR-GA) outperformed CatBoost and RFR in estimating mangrove AGB [13]. Therefore, our study is devoted to improving the performance of CatBoost in forest AGB estimation by the coupling with the feature selection.
Machine learning algorithms such as RF have their own built-in feature selection functions that can be used for simultaneous feature selection and regression [38]. However, there is little evidence to indicate that a greater accuracy of AGB estimation is achieved when feature selection and regression are performed with the same algorithm. Therefore, it is worthwhile to explore the performance of AGB estimation when different feature selection methods and machine learning algorithms are combined. This study explores the following two questions: (1) What are the effects of combining different feature selection methods with different regression algorithms? (2) Does the CatBoost algorithm improve the accuracy of AGB estimation? This study focused on the estimation of AGB using the ninth National Forest Continuous Inventory (NFCI) data from Jilin Province, China and Landsat Operational Land Imager (Landsat OLI) data.
The goals of this study are as follows: (1) Compare and analyze AGB prediction models of different forest types by combining three feature selection methods (recursive feature elimination (RFE), variable selection using random forests (VSURF) and least absolute shrinkage and selection operator (LASSO)) and three regression algorithms (random forest regression (RFR), XGBoost, and CatBoost); (2) Evaluate the accuracy of the CatBoost algorithm for AGB estimation and compare it with the RFR and XGBoost algorithms; (3) Identify the best input variables for AGB estimation by feature selection.

Study Area
The study was conducted across all of Jilin province ( Figure 2). It is characterized by hilly topography in the east and plains in the central region, and its elevation ranges from 279 to 2691 m [39]. Jilin has a temperate continental monsoon climate, with a more humid climate to the southeast and a semiarid climate in the northwest. The annual mean temperature is about 5.0 • C, with a maximum temperature of 23 • C in July and a minimum temperature of −20 • C in December. The seasonal distribution of precipitation is uneven: annual precipitation ranges from 400 to 600 mm, and 80% falls in the rainy season from June to September [40]. Natural forests form the bulk of the AGB studied in Jilin, and plantations are less common. The coniferous species are dominated by Pinus koraiensis Sieb. et. Zuccarini., followed by Picea asperata Mast. and Pinus sylvestris var. mongolica Litv. [41]. The major broadleaved species include Quercus mongolica Fischer ex Ledebour, Betula platyphylla Suk., Juglans mandshurica Maxim., Populus ussuriensis Kom., and Tilia amurensis Rupr [42].

Field Data Collection and Preprocessing
This study made use of the ninth NFCI data of Jinlin Province collected in 2014. Square sampling plots with an area of 0.06 hectares were arranged at 4 km × 8 km grid intersection points along a system of vertical and horizontal coordinates. Tree heights, diameter at breast height (DBH) of all trees greater than 5 cm, canopy density, slope, aspect, and slope position were measured. Plot types, land types, age group, and dominant tree species were also recorded. Plots that were not on forested land (cropland, water, urban land, and bare land) and those covered by clouds in the remote sensing images were excluded. The final dataset included 2716 plots ( Figure 2).

Study Area
The study was conducted across all of Jilin province (19.1 × 104 km 2 , 40°52′ N-46°18′ N, 121°38′ E-131°19′ E) in the central part of northeast China ( Figure 2). It is characterized by hilly topography in the east and plains in the central region, and its elevation ranges from 279 to 2691 m [39]. Jilin has a temperate continental monsoon climate, with a more humid climate to the southeast and a semiarid climate in the northwest. The annual mean temperature is about 5.0 °C, with a maximum temperature of 23 °C in July and a minimum temperature of −20 °C in December. The seasonal distribution of precipitation is uneven: annual precipitation ranges from 400 to 600 mm, and 80% falls in the rainy season from June to September [40]. Natural forests form the bulk of the AGB studied in Jilin, and plantations are less common. The coniferous species are dominated by Pinus koraiensis Sieb. et. Zuccarini., followed by Picea asperata Mast. and Pinus sylvestris var. mongolica Litv. [41]. The major broadleaved species include Quercus mongolica Fischer ex Ledebour, Betula platyphylla Suk., Juglans mandshurica Maxim., Populus ussuriensis Kom., and Tilia amurensis Rupr [42]. A continuous function was used to calculate the plot AGB, which was converted to per hectare biomass (Mg/ha) [43,44]. The regression equation is as follows: where B is biomass per unit area (Mg/ha); V that is the stock volume per unit area (m 3 /ha), was calculated according to the standing volume table method [45], and a and b are parameters whose specific values are given in Table 1.
According to the technical regulations for the continuous forest inventory of China, the plots were classified into three types (coniferous forest, broad-leaved forest, and mixed forest) based on the standing volume of tree species (

Field Data Collection and Preprocessing
This study made use of the ninth NFCI data of Jinlin Province collected in 2014. Square sampling plots with an area of 0.06 hectares were arranged at 4 km × 8 km grid intersection points along a system of vertical and horizontal coordinates. Tree heights, diameter at breast height (DBH) of all trees greater than 5 cm, canopy density, slope, aspect, and slope position were measured. Plot types, land types, age group, and dominant tree species were also recorded. Plots that were not on forested land (cropland, water, urban land, and bare land) and those covered by clouds in the remote sensing images were excluded. The final dataset included 2716 plots ( Figure 2).
A continuous function was used to calculate the plot AGB, which was converted to per hectare biomass (Mg/ha) [43,44]. The regression equation is as follows: =a +b B V (1) where B is biomass per unit area (Mg/ha); V that is the stock volume per unit area (m 3 /ha), was calculated according to the standing volume table method [45], and a and b are parameters whose specific values are given in Table 1.
According to the technical regulations for the continuous forest inventory of China, the plots were classified into three types (coniferous forest, broad-leaved forest, and mixed forest) based on the standing volume of tree species (    Table 2. Classification standard of forest types [23].

Remote Sensing Data Collection and Preprocessing
The remote sensing data used in this study were acquired from the Landsat OLI satellite and included one band in the panchromatic mode and eight bands in the multispectral mode. Landsat OLI images with cloud cover less than 5% were acquired on the GEE platform (https://earthengine.google.com/, accessed on 15 May 2020) from June to September of 2014. Data preprocessing steps were performed on the GEE platform, including radiation correction, terrain correction, cloud removal, image splicing, and atmospheric apparent reflectance conversion [46]. A total of 18 Landsat OLI images were obtained, covering the entire study area. Band 1-coastal and band 9-Cirrus were excluded, as Band 1 and Band 9 are used as indicators of coastal zone observation and cloud detection, respectively. Although the images were resampled to a pixel size of 25.82 m, the same size as the inventory plot, there are certain positional offsets between the boundary of the sample site and the remote sensing image. To reduce the impact of these positional offsets, a buffer zone with a radius of 25.82 m was established around each plot. Then, the mean pixels of each plot center buffer are extracted as the sample plot value.

Predictive Variables
A total of 53 predictive variables were extracted for this study (Table 4). Six spectral variables were derived from the average surface reflectance of the six multispectral bands. Eleven vegetation indices that have been widely used in previous forest studies were calculated [47]. Texture measures were extracted using the gray level co-occurrence matrix method (GLCM) [48], and eight GLCM measures with four different window sizes (3 × 3, 5 × 5, 7 × 7, and 9 × 9) were calculated from the panchromatic band. Elevation, slope, and aspect were selected as terrain factors and were derived from a digital elevation model (DEM) with a spatial resolution of 30 m, which were downloaded from the United States Geological Survey (USGS) Earth Explorer (http://earthexplorer.usgs.gov/, accessed on 30 June 2019). Canopy density from field surveys was also used as a predictor variable, that is, the ratio of canopy projection area to woodland area.

Feature Selection Methods
Feature selection is one of the critical steps in machine learning and has two main functions: (1) reducing the number of features and dimensions, thereby enhancing model generalizability and reducing overfitting, and (2) enhancing the understanding between features and eigenvalues. RFE, VSURF, and LASSO were applied to the original dataset to determine the appropriate variable number.

Recursive Feature Elimination
RFE is a well-established wrapper method to find the optimal feature subset [49]; it repeatedly constructs models and finally selects the best predicted feature set. The best or worst features are removed on the basis of coefficients, and the remaining features are used to build the next model until all features have been used. Finally, the features are sorted according to their order of elimination. To determine the best number of features, a 10-fold cross-validation resampling method was added to RFE. The 'rfe ()' function in the 'caret' package (version: 6.0-84) was used to implement RFE on R 3.5.3, with 'Repeatedcv' for method, '5' for repeats, and 'random forests (rfFuncs)' for functions.

Variable Selection Using Random Forests
VSURF is a wrapper-based algorithm that uses random forests as the base classifier [50]. Feature variables are first ranked based on a variable importance measure, and low-scoring features are eliminated to reduce the number of features and improve model accuracy [51]. In the final step, a ranked list of only the most important features is produced.

Least Absolute Shrinkage and Selection Operator
LASSO was introduced by Robert Tibshirani [52]. On the basis of linear regression, the algorithm constrains the absolute value of the model regression coefficient to a specific threshold and minimizes the sum of squares of model residuals by increasing the normal form function. By optimizing the objective function, variables whose correlation is less than the threshold value are compressed to 0 and eliminated, leaving only the remaining preferred variables. The 'cv.glmnet' function of the 'glmnet' package (version: 2.0-18) of R is used for LASSO feature selection. 'cv.glmnet' uses a 10-fold cross-validation method for verification, which is more suitable for feature selection with high-dimensional data.

Random Forest Regression
RFR is an ensemble learning technique proposed by Breiman in 2001 [53]. This multiple decision tree algorithm is based on classification and regression trees and is controlled by the number of decision trees (Ntree) and the number of nodes (Mtry) [54]. The complexity of the model is directly proportional to Ntree. Therefore, if prediction accuracy is similar, a smaller Ntree is better than a larger Ntree. Mtry controls the degree of randomness introduction, and its value is generally one-third of the number of input variables in the regression model [55,56]. The optimal variables are selected through successive modeling calculations. RFR has the advantages of processing high-dimensional data, avoiding overfitting, and ranking the importance of features.

Extreme Gradient Boosting
XGBoost is a boosting algorithm proposed by Chen et al. in 2016 based on the GBDT and RF approaches [57]. Compared with GBDT, XGBoost has improvements in multithreaded processing, the classifier, and the optimization function [58]. It also has the following advantages [59,60]: (1) The algorithm controls the complexity of the tree and then reduces overfitting by adding a regularization term to the objective function. (2) A column sampling technique is employed to prevent overfitting, similar to the random forest algorithm.
(3) The second-order Taylor expression of the objective function is used to make the definition of the objective function simpler and more precise when finding the optimal solution.

Categorical Boosting
CatBoost was developed by Dorogush et al. in 2018 [61] and is an improved GBDT toolkit similar to XGBoost. CatBoost solves the problems of gradient bias and prediction shift [33]. It has several advantages [34,62]: (1) An innovative algorithm is embedded to automatically treat categorical features as numerical characteristics. (2) It uses a combination of category features that take advantage of the connections between features, greatly enriching feature dimensions. (3) A perfectly symmetrical tree model is adopted to reduce overfitting and improve the accuracy and generalizability of the algorithm.

Evaluation of AGB Estimation Accuracy
First, data were preprocessed before modeling. Any data point that is more than 3 standard deviations is an outlier. At the same time, we also carefully removed outliers based on experience. To improve the convergence speed and model accuracy, the data was min-max normalized so that the resulting values were mapped to between 0 and 1 and formula is as Equation (2). Next, 10-fold cross-validation was used to optimize the parameters for all algorithms (RFR, XGBoost, and CatBoost) based on the training dataset (75% of the data). Finally, the calibrated model from 10-fold cross-validation was tested once again with the independent validation dataset (25% of the data) to estimate the root mean square error (RMSE), coefficient of determination (R 2 ), relative RMSE (RMSE%), and bias of the models. The formulas for these statistical parameters are as follows: where x i is the data before normalization, x j is the normalized data, x min is the minimum value of sample data, and x max is the maximum value of sample data.ŷ i is the predicted value, y i is the observed value, y is the mean of the observed values, and N is the number of observations.

Determining the Optimal Number of Variables
Different forest types showed different RMSE trends as the input variable number changed in the RFE feature selection method (Figure 3). The number of selected features differed among the different forest types. For coniferous forest, the optimal number of features was five, whereas 10, 21, and 25 features were selected for the broadleaf forest, the mixed forest, and all forests.

Variable Importance Measures
We calculated Pearson correlation coefficients between the predictor variables and AGB and found that the T9Mea texture measure had the highest correlation coefficient (−0.61). The correlation coefficients between most texture measures and AGB were 0.2-0.5. Band4 (red) had the highest correlation with AGB among the spectral variables, with a value of −0.57. RVI64 had the highest correlation with AGB among the vegetation indexes, with a value of −0.54. The correlation coefficients of elevation, slope, and canopy density with AGB were 0.56, 0.46, and 0.24, respectively.
The AGB models constructed by combining different feature selection methods (RFE, VSURF, LASSO) and machine learning algorithms (RFR, XGBoost, CatBoost) had different orders of feature importance. Figure 4 shows the top 10 selected prediction features based on feature importance for the different forest types obtained with the RFE feature selection method and the three machine learning algorithms. Spectral variables, vegetation indexes, texture measures, terrain factors, and the forest factor were included in most of the AGB models for all forests (All), although the specific selected variables differed. This result indicated that multiple features, rather than a single feature, influenced AGB estimation. For VSURF, the OOB error values decreased continuously as the variable number increased from 0 to 20 but changed only slightly as the variable number increased from 20 to 52. The variation trend for LASSO was similar to that of VSURF. After comprehensive consideration, 20 was a suitable variable number that maintained most of the information from the original dataset, and VSURF and LASSO selected the top 20 features with the highest scores.

Variable Importance Measures
We calculated Pearson correlation coefficients between the predictor variables and AGB and found that the T9Mea texture measure had the highest correlation coefficient (−0.61). The correlation coefficients between most texture measures and AGB were 0.2-0.5. Band4 (red) had the highest correlation with AGB among the spectral variables, with a value of −0.57. RVI64 had the highest correlation with AGB among the vegetation indexes, with a value of −0.54. The correlation coefficients of elevation, slope, and canopy density with AGB were 0.56, 0.46, and 0.24, respectively.
The AGB models constructed by combining different feature selection methods (RFE, VSURF, LASSO) and machine learning algorithms (RFR, XGBoost, CatBoost) had different orders of feature importance. Figure 4 shows the top 10 selected prediction features based on feature importance for the different forest types obtained with the RFE feature selection method and the three machine learning algorithms. Spectral variables, vegetation indexes, texture measures, terrain factors, and the forest factor were included in most of the AGB models for all forests (All), although the specific selected variables differed. This result indicated that multiple features, rather than a single feature, influenced AGB estimation. Five features were selected in the coniferous forest model: T9Mea, B3-Green, Elevation, B7-SWIR2, and Canopy. T9Mea, Elevation, and Canopy were also frequently selected for the AGB models of broadleaf forest, indicating that these features have significant roles in AGB estimation. Canopy and elevation were the two most important features in all mixed forest models. In AGB models of all forests, the three most important variables were Canopy, T7Mea, and Elevation for the RFE-RFR model; T9Mea, Canopy, and Elevation for the RFE-XGBoost model; and Canopy, Elevation, and T7Mea for the RFE-CatBoost model. Canopy and Elevation were included in all AGB models, indicating that both features contained sufficient information to enhance the performance of AGB estimation models.
Textures based on 7 × 7 and 9 × 9 window sizes were frequently involved in the models, indicating that these two window sizes were more appropriate for AGB estimation. Window size is an important factor that affects the accuracy of the AGB estimation model. Small windows are more sensitive to the pixel difference between canopy and shadow ratio, whereas large windows cannot extract sufficient texture information due to the excessive smoothness phenomenon of texture parameters. Mean texture was frequently involved in the model, indicating that Mean contributed more than other texture measures. Five features were selected in the coniferous forest model: T9Mea, B3-Green, Elevation, B7-SWIR2, and Canopy. T9Mea, Elevation, and Canopy were also frequently selected for the AGB models of broadleaf forest, indicating that these features have significant roles in AGB estimation. Canopy and elevation were the two most important features in all mixed forest models. In AGB models of all forests, the three most important variables were Canopy, T7Mea, and Elevation for the RFE-RFR model; T9Mea, Canopy, and Elevation for the RFE-XGBoost model; and Canopy, Elevation, and T7Mea for the RFE-CatBoost model. Canopy and Elevation were included in all AGB models, indicating that both features contained sufficient information to enhance the performance of AGB estimation models.
Textures based on 7 × 7 and 9 × 9 window sizes were frequently involved in the models, indicating that these two window sizes were more appropriate for AGB estimation. Window size is an important factor that affects the accuracy of the AGB estimation model. Small windows are more sensitive to the pixel difference between canopy and shadow ratio, whereas large windows cannot extract sufficient texture information due to the excessive smoothness phenomenon of texture parameters. Mean texture was frequently involved in the model, indicating that Mean contributed more than other texture measures. Figure A1 presents the rankings of the top 10 AGB model variables obtained when LASSO was combined with the machine learning algorithms, and Figure A2 presents the results obtained when VSURF was used. Similar to RFE, all types of features (spectral variables, vegetation indexes, texture measures, terrain factors, and the forest factor) were involved in the AGB models when LASSO and VSURF were used for feature selection. However, the specific variables selected were different. Spectral variables, vegetation indexes, and texture measures played different roles in the AGB models of broadleaf, mixed, and coniferous forests. The spectral and vegetation index variables could be used to distinguish broadleaf and mixed forests, owing to their diversity of tree species and multiple canopy layers. By contrast, because the coniferous forest was dominated by a small number of species, its AGB could be explained well by texture rather than spectral information. It was interesting to note that canopy density contributed significantly to the models for most forest types, especially in broadleaf and mixed. This likely reflects the fact that forest canopy structure was complex and diverse.

Comparison of Model Performance
The key tuning hyperparameters and the configurations of the tuning parameters for 36 AGB models are shown in Table A1. Three machine learning algorithms, RFR, XGBoost, and CatBoost, were used to fit the modeling dataset based on the RFE feature selection method, and the results are presented in Table 5. Comprehensive consideration based on the accuracy indicators (R 2 , RMSE, RMSE%, and Bias), CatBoost was the best machine learning algorithm. For the coniferous forest, the CatBoost algorithm produced more accurate estimates (with RMSE of 26.54 Mg/ha) than the RFR (with RMSE of 26.63 Mg/ha) and XGBoost (with RMSE of 28.81 Mg/ha) algorithms. CatBoost also achieved the lowest RMSE values for different forest types except for mixed forest compared with the other algorithms. Interestingly, sometimes the Bias of the RFR is lower than CatBoost. One possible explanation is that the positive and negative predicted values cancel each other out in the RFR algorithm. The ensemble learning algorithms (XGBoost and CatBoost) have shown potential in fields such as biology and medicine [13,[33][34][35][36]. Here, we found that CatBoost turns out to be a promising method for modeling forest AGB using remote sensing data.  Table 6 presents the estimate accuracy of AGB models in which LASSO was combined with the machine learning algorithms, and Table 7 presents the results obtained using VSURF. For all forest types, models that used RFE-based selection clearly performed better than models that used VSURF or LASSO. For example, for broadleaf, the RMSE of VSURF-selected models (24.77, 25.62, and 25.08 Mg/ha) and LASSO-selected models (25.11, 26.16, and 25.54 Mg/ha) were much larger than those of RFE-selected models (24.67, 25.55, and 24.74 Mg/ha). These results indicate that the RFE-selected dataset retained more useful information from the original dataset, thereby explaining the differences in AGB estimation. The best AGB model predictions were obtained when RFE was combined with the CatBoost algorithm. Moreover, the RFE-CatBoost model (RFE used for feature selection and CatBoost used for regression) performed well in prediction accuracy compared to the VSURF-RFR model (in which random forests were used for both feature selection and regression) for all forest types. This result indicates that the typical method of simultaneous feature selection and regression with the same machine learning algorithm may not always ensure an optimal estimate. The combination of different feature selection and regression algorithms was helpful in improving the accuracy of the AGB model.

Evaluation of AGB Estimation
The scatter plots shown in Figure 5 illustrate the relationships between predicted and observed AGB for different forest types using the RFR, XGBoost, and CatBoost algorithms with the RFE variable selection method. The CatBoost algorithm was superior to the RFR and XGBoost algorithms. Within a given algorithm, the accuracy of the broadleaf model is the highest, followed by the all and mixed models, the coniferous model being the least accurate.

Discussion
Feature selection methods and machine learning algorithms were effective tools for improving the accuracy of AGB estimates. To predict the AGB of four forest types, 36 models were derived using three feature selection methods (RFE, VSURF, and LASSO) and three machine learning algorithms (RFR, XGBoost, and CatBoost) based on NFCI data. CatBoost performed well in the prediction of all forest types compared to XGBoost and RFR. AGB models were most accurate when RFE was used for feature selection, and this method worked better than VSURF or LASSO. The combination of RFE feature selection with the CatBoost algorithm effectively improved the accuracy of AGB estimates. Since feature selection can reduce data dimensionality, minimize data storage space, and improve model interpretability, it has been employed to improve the performance of predictive models. RFE showed the smallest RMSE, followed by VSURF and LASSO. The first five features selected for the RFE-RFR-Broadleaf model were Elevation, Canopy, T9Mea, RVI64, and B3-green. This result demonstrated that the feature selection method effectively selected different types of features, including spectral variables, vegetation indexes, texture measures, terrain factors, and the forest factor, confirming its applicability and effectiveness for AGB estimation from multispectral data. In this study, there is a high correlation between the Mean texture and AGB. It seems likely that these results are in fact due to the reflection of the regularity of the image texture in the Mean texture. The image texture rules of different tree species are quite different; thus, the average image texture can better reflect the tree species diversity in the study area. The Mean texture contains the information on the species of the tree and is beneficial to the diversity prediction of the forest tree species, which is consistent with the finding of Li et al. [63]. Canopy density was proved to have an effect on the accuracy of the AGB estimation. There are many types of trees in the study area, and the variance in the growth status of those tree species leads to the change in the canopy density. A high canopy density can be attributed to the lush of the forest and a high AGB. In a study conducted by Li et al. [63], AGB and

Discussion
Feature selection methods and machine learning algorithms were effective tools for improving the accuracy of AGB estimates. To predict the AGB of four forest types, 36 models were derived using three feature selection methods (RFE, VSURF, and LASSO) and three machine learning algorithms (RFR, XGBoost, and CatBoost) based on NFCI data. CatBoost performed well in the prediction of all forest types compared to XGBoost and RFR. AGB models were most accurate when RFE was used for feature selection, and this method worked better than VSURF or LASSO. The combination of RFE feature selection with the CatBoost algorithm effectively improved the accuracy of AGB estimates. Since feature selection can reduce data dimensionality, minimize data storage space, and improve model interpretability, it has been employed to improve the performance of predictive models. RFE showed the smallest RMSE, followed by VSURF and LASSO. The first five features selected for the RFE-RFR-Broadleaf model were Elevation, Canopy, T9Mea, RVI64, and B3-green. This result demonstrated that the feature selection method effectively selected different types of features, including spectral variables, vegetation indexes, texture measures, terrain factors, and the forest factor, confirming its applicability and effectiveness for AGB estimation from multispectral data. In this study, there is a high correlation between the Mean texture and AGB. It seems likely that these results are in fact due to the reflection of the regularity of the image texture in the Mean texture. The image texture rules of different tree species are quite different; thus, the average image texture can better reflect the tree species diversity in the study area. The Mean texture contains the information on the species of the tree and is beneficial to the diversity prediction of the forest tree species, which is consistent with the finding of Li et al. [63]. Canopy density was proved to have an effect on the accuracy of the AGB estimation. There are many types of trees in the study area, and the variance in the growth status of those tree species leads to the change in the canopy density. A high canopy density can be attributed to the lush of the forest and a high AGB. In a study conducted by Li et al. [63], AGB and LAI were estimated based on digital aerial photograph data in northeast China. When the results were analyzed, it was found that the canopy density extracted from photogrammetric point cloud is the most strongly related to the estimated LAI. However, the underlying mechanisms and relationships should be studied in more detail in the future. The results of the present study agree to a large extent with those of previous studies [6,7,13]. The feature selection capability of RFE has been reported in recent studies of halophyte biomass estimation based on WorldView-2 multispectral data (Rasel et al. [7].) and forest parameter prediction using optical data (Dan Li et al. [63]). The VSURF feature selection method implements a feature selection function by measuring the importance of features [18]. Specifically, it produces a set of feature scores ranked from largest to smallest, rather than an optimal feature combination. A feature set is the final result of RFE, which more fully conveys the information from the original dataset. In addition, the stability of RFE was enhanced owing to the use of "random forests" as the underlying model for RFE during iteration.
The application of machine learning algorithms to optical data processing in quantitative remote sensing is developing rapidly. However, there have been few applications of new ensemble learning algorithms to AGB estimation. Here, the CatBoost algorithm was shown to be superior to XGBoost, but it was only slightly superior to RFR. The RFR algorithm, a parallel integrated algorithm, is not sensitive to predictive variables that contain noise, as the decision tree is independent [64,65]. The decision tree of XGBoost is generated based on the previous tree and improves the processing of regularized learning targets to avoid overfitting [60]. In contrast to XGBoost, CatBoost can avoid overfitting by adding an algorithm to compute leaf nodes when choosing a tree structure [62]. Consequently, Cat-Boost not only performed well on the validation set but also had the lowest degree of overfitting on the validation set. For example, the RFE-ALL-RFR (R 2 = 0.96, RMSE = 10.50 Mg/ha) model outperformed RFE-ALL-CatBoost (R 2 = 0.81, RMSE = 21.76 Mg/ha) in the training dataset, but RFE-ALL-CatBoost (R 2 = 0.73, RMSE=25.77 Mg/ha) turned out to be better than RFE-ALL-RFR (R 2 = 0.72, RMSE = 26.13 Mg/ha) in the validation dataset. CatBoost also had a faster run time than XGBoost, thus improving the efficiency of the operation. In addition, CatBoost works faster in large samples. In the coniferous (n = 358) and mixed (n = 247), RFR was slightly faster than CatBoost. However, in broadleaf (n = 2111) and ALL (n = 2716), CatBoost runs much faster than RFR (Tables 5-7). This finding is in line with the results reported by Zhang et al., which show that the CatBoost algorithm outperformed the RFR and GBRT algorithms for the AGB estimation [37]. Li et al. focused on the potential for machine learning algorithms to improve AGB estimation accuracy [23]. Interestingly, our results showed only a slight difference in prediction performance between XGBoost and RFR, whereas Li et al. reported better results for XGBoost. The different feature selection method used in our study may explain the differences in our results.
The AGB prediction models based on a combination of different feature selection methods and machine learning algorithms were superior to those in which one machine learning algorithm was used for both feature selection and regression. For example, the RFE-CatBoost-Coniferous model (RMSE = 26.54 Mg/ha) outperformed the VSURF-RFR-Coniferous model (RMSE = 27.44 Mg/ha). Previous studies on LAI estimation also reported that using the same machine learning algorithm for both feature selection and regression was not always the optimal solution. It is interesting to note that Chen et al. found the VSURF-RFR model to be the best, although in this case, the study object and the method of comparison were different [38]. Overall, our experiment indicates that using the same machine learning algorithm for feature selection and regression does not improve AGB estimation accuracy. There are two likely causes for such a result. One possible explanation for this might be that the scores are adopted by the RF algorithm as the criterion for feature selection. However, most of the scores only represent the correlation between one particular input variable and the AGB. Therefore, it does not guarantee that the combination of variables with high scores would be the optimal choice for AGB estimation. Additionally, the feature selection process was conducted twice duo to the employment of the combination of the feature selection method and the regression algorithm.
Here, the combination of CatBoost algorithm and feature selection methods was applied to AGB prediction for the first time and achieved the best performance. In this regard, the combination of different feature selection methods and regression algorithms can be considered as a powerful tool in AGB prediction. This tool could be extended to other fields of forest ecology, for example, predicting other biophysical parameters (leaf area index, forest canopy coverage rate, and effective canopy coverage rate), soil nutrient or soil heavy metal content using optical and hyperspectral remote sensing data. To some readers, machine learning methods (especially neural networks) seem like a black box. The three machine learning algorithms (RFR, XGBoost, and CatBoost) selected in our study are easier to understand in the steps of modeling and interpretation. With the advent of the era of big data, more data sources and larger amounts of data are available in the forest ecology field. Machine learning approaches can overcome the disadvantages of traditional approaches in dealing with big data.
Several previous studies have emphasized the effects of ecological conditions, canopy closure, and soil background on AGB estimation [4,17,26,66]. Here, the CatBoost algorithm made a significant improvement to the AGB model. However, the problems of low value overestimation and high value underestimation that have been reported in previous studies have not been completely solved [4,26,67]. All the algorithms underestimated the AGB observed values when the AGB values were larger than about 200 Mg/ha ( Figure 5). On the one hand, the factors affecting AGB estimation are complex. In our study, original spectral variables, vegetation index, texture features and topographic factors are included, but there are also some potentially important variables affecting AGB prediction, such as climate factors, etc. On the other hand, the Landsat OLI data itself has a number of limitations, including a low resolution of 30 m and the obvious mixed pixel phenomenon [32,68]. For plots with low AGB values, the reflectance of spectral bands is affected by ground vegetation and soil. For plots with high AGB values, the multispectral sensors become saturated [47]. Further research is needed to improve AGB estimation with feature selection methods and machine learning algorithms based on hyperspectral, high-resolution optical, and lidar data.

Conclusions
We used three feature selection methods (RFE, VSURF, and LASSO) and three machine learning algorithms (RFR, XGBoost, and CatBoost) to develop AGB estimation models for different forest types. The CatBoost algorithm was applied to AGB prediction from optical remote sensing data for the first time.
The following conclusions can be drawn: (1) Feature selection has a significant influence on the predictive performance of the AGB models. The RFE algorithm is one of the most appropriate feature selection methods for AGB estimation from optical remote sensing data. (2) The CatBoost algorithm better than the XGBoost and RFR algorithms and has great potential for AGB prediction. (3) Using the same machine learning algorithm for feature selection and regression is not always the best approach for AGB estimation.   Acknowledgments: Thanks to the anonymous reviewers for their constructive and valuable comments, and the editors for their assistance in refining this article.

Conflicts of Interest:
The authors declare that there is no conflict of interest regarding the publication of this paper.
Appendix A Table A1. Optimal hyperparameter using a grid search with 10-fold CV for the different AGB models.  Figure A1. Variable importance ranking for the different forest types by combining feature selection method (LASSO) and three machine learning algorithms (RFR, XGBoost, and CatBoost). Figure A1. Variable importance ranking for the different forest types by combining feature selection method (LASSO) and three machine learning algorithms (RFR, XGBoost, and CatBoost). Figure A2. Variable importance ranking for the different forest types by combining feature selection method (VSURF) and three machine learning algorithms (RFR, XGBoost, and CatBoost).