Application of Convolutional Neural Network on Lei Bamboo Above-Ground-Biomass (AGB) Estimation Using Worldview-2

: Above-ground biomass (AGB) directly relates to the productivity of forests. Precisely, AGB mapping for regional forests based on very high resolution (VHR) imagery is widely needed for evaluation of productivity. However, the diversity of variables and algorithms and the di ﬃ culties inherent in high resolution optical imagery make it complex. In this paper, we explored the potentials of the state-of-art algorithm convolutional neural networks (CNNs), which are widely used for its high-level representation, but rarely applied for AGB estimation. Four experiments were carried out to compare the performance of CNNs and other state-of-art Machine Learning (ML) algorithms: (1) performance of CNN using bands, (2) performance of Random Forest (RF), support vector regression (SVR), artiﬁcial neural network (ANN) on bands, and vegetation indices (VIs). (3) Performance of RF, SVR, and ANN on gray-level co-occurrence matrices (GLCM), and exploratory spatial data analysis (ESDA), and (4) performance of RF, SVR, and ANN based on all combined data and ESDA + VIs. CNNs reached satisfactory results (with R 2 = 0.943) even with limited input variables (i.e., only bands). In comparison, RF and SVR with elaborately designed data obtained slightly better accuracy than CNN. For examples, RF based on GLCM textures reached an R 2 of 0.979 and RF based on all combined data reached a close R 2 of 0.974. However, the results of ANN were much worse (with the best R 2 of 0.885).


Introduction
The above ground biomass (AGB) of forests is related to the productivity of forests, carbon cycle, and habitation in terrestrial ecosystems [1,2]. Evaluation of biomass is crucial for monitoring and estimating forest quality [3,4]. Moreover, accurate estimation of biomass could reduce the uncertainty in monitoring global climate changes [5][6][7][8][9]. There are several methods used to evaluate the AGB of forests. Directly calculating biomass by cutting down and weighing the whole tree is accurate but imagery AGB estimation, 2) test state-of-art ML algorithms and compare with CNN, and 3) test the performance of widely applied variables, including VIs, GLCM textures, and exploratory spatial data analysis (ESDA) textures.

Study Area and Field Survey
The study area is located in Taihuyuan, Zhejiang province, South China ( Figure 1). The area has a monsoon-type climate, warm and humid, with sufficient sunshine, abundant rainfall. Four seasons are distinctive. The average temperature of a year is 16.4°C and the total precipitation for one year is 1628.6 mm. Vegetation is mainly composed of broadleaf forests, coniferous (mainly Cunninghamia lanceolate and Pinus massoniana Lamb), and lei bamboo (Phyllostachys praecox C. D. Chu et C. S. Chao 'Prevernalis') forests. Lei bamboos are important economic crops that cover large areas inside the study area. Residents pay a lot of attention to the intensive cultivation of lei bamboo forests.
The field survey ( Figure 1) was carried out in July 2019, including 45 plots. For each plot, a 5 squared area (25 m 2 ) was placed to cover a homogenous area of lei bamboos, inside which the number of trees and average DBH were measured. An Unistrong mobile GIS was used to determine the precise location of every plot at the center. The Global Positioning System (GPS) accuracy is 2-5 m, based on one satellite, and would be improved to 1-3 m based on Satellite-Based Augmentation System (SBAS). The points were further used to extract features from satellite imagery. Allometric models are critical in preparing samples for RS-based biomass estimates. In this study, we took the allometric model constructed in the same area for lei bamboo forests [44,45]. The AGB (Kg) of a plot is calculated by: where N indicates the number of trees and D is the average DBH.

Satellite Data
Since launch in October 2009, WorldView-2 has been in orbit and acquires data of a high spatial, as well as a higher spectral resolution, hopefully giving reliable sources for monitoring environmental changes [46]. In this study, the satellite data was obtained in July 2016. The image contains 4 bands, including Green (G, 450-510nm), Blue (B, 510-581nm), Red (R, 630-690nm), Nearinfrared red (NIR, 770-895nm). The spatial resolution is 1.2 m. The image preprocessing including radiation correction and atmosphere correction was performed using ENVI 5.3. The radiance image The field survey ( Figure 1) was carried out in July 2019, including 45 plots. For each plot, a 5 squared area (25 m 2 ) was placed to cover a homogenous area of lei bamboos, inside which the number of trees and average DBH were measured. An Unistrong mobile GIS was used to determine the precise location of every plot at the center. The Global Positioning System (GPS) accuracy is 2-5 m, based on one satellite, and would be improved to 1-3 m based on Satellite-Based Augmentation System (SBAS). The points were further used to extract features from satellite imagery.
Allometric models are critical in preparing samples for RS-based biomass estimates. In this study, we took the allometric model constructed in the same area for lei bamboo forests [44,45]. The AGB (Kg) of a plot is calculated by: AGB = N * 0.1939D 1.5654 (1) where N indicates the number of trees and D is the average DBH.

Satellite Data
Since launch in October 2009, WorldView-2 has been in orbit and acquires data of a high spatial, as well as a higher spectral resolution, hopefully giving reliable sources for monitoring environmental changes [46]. In this study, the satellite data was obtained in July 2016. The image contains 4 bands, Remote Sens. 2020, 12, 958 4 of 23 including Green (G, 450-510 nm), Blue (B, 510-581 nm), Red (R, 630-690 nm), Near-infrared red (NIR, 770-895 nm). The spatial resolution is 1.2 m. The image preprocessing including radiation correction and atmosphere correction was performed using ENVI 5.3. The radiance image was atmospherically corrected and transformed into canopy reflectance using the Fast Line-of-Sight Atmospheric Analysis of Spectral Hypercubes (FLAASH) algorithm.

Variables Extraction
In this section, descriptions of three chosen variables are given. The detailed formulas are shown in (Table 1); 45 points represented the precise locations of the 45 plots. For every point, 5 × 5 square-shaped pixels around the center point were obtained and every pixel was regarded as a separated sample. Therefore, the samples included 1125 pixels in total. However, a question must be pointed out. Given the spatial resolution of our satellite data is 1.2 m while the plot sizes are 5 m, the obtained pixels cover 6 × 6 m 2 -that is slightly larger than do real plots. Besides, some studies abandoned all the partly included pixels and computed the mean value of the pixels from each plot. The design of this experiment would be explained in Section 4.3. The samples were divided into training samples and testing samples with a ratio of 3:1. The pixels belong to one plot were only divided into training or testing set.
VIs, most of which are computed directly from bands, can reveal the growth status of vegetations. In our study, VIs included Difference Vegetation Index (DVI), Normalized Difference Vegetation Index (NDVI), Normalized Difference Water Index (NDWI), and Ratio Vegetation Index (RVI) [33]. The NDWI was included mainly for classification purpose.
Exploratory spatial data analysis (ESDA) represents a series of techniques that are used to statistically analyze spatial data and mine necessary knowledge of features' spatial structure and correlation [47]. Information provided by ESDA is valuable for classification and spatial analysis [48,49]. However, few studies have taken ESDA into AGB estimation. In this study, we took two 3 indicators of ESDA into consideration: Moran's I, Geary's C, and G statistic [50][51][52][53]. Thus, for each band, there are 3 ESDA textures. We got 12 ESDA textures for each window size of ESDA.
The GLCM [54] is a tabulation of how often different combinations of pixel brightness values (grey levels) occur in an image. They are based on differences between pairs of pixels in a spatially defined relationship and consider all pixel pairs within the neighborhood. More information about GLCM could be found in [55,56]. GLCM textures have been proven to be vastly informative for high resolution AGB estimate [19,25,26]. Though are widely used, the window sizes of GLCM textures worth paying attention to. Especially, GLCM textures based VHRS would be much more informative than these based on high or middle resolution imagery. In our study, GLCM window sizes were set from 3 to 53 to explore the influence of spatial scales. For each window size, there were 4 bands, 4 directions and 8 kinds of GLCM textures Mean (MEA), Variance (VAR), Homogeneity (HOM), Contrast (CON), Dissimilarity (DIS), Entropy (ENT), Augular Second Moment (ASM), Correlation (COR). Therefore, we got 128 GLCM textures for each window size. Geary's C Gray-level co-occurrence

Machine Learning Algorithms
The relationships between biomass and RS variables are usually nonlinear. Using nonparametric models to estimate biomass in a data-driven manner is prevalent. However, it is hard to determine which ML algorithm is the best without experiment. Comparing different algorithms is helpful for revealing the relationships between biomass and RS variables [13]. Four ML models, including Convolutional neural networks (CNNs), Random forest (RF), Support vector regression (SVR), Artificial Neural Networks (ANNs), were applied to estimate AGB. Unless otherwise stated, every algorithm was run for 50 times to reduce the uncertainty that existed in algorithms and sample splits.

CNN
Convolutional neural networks (CNNs) in our study served as spatial information extractors. The formula of convolution [40] is given by: where k h,w l,j.m is the value at the position (h, w) of the kernel connected to the mth feature map in the (l − 1)th layer, H i and W i are the height and width of the kernel, respectively, b l,j is the bias of the jth feature map in the lth layer, f(x) is an activation function.
The derivatives of the loss function with respect to parameters are computed based on the chain derivative rule along the computational graph during backward propagation. We denoted the gradient of a trainable parameter w with respect to loss as grad w . Based on grad w , we applied the Adam [60] algorithm as the optimizer, by which the learning rate changes adaptively during the training process. The formulas of the Adam algorithm are given by: where M 1 and M 2 are the first and second momentum, while b 1 and b 2 are the amendments to M 1 and M 2 to avoid having a 'big step' at the beginning of the training. t is the number of training epochs. Other hyperparameters must be defined by empirical studies: β 1 and β 2 are the decay rates of the momentum, usually set as 0.9 and 0.999, respectively; lr is the learning rate, which is set as 3e-4; ε is a small number used to avoid the exception of being divided by 0. In our study, CNN was designed as a simple structure to conduct more repeats to reduce the uncertainty of samples and test the inputs' window sizes. For every pixel, the biomass of the pixel is estimated based on spatial and spectral properties from proximal pixels [26] as shown in Figure 2, or in other words, sliding windows. The CNN used included 8 convolutional layers and 2 linear fully connected layers at the end. Adam algorithm was used for optimization with a learning rate of 3e-4. The loss function was MSE (mean-squared error), the same as ANN. The activation function was Relu: f(x) = max(x, 0). The parameter settings of CNN were all the same over different inputs in our study.
Remote Sens. 2019, 11, x FOR PEER REVIEW 6 of 23 gradient of a trainable parameter w with respect to loss as gradw. Based on gradw, we applied the Adam [60] algorithm as the optimizer, by which the learning rate changes adaptively during the training process. The formulas of the Adam algorithm are given by: where M1 and M2 are the first and second momentum, while b1 and b2 are the amendments to M1 and M2 to avoid having a 'big step' at the beginning of the training. t is the number of training epochs.
Other hyperparameters must be defined by empirical studies: β1 and β2 are the decay rates of the momentum, usually set as 0.9 and 0.999, respectively; lr is the learning rate, which is set as 3e-4; ε is a small number used to avoid the exception of being divided by 0. In our study, CNN was designed as a simple structure to conduct more repeats to reduce the uncertainty of samples and test the inputs' window sizes. For every pixel, the biomass of the pixel is estimated based on spatial and spectral properties from proximal pixels [26] as shown in Figure 2, or in other words, sliding windows. The CNN used included 8 convolutional layers and 2 linear fully connected layers at the end. Adam algorithm was used for optimization with a learning rate of 3e-4. The loss function was MSE (mean-squared error), the same as ANN. The activation function was Relu: f(x) = max(x, 0). The parameter settings of CNN were all the same over different inputs in our study.

RF
RF [61] is a state-of-art ensemble algorithm, which is widely applied for classification and regression for its robustness and understandable feature-selection procedures [12,23]. On the other hand, RF provides other appealing statistical properties, such as the useful internal estimates of error, correlation and variable importance [20]. RF constructs numerous regression trees to determine the final result by an average of all regression trees. RF is built up by these steps: 1) firstly, bootstrap is used to randomly select one sample of the training data with replacement. 2) Repeat step 1 until the number of selected samples is the same as the training data. This procedure is called bagging. Some

RF
RF [61] is a state-of-art ensemble algorithm, which is widely applied for classification and regression for its robustness and understandable feature-selection procedures [12,23]. On the other hand, RF provides other appealing statistical properties, such as the useful internal estimates of error, correlation and variable importance [20]. RF constructs numerous regression trees to determine the final result by an average of all regression trees. RF is built up by these steps: (1) firstly, bootstrap is used to randomly select one sample of the training data with replacement. (2) Repeat step 1 until the number of selected samples is the same as the training data. This procedure is called bagging. Some samples may be picked up more than once, some may not be picked up at all. Around 37% of training data are not selected in a bagging. (3) A classification and Regression Tree (CART) is built upon the selected data in (2) without pruning. Several variables (n = mtry) are randomly selected to determine the best split of the node based on the Gini Impurity of all the variables, and (4) numerous trees (n = ntree) are put together and vote (average in regression) for final outputs. In this study, RF algorithm was carried out based on R [62] add-in package randomForest [63].
The classification map is obtained using an RF classifier. Bands, VIs, and 8 kinds of GLCM textures for 4 bands with window sizes of 3, 5, 7, 9, 11, 13 were included. In order to reduce the dimension of samples in the classification procedure, the mean values of the 4 direction (0 • , 45 • , 90 • , 135 • ) of GLCM were calculated. Therefore, 4 bands, 4 VIs, and 192 GLCM textures (4 bands × 8 kinds × 6 window sizes) consisted of 200 variables in total. The samples were obtained by expert experience. 4000 pixels were randomly selected as samples for an RF classifier and were divided into training samples and testing samples with a ratio of 3:1. The optimal value of mtry in classification experiment was obtained by traversing all possible values and the ntree was set to 2000.
The AGB estimate based on the 1125 pixels we obtained. Variables were extracted and RF regression was used to construct models. AGB samples were randomly divided into training samples and testing samples with a ratio of 3:1 at each repeat. Due to the extensive computation in our study and the insensitiveness of RF to parameter settings, we took 1/3 of the number of variables as the value of mtry, as recommended by previous studies [63,64]. The ntree values of all repeats were set to 2000. Detailed parameters were listed in the Result section.

SVR
Support vector machine (SVM) [14] is a supervised non-parametric statistical learning technique; therefore, there is no assumption made on the underlying data distribution nor on the dimensions of the input space [38,65]. SVMs presented with a set of labeled data instances and the SVMs training algorithm aims to find a hyperplane that separates the groups of input data into a discrete predefined number of classes in a fashion consistent with the training examples. Besides, SVM is appealing for regression, denoted as SVR. Many excellent works apply SVR for RS tasks, such as retrieval of oceanic chlorophyll concentration [66], biomass estimation [14].
In our study, SVR algorithm was carried out based on R [62] and RBF kernel was used. The optimization of SVR mostly bases on C and ε, where C balances the minimization of errors and the regularization term and ε defines a loss-free tube around the calibration data where the deviation is not penalized [39]. SVR algorithm is sensitive to parameter settings. Therefore, for every feature set, the C and ε were tested at a range of values to determine the value that max the accuracies. Detailed parameters were listed in the Result section.

ANN
ANNs have been developed for decades. Standard ANNs have one input layer, one output layer, numerous hidden layers, and use activation functions such as Sigmoid, Relu, Tanh for non-linearization. Recently, new computer technology and the development of artificial intelligence make deeper and larger neural networks possible. For example, the dropout [67] algorithm could dramatically and efficiently reduce the risk of overfitting.
In our study, dropout and ormalization [68] were used for the improvement of the models. The ANN used included 9 hidden layers. Adam algorithm was used for optimization [60] with the learning rate of 3e-4. MSE (mean-squared error) was applied as the loss function. The activation function was Relu: f(x) = max(x, 0). The parameter settings of ANN were all the same over different input in our study.

Experiment Design
The flowchart of steps is shown in Figure 3. The experiments were designed as 5 parts: (1) classification map is produced using RF, (2) CNN was used to estimate AGB with bands as inputs, (3) RF, SVR, and ANN are individually applied on the widely applied variables, and (4) according to the performances of all these methods, several relatively better methods were selected for mapping AGB with land-use cover map. Three indicators were used to evaluate the performance of all the AGB estimation: coefficient of determination (R 2 ), root mean squared error (RMSE), and relative RMSE (RMSEr).
Remote Sens. 2019, 11, x FOR PEER REVIEW 8 of 23 The flowchart of steps is shown in Figure 3. The experiments were designed as 5 parts: 1) classification map is produced using RF, 2) CNN was used to estimate AGB with bands as inputs, 3) RF, SVR, and ANN are individually applied on the widely applied variables, and 4) according to the

AGB Samples
Using the allometric model, the AGB of each plot was computed and then divided by plot size (25 m 2 ). The basic statistics of the field measured plant density, DBH, and AGB are shown in (Table 2). The vegetation of the study area is complex. In order to map AGB of lei bamboos and provide the spatial distribution of major local vegetation, a classification map is a requisite. Using the RF as described in Section 2.4.2, the classification results are produced and shown in Figure 4 and Table 3. Though bamboo forests cover most of the study area, but are scattered and mixed with other landcover types. Bamboo forests are intensively raised at the center of the study area. Coniferous forests mainly grow on mountains and broadleaf forests mainly distribute in the north of the study area.
performances of all these methods, several relatively better methods were selected for mapping AGB with land-use cover map. Three indicators were used to evaluate the performance of all the AGB estimation: coefficient of determination (R 2 ), root mean squared error (RMSE), and relative RMSE (RMSEr).

AGB Samples
Using the allometric model, the AGB of each plot was computed and then divided by plot size (25 m 2 ). The basic statistics of the field measured plant density, DBH, and AGB are shown in (Table  2). The vegetation of the study area is complex. In order to map AGB of lei bamboos and provide the spatial distribution of major local vegetation, a classification map is a requisite. Using the RF as described in Section 2.4.2, the classification results are produced and shown in Figure 4 and Table 3. Though bamboo forests cover most of the study area, but are scattered and mixed with other landcover types. Bamboo forests are intensively raised at the center of the study area. Coniferous forests mainly grow on mountains and broadleaf forests mainly distribute in the north of the study area.

AGB Estimation Based on CNN
In this section, CNN was tested as an automatic extractor of spatial information in comparison with other variables. Window sizes from 5 to 53 with a step of 4 pixels were tested. For every training/testing sample set, CNN was trained for 500 epochs. The results of CNN are reported in Figures 5 and 6. The best results were found in CNN_13 with RMSE = 0.274 Kg/m 2 , RMSEr = 23.1%, and R 2 = 0.943. The next was CNN_17 with RMSE = 0.276 Kg/m 2 , RMSEr = 23.5%, and R 2 = 0.940. Figure 6 shows some test results of different training epochs and window sizes. Overall Accuracy (OA) 0 . 9 5 7 Kappa coefficient (kappa) 0 . 9 5 2

AGB Estimation Based on CNN
In this section, CNN was tested as an automatic extractor of spatial information in comparison with other variables. Window sizes from 5 to 53 with a step of 4 pixels were tested. For every training/testing sample set, CNN was trained for 500 epochs. The results of CNN are reported in Figure 5 and Figure 6. The best results were found in CNN_13 with RMSE=0.274 Kg/m 2 , RMSEr=23.1%, and R 2 =0.943. The next was CNN_17 with RMSE=0.276 Kg/m 2 , RMSEr=23.5%, and R 2 =0.940. Figure 6 shows some test results of different training epochs and window sizes.

AGB Estimation Based on Other ML Algorithms
In this section, RF, ANN, and SVR are applied on all the RS variables we obtained. To simplify the description in the following sections, parameter settings and denotations of all the scenarios are listed in Table 4. x is a number to indicate the window size of the texture variables.

AGB Estimation Based on Other ML Algorithms
In this section, RF, ANN, and SVR are applied on all the RS variables we obtained. To simplify the description in the following sections, parameter settings and denotations of all the scenarios are listed in Table 4.

Bands and VIs
The AGB estimation results using bands and Vis + bands are reported in this section. The Bands scenario used visible bands (R, G, B) and NIR band as inputs. The VIs scenario used the 4 kinds of VIs (DVI, NDVI, NDWI, RVI) and the 4 raw bands as inputs. Detailed results are reported in (Figure 7) where box plots are used to represent the distribution of 50 repeats. The results of Bands and VIs are not satisfactory. R 2 of all the ML algorithms are lower than 0.3 and RMSE are higher than 1.0 Kg/m 2 . The differences between 1/4 quantile and 3/4 quantile of each box are around 0.1, which means there are not much variation when different subsets of samples were used.

Bands and VIs
The AGB estimation results using bands and Vis + bands are reported in this section. The Bands scenario used visible bands (R, G, B) and NIR band as inputs. The VIs scenario used the 4 kinds of VIs (DVI, NDVI, NDWI, RVI) and the 4 raw bands as inputs. Detailed results are reported in ( Figure  7) where box plots are used to represent the distribution of 50 repeats. The results of Bands and VIs are not satisfactory. R 2 of all the ML algorithms are lower than 0.3 and RMSE are higher than 1.0 Kg/m 2 . The differences between 1/4 quantile and 3/4 quantile of each box are around 0.1, which means there are not much variation when different subsets of samples were used.
For further and simplified comparison, the averaged results of 50 repeats are reported in (Table  5) to represent the performance of Bands and Vis + Bands. The best performing model using bands was ANN_Bands with RMSE = 1.041 Kg/m 2 , RMSEr = 87.2%, and R 2 = 0.238. Moreover, the best perform model using VIs was ANN_VIs with RMSE = 1.032 Kg/m 2 , RMSEr = 86.7%, and R 2 = 0.248. Both SVR and RF performed worse than ANN (Table 5).   For further and simplified comparison, the averaged results of 50 repeats are reported in (Table 5) to represent the performance of Bands and Vis + Bands. The best performing model using bands was ANN_Bands with RMSE = 1.041 Kg/m 2 , RMSEr = 87.2%, and R 2 = 0.238. Moreover, the best perform model using VIs was ANN_VIs with RMSE = 1.032 Kg/m 2 , RMSEr = 86.7%, and R 2 = 0.248. Both SVR and RF performed worse than ANN (Table 5).  (Figure 8) where box plots are used to represent the distribution of 50 repeats. The GLCM textures of different window sizes were used independently. When GLCM textures are used, RF and SVR reached much higher accuracy than ANN. At the same time, RF and SVR are more stable than ANN, especially when the window sizes are larger than 19 pixels. Values of R 2 of RF are much stable than SVR when window sizes are large. Window sizes of GLCM textures make great impact on the results.   (Table 6) to represent the performance of GLCM and ESDA. In the GLCM scenarios, the best result is obtained in GLCM_RF_37 with RMSE = 0.169 Kg/m2, RMSEr = 14.2%, and R 2 = 0.979. The next was GLCM_SVR_35 with RMSE = 0.203 Kg/m 2 , RMSEr = 17.0%, and R 2 = 0.970, that was close to the GLCM_RF_37. In the ESDA scenarios, the best result was obtained in ESDA_RF_51 with RMSE = 0.610 Kg/m 2 , RMSEr = 51.4%, and R 2 = 0.735. The next was ESDA_ANN_41 with RMSE = 0.708 Kg/m 2 , RMSEr = 59.4%, and R 2 = 0.646.  (Table 6) to represent the performance of GLCM and ESDA. In the GLCM scenarios, the best result is obtained in GLCM_RF_37 with RMSE = 0.169 Kg/m2, RMSEr = 14.2%, and R 2 = 0.979. The next was GLCM_SVR_35 with RMSE = 0.203 Kg/m 2 , RMSEr = 17.0%, and R 2 = 0.970, that was close to the GLCM_RF_37. In the ESDA scenarios, the best result was obtained in ESDA_RF_51 with RMSE = 0.610 Kg/m 2 , RMSEr = 51.4%, and R 2 = 0.735. The next was ESDA_ANN_41 with RMSE = 0.708 Kg/m 2 , RMSEr = 59.4%, and R 2 = 0.646. Detailed results are reported in (Figure 9) where box plots are used to represent the distribution of 50 repeats. When all combined data are used, results of ANN outperform RF and SVR at the first several window sizes, but much lower than RF and SVR when larger window sizes are used. The values of R 2 increase from around 0.5 to around 0.9 when the window sizes are enlarged from 3 to 19 pixels, while the improvement of ANN is limited (from around 0.7 to around 0.8). In ANN_ESDA scenario, the window sizes of ESDA make little influence on the accuracies of AGB estimate. The incorporation of VIs and ESDA reached similar accuracy compared to using ESDA alone.
The averaged results of 50 repeats are computed and the highest values of all window sizes are reported in Table 7 to represent the performance of all combined data and ESDA + VIs. The best model using all combined data was All_RF_51 with RMSE = 0.193 Kg/m 2 , RMSEr = 16.2%, and R 2 = 0.974. The best model using ESDA + VIs was ESDA + VIs_ANN_41 with RMSE = 0.574 Kg/m 2 , RMSEr = 48.2%, and R 2 = 0.765.  50 repeats. When all combined data are used, results of ANN outperform RF and SVR at the first several window sizes, but much lower than RF and SVR when larger window sizes are used. The values of R 2 increase from around 0.5 to around 0.9 when the window sizes are enlarged from 3 to 19 pixels, while the improvement of ANN is limited (from around 0.7 to around 0.8). In ANN_ESDA scenario, the window sizes of ESDA make little influence on the accuracies of AGB estimate. The incorporation of VIs and ESDA reached similar accuracy compared to using ESDA alone.

AGB Mapping
After comparison among all the variables that we took into consideration, GLCM textures were chosen for AGB mapping and the window sizes of each algorithm were selected based on Table 6. Thus, the models used for AGB mapping included GLCM_RF_37, GLCM_SVR_35, GLCM_ANN_27. CNN model selected from the best and stable model according to Figure 5, that CNN_13 was selected for AGB mapping. For each scene, the algorithm was run for 10 repeats and generated 10 AGB maps. Then the mean values (Mean) and standard deviation (Std) were computed based on the 10 repeats. The results were shown in Figure 10. The Std of RF and SVR were much lower than ANN and CNN.
After comparison among all the variables that we took into consideration, GLCM textures were chosen for AGB mapping and the window sizes of each algorithm were selected based on Table 6. Thus, the models used for AGB mapping included GLCM_RF_37, GLCM_SVR_35, GLCM_ANN_27. CNN model selected from the best and stable model according to Figure 5, that CNN_13 was selected for AGB mapping. For each scene, the algorithm was run for 10 repeats and generated 10 AGB maps. Then the mean values (Mean) and standard deviation (Std) were computed based on the 10 repeats. The results were shown in Figure 10. The Std of RF and SVR were much lower than ANN and CNN.

AGB Estimation Using Different Algorithms
In our study, four algorithms including RF, SVR, ANN, and CNN were used for AGB estimation. Though some parameter settings used in this study were less strict compared to recent studies that focus on the comparison of different models, the discussion and comparison of different models could provide valuable information about models and, more important, reveal the information behind variables by reducing the uncertainty caused by models.
It is widely acknowledged that both CNN and ANN are sensitive to structure and parameter settings and are computation-intensive. In particular, ANN is hard to cope with, though it has been taken into applications for decades [69,70]. Even with recently developed techniques, such as dropout, deep layers, and more nodes in hidden layers, ANN was still uncertain and time-consuming. Amid the four used algorithms in our study, only CNN took no affiliate information, that only bands were used. CNN reached very satisfactory results as shown in Figures 5 and 6. However, the uncertainty existed in structures chooses and input window sizes were hard to be fully understood. We investigated the uncertainty that existed mainly in training processes and window size in our study. It was obvious that some of the CNNs showed low accuracy ( Figure 5). This may be due to the gradient-driven training that the model got stuck into "locally optimal". However, this rarely happened. When the window sizes were larger than nine, the accuracy decreased as the window sizes getting larger. CNN as aforementioned was of simpler structure in our study than mainstream structures that are complex and deep. The results of CNN showed that certain input window sizes (CNN with widow sizes of five and seven were much better and stable than others) were more promising. Here is the hypothesis for the results: (1) when the window sizes are too small, there is not enough information that AGB estimation needs, (2) when the window sizes are too large, there is too much information that CNN overestimates unimportant information and lead to overfitting. Thus, the optimal scale for CNN may relate to the number of samples, land-cover, spatial, and spectral resolution, and sensor type. It is unjust to compare CNN with SVR, RF, and ANN for that we did not input more data into CNN. However, the results showed that CNN was comparable with other ML, even with bands. In conclusion, deep learning models can become powerful tools for applications in precision forest biomass monitoring if being properly adjusted and tuned. Figure 10 revealed the stability of algorithms directly. The only difference of four algorithms in Figure 10 was that inputs were randomly selected from the sample set. RF and SVR showed much better stability than ANN and CNN, which suggested that ANN and CNN were much sensitive to inputs even when the inputs were from the same dataset. Considering previously discussed the widely acknowledged truth that DL methods are sensitive to structures and hyper-parameter settings, DL algorithms need improvements concerning stability. As aforementioned, we generated 25 pixels from each 5×5 m plot. In order to test the consistency of this discrete of samples, the mean value of the 25 pixels of each plot was computed and shown in Figure 11. The results were obtained from the AGB maps as explained, that included GLCM_RF_37, GLCM_SVR_35, GLCM_ANN_27, CNN_13. Additionally, Figure 11 showed that the split of pixels from one plot is feasible for AGB mapping. Especially, CNN showed great consistency in one model, though the variation between different CNN model was large (Figure 10). SVMs are particularly appealing in the remote sensing field due to their ability to successfully handle small training data sets, often producing higher classification accuracy than the traditional methods [38]. The results of our study are consistent with many previous works that suggested SVR outperforms neural network and traditional regression algorithms on most occasions, even the best ANN model at the time. The problem is that SVR is sensitive to parameter initialization in terms of C, ε, and kernel choose. In other words, the result of SVR is completely dependent on the regularization of model parameters [71]. RF, on the other hand, is robust and insensitive to parameter settings. There were investigations showed that whether the optimal parameters were determined affected about 10% accuracy or less [61]. Moreover, the understandable procedures of decisionmaking and variable selection makes RF the hot spot of many fields. Thus, the results of RF are generally more informative and should be paid more attention. In our study, RF performed much worse when used Bands and VIs (with R 2 of 0.076 and 0.049, respectively) and showed slightly higher accuracy than SVR in GLCM and All combined scenarios. SVMs are particularly appealing in the remote sensing field due to their ability to successfully handle small training data sets, often producing higher classification accuracy than the traditional methods [38]. The results of our study are consistent with many previous works that suggested SVR outperforms neural network and traditional regression algorithms on most occasions, even the best ANN model at the time. The problem is that SVR is sensitive to parameter initialization in terms of C, ε, and kernel choose. In other words, the result of SVR is completely dependent on the regularization of model parameters [71]. RF, on the other hand, is robust and insensitive to parameter settings. There were investigations showed that whether the optimal parameters were determined affected about 10% accuracy or less [61]. Moreover, the understandable procedures of decision-making and variable selection makes RF the hot spot of many fields. Thus, the results of RF are generally more informative and should be paid more attention. In our study, RF performed much worse when used Bands and VIs (with R 2 of 0.076 and 0.049, respectively) and showed slightly higher accuracy than SVR in GLCM and All combined scenarios.

AGB Estimation Using Different Variables
Bands (RMSE = 1.041 Kg/m 2 , RMSEr = 87.2%, and R 2 = 0.238) and Vis + bands (RMSE = 1.032 Kg/m 2 , RMSEr = 86.7%, and R 2 = 0.248) showed bad accuracy compared to other variables and combined variables. The improvement expressed by using SVR was the highest. Literature shows different attitudes toward the idea that VIs significantly correlated to AGB for VHSR. Some suggested that bands and VIs could not work well as an indicator of AGB based on VHSR. In our study, however, we have to accept the VIs were not comprehensively studied such as some other excellent works [22,23,26]. We only took VIs in forms of widely used and did not take all possible combined bands into consideration.
In our study, GLCM textures of different window sizes were used independently. A very promising result was obtained using GLCM_RF_37 (with RMSE = 0.169 Kg/m 2 , RMSEr = 14.2%, and R 2 = 0.979), which is consistent with many studies that claimed GLCM to be excellent. For example, Nichol et al. [1] reports R 2 = 0.954 and RMSE = 30.10 by using GLCM and two optical VHSR images. Moreover, GLCM textures are not useful exclusively for optical data, Kuplich el al. [25] reports the incorporation of GLCM textures improved r a 2 (adjusted coefficient of determination) from 0.74 to 0.82 based on Synthetic Aperture Radar (SAR). The difference in our study is that GLCM textures of large window sizes are used. Using a single window size of GLCM, different from previous studies that gathered several smaller window sizes such as 3 × 3, 5 × 5, 7 × 7, could reach excellent AGB estimation results. In our study, small window sizes GLCM textures did not perform well in all algorithms. The influence of GLCM window sizes in our study shows an increasing tendency [46], no matter which algorithm was used. The increasing tendency started unstable when the size was larger than 43. Some studies suggested that a larger window size could result in lower R 2 [19]. Moreover, some studies indicate that large windows may not extract the texture information efficiently due to over-smoothing of the textural variations, though most studies did not conduct comprehensive experiments about this conclusion [1]. For instance, Janet et al. [46] tested GLCM textures with window sizes of 15 to 23 and zhang et al. [19] tested from three to nine. Therefore, the large window size GLCM textures should be concluded as potentially useful, since the window sizes of ours were not enough to conclude that large window size would lead to worse results. ESDA textures are informative, though limited. In our study, adds of ESDA does make some improvements. For example, ESDA_RF, ESDA_SVR, ESDA_ANN were 0.659, 0.3, 0.408 higher than Bands_RF, Bands_SVR, Bands_ANN in terms of R 2 . ESDA+VIs_RF, ESDA+VIs_SVR, ESDA+VIs_ANN were 0.663, 0.539, 0.517 higher than VIs_RF, VIs_SVR, VIs_ANN in terms of R 2 . Though here we took the best results of ESDA into comparison, the differentials between different window sizes of ESDA were not very large (Figures 8 and 9). All combined data made some improvement in comparison with GLCM when the window sizes were small. The incorporation of all these variables expands the dimension of inputs quickly. However, the excellent advantage of applying ML algorithms that no assumption and distribution was required made it possible. The problem is that all combined data make little or no improvement in average or when the window sizes are large. We speculate it is because GLCM textures are much more informative than other variables in our study. However, when the window sizes are small, ESDA performed better than GLCM, which lead to an improvement in small window sizes.

Experimental Settings
The accuracies obtained in this paper are very high. It is necessary to provide information about extrapolation to other landscape and data. Worldview-2 and other very high-resolution images have long been considered a useful data source for precise agriculture. However, the design of an experiment is difficult due to the variation of data attributes, landcover types, etc. There are lot of ways to related ground measured AGB with RS variables. Ling et al. [72] placed 18 × 18 m 2 plots. Inside every plot, they measured the AGB of 3-5 subplots with a size of 1 × 1 m 2 and extracted 1 × 1 pixel from worldview2 images (1.8 m) according to the coordinate of the subplots. Adam et al. [73] placed 84 20 × 20 m 2 plots. Then they extracted 8 × 8 pixels from a worldview2 image (2 m) around the center coordinate of each plot, which covered a 16 × 16 m 2 area, and used the average values of the 64 pixels. Sibanda et al. [74] set 54 plots measuring 13.7 × 18.3 m 2 . They randomly extracted 20 pixels from each plot to represent the spectral characteristics of the plot and got 1080 pixels by this way. The sampling method we used mostly based on Sibanda et al. [74] and used the method of Adam et al. [73] for validation.
The number of plots and the sizes of plots and subplots are a tradeoff between accuracy and efficiency. According to Justice et al. [75], in order to relate ground measured data with RS variables, plot sizes should satisfy Minimum plot area = (Pixel diameter × (1 + (2 × Geometric accuracy))) 2 . However, large plot size may lead to bias concerning both biomass and spectral bands. Brogaard et al. [76] suggest that the number of plot should exceed 30. The field survey for AGB is costly in most case, especially when no allometric model is available. Moreover, landcover types have to be taken into consideration. The landcover of our study area are complex. Bamboo forests grow closely to road, villages, and other vegetations. Other crops and non-forest land types have to be masked out precisely. Therefore, we set relatively small plot size and plot number to cover the center of the study area. More works about sizes and numbers of plot would be needed concerning the application of very high-resolution images.
Algorithms and RS variables have been discussed and compared. Landcover types and data types could impact the applicability of these methods and variables. For examples, NIR band may not be available in some UAV data. According to our study, GLCM and ESDA have their own advantages. We would like to suggest the use of large window size GLCM combined with small window size ESDA to improve the results. However, how large the GLCM should be is a problem. The application of CNN is worthy for this reason. CNN has not requirement for the data and land types and could be useful by exclusively taking bands as input. However, CNN is computationally intensive.

Conclusions
In this study, we applied state-of-the-art ML for AGB mapping using Worldview-2 optical imagery. Algorithms, including CNN, RF, SVM, and ANN, and variables including GLCM, VIs, and ESDA were compared. To further the comparison, large ranges of window sizes of GLCM, ESDA, and CNN were tested. The conclusion of the impact of algorithms and variables are listed as follows: 1.
CNNs reached a satisfactory result (R 2 = 0.943) with little inputs. Clearly, CNN could be used for AGB estimation and reach satisfactory results if being rightly adjusted. The main drawback is the high variance among different trained CNNs. However, there are still some aspects of CNN worth investigating. For example, data augment methods such as rotation and flip and CNN structures worth paying attention to. There is a probability that adding additional variables into CNN would make further improvement.

2.
SVR and RF reached very promising results, were computationally inexpensive, and easy to apply with little requirement of data amount and distribution. In our study, GLCM_RF_37 reached the best combination in our study with an R 2 of 0.979. Moreover, All_RF_51 was the next with an R 2 of 0.974. ANN was not as great, but the rapid development of DL may make an unforeseeable change in the future. The performance of algorithms or variables may be case-specific, but provide a reference for algorithms and variables selection.

3.
GLCM textures with no doubts were of the first class for that GLCM textures showed superiority in all the algorithms we used. However, the determining of the scale of GLCM texture was not well motivated for long, especially for VHSR, the impact of scales was deeper and more complicated. Our study showed that a GLCM texture with a single-proper window size could be useful, especially large ones. ESDA textures could be incorporated into AGB estimate but enlarging window size of ESDA made little improvement.