Comparative Analysis of Modeling Algorithms for Forest Aboveground Biomass Estimation in a Subtropical Region

Remote sensing–based forest aboveground biomass (AGB) estimation has been extensively explored in the past three decades, but how to effectively combine different sensor data and modeling algorithms is still poorly understood. This research conducted a comparative analysis of different datasets (e.g., Landsat Thematic Mapper (TM), ALOS PALSAR L-band data, and their combinations) and modeling algorithms (e.g., artificial neural network (ANN), support vector regression (SVR), Random Forest (RF), k-nearest neighbor (kNN), and linear regression (LR)) for AGB estimation in a subtropical region under non-stratification and stratification of forest types. The results show the following: (1) Landsat TM imagery provides more accurate AGB estimates (root mean squared error (RMSE) values in 27.7–29.3 Mg/ha) than ALOS PALSAR (RMSE values in 30.3–33.7 Mg/ha). The combination of TM and PALSAR data has similar performance for ANN and SVR, worse performance for RF and KNN, and slightly improved performance for LR. (2) Overestimation for small AGB values and underestimation for large AGB values are major problems when using the optical (e.g., Landsat) or radar (e.g., ALOS PALSAR) data. (3) LR is still an important tool for AGB modeling, especially for the AGB range of 40–120 Mg/ha. Machine learning algorithms have limited effects on improving AGB estimation overall, but ANN can improve AGB modeling when AGB values are greater than 120 Mg/ha. (4) Forest type and AGB range are important factors that influence AGB modeling performance. (5) Stratification based on forest types improved AGB estimation, especially when AGB was greater than 160 Mg/ha, using the LR approach. This research provides new insight for remote sensing-based AGB modeling for the subtropical forest ecosystem through a comprehensive analysis of different source data, modeling algorithms, and forest types. It is critical to develop an optimal AGB modeling procedure, including the collection of a sufficient number of sample plots, extraction of suitable variables and modeling algorithms, and evaluation of the AGB estimates.


Introduction
Forest biomass is one of the important variables needed to quantify the structure and function of forest ecosystems [1].A large number of studies using different technologies such as field measurements, remote sensing, and process-based ecosystem models have been conducted to estimate forest biomass at local, regional, and global scales [2].Remotely sensed imagery can characterize land surface features over large areas and it has been extensively used for estimating forest biomass, especially aboveground biomass (AGB).However, the data limitations (e.g., data saturation for optical and radar data, and the limitations in spectral and spatial resolutions) and the complex relationships between AGB and spectral variables make AGB estimation inaccurate, especially when AGB values are higher than 150 Mg/ha or lower than 40 Mg/ha [3].
In remote sensing-based biomass modeling, the collection of a sufficient number of high-quality sample plots, the selection of suitable variables, and the selection of modeling algorithms are three critical steps [2].Sample plots provide fundamental data for accurately estimating AGB.Once sample plot data are collected, research mainly focuses on the identification of suitable variables and modeling algorithms.Many previous studies have examined the importance of selecting suitable variables (e.g., spectral bands, vegetation indices, texture measures, and subpixel features) in improving AGB estimation [4][5][6].Stepwise regression analysis is often used for the identification of variables for AGB modeling [2].Previous research has indicated that the combination of spectral responses and texture variables can improve AGB estimation compared with using a single kind of imagery alone, especially in the tropical and subtropical regions with complex forest stand structure and composition of tree species [3,4].Texture measures are especially valuable for sites with complex forest stand structures [4].Another common approach to identify variables for AGB modeling is Random Forest (RF) because it can provide the ranks of importance of variables [7][8][9][10].The stepwise regression is simple and easy to use but the identified variables are only those having a linear relationship with AGB, while the selected variables using RF can have nonlinear relationships with AGB.
In AGB modeling, in addition to the selection of variables, another critical step is to identify an appropriate algorithm to establish AGB estimation models.Lu et al. [2] summarized the major AGB modeling algorithms, covering regression and machine learning algorithms.Regression has been widely used for AGB estimation [3,4,6].Because of its simplification and relatively good performance, this method is often used in AGB estimation for different ecosystems such as tropical, subtropical, and temperate forests [3,4,6,9].However, linear regression (LR) methods are based on linear relationships of AGB with predictors or independent variables, thus, may not provide satisfactory results due to the complex relationships between remote sensing variables and AGB [1,3,11].In this case, nonparametric and machine learning algorithms, such as artificial neural network (ANN), support vector regression (SVR), k-nearest neighbor (kNN), and RF, can deal with nonlinear relationships.Thus, these algorithms have gained increasing attention in the past decade [2,9,[12][13][14] and have been widely employed in forest AGB estimation [15][16][17].For example, when AGB is nonlinearly related to remote sensing variables, LR models cannot provide reliable AGB estimation; instead, machine learning algorithms such as ANN can establish nonlinear relationships and provide more accurate AGB estimates [18,19], especially when ancillary variables such as topographic factors (e.g., elevation, slope, aspect) are used [20].ANN has the advantages of distributed parallel processing and nonlinear and adaptive learning and may outperform LR models in AGB estimation [13], but its characteristics as a black box that result in difficult variable explanation [2] and inability in training convergence and function approximation [19] are often criticized.Compared to traditional LR, SVR has become an important approach for AGB modeling in the past decade [21,22] because of its capability in dealing with a relatively small number of sample plots.Through the selection of a kernel function to realize the data space projection from a low dimension to a high dimension, SVR solves the problems of linear inseparability in sample plots, reduces the structural risk, and leads to a greater ability for processing nonlinear data.In addition, SVR can be highly generalized for dealing with high dimensional data [23,24].
The RF algorithm is an extension of the traditional decision tree method through combining multiple decision trees [8], having the advantage of dealing with a large amount of data with high speed and efficiency, and the ability to rank the importance of variables [7,25,26].Previous research has shown that RF can lead to higher estimation accuracies and smaller errors than LR [21,24,27,28].Another commonly used method for AGB estimation is kNN [14,[29][30][31][32][33][34][35].This method calculates the weighted average of forest AGB values of k nearest sample plots based on spectral distances between the plot locations and the estimated pixel using Euclidean or Mahalanobis distance.The more similar the sample plots are to the estimated pixel, the greater the weights.Without the assumptions of normal distribution and linear relationship, kNN can be utilized to estimate both continuous and discrete forest variables.Moreover, the predictors used in kNN can be any remote sensing and environmental variables, including spectral bands, vegetation indices, soil properties, and topographic features [33,34].Previous research showed that the kNN method may produce poor AGB estimates at the pixel scale but can improve estimation at a coarser scale [14,36,37].Considerable research for the use of kNN has been conducted to improve the accuracy of estimating forest variables by optimizing the selection of kNNs and using different distances and weighting methods [32][33][34][35].
The advantages of using nonparametric and machine learning algorithms for AGB estimation have been recognized, but their estimation accuracy highly depends on the optimization of the parameters used in the relevant algorithms and representativeness of the samples.For example, ANN requires a large number of sample plots, and a small number of sample plots may lead to poor predictions [19].SVR is found to have poor performance due to the lack of a standard approach to optimize the model parameters [38].RF algorithm might have poor generalization and large estimation errors when the number of variables with low correlation to AGB in the model increases [25].For a specific study area and a set of remote sensing data, it is unclear which algorithm may produce the most accurate results.It is necessary to conduct a comparative analysis of different modeling algorithms for AGB estimation.
Zhao et al. [6] used LR and explored the role of using Landsat TM images, ALOS PALSAR data, and their combination, plus data fusion to improve AGB estimation performance with and without classification of forest types.The authors found that the combination of TM and PALSAR data as extra bands increased the estimation accuracy of AGB, and the stratification of vegetation types also improved AGB estimation performance, but LR led to significant overestimation and underestimation for the small and large AGB values (e.g., less than 40 Mg/ha and greater than 160 Mg/ha), respectively.It is unknown whether the conclusions are upheld when other AGB modeling methods are used.We also have to answer the following questions: (1) Which modeling algorithm should be used for different datasets, that is, can machine learning algorithms provide better AGB estimation performance than LR? (2) Can stratification of forest types improve AGB estimation performance in the case of using machine learning and nonparametric algorithms?There are rare reports on comprehensively comparative analysis of modeling algorithms based on different datasets under non-stratification and stratification of forest types.Therefore, the objective of this research is to understand the AGB modeling algorithms that are appropriate for the subtropical region by using Landsat TM and ALOS PALSAR L-band data and by conducting a comparative analysis of different algorithms (e.g., RF, ANN, SVR, kNN, LR) under non-stratification and stratification of forest types.In addition, we attempt to examine which modeling method can improve the overestimation and underestimation that often takes place for small and large AGB values, respectively, when LR is used.Through this comprehensive comparison, we can better understand the AGB modeling mechanisms by employing suitable remote sensing variables and modeling algorithms under non-stratification and stratification of forest types.

Study Area
Zhejiang province has complex topographic conditions with mountainous and hilly areas in the southwest and plain and basin areas in the northeast.Mountainous and hilly areas account for approximately 70%, the plain and basin areas for 23%, and water for 7%.The mountains having elevation of greater than 1000 m are mainly located in southwestern Zhejiang, the hilly areas in the central part, and the differently sized basins are dispersed in different regions (Figure 1).This province has a subtropical moist monsoon climate with average annual temperatures of 15-18 • C and average annual precipitation of 980-2000 mm.This province has forested areas of 6680 Kha with forest coverage of approximately 60%.The dominant forest types include coniferous forests (pines and firs as the dominant trees with nearly uniform forest stand structure), broadleaf forests (dominated by evergreen broadleaf trees with complex forest canopy layers), mixed needle and broadleaf forests (two or more dominant tree species, usually with a pine overstory, broadleaf middle layer, and shrub lower layer), and bamboo forests (usually pure bamboo species).average annual precipitation of 980-2000 mm.This province has forested areas of 6680 Kha with forest coverage of approximately 60%.The dominant forest types include coniferous forests (pines and firs as the dominant trees with nearly uniform forest stand structure), broadleaf forests (dominated by evergreen broadleaf trees with complex forest canopy layers), mixed needle and broadleaf forests (two or more dominant tree species, usually with a pine overstory, broadleaf middle layer, and shrub lower layer), and bamboo forests (usually pure bamboo species).

Materials and Methods
The framework of conducting a comparative analysis of AGB modeling approaches is illustrated in Figure 2. The major steps include (1) data preparation of different sources (i.e., geometric registration between Landsat, ALOS PALSAR, and digital elevation model (DEM); atmospheric and topographic corrections); (2) selection of the variables from Landsat, ALOS PALSAR, and DEM data; (3) development of AGB estimation models using different algorithms (i.e., LR, RF, ANN, SVR, and kNN) based on stratification and non-stratification; and (4) comparison and evaluation of the AGB modeling results.

Materials and Methods
The framework of conducting a comparative analysis of AGB modeling approaches is illustrated in

Data Collection and Preprocessing
The datasets used in this research are summarized in Table 1.All these data were registered in the Universal Transverse Mercator coordinate system (zone 50 north).For Landsat TM imagery, the dark object subtraction approach was used to convert digital numbers of pixels to surface reflectance [39,40] and the C-correction approach was used to conduct topographic correction [41].The global digital elevation model (GDEM) data with 30 m spatial resolution from the United States Geological Survey were used in the topographic correction.For ALOS PALSAR data, we downloaded the 2010 global mosaic data with cell size of 25 m.The PALSAR data were co-registered to Landsat imagery.During the image-to-image registration, the PALSAR data were resampled from a 25 m to 30 m cell size using the nearest neighbor algorithm so that both TM and PALSAR data had the same cell size.Because of the speckle problem, the enhanced Lee filtering approach with a window size of 3 × 3 pixels was used to reduce the speckles inherent in the PALSAR data [42].The forest (coniferous, broadleaf, mixed, and bamboo) classification image, which was developed from the 2010 Landsat image using a hybrid approach [3], was directly used in this research.A detailed description of this classified image is provided in [3].

Data Collection and Preprocessing
The datasets used in this research are summarized in Table 1.All these data were registered in the Universal Transverse Mercator coordinate system (zone 50 north).For Landsat TM imagery, the dark object subtraction approach was used to convert digital numbers of pixels to surface reflectance [39,40] and the C-correction approach was used to conduct topographic correction [41].The global digital elevation model (GDEM) data with 30 m spatial resolution from the United States Geological Survey were used in the topographic correction.For ALOS PALSAR data, we downloaded the 2010 global mosaic data with cell size of 25 m.The PALSAR data were co-registered to Landsat imagery.During the image-to-image registration, the PALSAR data were resampled from a 25 m to 30 m cell size using the nearest neighbor algorithm so that both TM and PALSAR data had the same cell size.Because of the speckle problem, the enhanced Lee filtering approach with a window size of 3 × 3 pixels was used to reduce the speckles inherent in the PALSAR data [42].The forest (coniferous, broadleaf, mixed, and bamboo) classification image, which was developed from the 2010 Landsat image using a hybrid approach [3], was directly used in this research.A detailed description of this classified image is provided in [3].The FBD (fine beam double polarization, HH/HV) L-band 1.5 product with 25-m cell size was downloaded from the global mosaic data with a time interval of 1 year.This downloaded image was produced using the 2010 PALSAR images.The cell size of 25-m was resampled to a 30-m cell size during the PALSAR-to-TM image registration ASTER GDEM data Global digital elevation model (GDEM) data with a 30-m spatial resolution were downloaded from the United States Geological Survey website

Forest classification image
The forest types in this study area included pine and fir (coniferous forest), broadleaf forest, mixed forest, and bamboo.The forest distribution map was developed from a Landsat TM image using a hybrid approach [3] with an overall classification accuracy of 78%.Forest classes had user's accuracy between 71 and 87% and producer's accuracy between 72 and 87%.More details are provided in [3].
Field measurements A total of 664 sample plots covering coniferous, broadleaf, mixed, and bamboo forests were inventoried in 2010 and 2011 [3] Note: HH, horizontally transmitted and received; HV, horizontally transmitted and vertically received.
The field inventory work was conducted at county level.Based on the forest distribution map at the subcompartment scale, sample plots were systematically allocated on this map according to the number of subcompartments, sampling interval, and proportion of samples.The sampling ratio depends on the number of subcompartments; for example, the ratios were designed as 3%, 2.5%, 2%, 1.5%, and 1% for the number of compartments of less than 1000, 1000-2000, 2000-4000, 4000-6000, and greater than 6000.Taking the number of subcompartments of 901 as an example, the number of sample plots is 30; thus, the coding of sampling subcompartment is 1, 31, 61, . . ., and 901. Figure 3 illustrates the concept of the sampling approach used in this research.After a subcompartment was selected, one plot with 20 m by 20 m and three nested subplots with 2 m by 2 m were allocated near the central area of this subcompartment, representing the forest type.Sample plots were inventoried in 2010 and 2011.Within each plot, diameters at breast height (DBH) of all trees greater than 5 cm were measured [43].Three nested subplots were used to measure trees and shrubs with DBH less than 5 cm and grass cover [43].Tree AGB was calculated using the allometric equations provided by [44].In remote sensing-based AGB modeling research, the representativeness of sample plots is critical for accurate AGB estimation.In order to make sure that all plots for analysis were representative of the forests, we overlaid all sample plots on the Landsat color composite to visually examine the geometric accuracy.A total of 664 sample plots were collected and used in this research.The statistics of the sample plot data based on different forest types are summarized in Table 2.The AGB values of all sample plots ranged from 25.7 Mg/ha to 180.7 Mg/ha with an average AGB of 95.9 Mg/ha.The sample plots were randomly divided into two groups: a dataset of 498 plots (75%) for model development and a dataset of 166 plots (25%) for validation of AGB estimates under non-stratification.Because we would examine the role of stratification of forest types in improving AGB modeling performance, we needed to have a sufficient number of validation samples for each forest type.Thus, 30% of the total samples for each forest type was adopted.The numbers of modeling and validation samples are also summarized in Table 2.  Note: Under non-stratification conditions, test samples were randomly selected from the total sample population at a proportion of 25%.Under stratification of forest types, test samples were randomly selected from the total sample population corresponding to each forest type at a proportion of 30%.

Extraction and Selection of Variables from Landsat TM and ALOS PALSAR Data
The remote sensing variables, including spectral responses (vegetation indices and transformed images in addition to spectral bands), spatial features (textures, segments), and subpixel features (fraction images using spectral mixture analysis), can be used for AGB modeling [2].Previous research has indicated that vegetation indices and transformed images have similar performance as the spectral bands [45], but the incorporation of spatial features into spectral responses improved AGB modeling performance [3,4].Therefore, the following variables were extracted in this research: (1) Landsat spectral bands (five spectral bands, excluding the blue band due to serious atmospheric impacts on this band); (2) textural images using gray-level co-occurrence matrix (GLCM) measures (e.g., mean, variance, second moment, dissimilarity, homogeneity, contrast, entropy, correlation) with window sizes of 3 × 3, 5 × 5, 7 × 7, and 9 × 9 pixels based on the Landsat five spectral bands; (3) ALOS PALSAR L-band horizontally transmitted and received (HH) and horizontally transmitted and vertically received (HV) data; (4) textural images using the same GLCM measures and window sizes based on HH and HV imagery.Three datasets-Landsat TM-based variables, ALOS PALSAR-based variables, and their combination-were utilized.
The selection of suitable variables is one of the critical steps in the AGB modeling procedure [2].Because a large number of potential variables are available, but not all variables are needed in AGB modeling, it is necessary to use proper approaches to identify the optimal variables for AGB modeling, depending on the different kinds of modeling algorithms, such as linear regression and nonparametric algorithms [2].AGB may have linear, nonlinear, or no relationships with remote sensing variables.Because of the high correlations among some explanatory variables, it is critical to  Note: Under non-stratification conditions, test samples were randomly selected from the total sample population at a proportion of 25%.Under stratification of forest types, test samples were randomly selected from the total sample population corresponding to each forest type at a proportion of 30%.

Extraction and Selection of Variables from Landsat TM and ALOS PALSAR Data
The remote sensing variables, including spectral responses (vegetation indices and transformed images in addition to spectral bands), spatial features (textures, segments), and subpixel features (fraction images using spectral mixture analysis), can be used for AGB modeling [2].Previous research has indicated that vegetation indices and transformed images have similar performance as the spectral bands [45], but the incorporation of spatial features into spectral responses improved AGB modeling performance [3,4].Therefore, the following variables were extracted in this research: (1) Landsat spectral bands (five spectral bands, excluding the blue band due to serious atmospheric impacts on this band); (2) textural images using gray-level co-occurrence matrix (GLCM) measures (e.g., mean, variance, second moment, dissimilarity, homogeneity, contrast, entropy, correlation) with window sizes of 3 × 3, 5 × 5, 7 × 7, and 9 × 9 pixels based on the Landsat five spectral bands; (3) ALOS PALSAR L-band horizontally transmitted and received (HH) and horizontally transmitted and vertically received (HV) data; (4) textural images using the same GLCM measures and window sizes based on HH and HV imagery.Three datasets-Landsat TM-based variables, ALOS PALSAR-based variables, and their combination-were utilized.
The selection of suitable variables is one of the critical steps in the AGB modeling procedure [2].Because a large number of potential variables are available, but not all variables are needed in AGB modeling, it is necessary to use proper approaches to identify the optimal variables for AGB modeling, depending on the different kinds of modeling algorithms, such as linear regression and nonparametric algorithms [2].AGB may have linear, nonlinear, or no relationships with remote sensing variables.Because of the high correlations among some explanatory variables, it is critical to identify the variables that have a high correlation with AGB but weak correlations among themselves [2].Thus, correlation analysis can be used to examine the relationships between AGB and remote sensing variables to remove the variables without significant correlation.Meanwhile, the correlation analysis was used to analyze the correlation coefficients between the remote sensing variables to remove the variables that have very high coefficients (e.g., >0.9) but relatively low standard deviations.The stepwise regression method was then used to identify the variables for AGB modeling because it introduces the variables into the model one by one and tests their significance.When the original explanatory variable is no longer significant due to the introduction of a new explanatory variable, it is deleted to ensure that only the significant variables are included in the regression equation before each new variable is added.This process is repeated until no significant variables can be added to and no insignificant variables can be removed from the model.Thus, the final variables are guaranteed to be optimal and not have serious multicollinearity.During the stepwise regression analysis, the F-statistical test was used to decide whether one variable was included or not, based on the F test level of 0.1 and significance test level of 0.05.
The LR approach assumes linear relationships between explanatory variables and a response variable, specifically remote sensing-derived predictors and AGB.However, in reality, this assumption is not always met, especially when multisource data include different kinds of remote sensing variables (e.g., spectral and spatial features, radar) and DEM-derived data are used.The nonparametric and machine learning algorithms such as RF, ANN, SVR, and kNN can handle the complex nonlinear relationships and were especially useful when multisource data were used [2,9,22,27,46].However, except for RF, most nonparametric algorithms cannot provide a variable selection method and are not able to identify the optimal variables for AGB modeling.Thus, the variables selected by RF were also used in SVR, ANN, and kNN for AGB modeling.
The principle of using RF is that the k samples are extracted from the training set using the bootstrap sampling method and the k decision tree models are set up for the extracted k sample sets, and then each sample is used to obtain the predicted values of the k group.Finally, the predicted values are averaged to obtain the final prediction value for each sample [7,15,25].The RF algorithm has three parameters: the number of decision trees (Ntree), the minimum number of observations per tree leaf (mtry), and the number of repetitions in the calculation of importance (nperm).The nperm is often assigned as a default value of 5.In this research, the mtry and ntree were modified by repeating the setting, and the root mean square errors (RMSEs) of test samples were compared.In the RF approach, a backward feature elimination method was often used to eliminate relatively less important variables from all variables and to keep the most important variables after many iterations.A variable was removed at each iteration, and the parameters of RF were optimized.When the RMSE reached the minimum, the most appropriate mtry and ntree were obtained.This process led to an importance rank of the independent variables.Based on the rank, the least number of the variables for producing the most accurate estimates of AGB was determined.

Biomass Modeling Algorithms
Different algorithms including LR, RF, ANN, SVR, and kNN were used for AGB modeling in this study.LR can be expressed as model (1): where a 0 is a constant, x 1 , x 2 , . . ., x n are the explanatory variables, a 1 , a 2 , . . ., a n are the regression coefficients associated with the corresponding variables, y is the value of the plot's AGB, n is the number of explanatory variables, and ε is the error term.
The advantages of using discrete or continuous datasets-less influenced by noise, high efficiency in using large datasets, no need for a priori probability distribution, flexibility, and robustness-have made RF an important tool in land cover classification and AGB estimation in the past decade [47][48][49][50][51][52].As described in the variable selection using the RF approach, the optimal RF model was used to map AGB for the entire study area.
SVR is based on the Vapnik-Chervonenkis (VC) dimension theory and structural risk minimization, seeking the best compromise between the complexity of the model (the learning accuracy of specific training samples) and learning ability (the ability to identify any sample without error) in order to achieve the best promotion ability based on a limited number of samples.In the high dimension feature space, the "curse of dimensionality" phenomenon is prone to occur; therefore, it is necessary to transform the calculation in the high dimension feature space, which requires the kernel function to replace the inner product [21][22][23].The choice of the kernel function is a core problem in SVR research.At present, there is no way to construct a suitable kernel function for a specific problem.The commonly used kernel functions are the linear, polynomial, and radial basis function and the sigmoid kernel function.In SVR, one critical step is to optimize three parameters: the kernel function, SVR type, and penalty parameters [38].In this research, we used a grid-search approach to determine the best penalty parameters and modified the kernel function and SVR type by repeating the setting.Similar to RF, the RMSEs of test samples were compared and the optimal parameters were determined when reaching the minimum.
ANN (back-propagation neural network here) is also used for AGB estimation [18].ANN is divided into two phases: the learning stage and the prediction phase.The learning stage is the process of finding the rule between the input variable and the output variable.Through training to modify the weight matrix, the output value is kept close to the target value.In this process, the weights and thresholds of the network are deterministic.From the structural point of view, the network weights and thresholds are consistent with the coefficients and constants in the model and the learning process is consistent with the process of solving the coefficients and constants of the model.Two important parameters-the number of neurons and the transfer function-can greatly affect the accuracy of modeling prediction.In this research, numbers from 3 to 50 neurons and 13 transfer functions were examined.The optimal parameters were determined when the RMSE reached the minimum.
The kNN algorithm is a typical nonparametric algorithm, which is widely used in forestry survey and forest parameter estimation [14,30,[32][33][34][35]46,53].kNN is based on the similarity between the observation plots and predicted pixels using univariate or multiple variables.In terms of AGB estimation, the Mahalanobis distances between the estimated pixels and the sample plots were calculated using the factors that had significant effects on AGB: where x = x 1 , x 2 , . . ., x p and y = y 1 , y 2 , . . ., y p , x and y represent p-dimension variable vectors in the spectral space between two AGB sample plots.In addition, where i = 1, 2, . . ., p; x and y represent mean values of the x and y variables.T represents matrix transpose.D (x,y) is the Mahalanobis distances between two sample plots at p-dimension variable vectors x and y.The closer the sample plot is to the target pixel, the greater the weight of the sample plot.The estimate of AGB for each pixel was created by calculating a weighted mean of AGB values from k nearest plots based on their inverse distances.In this research, we examined different values of k from 1 to 30, depending on different scenarios.When the prediction accuracy reached the maximum, the optimal parameter was selected.

AGB Modeling Based on Different Scenarios
Various scenarios were designed based on three datasets, five modeling algorithms, and stratification and non-stratification of forest types: (1) AGB models were developed based on three data sources with five modeling algorithms without stratification of forest type, which led to a total of 15 scenarios being examined; (2) AGB models were developed for each of four forest types based on three datasets and five modeling algorithms, which resulted in a total of 60 scenarios being examined.

Evaluation of AGB Estimates
In AGB modeling research, the RMSE and relative RMSE (RMSEr) are often used to assess the prediction performance [2,9]: where ŷi and y i are the predicted AGB and corresponding AGB at the sample plot i; y is the mean AGB of the test sample plots (total number of n), depending on the non-stratification (all test samples) or stratification of forest types (the samples for each forest type, respectively).In general, the smaller RMSE or RMSEr values indicate better model estimation performance.The scatterplots showing the relationships between AGB estimates and reference data were also used to evaluate the model performance.Considering the number of sample plots for evaluation, 25% of the sample plots were randomly selected as test samples for the non-stratification scenarios.However, for the stratification of forest types, we needed to make sure that a sufficient number of sample plots within each forest type were available for the evaluation of each scenario.Therefore, 30% of the sample plots were randomly selected from the sample population corresponding to each forest type for evaluation of AGB estimates.
The number of test samples under non-stratification and stratification of forest types is provided in Table 2.

Comparative Analysis of AGB Modeling Results under Non-Stratification Scenarios
The established models based on non-stratification scenarios are summarized in Tables 3 and 4, indicating the important role of textural images.For Landsat TM imagery, the spectral band 7 (SWIR2) played the most important role.Overall, the models based on ALOS PALSAR data contained more variables than Landsat data alone but Landsat imagery had better performance than ALOS PALSAR data, using either LR or RF.The combination of Landsat TM and ALOS PALSAR data improved the modeling performance for LR but not for RF.The selected variables (Table 4) using RF were also used for the AGB model using ANN, SVR, and kNN.The statistics (e.g., maximum, minimum, mean, and standard deviation) of the AGB estimates from the AGB prediction maps (Table 5) clearly indicate that PALSAR produced higher mean values than Landsat or the combination of Landsat and PALSAR data using the modeling algorithms, except RF.However, the inverse was observed for the standard deviation values, implying that PALSAR data alone could not effectively predict the AGB distribution when the AGB value was very high or very small.Table 5 indicates that the predicted mean values using all these models were smaller than the mean from the sample plots, implying the overall underestimation using these models.The much larger minimum values and smaller maximum values from the RF and kNN models, compared to those from ANN, SVR, and LR, imply that the RF and kNN modeling algorithms were not able to properly predict AGB when the values of AGB were small or large.The performance of the predictions can be explained with the scatterplots showing the relationships between the AGB estimates and reference data (Figure 4).It indicates that the overestimation and underestimation problems were obvious for all the prediction results, no matter which datasets and modeling algorithms were used.This situation especially became worse for the PALSAR-based predictions.For Landsat imagery, when AGB was within 50-130 Mg/ha, the residuals were relatively small, but for ALOS PALSAR data, the AGB ranges became narrower, only about 90-120 Mg/ha.The combination of Landsat TM and ALOS PALSAR did not improve the residuals (Figure 4).The RMSE and RMSEr results (   The above analysis was based on overall performance of different data sources and modeling algorithms but cannot provide detailed information on how different forest types and AGB ranges affected the AGB estimation under non-stratification scenarios.Table 7 summarizes the RMSE and RMSEr results for different scenarios.These results indicate that in terms of RMSE, mixed forests and the forests with an AGB range greater than 160 Mg/ha had the highest RMSE values for different datasets.However, because of the different AGB average values for the forest types and AGB ranges, bamboo forests and the forests with an AGB range less than 40 Mg/ha had the largest values of RMSEr.Although ANN provided the best overall estimation results for different datasets, LR provided the best estimation for broadleaf and bamboo forests and the forests with an AGB range of 40-120 Mg/ha when Landsat data were used.For the ALOS PALSAR data, SVR provided the best estimation when the AGB range was 40-160 Mg/ha.Table 7 indicates that no datasets and modeling algorithms can be optimal for all forest types or for different AGB ranges, implying the necessity to develop AGB estimation models according to specific forest types.

Comparative Analysis of AGB Modeling Results Based on Stratification of Forest Types
The AGB models derived using five algorithms and three datasets for four forest types were compared (Tables 8 and 9).The texture measures mean, correlation, or both were involved in almost all AGB models, implying that the textures had significant contributions in improving the predictions of AGB.When Landsat TM images alone or their combination with PALSAR data were utilized, the TM spectral bands 5 and 7 and textures were frequently included in the AGB models, implying the significant roles of spectral variables in AGB estimation modeling.In the coniferous forest AGB models, the relevant Landsat TM spectral variables were often selected, meaning that the variables had a potential influence on the predictions of AGB mainly because in this study area the coniferous forests consisted mainly of pine and Chinese fir plantations that were characterized by simple forest canopy structures.This conclusion is similar to those in tropical forest regions of the Amazon where spectral responses were more important than textures for the forest sites with relatively simple forest stand structure [4,54].On the other hand, the texture measures were involved in the models of other forests types, including broadleaf forests and mixed forests that had multiple canopy layers and complex canopy structures.This implies that the textures could account for the complicated forest structures [4].If PALSAR data were used alone, the PALSAR-derived textures were added into almost all the models, indicating they had great potential to improve the predictions of AGB models.In the AGB models of broadleaf forests built using the PALSAR data and LR, only one texture was selected, which indicates that the PALSAR data were not appropriate for use to develop AGB modeling of the broadleaf forests.
The AGB predictions of the aforementioned models were assessed based on the values of RMSE and RMSEr according to forest types and AGB ranges (Table 10).Compared with those from the models without stratification of forest types (Table 7), most of the RMSE and RMSEr values from the models with stratification of forest types were considerably reduced (Table 10) for three datasets and five methods, and this was especially true for the combination of Landsat TM and PALSAR data.For example, the overall RMSE and RMSEr values decreased from 30.3 Mg/ha and 31.5% to 26.1 Mg/ha and 27.1%, respectively, for the RF method.For the AGB models of mixed forests, the values of RMSE decreased from a range of 31.7-35.6Mg/ha to a range of 24.7-26.5Mg/ha, implying a statistically significant reduction of RMSE due to the stratification of forest types.For the models of bamboo forests, all the methods except LR generated significantly smaller values of RMSE and RMSEr after the stratification of forest types than those without the stratification.The reduction of errors due to the stratification was also noticed in the AGB models of coniferous forests but not in the models of broadleaf forests.
Comparison of the results between Tables 7 and 10 indicates that stratification of forest types was effective in reducing RMSE or RMSEr in different AGB ranges using these modeling algorithms.For both non-stratification and stratification situations, a similar conclusion is that AGB estimation had the lowest RMSE or RMSEr when AGB was in the range of 40-120 Mg/ha, had the highest RMSE values when AGB was greater than 160 Mg/ha, and had the highest RMSEr values when AGB was less than 40 Mg/ha.The stratification was especially valuable when AGB was greater than 160 Mg/ha.For example, for the Landsat TM image, RMSE values were reduced from 61 Mg/ha (see Table 7) to 53.4-53.9mg/ha (see Table 10) when kNN or LR was used; and for ALOS PALSAR data, the RMSE values were reduced from 67.8-78.6Mg/ha to 58.5-64.3Mg/ha for different modeling algorithms.For the combined Landsat TM and ALOS PALSAR data, the stratification reduced RMSE from 58.6 Mg/ha (Table 7) to 41.5 Mg/ha (Table 10) using LR when AGB was greater than 160 Mg/ha.This research implies that stratification is especially valuable for forest sites having high AGB amounts.

Selection of Suitable Variables for AGB Modeling
Many remote sensing variables, such as spectral bands, vegetation indices, textures, and image transformations (e.g., principal component analysis, tasseled cap, minimum noise fraction) [2] can be used as potential predictors for AGB modeling.The variables can be extracted at pixel, subpixel, and object levels.The pixel-level variables are commonly used.However, only a limited number of the remote sensing variables are useful because of their high correlations.For example, near infrared, normalized difference vegetation index and greenness from tasseled cap transform have similar vegetation information.The inclusion of such variables cannot improve AGB modeling performance [45].In reality, the roles of spectral bands and textures in AGB modeling are dependent on the complexity of forest structures [2,4].For example, this research indicated that the SWIR bands (e.g., Landsat TM spectral bands 5 and 7) play a more important role than visible and near-infrared bands.This conclusion is similar to that from previous studies on the moist tropical forests in the Amazon [4,45] and the Mediterranean forests [55].The more important role of SWIR than near-infrared and visible spectral bands in AGB modeling may be due to the fact that SWIR is more sensitive to moisture and shade components inherent in the forest stand structure and that atmospheric conditions have less impact on spectral signatures than other shorter wavelength (e.g., near-infrared and visible) spectral bands.In particular, for the complex forest types in tropical and subtropical regions, SWIR is more valuable in AGB modeling than shorter wavelength spectral bands due to the wide spectral variation in the near-infrared band and being less sensitive to forest spectral signatures in the visible bands [3,4].The higher data saturation values of AGB in SWIR than visible and near-infrared spectral bands further confirm the more important role of SWIR in AGB modeling (3).This research also implied that textures are another group of important variables for AGB modeling; in particular, textures may play more important roles than spectral bands in forest sites with complex forest stand structures.This was also confirmed in the Brazilian tropical forests [54] and subtropical forests [3,6].However, it is critical to identify an optimal combination of textural images [5].Because of the different features between textural images and pixel-level spectral responses, the combinations of these variables have proven helpful in improving predictions of AGB [3,4], and this research confirmed this conclusion no matter which modeling algorithm (LR or machine learning) was used.This research also indicates that AGB modeling requires multiple variables from remote sensing data, no matter which sensor data and which modeling algorithms were used.This implies that a single variable alone cannot effectively capture the complexity of forest stand structure and, thus, cannot provide satisfactory AGB estimation performance.
This research indicated that ALOS PALSAR data had poorer performance in AGB estimation than Landsat TM data no matter which modeling algorithm was used and the combination of TM and PALSAR could not improve or had limited effects in AGB estimation.A similar conclusion was also obtained in tropical forests [56-58]; for example, Hame et al. found that ALOS AVNIR data had better performance in AGB estimation than ALOS PALSAR data, and a combination of both data could not improve AGB estimation in the tropical forests in Laos [56].This conclusion seems against our initial hypothesis that synthetic aperture radar (SAR) L-band data can provide the vertical structure of the forest and thus should have better performance than the optical sensor data.This is because SAR data represent the roughness of forest canopy and AGB is not directly related to the forest surface roughness, resulting in poor AGB modeling performance [6].The poorer performance in SAR data than in optical sensor data was also confirmed in forest classification in the moist tropical regions in the Amazon [57,58].Because Landsat TM and ALOS PALSAR data characterize forest structure differently, we expected that the combination of both data might improve AGB modeling performance.However, this research indicated that the combination, as either extra bands or data fusion, has limited effects in improving AGB modeling.The possible reasons may be that (1) the relatively coarse spatial resolution in ALOS-1PALSAR data (25-m spatial resolution in this research) cannot effectively capture the forest stand structures of different forest types and (2) the data saturation in both Landsat and PALSAR data impedes the improvement of estimation [3,6].Compared to the ALOS-1 PALSAR data used in this research, the current ALOS-2 PALSAR data with higher spatial resolution may provide rich forest stand structure information and may improve AGB modeling performance.

Selection of Modeling Algorithms
LR is commonly used for AGB modeling [2] and this research indicated its good performance, especially when AGB falls within 40-120 Mg/ha.This research also showed that not all machine learning and nonparametric algorithms led to more accurate AGB estimates than LR, although they have some advantages in data selection.The time for optimizing the parameters used in the relevant models was much longer than that required by LR.In addition, the requirement for sample plots (e.g., number of sample plots and the data range) was much stricter than LR; this is especially true for RF and kNN, which limited their extensions to the predictions of very low or very high AGB values, as shown in Table 5.However, when AGB values are very large or very small, machine learning and nonparametric algorithms can indeed improve AGB modeling performance.For example, this research indicated that when AGB values were greater than 120 Mg/ha, especially greater than 160 Mg/ha, ANN based on Landsat TM imagery improved AGB estimation by 3.7-6.3% of RMSEr compared with LR.When AGB values were less than 40 Mg/ha, RMSEr was reduced from 96.8% using LR to 82.2% using kNN.This implies that proper selection of modeling algorithms is also valuable in improving AGB modeling, especially for the AGB values that are very low or very high.

Potential Solutions to Improve AGB Estimation
This research indicated that no single algorithm was optimal for all forest types, and that bamboo forests were an important contribution resulting in poor estimation accuracy.Table 7 implies that different remote sensing data and modeling algorithms should be used for different forest types to improve AGB estimation.That is, it is valuable to develop specific AGB models for various forest types.However, this situation produces a big challenge in AGB modeling because of the requirement in number of sample plots and the transfer of AGB models in a large area.Table 9 confirms the value in improving AGB modeling using the stratification of forest types.This conclusion is similar to our previous finding that stratification of forest types [3] was an effective approach to improve AGB estimation.This research also indicated that AGB range was another factor influencing the AGB estimation.Stratification based on AGB range was especially valuable for improving the predictions of AGB with very high or very low values [9].However, the challenge of using stratification is the requirement of a sufficient number of sample plots, which is often difficult considering the cost and intensive labor, especially when the AGB estimation is based on historical data [2].
Previous research and this case study indicate that small or large AGB values are the major factors influencing AGB modeling performance.For the forest sites with small AGB values, the complex composition of land cover type, such as bare soils, grass, and/or shrubs, is the major reason.Spectral unmixing may be one solution [54].For the forest sites with large values of AGB, the data saturation in optical sensors such as Landsat imagery and in radar such as ALOS PALSAR [6] is the major reason for considerable underestimation of AGB.To date, there are no effective approaches to solve the data saturation problem, except the improvement of radiometric resolution and the use of stereo images and lidar data [2].In recent years, airborne lidar and space-borne stereo images have become easily available, as well as other high spatial resolution images such as Worldview, Pleiades, and Quickbird.Therefore, more research should be directed toward integrating multiple remote sensing data for improving AGB estimation.
The representativeness of sample plots is fundamental for AGB modeling.The uncertainty of sample plot data could be from the sampling approach, plot size, the allometric equations for AGB calculation, and the measurements of tree attributes during fieldwork.Removal of the sample plots that do not have good representativeness is needed.Compared to previous work by Zhao et al. [3,6], this research reduced the RMSEr for the forest sites with less than 40 Mg/ha, from 137 to 96.8%, by the refinement of sample plots based on LR.The overall RMSEr was also reduced from 32% in Zhao et al. [6] to 30.5% in this research.If a sufficient number of sample plots are collected for the forest sites with high AGB or low AGB values, development of AGB estimation models based on AGB ranges may considerably improve AGB estimation accuracy.

Conclusions
This research selected a subtropical forest region in Zhejiang Province, China, as a case study to explore the AGB estimation through a comparative analysis of different modeling algorithms (i.e., RF, ANN, SVR, kNN, and LR) based on Landsat TM, ALOS PALSAR, and their combination under stratification and non-stratification of forest types.The results indicate the following: (1) Landsat TM imagery provided more accurate estimates of AGB than ALOS PALSAR, and the combination of TM and PALSAR had limited effects on improving AGB estimation.(2) Overestimation for small AGB values and underestimation for large AGB values were major problems when using the optical (e.g., Landsat) and radar (e.g., ALOS PALSAR) data.(3) LR was still an important tool for AGB modeling, especially for the AGB range of 40-120 Mg/ha; machine learning and nonparametric algorithms had limited effects on improving AGB estimation; however, ANN was relatively the best model in this study.(4) Overall, RF and kNN were not suitable for AGB prediction for the forest sites with relatively low (e.g., less than 40 Mg/ha) or high (e.g., greater than 160 Mg/ha) AGB values.
(5) Forest types and AGB ranges were important factors influencing AGB modeling performance; stratification based on both forest types and AGB ranges might provide great potential for improving AGB estimation if a sufficient number of sample plots were available.( 6) More research should focus on improving AGB estimation for forest sites with relatively low (e.g., less than 40 Mg/ha) and high (e.g., greater than 120 Mg/ha) AGB values.However, purely Landsat or ALOS PALSAR data cannot solve this problem.As multiple source data such as lidar, stereo images, SAR, optical sensors, and ancillary data (e.g., DEM, climate, soil) become more easily available, more research should focus on the effective integration of different source data for developing AGB estimation models under the stratification based on both forest type and AGB range.

Figure 1 .
Figure 1.(a) Zhejiang Province in Eastern China; (b) elevations of the study area; and (c) forest classification map for the study area based on the 2010 Landsat 5 Thematic Mapper and spatial locations of sample plots.

Figure 1 .
Figure 1.(a) Zhejiang Province in Eastern China; (b) elevations of the study area; and (c) forest classification map for the study area based on the 2010 Landsat 5 Thematic Mapper and spatial locations of sample plots.

Figure 2 .
The major steps include (1) data preparation of different sources (i.e., geometric registration between Landsat, ALOS PALSAR, and digital elevation model (DEM); atmospheric and topographic corrections); (2) selection of the variables from Landsat, ALOS PALSAR, and DEM data; (3) development of AGB estimation models using different algorithms (i.e., LR, RF, ANN, SVR, and kNN) based on stratification and non-stratification; and (4) comparison and evaluation of the AGB modeling results.

Figure 2 .
Figure 2. Strategy of identifying suitable biomass modeling approaches through a comparative analysis of modeling algorithms using different source data under non-stratification and stratification of forest types (Note: HH, horizontally transmitted and received; HV, horizontally transmitted and vertically received; RF, Random Forest; ANN, artificial neural networks; SVR, support vector regression; kNN, k-nearest neighbor).

Figure 2 .
Figure 2. Strategy of identifying suitable biomass modeling approaches through a comparative analysis of modeling algorithms using different source data under non-stratification and stratification of forest types (Note: HH, horizontally transmitted and received; HV, horizontally transmitted and vertically received; RF, Random Forest; ANN, artificial neural networks; SVR, support vector regression; kNN, k-nearest neighbor).

Figure 3 .
Figure 3. Concept for showing the allocation of sample plots and collection of field survey data.

Figure 3 .
Figure 3. Concept for showing the allocation of sample plots and collection of field survey data.

Figure 4 .Figure 4 .
Figure 4.The relationships between AGB estimates from different models against plot reference values.

Table 1 .
Datasets used in this research.

Table 2 .
Statistics of sample plot data used in this research.
No. of Total SamplesAGBRange (Mg/ha) Mean (Mg/ha) Standard Deviation No. of Training Samples No. of Test Samples Stratification based on forest type Coniferous 329 32.1-180.1 101.

Table 2 .
Statistics of sample plot data used in this research.

Table 3 .
The regression models for AGB estimation based on the non-stratification scenario.

Table 4 .
The identified variables for AGB estimation modeling using the RF approach based on the non-stratification scenario.
bi , spectral band i; Tbiwjxx, a texture image developed using the texture measure xx (xx can be such texture measures as ME (mean), VA (variance), HO (homogeneity), CON (contrast), DI (dissimilarity), EN (entropy), SM (second moment), COR (correlation)) on spectral band i (bi) with a window size of j × j pixel (wj); HH (horizontally transmitted and received) and HV (horizontally transmitted and vertically received) represent two polarization options of the PALSAR image.

Table 5 .
Summary of statistical results of the predicted AGB images.

Table 6 )
quantitatively confirmed this situation; that is, Landsat TM produced more accurate AGB estimates than ALOS PALSAR data no matter which modeling algorithms were used.The combination of Landsat and ALOS PALSAR data provided slightly better performance for LR, similar performance for SVR and ANN, and worse performance for RF and kNN.Overall, Landsat TM imagery had smaller RMSE values of 27.7-29.3Mg/ha compared with 30.3-33.7 Mg/ha for ALOS PALSAR.

Table 6 .
Summary of accuracy assessment results based on different scenarios.

Table 7 .
Summary of RMSE (Mg/ha) and RMSEr (%) results from different scenarios under non-stratification conditions.

Table 8 .
Linear regression models for AGB estimation based on different data sources under stratification of forest type.

Table 9 .
The identified variables using the RF approach based on different data sources under stratification of forest types.BLF, broadleaf forest; CFF, coniferous forest; BBF, bamboo forest; S bi , spectral band i; T biwjxx , a texture image that was developed using the texture measure xx (xx can be such texture measure as ME (mean), VA (variance), HO (homogeneity), CON (contrast), DI (dissimilarity), EN (entropy), SM (second moment), COR (correlation)) on spectral band i (bi) with a window size of j × j pixel (wj); HH (horizontally transmitted and received) and HV (horizontally transmitted and vertically received) represent two polarization options of the PALSAR image.

Table 10 .
Summary of RMSE (Mg/ha) and RMSEr (%) results from different scenarios under stratification of forest types.