Mapping Forest Ecosystem Biomass Density for Xiangjiang River Basin by Combining Plot and Remote Sensing Data and Comparing Spatial Extrapolation Methods

The distribution of forest biomass in a river basin usually has obvious spatial heterogeneity in relation to the locations of the upper and lower reaches of the basin. In the subtropical region of China, a large amount of forest biomass, comprising diverse forest types, plays an important role in maintaining the balance of the regional carbon cycle. However, accurately estimating forest ecosystem aboveground biomass density (AGB) and mapping its spatial variability at a scale of river basin remains a great challenge. In this study, we attempted to map the current AGB in the Xiangjiang River Basin in central southern China. Three approaches, including a multivariate linear regression (MLR) model, a logistic regression (LR) model, and an improved k-nearest neighbors (kNN) algorithm, were compared to generate accurate estimates and their spatial distribution of forest ecosystem AGB in the basin. Forest inventory data from 782 field plots across the basin and remote sensing images from Landsat 5 in the same period were combined. A stepwise regression method was utilized to select significant spectral variables and a leave-one-out cross-validation (LOOCV) technique was employed to compare their predictions and assess the methods. Results demonstrated the high spatial heterogeneity in the distribution of AGB across the basin. Moreover, the improved kNN algorithm with 10 nearest neighbors showed stronger ability of spatial interpolation than other two models, and provided greater potential of accurately generating population and spatially explicit predictions of forest ecosystem AGB in the complicated basin.


Introduction
Forest biomass is one of the important variables for the quantitative study of structures and functions of forest ecosystems [1].In the context of the Kyoto Protocol, forest biomass density is one the key parameters for evaluating the potential of forest carbon sinks and studying global climate change in terms of offsetting greenhouse gas emissions [2].However, accurately estimating and mapping forest ecosystem biomass density at large scales such as regionally, nationally and globally, is very challenging for the study of forest carbon sinks [3].
Traditional techniques based on field measurements are accurate, but time-consuming, labor-intensive and destructive to forest ecosystems [4,5].In fact, the methods work for small scales only.Huxley [6] mathematically proposed an idea of using relative growth rates of biomass components that was then widely utilized by researchers to estimate biomass values of tree components (stem, branch and leaves) [7][8][9][10][11][12][13]. Subsequent approaches, such as the method proposed in Intergovernmental Panel on Climate Change (IPCC) [14], that is, using biomass conversion or expansion factors [15,16], were proposed to estimate forest ecosystem biomass density at large scales.However, these models proved to be poor in transferability [17].Remote sensing technology has been widely applied for mapping forest ecosystem biomass density for regions, countries, and even the whole world due to its characteristics of rapidly, dynamically and repeatedly acquiring images of large areas [3,[18][19][20].The data used for remote sensing based prediction of forest ecosystem biomass include optical images [21,22], radar images [23,24], and LiDAR data [25][26][27][28].However, optical remote sensing has serious obstacles such as cloud cover and saturation of reflectance due to high biomass and complexity of multi-layer forests [29][30][31][32].A universal and accurate estimation model for forest ecosystem biomass has not been found [33].
There are three kinds of methods for forest ecosystem biomass estimation using remote sensing based empirical models (including parametric and non-parametric models), process models and simulation algorithms [33,34].Wang et al. [34,35] proposed image-aided co-simulation algorithms in which spatial correlations of interest variables are taken into account.Both process models and simulation algorithms are advanced, but complicated and computation intensive, and thus difficult to use for generating spatially explicit estimates of forest ecosystem biomass density at large scales [34,35].Moreover, the empirical models have been most widely used, but their prediction ability is often strictly limited by the data used.In addition, the regression models require normal distribution of variables and sometimes produce negative values [34,35].Wang et al. [36] proposed a classification method for hierarchical multinomial logistic regression of hyperspectral images to improve the accuracy of regression modeling.In contrast, non-parametric empirical models are often more effective in mapping forest ecosystem biomass density due to the use of more flexible algorithms, no assumptions to be made about the forms of the data and distributions of variables, and no necessary inputs of complex parameters.These methods include k-nearest neighbors (kNN) method [24,37,38], artificial neural network (ANN) [39,40], random forest [41][42][43][44], support vector machine (SVM) [45,46], and maximum entropy [47][48][49].Among the non-parametric models, kNN has become popular in recent years mainly due to its ability of spatial interpolation for complicated forest ecosystems in terms of topographic features and forest canopy structures [37,38,50,51].
Although various methods have been developed and used for mapping biomass density of forest ecosystems using different remote sensing data, there have been few reports that indicate which method can lead to most accurate population estimates and their spatial distributions at watershed scales [52].The aim of this study was to demonstrate a novel and cost-effective mapping method of above-ground biomass (AGB) for forest ecosystems in the Xiangjiang River basin based on a combination of existing forest inventory sample plot data and free Landsat images.This was achieved by improving k-nearest neighbors (kNN) algorithm using correlations between spectral variables and forest ecosystem AGB to calculate weighted spectral distances of unknown pixels to each sample plot.This method was called correlation-weighted kNN (CW-kNN) and its results with different k nearest neighbors were compared with those from a widely used multivariate linear regression (MLR) model, a logistic regression (LR) model, and a general kNN algorithm (g-kNN) using a leave-one-out cross-validation (LOOCV) technique.It has to be pointed out that, in this study, the abbreviation AGB actually implied aboveground biomass density of forest ecosystems.

Study Area
The Xiangjiang River is the largest tributary of the Dongting Lake river system in the Yangtze River Basin.The Xiangjiang River Basin is located mainly in Hunan Province (110 • 31 -114 • 15 E, 24 • 31 -29 • 52 N), China (Figure 1).The basin has complicated topographic features, and is surrounded by mountains and hills in the eastern, southern, and western parts of the basin and dominated by flat terrains in the central and northern parts.The basin has a humid mid-subtropical monsoon climate with distinct seasonality [53,54].The mean annual temperature is approximately 17.5 • C with a mean annual precipitation of 1400 mm.The annual precipitation distribution is uneven with rainfall mainly in spring and summer.The annual average evaporation is about 1200 mm, with a frost-free period of 270-311 days.The main soil types include red soil, paddy soil, purple soil, yellow soil, lime (rock) soil, and yellow brown soil [55].The typical vegetation in the basin is subtropical evergreen broad-leaved forests with a forest cover percentage of 54.4%.The basin has been subject to population increase and anthropogenic disturbances and it is necessary to know the spatial distribution and variability of the forest ecosystem AGB, and its dynamic change and trend in the basin.

Study Area
The Xiangjiang River is the largest tributary of the Dongting Lake river system in the Yangtze River Basin.The Xiangjiang River Basin is located mainly in Hunan Province (110°31′-114°15′E, 24°31′-29°52′N), China (Figure 1).The basin has complicated topographic features, and is surrounded by mountains and hills in the eastern, southern, and western parts of the basin and dominated by flat terrains in the central and northern parts.The basin has a humid mid-subtropical monsoon climate with distinct seasonality [53,54].The mean annual temperature is approximately 17.5 °C with a mean annual precipitation of 1400 mm.The annual precipitation distribution is uneven with rainfall mainly in spring and summer.The annual average evaporation is about 1200 mm, with a frost-free period of 270-311 days.The main soil types include red soil, paddy soil, purple soil, yellow soil, lime (rock) soil, and yellow brown soil [55].The typical vegetation in the basin is subtropical evergreen broad-leaved forests with a forest cover percentage of 54.4%.The basin has been subject to population increase and anthropogenic disturbances and it is necessary to know the spatial distribution and variability of the forest ecosystem AGB, and its dynamic change and trend in the basin.

Forest Inventory Data
The forest inventory database of the Xiangjiang River Basin was obtained and provided information on the area size, spatial distribution and condition of forests.The information consists of detailed coordinates, canopy cover percentage, average shrub height and cover percentage, vegetation height, herbaceous cover percentage, stand types, dominant tree species, average

Forest Inventory Data
The forest inventory database of the Xiangjiang River Basin was obtained and provided information on the area size, spatial distribution and condition of forests.The information consists of detailed coordinates, canopy cover percentage, average shrub height and cover percentage, vegetation height, herbaceous cover percentage, stand types, dominant tree species, average diameter at breast height, average stand height, number of non-timber forest trees, number of trees surrounding villages, houses, roads, and water bodies, number of bamboo trees, and number of miscellaneous bamboo stalks.The plot-level data were collected from a total of 782 fixed or permanent plots across the Basin in the summer of 2009, with a sampling interval of 4 km × 8 km and a square plot area of 0.067 ha (25.82 m × 25.82 m).
By using the regression models reported by Li et al. [56], plot-level biomass was estimated according to tree species groups, and the biomass of non-timber forests was estimated with average ground diameters.The biomass of shrubs and herbs was obtained by using the models proposed by Fan et al. [57] and estimating biomass of individual plants with regression equations based on height of shrubs and herbs.The total biomass of shrubs and herbs was obtained by summing the values of individual plant biomass and converting the values per unit.The values of biomass for the sample plots of mixed coniferous forests, mixed broad-leaved forests, and mixed coniferous and broad-leaved forests, were estimated by using tree species biomass and corresponding mixed percentage.The average AGB for all forest inventory field plots was 64.53 Mg/ha with a standard deviation of 46.80 Mg/ha and a confidence interval of 61.25 Mg/ha to 67.81 Mg/ha at a significant level of 0.05.The spatial distribution of the AGB estimates for the forest inventory sampled plots is shown in Figure 2. Relatively, the plot AGB biomass had larger values in northeastern, eastern, southern and southwestern mountainous and hilly areas, and smaller values in the central and northern flat areas.In this study, the plot AGB values were used as reference data.
diameter at breast height, average stand height, number of non-timber forest trees, number of trees surrounding villages, houses, roads, and water bodies, number of bamboo trees, and number of miscellaneous bamboo stalks.The plot-level data were collected from a total of 782 fixed or permanent plots across the Basin in the summer of 2009, with a sampling interval of 4 km × 8 km and a square plot area of 0.067 ha (25.82 m × 25.82 m).
By using the regression models reported by Li et al. [56], plot-level biomass was estimated according to tree species groups, and the biomass of non-timber forests was estimated with average ground diameters.The biomass of shrubs and herbs was obtained by using the models proposed by Fan et al. [57] and estimating biomass of individual plants with regression equations based on height of shrubs and herbs.The total biomass of shrubs and herbs was obtained by summing the values of individual plant biomass and converting the values per unit.The values of biomass for the sample plots of mixed coniferous forests, mixed broad-leaved forests, and mixed coniferous and broad-leaved forests, were estimated by using tree species biomass and corresponding mixed percentage.The average AGB for all forest inventory field plots was 64.53 Mg/ha with a standard deviation of 46.80 Mg/ha and a confidence interval of 61.25 Mg/ha to 67.81 Mg/ha at a significant level of 0.05.The spatial distribution of the AGB estimates for the forest inventory sampled plots is shown in Figure 2. Relatively, the plot AGB biomass had larger values in northeastern, eastern, southern and southwestern mountainous and hilly areas, and smaller values in the central and northern flat areas.In this study, the plot AGB values were used as reference data.

Remote Sensing Data
The remote sensing images used in this study were obtained from Landsat 5 Thematic Mapper (TM) because the Landsat 5 data completely covered the study area and the acquired images were consistent with the field plot data collected in the summer of 2009.The images had six bands consisting of bands 1-5 and band 7 at a spatial resolution of 30 m. Seven adjacent and cloud-free images from the summer of 2009 were selected with path 123 and rows 40-43 and path 124 and rows 41-43.The images were susceptible to interference and distortion due to the sensor response characteristics and atmospheric absorption and scattering as well as other random factors.The images were enhanced by radiometric, atmospheric and geometric correction, as well as ortho rectification and then mosaicked.The pixel digital numbers of the images were first converted to

Remote Sensing Data
The remote sensing images used in this study were obtained from Landsat 5 Thematic Mapper (TM) because the Landsat 5 data completely covered the study area and the acquired images were consistent with the field plot data collected in the summer of 2009.The images had six bands consisting of bands 1-5 and band 7 at a spatial resolution of 30 m. Seven adjacent and cloud-free images from the summer of 2009 were selected with path 123 and rows 40-43 and path 124 and rows 41-43.The images were susceptible to interference and distortion due to the sensor response characteristics and atmospheric absorption and scattering as well as other random factors.The images were enhanced by radiometric, atmospheric and geometric correction, as well as ortho rectification and then mosaicked.The pixel digital numbers of the images were first converted to the values of radiance at the sensor's aperture and further to the values at-satellite reflectance using the parameters of Landsat 5 and solar zenith angles.The spectral values at-satellite reflectance were then converted to the reflectance values at ground surface.Moreover, the effects of slope, aspect and shade on the images were eliminated by conducting a topographic correction using Minnaert model.Finally, all the images were geo-referenced to the Universal Transverse Mercator (UTM) projection and coordinate system using a first-order affine transformation and a root mean square error (RMSE) of less than one pixel was yielded.Figure 3 shows a false color composite image by combining Landsat TM band 3, band 5 and band 4 as blue, green and red, respectively.
Remote Sens. 2017, 9, 241 5 of 22 the values of radiance at the sensor's aperture and further to the values at-satellite reflectance using the parameters of Landsat 5 and solar zenith angles.The spectral values at-satellite reflectance were then converted to the reflectance values at ground surface.Moreover, the effects of slope, aspect and shade on the images were eliminated by conducting a topographic correction using Minnaert model.Finally, all the images were geo-referenced to the Universal Transverse Mercator (UTM) projection and coordinate system using a first-order affine transformation and a root mean square error (RMSE) of less than one pixel was yielded.Figure 3 shows a false color composite image by combining Landsat TM band 3, band 5 and band 4 as blue, green and red, respectively.

Extraction and Selection of Spectral Variables
In this study, a total of 105 spectral variables were extracted, including normalized difference vegetation index (NDVI), infrared index II, spectral vegetation index (SVI), four soil-adjusted vegetation indices (SAVIl, l = 0.1, 0.25, 0.3 and 0.5), modified soil-adjusted vegetation index (MSAVI), modified normalized difference vegetation index (MNDVI), difference vegetation index (DVI), transformed vegetation index (TVI), reduced simple ratio (RSR), atmospherically resistant vegetation index (ARVI), visible atmospherically resistant index (VARI), enhanced vegetation index (EVI), thirty-two-band ratio indices and sixty-three-band ratio indices.The spectral variables were selected to capture the characteristics and canopy structures of complicated forest ecosystems and to reduce the effects of slope and aspects [58].The remote sensing variables as the independent variables of models might be significantly correlated with each other and the correlations would lead to the duplication of information and interfere the performance of the models.Thus, the spectral variables were first screened using significant coefficients of their correlations with the dependent variable-the plot AGB from the forest inventory database (FID) at the significance level of 0.05.A stepwise regression method with a variance inflation factor (VIF) ≥10 was utilized to diagnose the possible interference of the correlations among the spectral variables; that is, multicollinearity.The used VIF value was determined based on a rule of thumb, literature and examination of data [59,60].

Extraction and Selection of Spectral Variables
In this study, a total of 105 spectral variables were extracted, including normalized difference vegetation index (NDVI), infrared index II, spectral vegetation index (SVI), four soil-adjusted vegetation indices (SAVI l , l = 0.1, 0.25, 0.3 and 0.5), modified soil-adjusted vegetation index (MSAVI), modified normalized difference vegetation index (MNDVI), difference vegetation index (DVI), transformed vegetation index (TVI), reduced simple ratio (RSR), atmospherically resistant vegetation index (ARVI), visible atmospherically resistant index (VARI), enhanced vegetation index (EVI), thirty-two-band ratio indices and sixty-three-band ratio indices.The spectral variables were selected to capture the characteristics and canopy structures of complicated forest ecosystems and to reduce the effects of slope and aspects [58].The remote sensing variables as the independent variables of models might be significantly correlated with each other and the correlations would lead to the duplication of information and interfere the performance of the models.Thus, the spectral variables were first screened using significant coefficients of their correlations with the dependent variable-the plot AGB from the forest inventory database (FID) at the significance level of 0.05.A stepwise regression method with a variance inflation factor (VIF) ≥10 was utilized to diagnose the possible interference of the correlations among the spectral variables; that is, multicollinearity.The used VIF value was determined based on a rule of thumb, literature and examination of data [59,60].

Multivariate Linear Regression (MLR) and Logistic Regression (LR) Model
In this study, both multivariate linear regression (MLR) model and logistic regression (LR) model were used to account for the relationship of forest ecosystem AGB with the spectral variables selected by stepwise regression analysis [61,62].MLR is the most widely used method, but many studies have shown that the MLR has several shortcomings such as leading to negative and extremely large estimates.In this study, it was employed mainly for the purpose of comparison.LR is a method mainly used for analysis of binary dependent variables and probability prediction in which the predictions range from zero to one [63,64].LR model can be expressed in its simplest form as proposed by Atkinson and Massari [63], and Devkota et al. [65].In order to use the LR model to account for the relationship of the selected spectral variables with forest ecosystem AGB, range normalization of the biomass density data was implemented [60].In the present study, the biomass data were transformed into a dimensionless quantity using the method of range normalization [59], which was in accordance with the dependent variable being probability distribution of the logistic model.The LR model was selected and compared with the MLR model, on one hand, to test whether the relationships of the forest ecosystem AGB with spectral variables were linear or non-linear, and on the other hand, to validate if the LR model performed better because of its non-linearity and positive estimates to be created than the MLR model in prediction of the forest ecosystem AGB.

kNN Algorithms
k-nearest neighbors (kNN) is one of non-parametric statistical methods and can be used to simultaneously estimate multiple forest ecosystem variables using the same underlying field dataset [37,38,66,67].In this study, the estimation was conducted in a spectral space based on a spectral distance-weighted algorithm in which the estimate of AGB of pixel p was calculated [37,38,50,67] from Equation (1): where k was the number of plots closest to pixel p in the spectral space, AGB j was the AGB value for the jth plot, and w pj was the weight for the jth nearest plot of pixel p.The plot weights w pj were calculated from Equation (2): The spectral distance metric between pixel p and the pixel corresponding to the jth plot, d pj , was calculated from Equations ( 3) and (4): (3) where d pj was the spectral distance between pixel p and the pixel corresponding to the jth plot, l was the lth remote sensing variable, m was the number of remote sensing variables, x pl was the value of remote sensing variable l for pixel p, x jl was the value of remote sensing variable l for pixel j, v l was the weight for remote sensing variable l on spectral distance, and r l was the correlation coefficient between remote sensing variable l and the AGB of the plot.
In this study, the kNN method was selected mainly because several studies have proven that it is appropriate for estimating and mapping forest ecosystem AGB at a large scale because of its non-parametric characteristics, that is, normal distributions of interest variables are not required [24,37,38,50,67].Moreover, the kNN was improved by using the correlations between the spectral variables and plot AGB to calculate a weighted spectral distance of each unknown pixel to each sample plot.The improved kNN was called correlation-weighted kNN (CW-kNN).Compared with the general kNN algorithm (g-kNN) without the correlation-based weighting, CW-kNN takes into account the importance of spectral variables for mapping forest ecosystem AGB and provides the potential to more accurately account for the difference between one unknown pixel and one sample plot in the spectral space and thus in canopy structure and biomass density of forest ecosystems.In this study, both g-kNN and CW-kNN with four k values (3, 5, 7 and 10 nearest plots) were compared for mapping forest ecosystem AGB.Based on previous studies, using more than 10 neighbors did not produce a greater estimation accuracy [37,38,50,67,68] and, therefore, the maximum k value of 10 was selected.

Leave-One-Out Cross-Validation (LOOCV) and Model Evaluation
The LOOCV technique [69] was applied to evaluate the accuracies of three kinds of methods and corresponding AGB maps.The LOOCV followed the algorithm as reported by Ji et al. [70]: one single plot was withheld as a validation sample and the remaining plots were used to train the models.This step was repeated until each plot was used once as a validation sample.The AGB values of all field plots were then compared with their estimates using the "leave-one-out" training samples.The LOOCV technique has the advantage of providing an unbiased estimation of the prediction error [71].
Moreover, in this study four indices including the coefficient of determination (R 2 ), RMSE, the mean and variance values ( ∧ µ map ,Var map ) of prediction maps, were employed to quantify the errors and to clarify which method performed most accurately in terms of plot and pixel levels.The uncertainty was quantified at the 95% confidence interval.
∧ µ map and Var map were calculated using model-assisted regression estimators [51] from Equations ( 5) and ( 6): ) where ∧ µ map was the mean value of the predicted results at the pixel level in the study area, that is, a AGB map; N was the total number of pixels in the study area; n was the total number of sample plots used for the predictions; ∧ AGB j was the predicted value of each pixel; ε i was the residual between the referenced and predicted value from the same plot i; Var map was the variance of the predicted results for the study area; and p was the number of variables used for modeling, including constant variables.

Independent Variables of Models
A total of 82 spectral variables were significantly correlated with the plot AGB and the absolute values of the correlation coefficients ranged from 0.211 to 0.706.The coefficients of correlation for the five most correlated spectral variables were −0.706, −0.698, 0.697, 0.683, and 0.681, corresponding to SR 314 : band 3/(band 1 + band 4); SR 324 : band 3/(band 2 + band 4); MNDVI: ((band 4 − band 3)/(band 4 + band 3))(1 − (band 5 − band 5 min )/(band 5 max − band 5 min )); SR 436 : band 4/(band 3 + band 7); and NDVI: (band 4 − band 3)/(band 4 + band 3), respectively.It was found that many of the spectral variables were highly correlated with each other.In this study, the stepwise regression analysis with a VIF value of 10 eliminated most of the correlated spectral variables and led to five spectral variables selected, including NDVI; SR 23 : band 2/band 3; SR 415 : band 4/(band 1 + band 5); SR 546 : band 5/(band 4 + band 7); and SR 625 : band 7/(band 2 + band 5) with the correlation coefficients of 0.618, 0.389, 0.646, −0.282, and −0.401, respectively.This implied that the stepwise regression analysis successfully excluded the spectral variables that contained similar information.For example, except for NDVI, other four most significant spectral variables were not selected because of their high correlations with NDVI.The five selected spectral variables were used as the independent variables in the multivariate linear regression model, logistic regression model, and kNN algorithms with plot AGB as the dependent variable.

Multivariate Linear Regression (MLR) Modeling
The five selected remote sensing variables were used as independent variables to fit a MLR model to the AGB data from the forest inventory dataset as follows:  (7) Based on the results of validation, there was a strong correlation between the plot predicted and referenced biomass density values (Figure 4a).The predicted average biomass density of the sample plots by the MLR method was almost the same as the average FID biomass.The mean ∧ µ map of the AGB map for the basin was slightly underestimated compared to the average biomass density value of the sample plots (Table 1).The underestimation mainly took place for the plots with AGB values greater than 100 Mg/ha and smaller 20 Mg/ha (Figure 4a).The distribution of the residuals as estimates of the error variance was in the shape of a horn (Figure 5a).∧ µ map and Var map are the mean estimate and its variance of a forest AGB map using model-assisted regression estimators from the MLR and LR models, and both the g-kNN without correlation based weighting and the CW-kNN with correlation based weighting.The percentages of the sample plot AGB values and the corresponding estimates falling in the biomass density intervals for the AGB map from the MLR are listed in Table 2. Compared to those of the sample plot AGB values, there were much smaller percentages of estimates falling in the biomass density intervals less than 20 Mg/ha and larger than 100 Mg/ha, indicating the underestimations happened in the intervals.The percentages of estimates falling in the biomass density intervals of 20-40 Mg/ha, 40-60 Mg/ha, 60-80 Mg/ha, and 80-100 Mg/ha were much greater than those of the sample plot AGB values, impling that overestimations took place in the intervals.In addition, the percentage for the biomass density interval less than 0 Mg/ha showed that the MLR method generated negative predictions at many places.The percentages of the sample plot AGB values and the corresponding estimates falling in the biomass density intervals for the AGB map from the MLR are listed in Table 2. Compared to those of the sample plot AGB values, there were much smaller percentages of estimates falling in the biomass density intervals less than 20 Mg/ha and larger than 100 Mg/ha, indicating the underestimations happened in the intervals.The percentages of estimates falling in the biomass density intervals of 20-40 Mg/ha, 40-60 Mg/ha, 60-80 Mg/ha, and 80-100 Mg/ha were much greater than those of the sample plot AGB values, impling that overestimations took place in the intervals.In addition, the percentage for the biomass density interval less than 0 Mg/ha showed that the MLR method generated negative predictions at many places.

Logistic Regression (LR) Modeling
The LR method led to following AGB estimation model: The validation results showed that the coefficient of determination R 2 between the predictions and the plot biomass density values was statistically significant at the level of 0.05 (Table 1 and Figure 4b).The mean AGB prediction at the plot level was close to the sample plot average, but the mean map μ ∧ of the AGB map for the basin was smaller (Table 1).Compared with those from the MLR model (Figure 5a), the residuals of predictions from the LR model were distributed more evenly and randomly (Figure 5b).The LR model also led to underestimations for the plots with biomass density values smaller than 20 Mg/ha and larger than 120 Mg/ha (Figures 4b and 5b), but the underestimations were slightly improved compared to those from the MLR model (Figures 4a  and 5a).
Unlike the MLR model, the LR method did not lead to negative predictions of biomass density (Table 2).Compared to those of the sample plot AGB values, there were smaller percentages of estimates falling in the biomass density intervals less than 20 Mg/ha and larger than 120 Mg/ha.This indicated that the LR method resulted in underestimations in the intervals, but the underestimations were improved compared to those by the MLR method.Moreover, the

Logistic Regression (LR) Modeling
The LR method led to following AGB estimation model: AGB = 217.603e (4.1681+10.9561NDVI−1.3100SR 23 −3.9319SR 415 −5.5985SR 546 −5.9164SR 625 ) 1 + e (4.1681+10.9561NDVI−1.3100SR 23 −3.9319SR 415 −5.5985SR 546 −5.9164SR 625 ) (8) The validation results showed that the coefficient of determination R 2 between the predictions and the plot biomass density values was statistically significant at the level of 0.05 (Table 1 and Figure 4b).The mean AGB prediction at the plot level was close to the sample plot average, but the mean ∧ µ map of the AGB map for the basin was smaller (Table 1).Compared with those from the MLR model (Figure 5a), the residuals of predictions from the LR model were distributed more evenly and randomly (Figure 5b).The LR model also led to underestimations for the plots with biomass density values smaller than 20 Mg/ha and larger than 120 Mg/ha (Figures 4b and 5b), but the underestimations were slightly improved compared to those from the MLR model (Figures 4a and 5a).
Unlike the MLR model, the LR method did not lead to negative predictions of biomass density (Table 2).Compared to those of the sample plot AGB values, there were smaller percentages of estimates falling in the biomass density intervals less than 20 Mg/ha and larger than 120 Mg/ha.This indicated that the LR method resulted in underestimations in the intervals, but the underestimations were improved compared to those by the MLR method.Moreover, the percentages of estimates falling in the biomass density intervals of 40-60 Mg/ha, 60-80 Mg/ha, and 80-100 Mg/ha were slightly larger than those of the sample plot AGB values, implying that overestimations took place in the intervals, but were not very serious.The great overestimations happened mainly in the interval of 20-40 Mg/ha.

kNN Modeling
The mean predictions of kNN methods derived AGB for the FID-plots had a small range, but all fell in the confidence interval of the sample plot data (Table 1).For both kNN methods, the coefficient of determination R 2 between the plot predicted and referenced biomass density values increased gradually with the increasing number of k-nearest neighbors (Table 1 and Figure A1).Moreover, the larger the k value, the smaller the values of RMSE.For the same k values, the values of RMSE by CW-kNN were similar to those by g-kNN.Compared to those from the MLR and LR models, the RMSE values from both g-kNN and CW-kNN were slightly larger when k = 3 and 5 nearest neighbors and did not significantly differ when k = 7 and 10 nearest neighbors (Table 1 and Figure 4).The predicted ∧ µ map of the AGB maps from the kNN methods slightly fluctuated and the corresponding variance Var map slightly decreased with the increasing values of k (Table 1).The map mean estimates were smaller than that of the sample plot data and fell out of the confidence interval, but very close to the lower bound, implying slight underestimations.
Similar to the LR method, the kNN algorithms did not result in negative estimates (Table 2).For both g-kNN and CW-kNN algorithms, the percentages of the map estimates falling in the biomass density intervals of 1-20 Mg/ha, 40-60 Mg/ha and 60-80 Mg/ha were very close to those of the sample plot AGB values.Compared to those of the sample plot AGB values, there were smaller percentages of the estimates in the biomass density intervals larger than 100 Mg/ha, while greater percentage existed in the interval of 80-100 Mg/ha.However, overall both g-kNN and CW-kNN algorithms greatly improved the underestimations and overestimations of AGB for the intervals compared to the MLR and LR models.Both g-kNN and CW-kNN algorithms led to similar characteristics of plot and pixel estimates.
The distributions of residuals looked horn-shaped for both kNN methods and for all k values (Figure A2).With an increase of k values, the residual distribution tended to be random, that is, the residuals fluctuating around the line of zero except for one plot with an extremely residual (Figure A2).Compared to those from the MLR and LR models, both g-kNN and CW-kNN algorithms improved the distributions of residuals (Figure 5).

Spatial Distribution of AGB
In Figure 6a,b, the spatial distributions of predicted AGB values for the basin by the MLR and LR models looked different from those by both g-kNN and CW-kNN algorithms with k = 10 nearest neighbors in Figure 6c,d, although the locations of the areas where large and small estimates existed were similar.The greater biomass estimates were distributed mainly in the northeastern and the upper reaches of the Xiangjiang River Basin, including the eastern, southern and southwestern parts.The smaller biomass estimates were allocated in the middle and lower reaches, that is, central, northwestern and northern parts of the basin.Among the methods, the LR model led to the highest spatial variability of predicted AGB values, especially in the northeast parts (Figure 6b), and then the MLR model (Figure 6a).Both g-kNN and CW-kNN algorithms with k = 10 resulted in similar spatial distributions of the estimates (Figure 6c,d) with lower spatial variability than that from the regression models.For the kNN algorithms, slightly higher spatial variability of the predicted values was derived by using smaller k values (Figure A3).In addition, the MLR model resulted in negative predictions in the middle and lower reaches (central, northwestern and northern parts) of the basin.

Rationality of Spectral Variable Selection
Although Landsat TM images have been widely employed for AGB estimation of forest ecosystems [4,52,72-78], extracting and selecting spectral variables to accurately derive spatial distribution of AGB is still challenging mainly due to the saturation of spectral reflectance and the presence of mixed pixels [4, 58,79].The image data and spectral bands from different sensors have their own characteristics in reflecting land surfaces [4].Some vegetation indices such as simple ratios of spectral bands and NDVI obtained from Landsat data have been demonstrated to be useful predictors of biomass density in forest ecosystems [80-82].However, not all vegetation indices are significantly correlated with AGB of forest ecosystems [4].In this work, the independent variables that contributed to significantly improving the statistical fit of models to data and reducing the sum of squared errors were first selected from a total of 82 Landsat TM image derived spectral variables that were significantly correlated with plot AGB.The spectral variables were utilized in all three kinds of methods and thus the differences among the estimates should be attributed to the properties of

Rationality of Spectral Variable Selection
Although Landsat TM images have been widely employed for AGB estimation of forest ecosystems [4,52,[72][73][74][75][76][77][78], extracting and selecting spectral variables to accurately derive spatial distribution of AGB is still challenging mainly due to the saturation of spectral reflectance and the presence of mixed pixels [4, 58,79].The image data and spectral bands from different sensors have their own characteristics in reflecting land surfaces [4].Some vegetation indices such as simple ratios of spectral bands and NDVI obtained from Landsat data have been demonstrated to be useful predictors of biomass density in forest ecosystems [80][81][82].However, not all vegetation indices are significantly correlated with AGB of forest ecosystems [4].In this work, the independent variables that contributed to significantly improving the statistical fit of models to data and reducing the sum of squared errors were first selected from a total of 82 Landsat TM image derived spectral variables that were significantly correlated with plot AGB.The spectral variables were utilized in all three kinds of methods and thus the differences among the estimates should be attributed to the properties of the algorithms.The selected spectral variables matching the ground survey of the sample plots in time were site-specific and could not be generalized for the use of other years or areas.

Comparison of Different Methods for Biomass Estimation
The selection of appropriate spatial extrapolation methods plays a central role in mapping biomass density of forest ecosystems [3,4,43].In this work, three kinds of spatial modeling approaches, including MLR model, LR model, and kNN algorithms with and without spectral variable-AGB correlation based weighting (CW-kNN and g-kNN), were compared to yield the spatial estimates of AGB in the study area.The results showed that the smallest RMSE was obtained by the MLR, then the g-kNN and CW-kNN algorithms with k = 10.However, the differences of RMSE values were not significant.Moreover, the MLR resulted in negative estimates at many places (5.79% of the pixels).The distributions of the residuals and the scatter graph of predicted vs. referenced AGB values from the MLR showed a "horn" shape and underestimations existing for the sample plots and areas that had the AGB values smaller than 20 Mg/ha and larger than 100 Mg/ha.These implied that the relationship of the selected spectral variables with the forest ecosystem AGB was not linear [76].We tested and verified the nonlinear relationships of the five selected spectral variables with forest ecosystem AGB.The distributions of the residuals and the underestimations were improved slightly by the non-linear LR model and greatly by the non-parametric methods CW-kNN and g-kNN with k = 10.
In addition, for the sample plots and areas with AGB values smaller than 20 Mg/ha, the underestimations could be caused by the impacts of spectral reflectance from soils and bare lands within young forest stands on the values of the selected spectral variables.On the other hand, the underestimations for the sample plots and areas with AGB values larger than 100 Mg/ha for the MLR model or 120 Mg/ha for the LR model were also due to reflectance saturation of multi-layer canopy and high biomass forest stands [79].More importantly, this was mainly because both regression models were global methods that modeled and used global trends of AGB to generate estimates of local areas or pixels and in contrast, both CW-kNN and g-kNN were local methods that modeled and utilized local variability of AGB to create the estimates and thus improved the underestimations.
It had to be pointed out that both g-kNN and CW-kNN algorithms led to a great residual for one sample plot located in a shadow slope close to the south border of the basin with elevation of 1120 m and covered by a high dense and mixed forest of deciduous trees and shrub.This suggested that, although a topographic correction was conducted in this study, the effects of shadows still existed.If a more advanced method for topographic correction can be developed, the accuracy of AGB estimation can be further increased.In addition, in this study the kNN algorithms were investigated only using k values not larger than 10, mainly because based on previous reports using a k value larger than 10 would often smooth the spatial distribution of AGB estimates and not increase the accuracy of estimation [37,38,50,67,68].

Spatial Distribution of AGB in the Xiangjiang River Basin
Compared with those from the MLR and kNN algorithms, the spatial distribution of AGB estimates using the LR model had the higher spatial variability, especially in the northeast part of the basin (Figure 6b), because of its non-linear characteristics.However, all the predicted maps (Figure 6) were characterized by larger estimates of AGB distributed in the northeastern parts of the basin and the upper reaches (eastern, southern and southwestern parts) and smaller predictions of AGB allocated in the middle reaches (central areas) and the lower reaches (northwestern and northern parts).The spatial characteristics and patterns of AGB estimates were consistent with those of the sample plot biomass density values and reasonable because the northeastern (Liuyang of Changsha and Liling of Zhuzhou), eastern (Youxian and Chalin county of Zhuzhou), southern (Yongzhou) and southwest (Guilin of Guangxi) parts of the basin were mountainous and hilly areas and dominated by various high dense forests, and the central (Hengyang), northwestern (Xiangtan) and northern (Changsha) parts were flat areas and dominated by cities and villages.
The spatial patterns of AGB predictions were supported by the results of the report from Jiao et al. [83,84], i.e., the estimates of AGB were relatively greater in the areas of Yongzhou and Chengzhou (the upper reaches of the basin) and smaller in the areas of Hengyang and Xiangtan cities (the middle and lower reaches of the basin).The similar spatial patterns of carbon storage for Cunninghamia lanceolata were characterized by Huang and Tong [85].The AGB values of Pinus massoniana forests tended to decrease from the southwestern and southern parts to the northern parts of Hunan Province [86].Xiangtan City had the lowest forest cover percentage and AGB value, followed by Hengyang City due to the implementation of the Hunan Agricultural Development Strategy in the mid-1990s, resulting in a decrease in the forested land in the areas of the basin [87,88].However, in recent years net primary production (NPP) has been increasing in the middle and downstream reaches of the basin due to the programs of the "returning farmland to forests" and "forest protection", based on the data of four forest inventories from 1983 to 2009 with the area of broad-leaved forests increased by four times [87,88].

Comparison with Previous Biomass Estimations
The average AGB estimates of the sample plots for the forest ecosystems in the Xiangjiang River Basin in 2009 varied from 63.84 Mg/ha to 64.52 Mg/ha, very close to the referenced value (64.53 Mg/ha) of the plots measured in the same year.The corresponding average estimates of the obtained maps ranged from 59.32 Mg/ha to 60.31 Mg/ha, slightly underestimated compared to the referenced value.Based on previous studies [83,89], the average AGB estimates of vegetation ecosystems (including forests, shrubs and grasslands) in the subtropical regions of China had a great range of 31.76Mg/ha to 74.78 Mg/ha.In Hunan province, the forest average AGB values obtained from the sample plots of the 4th and 8th national forest inventories in 1990 and 2009 respectively were 31.76Mg/ha [83] and 27.56 Mg/ha.For the Xiangjiang River Basin, the forest average AGB values obtained from the sample plots of national forest inventories in 1999, 2004 and 2009 respectively were 32.55 Mg/ha, 19.5 Mg/ha and 47.25 Mg/ha.This implied that the AGB values of forests in the Xiangjiang River Basin were larger than those of the whole Hunan in the same years mainly because the upper reaches of the Xiangjiang River was dominated by various protected forests.However, the value of 47.25 Mg/ha in 2009 did not include biomass of shrubs and grass within the forests.This study dealt with the forest ecosystems of the Xiangjiang River Basin consisting of forests, and shrubs and grass within the forests and thus, the obtained average AGB estimates in this study were greater than that of the forests from the sample plots of the 8th national forest inventory in 2009.The forest AGB of the Xiangjiang River Basin varied greatly over time from 1999 to 2004 and 2009, which was mainly caused by Human activities including city sprawling and urbanization, returning farmland to forests and forest protection.

Uncertainty Analysis of Forest Biomass
The estimates of AGB for the forest ecosystems of the Xiangjiang River basin are associated with uncertainties.The uncertainties may come from the images used and measurement errors of field observations for the tree variables involved in allometric equations, including tree height and diameter at breast height (DBH) [4, [90][91][92].In the present work, five remote sensing variables were selected and they were all significantly correlated with the sample plot AGB at the significance level of 0.01.Using the same five remote sensing variables three kinds of approaches were employed to produce the AGB estimates.The estimates of the sample plots from the methods were similar to the FID-based AGB values, suggesting that the five spectral variables contributed to statistically significantly improving the fit of the models to the data and reducing the sum of square errors.In this study, the forest inventory dataset collected in 2009 served as a reference, and allowed us to test and compare the remote sensing based methods.The dataset contained sampling and measurement errors of tree height and DBH that led to uncertainty of AGB estimates for the forest ecosystems of the Xiangjiang River basin.However, this study did not analyze the effects of sampling and measurement errors on the accuracy of estimation because of lacking the corresponding information of the tree variables.
Moreover, the coefficients of allometric equations are site-specific [8] and not generalized.Applying the allometric equations to other study areas will cause uncertainties of forest ecosystem AGB estimates [11,[93][94][95].In this study, several allometric equations were used to estimate the values of AGB of trees by species; however, the uncertainty from the used allometric equations was unknown.In addition, Tang et al. [54] demonstrated that selecting spatial extrapolation methods is also vital for estimation and mapping of forest ecosystem AGB.In this study, three kinds of methods were compared using the same dataset and spectral variables.Based on the average estimates of the AGB maps at the pixel level, both CW-kNN and g-kNN had more robust ability of spatial extrapolation than the MLR and LR models.

Conclusions
The objective of this study was to develop a novel and cost-effective mapping method of forest ecosystem AGB for the Xiangjiang River Basin by combining an existing forest inventory database and free Landsat images and by improving kNN algorithm and comparing it with other three spatial extrapolation methods.It was found that: (1) spectral variables NDVI, SR 23 , SR 415 , SR 546 , and SR 625 had significant contributions to improving the models' fit to the data and reducing the residuals of predictions; (2) all the spatial extrapolation methods led to reasonable and similar spatial patterns of forest ecosystem AGB estimates, but different spatial variability, and both the original g-kNN and the improved CW-kNN with 10 nearest plots resulted in slightly smaller RMSE than the LR model; (3) there was no significant difference between the g-kNN and the improved CW-kNN in terms of estimation accuracy at both plot and pixel levels; (4) compared to the MLR and LR models, both the g-kNN and CW-kNN algorithms improved the distributions of residuals of predictions and showed stronger ability of spatial interpolation; and (5) although the MLR model created the smallest RMSE, it led to negative estimates and was not appropriate for mapping the biomass of the forest ecosystems.Thus, this study suggested that the CW-kNN or g-kNN algorithm provided greater potential for generating accurate plot level estimates and spatially explicit predictions of AGB for the forest ecosystems in the complicated basin by combining free Landsat TM images and existing plot data, and can be applied to other areas at similar scales.

Figure 1 .
Figure 1.The study area: (a) the Xiangjiang River Basin in cyan with the river in blue; and (b) the basin location in China marked in red.

Figure 1 .
Figure 1.The study area: (a) the Xiangjiang River Basin in cyan with the river in blue; and (b) the basin location in China marked in red.

Figure 2 .
Figure 2. Spatial distribution of sampling plots and corresponding plot AGB values across the Xiangjiang River Basin.

Figure 2 .
Figure 2. Spatial distribution of sampling plots and corresponding plot AGB values across the Xiangjiang River Basin.

Figure 3 .
Figure 3. Landsat 5 TM composite image covering the study area after pre-processing.

Figure 3 .
Figure 3. Landsat 5 TM composite image covering the study area after pre-processing.

Table 1 .
The accuracy assessments of forest aboveground biomass density (AGB) estimates from a multivariate linear regression (MLR) model, a logistic regression (LR) model, and k-nearest neighbors (kNN) algorithms in the basin based on the leave-one-out cross-validation (LOOCV).The mean AGB means the average estimates for the forest inventory sample plots; and R 2 and RMSE are the coefficient of determination and root mean square error between the estimated and referenced values of the sample plots, respectively.

Figure 6 .
Figure 6.Spatial distributions, that is, maps of above-ground biomass density (AGB) predictions for the basin by: (a) the multivariate linear regression (MLR); (b) the logistic regression (LR); (c) the g-kNN with k = 10; and (d) the CW-kNN with k = 10.

Figure 6 .
Figure 6.Spatial distributions, that is, maps of above-ground biomass density (AGB) predictions for the basin by: (a) the multivariate linear regression (MLR); (b) the logistic regression (LR); (c) the g-kNN with k = 10; and (d) the CW-kNN with k = 10.

Table 2 .
Statistical percentages of forest aboveground biomass density (AGB) estimates falling in the biomass intervals for forest ecosystem AGB maps using MLR, LR, g-kNN, and CW-kNN modeling.The reference AGB was obtained from a forest inventory dataset (FID column).

Table 2 .
Statistical percentages of forest aboveground biomass density (AGB) estimates falling in the biomass intervals for forest ecosystem AGB maps using MLR, LR, g-kNN, and CW-kNN modeling.The reference AGB was obtained from a forest inventory dataset (FID column).