Evaluating k -Nearest Neighbor ( k NN) Imputation Models for Species-Level Aboveground Forest Biomass Mapping in Northeast China

: Quantifying spatially explicit or pixel-level aboveground forest biomass (AFB) across large regions is critical for measuring forest carbon sequestration capacity, assessing forest carbon balance, and revealing changes in the structure and function of forest ecosystems. When AFB is measured at the species level using widely available remote sensing data, regional changes in forest composition can readily be monitored. In this study, wall-to-wall maps of species-level AFB were generated for forests in Northeast China by integrating forest inventory data with Moderate Resolution Imaging Spectroradiometer (MODIS) images and environmental variables through applying the optimal k -nearest neighbor ( k NN) imputation model. By comparing the prediction accuracy of 630 k NN models, we found that the models with random forest (RF) as the distance metric showed the highest accuracy. Compared to the use of single-month MODIS data for September, there was no appreciable improvement for the estimation accuracy of species-level AFB by using multi-month MODIS data. When k > 7, the accuracy improvement of the RF-based k NN models using the single MODIS predictors for September was essentially negligible. Therefore, the k NN model using the RF distance metric, single-month (September) MODIS predictors and k = 7 was the optimal model to impute the species-level AFB for entire Northeast China. Our imputation results showed that average AFB of all species over Northeast China was 101.98 Mg / ha around 2000. Among 17 widespread species, larch was most dominant, with the largest AFB (20.88 Mg / ha), followed by white birch (13.84 Mg / ha). Amur corktree and willow had low AFB (0.91 and 0.96 Mg / ha, respectively). Environmental variables (e.g., climate and topography) had strong relationships with species-level AFB. By integrating forest inventory data and remote sensing data with complete spatial coverage using the optimal k NN model, we successfully mapped the AFB distribution of the 17 tree species over Northeast China. We also evaluated the accuracy of AFB at di ﬀ erent spatial scales. The AFB estimation accuracy signiﬁcantly improved from stand level up to the ecotype level, indicating that the AFB maps generated from this study are more suitable to apply to forest ecosystem models (e.g., LINKAGES) which require species-level attributes at the ecotype scale.

the stand to the ecotype level; (iii) analyze the spatial patterns of species-level aboveground forest biomass across different ecoregions in Northeast China; and (iv) investigate the relationship between environmental variables (e.g., climate, topography, soil) and species-level aboveground forest biomass.

Study Area
Northeast China (38 • 42 -53 • 35 N, 115 • 32 -135 • 09 E) includes China's largest natural forest area, storing nearly half of the total forest biomass in China [23]. The region includes Heilongjiang, Jilin, and Liaoning provinces, and the eastern part of the Inner Mongolia Autonomous Region, covering over 1.24 million km 2 . According to natural conditions (e.g., temperature, humidity, topography, elevation), Northeast China can be divided into seven major ecoregions [24] environmental variables (e.g., climate, topography, soil) and species-level aboveground forest biomass.

Study Area
Northeast China (38°42′-53°35′N, 115°32′-135°09′E) includes China's largest natural forest area, storing nearly half of the total forest biomass in China [23]. The region includes Heilongjiang, Jilin, and Liaoning provinces, and the eastern part of the Inner Mongolia Autonomous Region, covering over 1.24 million km 2 . According to natural conditions (e.g., temperature, humidity, topography, elevation), Northeast China can be divided into seven major ecoregions [24]

Forest Inventory Data
We obtained data for 25,000 forest stand polygons (average stand size: 20.6 ha) in Northeast China surveyed in the early 2000s from the National Forestry and Grassland Data Center (http: //www.cfsdc.org/). The geometric center sites of forest stand polygons are shown in Figure 1. They were mainly distributed in three areas that are representative of forests in Northeast China, the Greater Khingan Mountains, the Lesser Khingan Mountains and the Changbai Mountains [23]. The forest inventory data spanned a variety of forest types (e.g., cold-temperate conifer mixed forests, temperate conifer forests, broadleaf mixed forests, and warm-temperate deciduous broadleaf mixed forests) with different age classes, and include all major tree species in Northeast China. Therefore, the collected forest inventory data in this study effectively represent the species composition and structure of Northeast China. Each forest stand polygon is a contiguous area ranging from a few to tens of hectares that contains a relatively homogeneous tree community and is normally managed as a single unit [25]. The statistics of each forest stand polygon mainly include mean diameter at breast height, stand height, stand age, stand volume, and volume proportion by species. Based on the species-specific biomass-volume relationships [26], we transformed the species-level volume of each forest stand polygon into species-level AFB. In the kNN imputation models, we selected the AFB for the 17 dominant tree species of each stand polygon as the response variables. Additionally, we derived aboveground forest biomass measurements for 143 sample plots spread across the areas where inventory data were sparse ( Figure 1) from literatures [27,28]. These added plots were used to validate our imputed total AFB for areas where forest inventory data were not available.

MODIS Data
MODIS data have wide, complete spatial coverage and a relatively high temporal resolution [7]. We developed mathematical relationships between the rich spectral information in MODIS and species-level AFB from field data [23]. We then applied the kNN imputation method to integrate spatially continuous MODIS data and forest inventory data to derive wall-to-wall species-level AFB over the entirety of Northeast China. Since most of the forest stand polygons were obtained around 2000, the MODIS data (MOD09Q1: b1-b2, 250 m and MOD09A1: b3-b7, 500 m) in 2000 were selected as basic data to derive predictor variables for further AFB mapping. MOD09A1 data were resampled to 250 m using nearest neighbor interpolation to fit the spatial resolution of MOD09Q1 data. To investigate the influences of multi-temporal MODIS data on biomass prediction accuracy, we obtained monthly MOD09Q1 and MOD09A1 data. We extracted the monthly reflectance value of MOD09Q1 and MOD09A1 data by the Maximum Value Composite (MVC) method and average method, respectively. The difference of monthly reflectance extracted by the MVC method was small and the discrimination was not enough to reflect the rich temporal information of monthly MODIS data. The reflectance value of each month extracted by the average method was significantly different, and thus, we selected the monthly average reflectance value (May-October) of MOD09Q1 and MOD09A1 data as predictor variables. Because Northeast China was largely covered by snow or frost during January-April and November-December [23], spectral bands of these two periods were not applied to the kNN models. Several spectral indices (May-October) correlated with vegetation characteristics were also used as predictor variables, which were computed by using the monthly spectral bands (Table 1). Additionally, the 2000 MODIS Vegetation Continuous Fields (VCF) product MOD44B (250 m) was used to extract the forest areas, and pixels with tree cover greater than 10% are defined as forest areas [29]. The 2000 MODIS land cover product MCD12Q1 (500 m) was also resampled to 250 m using nearest neighbor interpolation and used to distinguish different forest types.

Environmental Data
To reduce uncertainties due to environmental heterogeneity, environmental data related to species-level AFB were selected as auxiliary predictors (Table 1). Topographic data (e.g., slope and cosine of aspect) were derived from a 90 m digital elevation model (DEM) provided by the Shuttle Radar Topography Mission (SRTM) (http://srtm.csi.cgiar.org/) by using ArcGIS 10.3 software. As primary climatic variables, monthly mean temperature and cumulative precipitation data (250 m; 1982-2015) were interpolated from the 103 meteorology stations in Northeast China (https://data.cma.cn/) by using ANUSPLIN 4.3 software, which applied a thin-plate spline function, with the resampled SRTM DEM (250 m) as the covariate [41]. The SRTM DEM was resampled from 90 m to 250 m using bilinear interpolation.
Monthly potential evapotranspiration and radiation data from 1982 to 2015 with a spatial resolution of 0.1 • × 0.1 • were derived from the China Meteorological Forcing Dataset (http://westdc.westgis.ac.cn). Soil data (1 km resolution) were derived from the Harmonized World Soil Data Base Version 1.2 [42]. In order to preserve the original information of MOD09Q1 data as much as possible, we resampled evapotranspiration data, radiation data, and soil data to the same resolution (250 m) as MODO9Q1 data using nearest neighbor interpolation, and resampled the topographic data (DEM, slope, cosine of aspect) to 250 m resolution using bilinear interpolation.

Optimizing kNN Models and Species-Level Biomass Imputation
Detailed introduction of the kNN method and its parameters was described in McRoberts et al. [12]. The application of the kNN method entails identifying the k nearest reference samples in the feature space defined by predictor variables for each target unit. Values of each response variable within these k nearest samples are then averaged and assigned to the target unit. Formally, the nearest neighbors prediction, y i , for the i-th target element is calculated as follows [12]: where {y ij ; j = 1, 2, . . . , k} is the set of response variable observations for the k reference elements that are nearest to the i-th target element in feature space given a specific distance metric, and w ij is the weight assigned to the j-th nearest neighbor with k j=1 w ij = 1. The w ij is defined as follows: where d ij is the distance in feature space between the j-th nearest neighbor and the i-th target calculated using a given distance metric.
In the process of optimizing kNN models, forest stand polygon was used as the unit of observation. Predictor variables of each stand polygon were calculated as the mean value of the raster pixels with >50% stand cover. The 25,000 stand polygons containing both response variables and predictor variables were split into training and test sets using the 7:3 split ratio. Before building models, redundant predictors were removed using forward stepwise canonical correspondence analysis (CCA) to select significant variables (P < 0.01) [10]. Then, based on six popular distance metrics (RF, GNN, MSN, Euclidean, Mahalanobis, and msnPP), 15 k values (1-15) and seven sets of selected predictor variables, we built 630 kNN models [22]. The process of optimizing the kNN models was executed through kNN prediction analysis based on the training-test sets using yaImpute package in R [43]. For each kNN model, we calculated the multivariate goodness of fit criterion T and the generalized root mean squared distance (GRMSD) as optimization criteria. T is used to measure the quality of multivariate fitting, and larger T values mean better model performance [12]. GRMSD is used to measure the degree of similarity of the structures between the predictive and observed values, and larger GRMSD values mean worse model performance [43]. The multivariate goodness of fit criterion T [12] is defined as: where y represents one of the 17 tree species' AFB, Y is the number of the response variables, w y is the weight of the yth species' AFB, which is the percent AFB of the yth species against the total AFB based on the observed value. T 2 y is the fractional amount of variance in response variable y explained by the kNN prediction. GRMSD in this study represented the generalized root mean square distance between observed and predictive values in an orthogonal multivariate space defined by AFB values of the 17 species. The optimization exercise was replicated 30 times, yielding the mean values of T and GRMSD. Then, we generated the curves of T vs. k and GRMSD vs. k based on the combinations of different distance metrics and MODIS predictors. For a given combination of distance metric and MODIS predictors, when the T value reached 0.95 of the maximal T value on the curve, the corresponding k value was considered to be optimal [16]. This selected optimal k value was further inspected using GRMSD curves (the k value when GRMSD approximately equalled 1.05 of the minimum GRMSD value on the curve).
After the optimal distance metric, MODIS predictor variables and k value were selected, we applied the optimal kNN model to impute species-level AFB for all the forest pixels with a tree cover >10%. All the processes were implemented in the R software supported by "yaImpute" package ( Figure 2).
Remote Sens. 2019, 11, x FOR PEER REVIEW 7 of 20 of different distance metrics and MODIS predictors. For a given combination of distance metric and MODIS predictors, when the T value reached 0.95 of the maximal T value on the curve, the corresponding k value was considered to be optimal [16]. This selected optimal k value was further inspected using GRMSD curves (the k value when GRMSD approximately equalled 1.05 of the minimum GRMSD value on the curve).
After the optimal distance metric, MODIS predictor variables and k value were selected, we applied the optimal kNN model to impute species-level AFB for all the forest pixels with a tree cover >10%. All the processes were implemented in the R software supported by "yaImpute" package ( Figure 2).

Accuracy Assessment
Accuracy assessment of the estimated AFB over Northeast China was performed at the stand and the ecotype level. Ecotype was defined as the unique combination of each landform with forest type in different ecoregions. The seven main ecoregions of Northeast China represented different temperature and moisture conditions. Nine landforms (1: Ridge, 2: Upper slope, 3: Sunny slope, 4: Semi-sunny slope, 5: Semi-shady slope, 6: Shady slope, 7: Flat slope, 8: Lower slope, 9: Bottomland) were classified from DEM using Topographic Position Index [44]. Five forest types (1: evergreen

Accuracy Assessment
Accuracy assessment of the estimated AFB over Northeast China was performed at the stand and the ecotype level. Ecotype was defined as the unique combination of each landform with forest type in different ecoregions. The seven main ecoregions of Northeast China represented different temperature and moisture conditions. Nine landforms (1: Ridge, 2: Upper slope, 3: Sunny slope, 4: Semi-sunny slope, 5: Semi-shady slope, 6: Shady slope, 7: Flat slope, 8: Lower slope, 9: Bottomland) were classified from DEM using Topographic Position Index [44]. Five forest types (1: evergreen coniferous, 2: evergreen broad-leaved, 3: deciduous coniferous, 4: deciduous broad-leaved, 5: mixed forest) were retrieved from the MCD12Q1 product. For each of the seven ecoregions, 45 ecotypes were generated by combing the nine landforms and five forest types. A total of 315 ecotypes was produced throughout Northeast China.
The estimated AFB of the 30% test dataset at the stand level was calculated by averaging the AFB values of all pixels with over 50% of their area located in each forest stand. Then both the observed and estimated species-level AFB of the test dataset were averaged to the ecotype level. At stand and ecotype levels, we calculated the Pearson correlation (R 2 ), root mean square error (RMSE: Mg/ha), empirical cumulative distribution functions (ECDFs) and the Kolmogorov-Smirnov (KS) statistic [45] between the estimated and observed species-level AFB. R 2 and RMSE provide the overall assessment of the estimation accuracy. The ECDFs and KS statistic can quantify the discrepancy in the distributions of the estimated and observed species-level AFB. KS statistic is defined as the maximum distance between observed and estimated ECDFs, without assuming the distribution of the data and independent of the scale changes [46]. We also calculated the R 2 , RMSE, ECDFs, and KS statistic between the observed and estimated total AFB for the collected 143 sample plots.

Performance of Different kNN Models
RF-based kNN models showed the best performance with largest T values and smallest GRMSD values for most of the combinations of k value and MODIS predictor variables, followed closely by MSN-and GNN-based kNN models. The msnPP-based kNN models obviously showed the worst performance with smallest T values and largest GRMSD values (Figures 3 and 4). Compared with single-month predictor variables (September), use of multi-month MODIS predictors only slightly improved the accuracy of RF-based kNN models, with the mean value of T (average across all k values) increased by 2% and the mean value of GRMSD reduced by 0.8%. Although with increasing k, the models performed better, the computational intensity also increases greatly with the increase of k. When k = 7, the T value of the RF-based kNN model using the MODIS predictors of September reached 0.95 of the maximal T value on the curve, and the corresponding GRMSD value equalled 1.046 of the minimum GRMSD value on the curve, meeting the selection criteria for the optimal k value described in the methods. Additionally, the difference in T and GRMSD values was essentially negligible when k > 7. Therefore, the kNN model using the RF distance metric, single-month (September) MODIS predictors and k = 7 was the optimal model, and we selected this model to impute the species-level AFB over Northeast China.

Species-Level AFB Estimation in Northeast China
The total AFB across the 17 species was mapped by summing the species-level AFB within each pixel. The average AFB of all the 17 species for entire Northeast China (excluding non-forest areas) was approximately 101.98 Mg/ha, and the standard deviation was 56.91 Mg/ha. Overall, the imputed total forest aboveground biomass in Northeast China was mainly distributed at ecoregion GKM, LKM and CM, and the average AFB in ecoregion CM was higher than that in ecoregion LKM and

Species-Level AFB Estimation in Northeast China
The total AFB across the 17 species was mapped by summing the species-level AFB within each pixel. The average AFB of all the 17 species for entire Northeast China (excluding non-forest areas) was approximately 101.98 Mg/ha, and the standard deviation was 56.91 Mg/ha. Overall, the imputed total forest aboveground biomass in Northeast China was mainly distributed at ecoregion GKM, LKM and CM, and the average AFB in ecoregion CM was higher than that in ecoregion LKM and GKM ( Figure 5).
Ecoregion SNP, LHP, and SJP were all located in the plain area with the proportion of forest less than 5%, and the average AFB of these three ecoregions was low ( Figure 5).   Larch had the largest AFB (20.88 Mg/ha), which was calculated by summing the biomass of all three larch species, indicating larch was the most dominant tree in Northeast China, followed by white birch (13.84 Mg/ha), although white birch had a broader distribution range than larch (Figures 5 and 6). Mongolian oak and aspen were also widely distributed, with an average AFB of 11.23 Mg/ha and 10.87 Mg/ha, respectively. However, at the northern edge of ecoregion GKM, Mongolian oak was rare. Basswood, mono maple, elm, walnut, ash, and Korean pine were all mainly distributed in the southern LKM and CM, and their AFB decreased sequentially. Black birch (2.69 Mg/ha) was imputed mainly in the middle of ecoregion GKM, the north of LKM, and the north of SJP. Whereas Scots pine (4.88 Mg/ha) was mainly imputed in the northern GKM, CM, and SJP. Spruce    The accuracy of the estimated total AFB improved substantially from the stand level to the ecotype level, the value of R 2 increased from 0.63 to 0.92, and the value of RMSE decreased from 35.49 to 8.27 Mg/ha (Figure 7a,c). Although the value of the KS distance (0.12 vs. 0.05) was slightly higher at the ecotype level than that at the stand level, a higher P value (0.79 vs. 0) for KS distance at the ecotype level revealed that the estimated and observed AFB ECDFs became more similar (Figure 7b  (b) cumulative distribution functions of the observed and the estimated total AFB at the stand level; (c) scatter plots between the observed and the estimated total AFB at the ecotype level; (d) cumulative distribution functions of the observed and the estimated total AFB at the ecotype level; (e) scatter plots between the observed and the estimated total AFB at the sample plots; (f) cumulative distribution functions of the observed and the estimated total AFB at the sample plots. Note that the dotted line is the 1:1 line.

Relationship between Environmental Variables and Species-Level AFB
Environmental variables such as climate and topography variables had strong relationships with species-level AFB (Figure 8). For example, precipitation, elevation, and climate moisture index was positively correlated with the AFB of fir, spruce, ribbed birch, and Korean pine with strong correlation, but had a negative correlation with aspen, Scots pine and black birch. Temperature, radiation and slope were positively correlated with the AFB of mono maple, basswood, ash, elm, Amur corktree, walnut and Mongolian oak, but negatively correlated with larch and white birch. Mean annual climate variables (PRE, ACMI, TEM, and RAD) had stronger effects on the species-level AFB than the mean growing season variables (GPRE, GACMI, GTEM, and GRAD). The X and Y At the stand level, compared with the other species, the estimated species-level AFB for larch, fir, and willow had relatively higher accuracy (R 2 = 0.56, 0.54, 0.51, respectively) ( Figure S1) [47]. The KS statistic metrics indicated the ECDFs between the estimated and observed species-level AFB at the stand level for all the 17 species were significantly different with low P value (P = 0; Figure S2). The estimated AFB accuracy of most species (R 2 = 0.77-0.98) except for willow and Scots pine (R 2 = 0.55, 0.62, respectively) was significantly improved from the stand level to the ecotype level ( Figure S3). At the ecotype level, the ECDFs between the observed and estimated species-level AFB became similar for all species with smaller KS distance (0.04-0.37) and higher P value (P ≈ 1; Figure S4).

Relationship between Environmental Variables and Species-Level AFB
Environmental variables such as climate and topography variables had strong relationships with species-level AFB (Figure 8). For example, precipitation, elevation, and climate moisture index was positively correlated with the AFB of fir, spruce, ribbed birch, and Korean pine with strong correlation, but had a negative correlation with aspen, Scots pine and black birch. Temperature, radiation and slope were positively correlated with the AFB of mono maple, basswood, ash, elm, Amur corktree, walnut and Mongolian oak, but negatively correlated with larch and white birch. Mean annual climate variables (PRE, ACMI, TEM, and RAD) had stronger effects on the species-level AFB than the mean growing season variables (GPRE, GACMI, GTEM, and GRAD). The X and Y coordinates also strongly associated with species-level AFB, especially for willow. Compared with other environmental variables, soil variables contributed less to the species-level AFB. coordinates also strongly associated with species-level AFB, especially for willow. Compared with other environmental variables, soil variables contributed less to the species-level AFB.

Selection of Optimal Distance Metric, k value and MODIS Imagery
RF-based kNN models produced the best results overall in our study. This may be because the RF distance metric is sufficient to process highly dimensional data and highly correlated predictor variables [17]. The performance of the MSN-based kNN model was similar to that of the RF-based kNN model, but its execution efficiency was higher. Therefore, it is worthwhile to consider using an MSN distance metric for large-scale biomass imputation rather than an RF distance metric to reduce the computational intensity. As the k value increased, the changing trend of T and GRMSD values in our study was the same as the results of Beaudoin et al. [16] and Zhang et al. [22]. Selection of optimal k value was the balance between covariance structure and imputation accuracy [12] and varied with different mapping applications. Different from the selection (k = 6) of Zhang et al. [22], in our study, k = 7 was selected, which could maintain the stable covariance structure of species composition. Zhang et al. [22] selected the single-month MODIS imagery for June as predictors. However, our results indicated that use of single-month MODIS imagery for September had the highest accuracy among the six selected months. The differences between our results and Zhang et al. [22] could be due to our study area spanning from temperate forests to boreal forests, and September imagery could best capture the differences in spectral reflectance among species that result from leaf

Selection of Optimal Distance Metric, k value and MODIS Imagery
RF-based kNN models produced the best results overall in our study. This may be because the RF distance metric is sufficient to process highly dimensional data and highly correlated predictor variables [17]. The performance of the MSN-based kNN model was similar to that of the RF-based kNN model, but its execution efficiency was higher. Therefore, it is worthwhile to consider using an MSN distance metric for large-scale biomass imputation rather than an RF distance metric to reduce the computational intensity. As the k value increased, the changing trend of T and GRMSD values in our study was the same as the results of Beaudoin et al. [16] and Zhang et al. [22]. Selection of optimal k value was the balance between covariance structure and imputation accuracy [12] and varied with different mapping applications. Different from the selection (k = 6) of Zhang et al. [22], in our study, k = 7 was selected, which could maintain the stable covariance structure of species composition.
Zhang et al. [22] selected the single-month MODIS imagery for June as predictors. However, our results indicated that use of single-month MODIS imagery for September had the highest accuracy among the six selected months. The differences between our results and Zhang et al. [22] could be due to our study area spanning from temperate forests to boreal forests, and September imagery could best capture the differences in spectral reflectance among species that result from leaf senescence [48]. Our results suggest that the optimal kNN model selected by Zhang et al. [22] was not the optimal model for species-level biomass imputation for entire Northeast China, which confirms the necessity of selecting kNN models on a case-by-case basis [13].

Environmental Factors and Species Distribution
Environmental variables showed a major effect on the imputation of species-level AFB ( Figure 8) because they were strongly correlated with the spatial patterns of tree species ( Figure 5). Larch and white birch have high tolerance to low temperature [49], and are therefore distributed most widely in the northern part of our study region. The AFB of larch is commonly higher than white birch because it is a late-successional species compared to white birch [50]. Aspen has more limited distribution than white birch since it requires warmer temperatures and higher soil fertility [50]. Scots pine has a high tolerance of drought and low temperatures [51], and consequently is mainly distributed on sunny slopes and ridges in the northern edge GKM. The distribution of Mongolian oak and black birch in ecoregion GKM and LKM are similar since they have similar ecological niches [50]. Compared to black birch, Mongolian oak has a wider range of adaptation to environmental conditions, thus, Mongolian oak has a wider distribution and the AFB of Mongolian oak is obviously higher than black birch. Spruce and fir grow under cold environments [52], and therefore, they are mainly mapped in areas of relatively high elevation. Korean pine, ribbed birch, mono maple, basswood, ash, elm, Amur corktree, and walnut are most abundant in warmer ecoregions (i.e., LKM and CM), because they require sufficient humidity and heat to survive [53,54].
The spatial distribution of species-level AFB from our imputation results showed species composition similar to each ecoregion delineated by Zheng et al. [24]. Therefore, our imputation results provided support for the division of their ecoregions. Specifically, Zheng et al. [24] defined the ecoregion GKM as a deciduous-coniferous forest region, in which larch was the most dominant coniferous species with a small amount of Scots pine and spruce, and the main broadleaved tree species included white birch, Mongolian oak and aspen. Ecoregions LKM and CM were defined as coniferous and broadleaved mixed forest. Compared to GKM, ecoregions LKM and CM had similar species composition, including more coniferous species (e.g., Korean pine, fir) and broadleaved species (e.g., basswood, mono maple, elm, walnut, ash). The ecoregions SJP, SNP, LHP and HBP were defined as plain or plateau, with few tree species. The imputed species-level AFB also captured the forest regions affected by large disturbances. For example, the forests in the northernmost GKM were severely burned in 1987 [50], and thus, the total aboveground forest biomass in this region was obviously lower than other regions ( Figure 5).

Imputation Accuracy and Limitations
Results from our study indicated that the total forest biomass of all species in the ecoregion CM was the largest, followed by that in ecoregion LKM and GKM, which were consistent with the biomass distribution of previous studies also conducted in Northeast China [23,55]. Results in our study showed the average AFB and total AFB carbon stock (multiplying by a standard factor of 0.5 to convert forest biomass to forest carbon stock) [56] in Northeast China was 101.98 Mg/ha and 2.51 Pg C, respectively. These results were higher than those estimated by Zhang et al. [23] in Northeast China (average AFB: 93.02 Mg/ha; total AFB carbon stock: 1.55 Pg C). The higher total AFB carbon stock in our study was most likely due to the definition of the forest as pixels with tree cover over 10%, lower than the 30% threshold defined by Zhang et al. [23], resulting in larger forest areas, and therefore an increase of the forest carbon stock [2]. In contrast, the total AFB carbon stock from our results was consistent with the value of 2.571 ± 0.075 Pg C estimated by Zhang et al. [55] for Northeast China, which also defined forest by a 10% threshold.
At the stand level, our imputation accuracy of total AFB (R 2 = 0.63) was comparable or more accurate to the results of published studies (R 2 = 0.43-0.69) [16,23,57], that integrated field plots and MODIS data for biomass mapping in Northeast China and Canada. Species-level AFB accuracy (R 2 = 0.14-0.56) was more accurate than the results by Zhang et al. [22] (R 2 = 0.01-0.47). These comparisons indicated that our estimation results had a relatively reliable accuracy and the optimal model developed for the whole Northeast China in this study could effectively identify species-level biomass in multiple constituent ecoregions. Some species (e.g., Amur corktree and walnut) had lower accuracy at the stand level, which may be due to their limited sample distribution. Each ecotype contained relatively homogeneous temperature, humidity, soil, topography, and forest type. Within each ecotype, the tree species heterogeneity was reduced and the composition and structure of the species were specific to ecotype [58]. Thus, at the ecotype level, the imputation accuracy of both total AFB and species-level AFB had significant improvement except for willow and Scots pine. The lesser accuracy improvement for willow and Scots pine might be because their inventory data were relatively concentrated and distributed on fewer ecotypes. One solution to reduce the variances of inventory data distribution is to obtain sufficient field sample data evenly distributed across the ecotypes and within each ecotype. In order to obtain sufficient field sample data, more field inventories across different ecotypes and integrating lidar-based metrics and field sample data may be necessary. The lidar footprints distribute evenly and widely, and thus, the derived forest attributes in these footprints could be better used to impute the species-level AFB [2].
Though the inventory data were abundant in the three well-investigated forest regions and could well represent the forest composition in Northeast China, lack of inventory data in the other forest areas may lead to overestimations or underestimations of the species-level biomass. There are also limitations in the methodology and input parameters (e.g., MODIS variables). For example, the imputation results of the species-level aboveground forest biomass in our study indicated small values were often overestimated and large values were underestimated. This pattern is a typical feature of the kNN imputation method [59] and may also be caused by the spectral saturation of optical MODIS data in the forests with dense canopy cover [16]. Besides the saturation effect, the coarse spatial resolution (250 m) of MODIS data may also affect the estimation accuracy of the biomass, which will increase the possibility for the mismatch between forest stand polygon and pixels, especially along polygon boundaries. Due to the irregular boundaries of forest stand polygons, there would always be a spatial mismatch between the forest stand polygons and MODIS pixels even using remote sensing data with a finer spatial resolution [60]. In our study, when more than one raster pixel falls into one forest stand polygon, we extracted the pixel value of each stand polygon by averaging the values of the raster pixels with >50% stand cover to reduce contingency in the comparison with the stand-level attributes.

Conclusions
This study represents the first effort to map species-level aboveground forest biomass for the entirety of Northeast China. The biomass maps of the 17 species generated from this work are the basic prerequisites for regional-level ecological modelling and assessment in Northeast China. By integrating MODIS multispectral and environmental variables with forest inventory data, the optimal kNN model selected in this study provided a cost-effective means for such an effort. Among the six distance metrics, random forest presented the highest accuracy to impute the species-level aboveground forest biomass. The use of all six-months of MODIS data did not significantly improve the imputation accuracy compared to the use of single-month MODIS data for September. Among the 15 k values (1-15), k = 7 as the input parameter of the kNN model showed the best accuracy. Larch was the most dominant species in Northeast China, followed by white birch. The biomass of willow and Amur corktree was very low due to their limited distribution over the study area. Overall, the aboveground forest biomass in the Greater Khingan Mountains was lower than that in the Lesser Khingan and Changbai Mountains.
Accuracy of the results improved obviously from the stand level up to the ecotype level, therefore our results are more suitable to apply to forest ecosystem models (e.g., LINKAGES) which require species-level attributes at the ecotype scale. Our mapped wall-to-wall species-level biomass can also be used to initialize the forest landscape models (e.g., LANDIS) to simulate changes in tree species composition. The spatial pattern of species-level aboveground forest biomass presented here could also capture the forest regions influenced by disturbance (e.g., fire or harvest). However, this study presented the maps of species-level aboveground forest biomass in Northeast China for only one year (2000). In order to better assess the influence of disturbance and climate change on forests, the temporal variation of species-level aboveground forest biomass should be further studied.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-4292/11/17/2005/s1, Figure S1: Scatter plot between the observed and the estimated AFB for 17 species separately based on the testing data at the stand level, Figure S2: Cumulative distribution functions of the observed and the estimated AFB for 17 species separately based on the testing data at the stand level, Figure S3: Scatter plot between the observed and the estimated AFB for 17 species separately based on the testing data at the ecotype level, Figure S4: Cumulative distribution functions of the observed and the estimated AFB for 17 species separately based on the testing data at the ecotype level.