Estimating the Growing Stem Volume of Chinese Pine and Larch Plantations based on Fused Optical Data Using an Improved Variable Screening Method and Stacking Algorithm

: Accurately estimating growing stem volume (GSV) is very important for forest resource management. The GSV estimation determination, 0.8127 and 0.6047, and the smallest relative root mean square errors of 17.1% and 20.7% for Chinese pine and larch, respectively. This study provided new insights on how to choose suitable optical images, variable selection methods and optimal modeling algorithms for the GSV estimation of Chinese pine and larch plantations. estimation using the Stacking algorithm.

imagery has five multispectral bands plus two short-wave infrared (SWIR) bands, which can help vegetation observation. Furthermore, Landsat 8 imagery has global coverage and is free of charge [31]. Therefore, fusing GF-2 multispectral bands with Landsat 8 multispectral images will provide the potential to improve the GSV estimation of Chinese pine and larch plantations in northeastern China.
Selecting suitable feature variables from optical images is a key step in developing forest GSV estimation models [45]. Besides raw spectral data, many variables such as various band ratios, vegetation indices, and texture measures can be derived by band calculations and transformations [13,32], which offers the possibility of obtaining a large number of feature variables. However, this also leads to difficulty in selecting the variables that can significantly increase the GSV estimation accuracy. Principal component analysis (PCA) is often used to reduce the information redundancy, as it linearly combines original variables and transforms them into a set of new variables that are not correlated with each other [13,19]. The new variables are not bio-physically meaningful. In addition, Stepwise Regression Analysis (SRA) [28,30], Random Forest (RF) [13,19,46], the sure independence screening procedure based on Pearson correlation (PC-SIS) [47], the sure independence screening procedure based on distance correlation (DC-SIS) [48], and the fast iterative features selection method for k-nearest neighbor (kNN-FIFS) have also been widely used for variables selection in classification or regression modeling [49]. Unlike PCA, these methods keep the original variables and do not generate new variables when reducing the dimensionality of sample data. However, these methods do not consider the self-correlation of the variables or the combined effects of the variables. Due to the interaction and correlation between remote sensing variables, different variable combinations may lead to very different accuracies of GSV estimation [13]. Thus, we should develop a new variable selection method or improve the existing methods to overcome the gaps that currently exist and further investigate the optimal number and combination of feature variables to increase the estimation accuracy.
The estimation algorithms also influence the GSV estimation results [13,[50][51][52]. Ensemble learning algorithms, such as RF, eXtreme Gradient Boosting (XGBoost) and Stacking, often have better performance than traditional regression algorithms such as Multiple Linear Regression (MLR) [53][54][55][56][57][58]. However, the ensemble learning algorithms are generally more sensitive to the characteristics of training samples, including sample sizes and representatives of population features, and thus require more training samples than the traditional parametric algorithms [13,19,45]. In addition, the performance of the algorithms may be site-specific [13,45].
This research aimed to develop an effective method for feature variable screening and identifying suitable optical images and modeling algorithms for improving the GSV estimation of planted forests. We first fused GF-2 multispectral bands with Landsat 8 multispectral images, which led to various datasets or data scenarios along with the original images. We then proposed a feature variable screening and combination optimization procedure based on the distance correlation coefficient and k-nearest neighbor algorithm (DC-FSCK). The proposed DC-FSCK method was compared with two widely used feature variable screening methods (SRA, RF) to select the optimal set of feature variables from the datasets. In Section S3 of the Supplementary Material, we described the SRA and RF methods in detail. Moreover, we validated the selections of the feature variables by comparing the performance of the parametric algorithm MLR and five nonparametric machine learning algorithms, including K-Nearest Neighbor (kNN), Support Vector Regression (SVR), RF, XGBoost and Stacking. The best method was used to map the GSV of Chinese pine and larch plantations in northern China. It was expected that this research could provide new insights on the GSV estimation of the coniferous plantations.

Study Area
The study area, Wangyedian Experimental Forest Farm, is located in the southwest of Harqin, Inner Mongolia of China (Figure 1), with a total area of approximately 500 km 2 . Influenced by the monsoon temperate continental climate, this region has an annual precipitation of 300~500 mm and an annual average temperature of 4.2 • C. The elevation of this region is higher in the southwest and lower in the northeast, with the range of 800 m to 1890 m. There are many middle mountains and low mountains [59,60]. The percentage forest cover was about 93% by the end of 2016, with a total stock volume of 1.527 million m 3 [59]. The plantations occupied most of the forested area in this forest farm, mainly including larch (Larix principis-rupprechtii and Larixolgensis), Chinese pine (Pinus tabuliformis) and Scots pine (Pinus sylvestris) forests [48].
Remote Sens. 2020, 11, x FOR PEER REVIEW 4 of 24 an annual average temperature of 4.2°C. The elevation of this region is higher in the southwest and lower in the northeast, with the range of 800 m to 1890 m. There are many middle mountains and low mountains [59,60]. The percentage forest cover was about 93% by the end of 2016, with a total stock volume of 1.527 million m 3 [59]. The plantations occupied most of the forested area in this forest farm, mainly including larch (Larix principis-rupprechtii and Larixolgensis), Chinese pine (Pinus tabuliformis) and Scots pine (Pinus sylvestris) forests [48].

Framework of This Research
The methodological framework for mapping the GSV of Chinese pine and larch plantations using multiple datasets includes four steps: 1) Data preparation and processing; 2) Feature variable extraction; 3) Variable selection; 4) Model development, evaluation, and application ( Figure 2). Based on GF-2 and Landsat 8 images, we derived various datasets, including the original and fused images, and from which we selected the best sets of feature variables, developed the relationships of GSV from the sample plots with the feature variables for Chinese pine and larch, including the spectral bands, vegetation indices and texture measures. The digital elevation model (DEM) was utilized to conduct the terrain corrections of the images. We did not calculate tree DBH and height from the remote sensing data. Therefore, the framework can be extended to other study areas if and only if there are field plot data ready for use, or high quality LIDAR data are available.

Framework of This Research
The methodological framework for mapping the GSV of Chinese pine and larch plantations using multiple datasets includes four steps: 1) Data preparation and processing; 2) Feature variable extraction; 3) Variable selection; 4) Model development, evaluation, and application ( Figure 2). Based on GF-2 and Landsat 8 images, we derived various datasets, including the original and fused images, and from which we selected the best sets of feature variables, developed the relationships of GSV from the sample plots with the feature variables for Chinese pine and larch, including the spectral bands, vegetation indices and texture measures. The digital elevation model (DEM) was utilized to conduct the terrain corrections of the images. We did not calculate tree DBH and height from the remote sensing data. Therefore, the framework can be extended to other study areas if and only if there are field plot data ready for use, or high quality LIDAR data are available.

Field Plot Data Collection
The fieldwork was carried out in September and October of 2017. Considering the tree age and spatial distribution, we measured 37 Larch plots and 42 Chinese pine plots using a random stratification sampling. The angles and central positions of the plots were measured by a differential Global Positioning System unit with the location measurement accuracy higher than 15 cm. The spatial distribution of all the plots is shown in Figure 1c. Each sample plot had a size of 25 m × 25 m. Each plot fell in one land cover type and contained only one major forest type. All the plots were divided into several age groups, and the age of the trees in each plot was basically the same. All sample plots were away from forest stand boundaries. For each plot, we measured and recorded the information of all standing living trees with DBH larger than 5 cm, including tree DBH, height, crown base height and crown diameters in two perpendicular directions, and topographic factors (e.g., slope, aspect).
Based on the field measurements of tree height and DBH, the GSV of each tree in each sample plot was calculated according to the two-variable tree GSV equations provided by the Wangyedian Experimental Forest Farm (Table 1) for the corresponding tree species. The equations were developed in accordance with the technical regulations on construction of two-variable tree volume table of the People's Republic of China (bz0000008300) (http://www.forestry.gov.cn/portal/xdly/s/5191/content-973771.html). The two-variable tree GSV equations take DBH and height of individual trees as the independent variables and tree GSV as the dependent variable. The total GSV of all trees in each sample plot was considered as the forest GSV at the plot level and then converted to the GSV at the hectare level (i.e., M 3 /ha). The final plot forest GSV statistics of the study area are shown in Table 2.

Satellite Image Collection and Pre-Processing
The GF-2 satellite is a Chinese satellite with optical sensors offering panchromatic images at a 1 m spatial resolution and multispectral (blue, green, red, and near infrared) images at a 4 m spatial resolution. The Landsat 8 operational land imager (OLI) images have eight 30 m spatial resolution multispectral bands, one 15 m spatial resolution panchromatic band, and two 100 m spatial resolution thermal bands. We collected two images of GF-2 dated on 5 September 2017 (http://www.cresda.com/CN/), and one Landsat 8 OLI image with the cloud cover of 0%-5% and dated on 21 September 2017 for the experiment (https://www.usgs.gov/). Landsat 8 OLI has four multispectral bands that are similar to those of GF-2. Moreover, Landsat 8 OLI has other three multispectral bands (band1_Coastal, band6_SWIR1 and band7_SWIR2), which may provide the potential to improve GSV estimation accuracy.
The FLAASH model of ENVI (version 5.3) was used for the atmospheric correction of these images. By radiometric calibration, the digital number (DN) values were converted to surface spectral reflectance. The terrain correction was conducted using a 30 m spatial resolution DEM (http://www.gscloud.cn/) and the SCS + C correction model that is suitable for correcting topographic effects of the imagery for the areas characterized by steep topography and low sun zenith angle [61]. The slope, aspect and elevation of the study area on the pixel level were extracted from DEM.

Data Fusion and Matching
The Gram-Schmidt method [39,48] was employed to fuse the multi-spectral and panchromatic data of GF-2 to generate the images with a spatial resolution of 1 m. Using the Gram-Schmidt method, we also fused each band (band 1 to band 5) of the GF-2 images with all multispectral bands (band 1 to band 7) of the Landsat 8 image to obtain five Landsat-like multispectral images at a spatial resolution of 1 m, including Pan_Landsat, Blue_Landsat, Green_Landsat, Red_Landsat, and Nir_Landsat. In Figure S1 of the Supplementary Material, we show examples of the resulting fused images. The fused Landsat-like images contain more details of the ground objects and richer spectral information due to two SWIR bands involved. The Landsat-like multispectral images were resampled to a 30 m or 25 m spatial resolution. In order to better match the field measured GSV plots with the pixels of the remote sensing images, we extracted feature variables of different spatial resolutions (30 m and 25 m) to explore the most appropriate pixel size for the GSV estimation model. We used the Spatial Analyst tool (Extract MultiValues to Points) of ArcGIS 10.2.2 to extract the spectral and texture measures of each field sample plot by Bilinear Interpolation.

Extraction of Feature Variables
In this research, we extracted the spectral signatures, vegetation indices and texture measures for the forest GSV estimation. The Landsat-like spectral data consist of seven basic bands (Coastal, Blue, Green, Red, Nir, SWIR 1 and SWIR 2). The vegetation indices based on spectral features are summarized in Table 3. Gray Level Co-occurrence Matrix (GLCM) is a common texture analysis method, which captures the comprehensive information of the imagery gray level including the direction, field and change amplitude, and can better reflect the correlation between the gray levels of texture [13]. We extracted the textural images from the Landsat-like images using the GLCM with the step size [1,1] and the window size (3 × 3). Eight texture measures (Mean, Variance, Homogeneity, Contrast, Dissimilarity, Entropy, Second moment, Correlation) were calculated in this research. Table 3. Vegetation indices used in this research.

Vegetation Indices Definitions
Simple two-band ratios

Selection of Optimal Variable Combination
Proper variable combinations can improve the GSV estimation accuracy, but selecting the best variable combinations is challenging [13]. On one hand, there are a large number of features, vegetation indices and texture measures available. On the other hand, the variables may be interrelated, which leads to information redundancy and affects the improvement of estimation accuracy [15,17,18,30]. A novel method that can provide potential solutions for the challenges is needed.

Distance Correlation
Feature selection is usually based on correlation analysis, such as Pearson correlation coefficient, Kendall's τ coefficient, distance correlation (DC) coefficient, and martingale difference correlation [46,47,62,63]. However, these correlation analysis methods cannot accurately describe the relationship between random vectors. For example, when the martingale difference correlation or Pearson correlation is zero, two random vectors may be not independent from each other. But if and only if two random vectors are independent, the DC coefficient between them is zero [64]. Thus, it is better to utilize DC coefficient to account for the correlation between two random vectors.
We used the DC coefficient to measure the correlation between two variables u and v, noted as dcorr (u, v). When dcorr (u, v) = 0, u and v are independent from each other, and the larger the dcorr (u, v), the stronger the DC. Suppose that {(u i , v i ), i = 1, . . . , n} is a random sample from the population (u, v). The estimation of DC between random variables u and v can be defined as follows [64]:

Selection of Suitable Variables Using an Improved Method
In this paper, we developed an improved method, DC-FSCK, to select feature variables. The improved method takes the correlations among feature variables as a new penalty adjustment factor, then removes the variables that decrease the estimation accuracy to optimize the feature variable selection. This is the preliminary variable screening. Then we compared the estimation errors of the GSV using kNN and different combinations of variables, and the one with the smallest error is the optimal variable combination.
We used DC coefficients to measure the correlations between feature variables and the plot GSV, and selected important variables by thresholds. Let y = (Y) T be the response vector (the plot GSV), and x = (X 1 , . . . , X p ) T be the predictor vector (feature variables from the used images). The dimensionality p was much larger than the sample size n. Therefore, it was safe to assume that only a few variables are correlated to the response vector y. We took ω j as the marginal utility to rank X j by importance at the population level. That is, we defined ω j based on a random sample {x i , y i }, We defined the number of important variables as d (d < n), so the reference threshold was d = m [n log(n)] with m being a positive number [46]. The process of feature variable selection by the DC-FSCK consists of the following 7 steps. In Figure S2 of the Supplementary Material, we show the process of selecting feature variables by the DC-FSCK algorithm.
Step 1. Rank the feature variables by the values of DC coefficientω j . Select the feature variable with the largestω j and denote it with X 1 .
Step 2. Select the second important feature variable, and calculate the comprehensive DC coefficient R j between each rest d − 1 variable (X j , j = 1, . . . , d − 1) and X 1 is calculated bŷ where λ is the penalty adjustment factor. In this study, λ is not a fixed constant, but grows with the increase of the number of the selected variables. When the selected variables are very few, the weight of the DC coefficientω j is relatively large. The larger the λ value, the greater the penalty for the correlation between the independent variables. A greaterR j brings less redundant information and higher prediction accuracy. The feature variable with the largestR j , denoted with X 2 , is selected for the GSV estimation. When the third or more important feature variables are selected, the comprehensive DC coefficientR jk (k = 3, . . . , d) between each of the dk + 1 feature variables and the variables X 1 , X 2 , . . . , and X k−1 is calculated. The mean values of all the correlation coefficients are then derived. After multiplying −λ and addingω j , the resultR jk is ascertained.
The feature variable with the largest comprehensive DC coefficient is selected and denoted with X k .
Step 4. Update the candidate feature variable subset Step 5. Develop model kNN_i using the feature variable subset Optimize the parameters of model kNN_i and calculate the optimal RMSE. Repeat this step until i = m.
Step 7. Calculate the DC coefficient between the variables of Best_F. If the DC coefficient is large, and the estimation accuracy is not significantly reduced when the lower ranked variable is deleted, delete the lower ranked variables. Finally, the optimal variable combinations BEST_F is obtained.
Steps 1-3 use the DC method to preliminarily screen the feature variables that are highly correlated with the dependent variable but weakly self-correlated. Steps 4-7 examine the combined effect of variables. Due to the mutual influence of enhancement or inhibition between the variables, the optimal subset of variables is further determined by comparing the estimated results (RMSE) using kNN.

GSV Estimation Models
In this study, we compared two machine learning algorithms (kNN, SVR), three ensemble machine learning algorithms (RF, XGBoost, and Stacking) and the widely used parametric approach, MLR. For details of the methods, please refer to Section S4 of the Supplementary Material. Based on the estimation algorithms, we developed the relationship of the plot GSV with the spectral signatures, vegetation indices and texture measures, and optimized the hyperparameters of the algorithms according to the characteristics of different sample datasets. Finally, we obtained corresponding estimation models for different datasets or data scenarios used to estimate the GSV of Chinese pine and Larch forests.

Evaluation of Modeling Results
In order to analyze the influence of data scenarios, feature variable selection methods and modeling algorithms on GSV estimation, we used three kinds of datasets (GF-2, Landsat 8, and their fusion images), three variable selection methods (SRA, RF, DC-FSCK), and six modeling algorithms (MLR, kNN, SVR, RF, XGBoost, and Stacking) to estimate the GSV and compared the results.
Usually, sample plots are randomly separated into training samples and testing samples [65], but collecting enough samples for modeling and validation is not easy, because of the high cost and limited accessibility. The k-fold cross validation method is useful for both classifications and estimation without extra data required (where, k is the number of sample plots). Therefore, we used the leave-one-out cross-validation for calculating determination coefficient (R 2 ), adjusted R 2 , RMSE and relative RMSE (RMSEr) between the estimated and observed values to assess the models' prediction performance [28,31]. They were calculated by: Greater R 2 and smaller RMSE and RMSEr indicate better modeling and prediction performance. The most accurate estimation model was used to map the GSV of the study area.

Mapping the GSV of Chinese Pine and Larch Plantations
Xie et al. [48] classified the forests of the study area using the ZY-3 data and the SVM technique, and obtained the spatial distributions of Chinese pine and larch plantations. Those data were directly used in this research. The final model for GSV estimation was determined, the GSV value of each pixel was calculated and the GSV maps of the whole study area were obtained.

Data Fusion for Estimating Plantation GSV
By fusing each band of the GF-2 images with the Landsat 8 multispectral image over the study area, we got five Landsat-like images, including Pan_Landsat, Blue_Landsat, Green_Landsat, Red_Landsat, and Nir_Landsat. After extracting the spectral and texture variables, the SRA method was used to select the feature variables and obtain linear regression results. In Table 4 This indicates that the high-resolution multispectral band and medium-resolution multispectral image fusion has better performance of GSV estimation than the high-resolution panchromatic band and medium-resolution multispectral image fusion.

Variables Selection and Estimation Result Comparison
In order to investigate the effects of the GF-2 multispectral and Landsat 8 multispectral fusion data on the estimation accuracy, we further analyzed and compared the estimation errors and saturation levels of the GF-2, Landsat 8, and Red_Landsat data using different variable screening methods and estimation models. The best optical data source, optimal variable screening method and estimation model were then used to estimate the GSV. Table 5 lists the variables selected by SRA, RF and DC-FSCK in three data scenarios. In order to improve the GSV estimation accuracy, we performed variable selection and model development based on tree species stratification. In Table 5, the abbreviations M, V, H, Con, D, E, S, and Cor mean the texture measures, mean, variance, homogeneity, contrast, dissimilarity, entropy, second moment and correlation, respectively. Except a few key variables, the variables selected by DC-FSCK are different from those selected by RF and SRA. For example, from the fused image Red_Landsat for Chinese pine plantations, SRA selected Blue_M, Blue_H, and Coastal _Cor; RF selected Blue_M, Coastal _M, SWIR2_E, Green_S, DVI 34 , Blue_S, SWIR2_M, and NDVI 34 , while the DC-FSCK method selected Blue_M, Green_Dis, Green_S, SWIR2_H, Red_H, SAVI 0.25 , and Coastal _S. The three methods all selected the mean texture measure Blue_M in the blue band, and some texture measures in the Coastal band. The RF and DC-FSCK selected eight and seven variables, respectively, but SRA only led to three variables. In addition, both RF and DC-FSCK selected the relevant texture measures of the SWIR2 and Green bands.

Feature Variables Selected by Three Methods
The difference in the variable selection is mainly due to the different principles based on which the three methods were developed for the purpose. The SRA focuses on examining the linear correlation between the feature variables and GSV. The RF is based on importance ranking of the variables. The DC-FSCK does not only examine the linear and nonlinear relationship between the feature variables and GSV, but also takes into account the self-correlation and combination effects between the variables.

Estimation Results of the Chinese Pine
In Table 6, the estimation results of three variable selection methods and six modeling algorithms are compared in terms of adjusted R 2 and the RMSEr. Overall, the largest adjusted R 2 and smallest RMSEr value of 0.8127 and 17.05% were obtained using a combination of the fused data Red_Landsat, the DC-FSCK variable selection method and the Stacking model for Chinese pine. For the GF-2 image and the Landsat 8 image, the combination of DC-FSCK with the Stacking model and the combination of DC-FSCK with the SVR were the best combinations in terms of performance. The DC-FSCK decreased the estimation RMSEr values by 6.14% to 7.53% and 3.41% to 9.05% compared with those from SRA and RF, respectively. The decreases in RMSEr were statistically significantly different from zero at the significant level of 0.05. The Red_Landsat image based on the Stacking model achieved the RMSEr value of 17.05%, being significantly smaller than those from the GF-2 and Landsat 8 image at the significant level of 0.05. When the RF method was used for both the selection of the feature variables and the GSV estimation, the obtained RMSEr values of 24.93%, 24.75%, and 26.10% for the GF-2, Landsat 8 and Red_Landsat images, respectively, were relatively smaller. Regardless of the datasets and the variable selection methods, the MLR, kNN and SVR led to larger RMSEr values than other modeling algorithms. Table 6. Summary of the Adjusted R 2 and RMSEr for GSV estimation of Chinese Pine and Larch Plantations using three variable selection methods and six estimation models based on three datasets (Note: The numbers in bold indicate the model with the greatest Adjusted R 2 and smallest RMSEr value in the same data scenario).

Estimation Results of Larch
Similar to the results of Chinese pine, overall, the combination of the fusion data Red_Landsat, the DC-FSCK variable selection method and Stacking model led to the greatest adjusted R 2 and smallest RMSEr of 0.6047 and 20.68% for larch. For the GF-2 data, the best estimation result with the RMSEr of 25.72% was achieved by the Stacking model based on the DC-FSCK selected variables. For the Landsat 8 image, the combination of DC-FSCK with kNN resulted in the most accurate estimates with the RMSEr of 23.13%. When the variable screening methods, SRA and RF, were used, the ensemble machine learning algorithms, XGBoost and RF, resulted in smaller RMSEr (26.56% and 27.01%). At the significant level of 0.05, the smallest RMSEr values of the SRA method were significantly larger than those of RF and DC-FSCK. When the DC-FSCK method was used to select the feature variables, in addition to the Stacking model, kNN obtained relatively smaller RMSEr values of 26.47%, 23.13% and 21.70% for the GF-2, Landsat 8, and Red_Landsat images, respectively. Given the datasets and the variable selection methods, the MLR model led to largest RMSEr values.

Residuals and Potential Saturation Levels of GSV Estimation
Generally, the estimation residuals should not be larger than 50% of the sample mean. In Figures 3  and 4, the measured and estimated GSV values were compared for Chinese Pine and Larch Plantations, respectively. For Chinese pine, the Stacking model based on the Red_Landsat image and the DC-FSCK variable selection method has the best fitting trend and the maximum estimate, which potentially indicates the highest saturation level (Figure 3(c3)). For the GF-2 and Landsat 8 images, the Stacking and RF model using the variables selected by the DC-FSCK method also have a good fitting trend and a potentially high saturation level (Figure 3(a3,b3)). Although the scatter graphs (Figure 3(c3 vs. a3 and b3)) look similar, the estimation residual absolute values are considerably different, especially for the sample plots with large GSV values. For example, in Figure 3(a3,b3), there are two and four sample plots with their residual absolute values larger than 128.6 m 3 /ha, respectively. Whereas, in Figure 3(c3), only one sample plot has the residual value close to 128.6 m 3 /ha.
For the larch sample plots, the Red_Landsat image has the best estimation results (Table 6 and Figure 4(c3)), the set of the feature variables selected by DC-FSCK have the potentially highest saturation level, and the Stacking model has the best estimation performance. For the Landsat 8 image (Figure 4(b2,b3)), the sets of the feature variables selected by RF and DC-FSCK also show that most points fit the reference line well, and they have potentially higher saturation levels than that by SRA. The Stacking and RF models have better fitting trends than other models.
In sum, for both Chinese pine and larch sample plots, the fusion image Red_Landsat has richer information than the GF-2 and Landsat 8, and the set of the feature variables selected by DC-FSCK has a potentially higher saturation level than those by SRA and RF. Therefore, the estimation model based on the combination of the Red_Landsat image with the variables selected using the DC-FSCK method has the best fitting trend and the smallest estimation error (Figure 3(c3), Figure 4(c3)).  Table 6. The black diagonal line is the theoretical best fit reference line, and the blue parallel dashed lines are the estimation residual reference lines with a 50% deviation from the sample mean.  Table 6. The black diagonal line is the theoretical best fit reference line, and the blue parallel dashed lines are the estimation residual reference lines with a 50% deviation from the sample mean.  Table 6. The black diagonal line is the theoretical best fit reference line, and the blue parallel dashed lines are the estimation residual reference lines with a 50% deviation from the sample mean.

Mapping the GSV of Chinese Pine and Larch Plantations
As shown in Figure 5, for both Chinese pine and larch, the GSV estimation models using the variables selected by the DC-FSCK method get more pixels with the GSV values greater than 270 m 3 /ha than other models ( Figure 5(a3) vs. (a1,a2) and (b3) vs. (b1,b2)), indicating that the selection of the key feature variables is important in increasing data saturation levels. It also shows that the sets of the feature variables selected by the DC-FSCK method have potentially higher saturation levels and contain more useful information. (1,2,3) are the GSV estimated using three variable screening methods SRA, RF, and DC-FSCK, respectively. Each graph corresponds to the best estimation model with the smallest RMSEr value for each data scenario in Table 6. The black diagonal line is the theoretical best fit reference line, and the blue parallel dashed lines are the estimation residual reference lines with a 50% deviation from the sample mean.

Mapping the GSV of Chinese Pine and Larch Plantations
As shown in Figure 5, for both Chinese pine and larch, the GSV estimation models using the variables selected by the DC-FSCK method get more pixels with the GSV values greater than 270 m 3 /ha than other models ( Figure 5(a3) vs. (a1,a2) and (b3) vs. (b1,b2)), indicating that the selection of the key feature variables is important in increasing data saturation levels. It also shows that the sets of the feature variables selected by the DC-FSCK method have potentially higher saturation levels and contain more useful information. Figure 5. The GSV maps of (a1-a3) Chinese pine and (b1-b3) larch plantations in the study area estimated based on the fusion image Red_Landsat using the variables selected by SRA, RF, and DC-FSCK method, respectively. Each map corresponds to the best estimation model with the smallest RMSEr value for each data scenario in Table 6. Figure 5. The GSV maps of (a1-a3) Chinese pine and (b1-b3) larch plantations in the study area estimated based on the fusion image Red_Landsat using the variables selected by SRA, RF, and DC-FSCK method, respectively. Each map corresponds to the best estimation model with the smallest RMSEr value for each data scenario in Table 6.

The Role of Data Fusion in GSV Estimation
Fusing multispectral and panchromatic images from different optical sensors helps the visual interpretation of vegetation, mainly because it improves spatial resolution [37,40]. This research showed that the fusing data of high-resolution multispectral bands with medium-resolution multispectral images can improve the accuracy of GSV estimation for Chinese pine and larch plantations in northern China. Some studies have proved that the red band performs much better than other bands when using optical images for AGB or GSV estimation [28,31], which is consistent with the finding of this study that the Red_Landsat image fused by the GF-2 red band with the Landsat 8 multispectral data leads to larger R 2 and smaller RMSEr values than other fused images, including the fusion image Pan-Landsat.
The spatial information of the 1 m spatial resolution GF-2 image is much richer than that of the 30 m spatial resolution Landsat 8 image. AS shown in Table 5, for Chinese Pine and Larch plantations, more texture variables are selected from the GF-2 images than the Landsat 8 image. But the GF-2 image has no SWIR bands that have a strong relationship with GSV or AGB [13,31]. The fusion image Red_Landsat has both high spatial resolution and SWIR bands and thus resulted in the best performance in the GSV estimation.

Effective Methods for Improving Spectral Variable Selection and Data Saturation
Different remote sensing feature variables have different saturation and sensitivities to forest GSV estimation. The selected feature variables with low saturation can hardly generate high forest GSV values, so studies should focus on improving the selection of feature variables and increasing the data saturation levels. Although the stratification of forest types and topography can improve the GSV estimation accuracy [13,31], it cannot effectively capture the important and un-correlated feature variables and thus cannot increase the data saturation levels. Therefore, using proper variable selection methods, such as DC-FSCK, is important in increasing data saturation levels. The DC-FSCK can examine the linear and nonlinear relationships between feature variables and GSV and consider the self-correlation and combination effects among the feature variables. The SRA and RF can only select the variables with good linear correlation or importance, but discard the rest variables that might have a high saturation level and contain useful information [19]. Song et al. [19] proposed a spectral variable selection method by integrating Jeffreys-Matusita distance and correlation among feature variables for image classification of wetlands and found that their method offered greater potential than RF. However, their method ignores the combination effects of the feature variables.
This study revealed that based on the fused image Red_Landsat, using the sets of the feature variables selected by the DC-FSCK, RF, and SRA methods, the GSV maximum values of the Chinese pine and larch plantations were about 490 m 3 /ha and 350 m 3 /ha, 410 m 3 /ha and 300 m 3 /ha, 400 m 3 /ha and 320 m 3 /ha, respectively, indicating the corresponding saturation levels. The sets of the spectral variables selected by the DC-FSCK method led to potentially higher saturation level and GSV estimation accuracies than those by RF and SRA. Zhao et al. [31] used the feature variables selected from Landsat Thematic Mapper images by SRA and a spherical model, and obtained the forest biomass saturation values of 159 Mg/ha and 152 Mg/ha for pine (Pinus Massoniana) plantations and broad-leave forests in Zhejiang of Eastern China. Based on the study of Wang et al. [66], the average gravity coefficients of the pine and broad-leave forests in this area were 0.446 g/cm 3 and 0.417 g/cm 3 , respectively. Thus, the authors got GSV saturation values of 356.5 m 3 /ha and 364.5 m 3 /ha for the pine and broad-leave forests. The pine plantations are biologically similar to Chinese pine forests in this study. However, the authors resulted in a much smaller GSV saturation value although the mature plantations of Pinus Massoniana in eastern China often have greater per unit GSV value than mature Chinese pine plantations in northern China. This might be mainly because different optical images and spectral variable selection methods were utilized and the data fusion was ignored. Moreover, Long et al. [10] obtained the saturation values of GSV ranging from 140.05 m 3 /ha to 349.84 m 3 /ha using a saturation-based multivariate method and quad-polarimetric synthetic aperture radar SAR images for Chinese fir plantations in the Hunan area of south central China. Although the Chinese fir plantations are also biologically similar to the Chinese pine plantations, mature Chinese fir plantations in south central China usually have larger per unit GSV values than the Chinese pine plantations in north China. More importantly, microwave images are less affected by weather conditions such as clouds, fogs and moisture, and SAR is characterized by the capacity to penetrate forest canopies. Thus, the quad-polarimetric SAR images should provide greater potential to obtain higher saturation levels than optical images. However, the GSV saturation values of the Chinese fir plantations obtained by Long et al. [10] are much smaller compared with those of the Chinese pine forests in this study. This might be mainly due to the uncertainties that are induced by the complexity of processing SAR images and the lack of the feature variables that accurately capture the characteristics of the data saturation.

Selection of Suitable and Stable Estimation Algorithms
This research compared six estimation models based on different data sources of Chinese pine and larch in northern China ( Table 6). The results show that the performances of these models vary with tree species, images, and the selected feature variables. For example, when the Landsat 8 image was used for Chinese pine GSV estimation, the MLR model led to higher estimation accuracy than the machine learning algorithms using the variables selected by SRA, and the RF model resulted in smaller RMSEr value using the variables selected by RF. However, when the variables selected by DC-FSCK were used, the SVR model performed best. For the Red_Landsat image, when the variables selected by DC-FSCK were utilized, the Stacking model had RMSEr values of 17.05% and 20.68%, which are 39.1% and 36.4% smaller than those of MLR for Chinese pine and larch, respectively. In addition, previous research results show that the estimation method of relearning based on the prediction results of different basic learners has good estimation accuracy [53]. This conclusion confirms the findings of this study on GSV estimation using the Stacking algorithm.

Implication of Methods for Improving GSV Estimation of Chinese Pine and Larch Plantations
Optical images often have low saturation levels and thus have low accuracy when they are used for forest AGB or GSV estimation [5,16]. Tree height or canopy height has higher saturation levels and is an important forest stand attribute, so the variables based on height provide great potential for improving AGB or GSV estimation [25,26,28,35]. However, it is expensive to get sufficient photogrammetric and LiDAR data that can be used to extract tree height. Optical images are easily acquired with large-scale coverages, so increasing the saturation level of optical images is a cost-efficient way for GSV estimation. The results of this study show that the multispectral and multispectral image fusion method can improve the performance of GSV estimation and increase the data saturation level by the DC-FSCK method.
The high estimation accuracy of forest GSV depends on many factors, such as forest structure characteristics, the quality of field plot data and images, ancillary data, methods for variable selection, and models used for estimation. The GSV of forest plots with low canopy density and large gaps between canopies is usually overestimated, because bare soils, shrubs and grass under canopies have great influence on the surface reflectance, resulting in the high uncertainty of the GSV estimation [13]. Contrarily, forest plots with high canopy densities and large GSV values often have GSV underestimation, since the optical sensor data cannot capture the vertical structural features of the forest canopies, thus lead to small saturation values [31].
In addition to using the forest canopy structure features, such as tree height and canopy height, there are other ways to improve the GSV estimation accuracy and saturation levels. First, using fused images obtained by integrating high spatial resolution bands with medium spatial resolution multispectral images. In this study, the GF-2 multispectral bands and Landsat 8 multispectral image were fused to obtain new sets of Landsat-like multispectral images. In future studies, data fusion of other optical sensors' (such as ZY-3 and Sentinel-2, etc.) images should be explored. Secondly, applying a suitable image fusion method. This study used the Gram-Schmidt method to fuse the GF-2 multispectral bands with the Landsat 8 multispectral image. The impact of different fusion methods, such as Wavelet transform, Energy Division transform, and PCA transform on the performance improvement of GSV estimation should be analyzed in future studies. Thirdly, extracting the feature variables that significantly contribute to the estimation improvement of forest GSV, such as texture measures, topographical, phenological and auxiliary data. Current GSV estimation models focus on the use of spectral variables from remote sensing images and only a few reports take phenological features and auxiliary data into account [17,18,33,34]. Future attention should be paid to the consideration of the variables from different data sources.
Moreover, appropriate methods should be further developed to select the significant and un-correlated variables that can lead to high saturation levels and improvement of GSV estimation performance. In this study, DC-FSCK was used to select the feature variables and improved the GSV estimation performance. However, the DC-FSCK method needs to further be refined, such as speeding up the program's running and simplifying computational complexity. In addition, new estimation models that are relatively simple and easily used and less sensitive to sample size should be developed. In this study, the Stacking showed the best performance for the GSV estimation of both Chinese pine and larch plantations (Table 6). However, this method is very complicated and time-consuming. It requires a set of inputs, such as base-learning and meta-learning algorithms, and optimizing hyperparameters, so the results may vary with different initial inputs. The characteristics of the Stacking should be further investigated in future studies. With the development of machine learning, the process of constructing algorithms is getting simpler and requires less interactions from users. Automatic Machines Learning (AutoML) is intended to automatize the entire process of machine learning, including neural network architecture search, learning algorithm selection, hyperparameter optimization, model evaluation and model application, resulting in end-to-end models. AutoML may also be promising to improve the GSV estimation. Finally, in this study, the proposed method that aims to create, extract and select the significant and un-correlated feature variables to improve the saturation levels and estimation accuracy of GSV for Chinese pine and larch was validated in one study area. In order to evaluate the robustness of this method, further verification is needed in various study areas.

Conclusions
This research fused GF-2 multispectral band and Landsat 8 multispectral images to estimate the GSV of Chinese pine and larch plantations in North China. An improved variable selection method DC-FSCK was developed and compared with SRA and RF to screen the feature variables used for modeling. Six models for GSV estimation were compared, which are MLR, kNN, SVR, RF, XGBoost and Stacking. A comparative analysis of the GSV estimation results indicates that (1) fusing the data of high-resolution GF-2 multispectral bands with medium spatial resolution Landsat 8 multispectral image provides more accurate estimates of GSV than those of the high-resolution panchromatic band and medium-resolution multispectral image. The fused image Red_Landsat based on the GF-2 Red band contained the information of both high resolution and SWIR band, and thus performed better than the other fused images (Blue_Landsat, Green_Landsat, and Nir_Landsat) and the original images. The GSV estimation using the Red_Landsat image and the feature variables selected by DC-FSCK had the smallest RMSEr values for both tree species. (2) Different feature variables lead to different data saturation levels. The vegetation indices and texture features extracted from the fused image Red_Landsat showed potentially higher saturation values. Employing a proper variable selection method, such as DC-FSCK, is critical in capturing the canopy structure characteristics of the plantations and thus increasing the data saturation levels. The DC-FSCK examined the nonlinear relationship and combination effects among the spectral variables, so the selected spectral variables resulted in better GSV estimation performance and higher saturation values than those selected by SRA and RF. Based on the Red_Landsat image for Chinese pine, the smallest RMSEr values were 24.85%, 26.10% and 17.05%, respectively, for the variables selected by SRA, RF and DC-FSCK. For larch plantations, the corresponding RMSEr values were 24.11%, 26.75% and 20.68%, respectively.
(3) The Stacking performs better than MLR and other machine learning algorithms for the GSV estimation of Chinese pine and larch plantations. For Chinese pine and larch, the average RMSEr values of all the combinations of three datasets (GF-2, Landsat 8, and Red_Landsat image) with three variable selection methods (SRA, RF and DC-FSCK) by the Stacking algorithm were 24.31% and 25.78%, respectively, smaller than those by MLR and other machine learning algorithms.