Mapping 30 m Fractional Forest Cover over China’s Three-North Region from Landsat-8 Data Using Ensemble Machine Learning Methods

: The accurate monitoring of forest cover and its changes are essential for environmental change research, but current satellite products for forest coverage carry many uncertainties. This study used 30-m Landsat-8 data, and aggregated 1-m GaoFen-2 (GF-2) satellite images to construct the training samples and used multiple machine learning algorithms (MLAs) to estimate the fractional forest cover (FFC) in China’s Three North Region (TNR). In this study, multiple MLAs were merged to construct stacked generalization (SG) models based on the idea of SG, and the performances of the MLAs in the FFC estimation were evaluated. The results of the 10-fold cross-validation showed that all non-linear algorithms had a good performance, with an R 2 value of greater than 0.8 and a root-mean square error (RMSE) of less than 0.05. In the bagging ensemble, the random forest (RF) (R 2 = 0.993, RMSE = 0.020) model performed the best and in the boosting ensemble, the light gradient boosted machine (LGBM) (R 2 = 0.992, RMSE = 0.022) performed the best. Although the evaluation index of the RF is slightly better than that of the LGBM, the independent validation results show that the two models have similar performances. The model evaluation results of the independent datasets showed that, in the SG model, the performance of the SG(LGBM) (R 2 = 0.991, RMSE = 0.034) was better than that of the single or non-ensemble model. Comparing the FFC estimates of our model with those of existing datasets showed that our model exhibited more forest spatial distribution details and higher accuracy in complex landscapes. Overall, in this study, the method of using high-resolution remote sensing (RS) images to extract samples for FFC estimation is feasible. Our results demonstrate the potential of the ensemble MLAs to map the FFC. The research results also show that among many MALs, the RF algorithm is the most suitable algorithm for estimating FFC, which provides a reference for future research.


Introduction
Forests are the ecosystems with the richest biodiversity on land and represent an important factor affecting global carbon, water cycles, biodiversity, land-use change, and climate change [1][2][3][4][5][6][7]. China's Three-North Shelter Forest Program (TNSFP) [8] is the largest artificial afforestation project in the world, covering more than 40% of China's land area and serving as an important ecological barrier in northern China [9,10]. TNSFP not only holds important economic benefits but also plays an important role in preventing wind, fixing sand, regulating water and heat, improving the microclimate, soil, and water conservation, reducing natural disasters, and realizing the sustainable and healthy development of the ecological environment [11]. Owing to the fragility and complexity of the ecological environment in the Three-North Region (TNR), the vegetation ecosystem in this region is estimated parameters, and/or research area [40][41][42], and there is a lack of comparison of multiple algorithms in FFC estimation. To choose a more robust estimation model, it is necessary to verify the performances of the different algorithms in FFC estimation.
Considering the uneven spatial distribution of a forest, a large number of training samples need to be selected and labeled [32]. In FFC research, although the method based on field sampling can be used to obtain more accurate sample data, this method is expensive and can only be used in small research areas [43,44]; UAV RS can be used to obtain a larger range of sample data, but it is limited by the availability and timeliness of the data [20,[45][46][47]. High-resolution satellite remote sensing imagery (HRSRS) is characterized by a high spatial resolution and wide imaging range and has obvious advantages in terms of ground object recognition. HRSRS is widely used to obtain reference datasets in the research involving estimating the FFC [22,23,34,[47][48][49]. For example, Baumann, et al. [23], Godinho et al. [34] used high-resolution Google Earth images to collect reference datasets, while Gessner, et al. [22] used the results of classification and aggregation of IKONOS and QuickBird as a reference dataset. The spatial resolution of the GaoFen-2 (GF-2) satellite data is better than 1 m, and fused GF-2 data can provide a richer texture and geometric information and can make it easier to distinguish between surface forest and non-forest cover [50][51][52]. In this study, GF-2 was selected as the main source of the reference dataset.
In this study, the FFC in the TNR was drawn by using Landsat-8 RS images and was based on the research ideas of MLAs fusion. First, this research combines RS images of different resolutions (Landsat-8 and GaoFen-2) to construct a sample set for estimating the FFC, and then a multi-algorithm integrated machine learning model is constructed based on existing MLAs The performance of the MLAs estimating the FFC is evaluated and compared, and finally, a robust estimation model is established to estimate the FFC of the TNR.

Study Area
The TNR of China includes semi-arid and arid lands in the northeast, north, and northwest of China, where desertification and erosion of soil and water constitute serious problems. The TNSFP was implemented to realize the protection of forest resources and the sustainable development of the ecological environment [12]. Located between 73 • 29 -129 • 50 east longitude and 33 • 30 -50 • 14 north latitude, it includes 551 counties (banners, cities, districts) in 13 provinces (autonomous regions, municipalities): Shaanxi, Gansu, Ningxia, Qinghai, Xinjiang, Shanxi, Hebei, Beijing, Tianjin, Inner Mongolia, Liaoning, Jilin, and Heilongjiang. The total project construction area accounts for 42.4% of China's total land area. The goal of this project plans to increase the forest coverage rate in this area from 5.05% in 1978 to 14.95% in 2050 [53,54].
The TNSFP area exceeds 4000 km from east to west, and there are significant differences in surface environment and vegetation cover. The altitude rises from east to west, varying from less than 100 m to more than 5000 m, including plains, plateaus, mountains, and basins. The annual average precipitation in the region is extremely uneven and gradually decreases from east to west and from south to north, in some areas exceeding 400 mm. However, in the Tengger Desert, Badain Jaran Desert, and Qaidam Basin, the driest areas in China, the annual average precipitation is between 20 and 50 mm [12,30,53,54]. The vegetation distribution of TNR has obvious spatial variability, and different types of forests are mainly distributed in the northeast and north China. The northwest region is dominated by grasslands and deserts, a small part of the basin in this region is also covered by deciduous coniferous forests and mixed forests ( Figure 1) [55].

Data
Landsat images were obtained from the USGS Earth Resources Observation and Science (EROS) Center archive (https://earthexplorer.usgs.gov/, (accessed on 31 May 2020)). Atmospheric correction and conversion to surface reflectance were implemented using the Fast Line-of-sight Atmospheric Analysis of Spectral Hypercube (FLAASH) radiative transfer model in the ENVI environment [57]. In this study, only the visible light band data of Landsat-8 and the image quality assessment (QA) band were needed. Through the MASK of the QA band, the pixels marked as clouds, cloud shadows, water, snow, ice, or filled were excluded, and only clean pixels were kept [58][59][60]. As relationships between FFC and remote-sensing-derived variables are affected by many factors, the selection of suitable variables is a crucial step in FFC estimates. In this study, three different types of RS variables were used as the features for the FFC estimation: The vegetation indices reflect the spectral characteristics of the vegetation-soil system. They are widely used to evaluate the plant growth status, chlorophyll content, canopy structure, and photosynthetic efficiency, and can enhance the electromagnetic behavior of key biophysical parameters of vegetation, such as photosynthetic potential, canopy pigment, and canopy water content [62,63]. Healthy vegetation reflects a lot of light in the near-infrared part of the spectrum and

Data
Landsat images were obtained from the USGS Earth Resources Observation and Science (EROS) Center archive (https://earthexplorer.usgs.gov/, (accessed on 31 May 2020)). Atmospheric correction and conversion to surface reflectance were implemented using the Fast Line-of-sight Atmospheric Analysis of Spectral Hypercube (FLAASH) radiative transfer model in the ENVI environment [57]. In this study, only the visible light band data of Landsat-8 and the image quality assessment (QA) band were needed. Through the MASK of the QA band, the pixels marked as clouds, cloud shadows, water, snow, ice, or filled were excluded, and only clean pixels were kept [58][59][60]. As relationships between FFC and remote-sensing-derived variables are affected by many factors, the selection of suitable variables is a crucial step in FFC estimates. In this study, three different types of RS variables were used as the features for the FFC estimation: least light in the visible part (especially the red part). Therefore, green plants and other objects can be distinguished by their absorption and reflection differences in the red band and the near-infrared band, and the vegetation indices can be used as an effective predictor for estimating forest parameters [49,[64][65][66]. This study selected the vegetation indices shown in Table 1.
GF-2 images were accessed from China Resources Satellite Data and Application Center (CRESDA, http://www.cresda.com/CN, (accessed on 31 May 2020)). The GF-2 sensor index is shown in Table 2. The satellite was equipped with panchromatic and multispectral sensors (PMS), including a 0.8-m resolution panchromatic band and four 3.2-m resolution multispectral bands. Using the fast-line-of-sight atmospheric analysis (FLAASH) model, radiometric calibration and atmospheric correction were performed. The panchromatic and multispectral bands were resampled to 1-m and 4-m resolutions, respectively, and fused into 1-m resolution data [50,68]. We chose images with GF-2 and Landsat-8 imaging times close to each other and under cloud-free conditions. The distribution of data in the study area is shown in Figure 1a. Table 3 shows the imaging time of the Landsat-8 and GF-2 images. The imaging time difference between the two data sources was within one week, which avoided large surface occurrences caused by factors such as seasons and climate. In particular, the spatial distribution of forests did not change significantly in a short period.

Existing FFC Products for Comparison
Two global-scale datasets were used to compare the FFC results in this study: Global Forest Change 2000-2019 (GFC) [28] and Global Land Survey Landsat Tree Canopy Cover Continuous Fields (GLS_TCC) [21]. GFC uses a decision tree algorithm and Landsat data to generate annual global datasets with a resolution of 30 m and a time range from 2000 to 2019. GLS_TCC uses the RT algorithm to estimate the global 30-m resolution continuous fields of tree cover based on Landsat reflectance data and land surface temperature data. This dataset is produced every 5 years, and four datasets are currently available: 2000, 2005, 2010, and 2015.

Methodology
The methodology and experimental setup for this study are outlined in Figure 2. First, preprocess the selected RS images and auxiliary data to construct the dataset required by the machine learning model. Next, select MLAs and determine the optimal parameter combination of the model through parameter tuning. Then, use cross-validation (CV), independent validation, reference image data comparison, and other methods to compare and evaluate these algorithms, and select the algorithm with the best stability and robustness as the final estimation model. Finally, use the final model to estimate the FFC of TNR.  Figure 2. Flowchart of the methodology applied in this study.

Reference Data Sampling and Preprocessing
This subsection mainly describes the process of generating high-quality data for the development of the model. It mainly includes the processing of GF-2 images, the matching and processing of Landsat-8 RS images and reference data, and the construction and processing of the reference FFC dataset: (a) Choose 1-m resolution GF-2 images as reference image data. To extract forest and non-forest information from it, we used the eCognition Developer (Version 9.0) software combined with visual interpretation to realize the classification of GF-2 images (in Figure 3) [80,81]. In the process of visual interpretation, forestland and non-forest land are distinguished and judged according to the color, shape, and texture of the ground objects and the spatial distribution characteristics of the shadow between the ground objects, and then the wrong classification is manually edited and modified.

Reference Data Sampling and Preprocessing
This subsection mainly describes the process of generating high-quality data for the development of the model. It mainly includes the processing of GF-2 images, the matching and processing of Landsat-8 RS images and reference data, and the construction and processing of the reference FFC dataset: (a) Choose 1-m resolution GF-2 images as reference image data. To extract forest and nonforest information from it, we used the eCognition Developer (Version 9.0) software combined with visual interpretation to realize the classification of GF-2 images (in Figure 3) [80,81]. In the process of visual interpretation, forestland and non-forest land are distinguished and judged according to the color, shape, and texture of the ground objects and the spatial distribution characteristics of the shadow between the ground objects, and then the wrong classification is manually edited and modified. Then, extract sparse or independently distributed trees in the image to ensure that the extraction of forestland is correct. Finally, obtain the forest classification result of the reference image. (b) Extract the surface reflectance data of Landsat-8 images in the reference area, and calculate the vegetation index based on the reflectance data. Register GF-2 classification results with Landsat-8 data, and use a 5 × 5 sampling window for Landsat sampling, because a larger sampling window will lose small areas or fragmented forest cover, while a smaller sampling window will increase the spatial position inconsistency between different data sources because of registration errors [23,82]. Calculate the FFC in the corresponding sampling window. (c) Construct a dataset with surface reflectance, vegetation indices, and terrain factors as predictors, and FFC as the response variables. Divide the dataset into a training set, test set, and independent validation set, and then train, verify, and evaluate the model.  (b) Extract the surface reflectance data of Landsat-8 images in the reference area, and calculate the vegetation index based on the reflectance data. Register GF-2 classification results with Landsat-8 data, and use a 5 × 5 sampling window for Landsat sampling, because a larger sampling window will lose small areas or fragmented forest cover, while a smaller sampling window will increase the spatial position inconsistency between different data sources because of registration errors [23,82]. Calculate the FFC in the corresponding sampling window. (c) Construct a dataset with surface reflectance, vegetation indices, and terrain factors as predictors, and FFC as the response variables. Divide the dataset into a training set, test set, and independent validation set, and then train, verify, and evaluate the model.
The performance of MLAs depends on the quality of training samples. Because of noise problems, unbalanced distribution of data samples, and the lack of representativeness of samples, the data obtained according to the above methods cannot be directly used as input for MLAs. We need to preprocess the data, including the detection of outliers, the processing of missing and repeated values, data standardization, and feature selection [83].
Feature selection can reduce the computation time, prevent the model from being too complex and which can lead to overfitting, and facilitate a better understanding of the learning model or data by removing irrelevant and redundant data [84,85]. The RF pro- The performance of MLAs depends on the quality of training samples. Because of noise problems, unbalanced distribution of data samples, and the lack of representativeness of samples, the data obtained according to the above methods cannot be directly used as input for MLAs. We need to preprocess the data, including the detection of outliers, the processing of missing and repeated values, data standardization, and feature selection [83].
Feature selection can reduce the computation time, prevent the model from being too complex and which can lead to overfitting, and facilitate a better understanding of the learning model or data by removing irrelevant and redundant data [84,85]. The RF provides an assessment of the importance of the different feature variables in the prediction process. To evaluate the importance of each feature (e.g., the satellite image band and vegetation index), the RF switches one of the input random variables while keeping the rest constant, and it measures the decrease in the accuracy using the OOB error estimation and the decrease in the Gini index [86,87]. In this study, we used the RF model to analyze the importance of the selected feature variables to the target value and ranked the importance of the features, which is presented in Figure 4. The feature importance ranking results indicate that SAVI and SWIR_2 are the main contributors to the models, and these two features are highly sensitive to soil and vegetation. The topographic information in the DEM makes a high contribution, and the contribution of the image spectrum is ranked in the middle level. Some features have low importance scores (IPVI, NDVI, and DVI), but this does not indicate that these features are not sufficiently related to the FFC. This may be because the top-ranked features have a higher correlation with the FFC, or because there is autocorrelation between the features. Using the RF to rank the importance of the features allows for the selection of the features with higher relevance to the target value. We eliminated the features with importance scores of less than 0.01. We compared the accuracy of the model before and after the feature selection, there is no obvious change, which demonstrates that the eliminated features had no obvious effect on the model. In general, the selection of appropriate spectral indices, topographical information, and other features has a significant impact on the estimation of FFC. Finally, the estimation of the FFC can be interpreted as FFC = f(SAVI, SWIR_2, Blue, DEM, GRVI, ARVI, Green, NIR, Slope, EVI, SWIR_1, RDVI).

Machine Learning Algorithms
MLAs have a wide range of applications in the field of RS. To obtain a model suitable for this research, it was necessary to conduct a comprehensive comparative analysis of each MLA. Table 4 lists the selected MLAs, including the ensemble algorithms (bagging and boosting), MLP, DTR, LR, Ridge, ENe, LASSO, and KNN.
The bagging ensemble algorithm uses a decision tree as the base-learner and employed bagging sampling to extract n samples from the original sample set, with k extractions, to obtain k subsets of data, and finally, k models are trained. For the regression problem, the mean of the k models is taken as the final result. The representative algorithms are the RF [88], and the bagged decision trees [89]. The boosting ensemble algorithm combines multiple weak learners into a strong learner through iteration. The training of each learner depends on the results of the previous learner, and each time the samples that are incorrectly estimated are weighted and the samples that are correctly estimated are down-weighted. Finally, the multiple weak learners are weighted and combined. Its representative algorithms include GBR [34] and Extreme Gradient Boosting (XGBoost) [38]. The feature importance ranking results indicate that SAVI and SWIR_2 are the main contributors to the models, and these two features are highly sensitive to soil and vegetation. The topographic information in the DEM makes a high contribution, and the contribution of the image spectrum is ranked in the middle level. Some features have low importance scores (IPVI, NDVI, and DVI), but this does not indicate that these features are not sufficiently related to the FFC. This may be because the top-ranked features have a higher correlation with the FFC, or because there is autocorrelation between the features. Using the RF to rank the importance of the features allows for the selection of the features with higher relevance to the target value. We eliminated the features with importance scores of less than 0.01. We compared the accuracy of the model before and after the feature selection, there is no obvious change, which demonstrates that the eliminated features had no obvious effect on the model. In general, the selection of appropriate spectral indices, topographical information, and other features has a significant impact on the estimation of FFC. Finally, the estimation of the FFC can be interpreted as FFC = f(SAVI, SWIR_2, Blue, DEM, GRVI, ARVI, Green, NIR, Slope, EVI, SWIR_1, RDVI).

Machine Learning Algorithms
MLAs have a wide range of applications in the field of RS. To obtain a model suitable for this research, it was necessary to conduct a comprehensive comparative analysis of each MLA. Table 4 lists the selected MLAs, including the ensemble algorithms (bagging and boosting), MLP, DTR, LR, Ridge, ENe, LASSO, and KNN.
The bagging ensemble algorithm uses a decision tree as the base-learner and employed bagging sampling to extract n samples from the original sample set, with k extractions, to obtain k subsets of data, and finally, k models are trained. For the regression problem, the mean of the k models is taken as the final result. The representative algorithms are the RF [88], and the bagged decision trees [89]. The boosting ensemble algorithm combines multiple weak learners into a strong learner through iteration. The training of each learner depends on the results of the previous learner, and each time the samples that are incorrectly estimated are weighted and the samples that are correctly estimated are down-weighted. Finally, the multiple weak learners are weighted and combined. Its representative algorithms include GBR [34] and Extreme Gradient Boosting (XGBoost) [38].
Stacked generalization (SG) is a layered ensemble algorithm [90]. Take two layers as an example. The first layer is called the base model, and its input is the original training set, which is trained using a k-fold crossover. The second layer is called the meta-model, and the output of the first layer is used as the training set for the retraining to obtain a complete stacking model [91]. In contrast to bagging and boosting, the base model in the SG method should be accurate and different. They can be bagging-and boosting-type algorithms, or KNN, SVR, and so on, but they have to have both less similarity and better estimation ability. The meta-model uses the estimation results of the base model as the input data for the training. The meta-model can be a single model or an ensemble model [39,92]. The construction process of the SG algorithm is shown in Figure 5. The optimal algorithm among the various algorithms was selected as the base model through model comparison, and different meta-models were chosen to construct the SG model. Stacked generalization (SG) is a layered ensemble algorithm [90]. Take two layers as an example. The first layer is called the base model, and its input is the original training set, which is trained using a k-fold crossover. The second layer is called the meta-model, and the output of the first layer is used as the training set for the retraining to obtain a complete stacking model [91]. In contrast to bagging and boosting, the base model in the SG method should be accurate and different. They can be bagging-and boosting-type algorithms, or KNN, SVR, and so on, but they have to have both less similarity and better estimation ability. The meta-model uses the estimation results of the base model as the input data for the training. The meta-model can be a single model or an ensemble model [39,92]. The construction process of the SG algorithm is shown in Figure 5. The optimal algorithm among the various algorithms was selected as the base model through model comparison, and different meta-models were chosen to construct the SG model. Each algorithm has a different performance when using different parameters, and tuning the parameter is required to obtain the best estimation capability for each model. In the process of model tuning, root mean squared error (RMSE) was used as the evaluation metric of the model, and each set of parameters was evaluated using a grid search and repeated 10-fold CV. Finally, the set of parameters with the smallest RMSE was selected as the best combination of parameters for the model [93]. Table 4 lists the best parameter combinations for each machine learning algorithm. After tuning, we used 80% of the data in the entire dataset to train the model and used the other 20% of the data as the validation set to verify the estimation performance of the selected parameters. (In the experiment, we used scikit-learn to implement the required MLAs and the related processes based on Python 3.7.1 http://scikit-learn.org, (accessed on 31 May 2020)). S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 result1 result2 result10 Pred1 Base Model 1

All Samples
Base_Model N S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Each algorithm has a different performance when using different parameters, and tuning the parameter is required to obtain the best estimation capability for each model. In the process of model tuning, root mean squared error (RMSE) was used as the evaluation metric of the model, and each set of parameters was evaluated using a grid search and repeated 10-fold CV. Finally, the set of parameters with the smallest RMSE was selected as the best combination of parameters for the model [93]. Table 4 lists the best parameter combinations for each machine learning algorithm. After tuning, we used 80% of the data in the entire dataset to train the model and used the other 20% of the data as the validation set to verify the estimation performance of the selected parameters. (In the experiment, we used scikit-learn to implement the required MLAs and the related processes based on Python 3.7.1 http://scikit-learn.org, (accessed on 31 May 2020)).

Evaluation Approaches
This study used three validation schemes to verify the model and experimental results: model-based cross-validation, validation based on independent samples, and comparative analysis with existing datasets.

1.
10-fold cross-validation: After repeating the 10-fold cross-validation 5 times to ensure that all samples were tested, the RMSE, R 2 , and MAE of each model were calculated, and their mean was calculated. The average value was used as an evaluation index of the algorithm's accuracy [107].

2.
Independent validation: (a) When data were obtained in each reference area, 20% of each set of reference data was regarded as independent sample data (in Figure 6). When the model was trained, this part of the data does not participate in the training of the model. We used the trained model to predict this part of the data, compared the prediction results with the observed values of the sample, and used RMSE, R 2 , and MAE as evaluation indicators to achieve the evaluation of the model's predictive ability [108].
A reference region with a geographical location that differs from that in Section 2.2 is selected as the independent validation area, and this part of the data is used as the input data of the trained model for the prediction. The predicted results are compared with the observed value to verify the generalization ability of the model. This part of the validation data has better independence compared to the training and verification data in (a) have, because it is selected as an independent region and does not participate in the training and parameter optimization of the model.

3.
Product inter-comparison: We used the model for data production, and compared and verified the produced data with the Global Forest Change 2000-2019 (GFC) datasets and GLS_TCC datasets. In this paper, the correlation coefficient (R 2 Equation (1)), mean absolute error (MAE, Equation (2)), and root mean square error (RMSE, Equation (3)) were selected as the evaluation indicators of the model's performance [42,87,109].
In the equations , , and are the original values, the prediction results, and the average value of the observations, respectively, and n is the number of observations.

Comparing the Performance of Different ML Models
The performance and predictive ability of each model in Table 4 were statistically analyzed. The figures and tables in this section show the performance of each model. As summarized in the previous section, the DTR model is one of the most commonly used models and therefore we use the DTR model as a standard model to evaluate the performance of other models.
First, we evaluated all models using a 10-fold CV method repeated five times. The results are presented in Table 5. The linear model was the worst type of all models, and ENe was the worst one of the linear models (R 2 = 0.759, MAE = 0.130, RMSE = 0.194); In the bagging ensemble, the RF model was the best performing model; in boosting ensemble, the LGBM model was the best performing model. The validation results of the RF were slightly better than those of the LGBM. The results of the MLP, DTR, and KNN were not very different. In this paper, the correlation coefficient (R 2 Equation (1)), mean absolute error (MAE, Equation (2)), and root mean square error (RMSE, Equation (3)) were selected as the evaluation indicators of the model's performance [42,87,109]. (1) In the equations y i ,ŷ i , and y i are the original values, the prediction results, and the average value of the observations, respectively, and n is the number of observations.

Comparing the Performance of Different ML Models
The performance and predictive ability of each model in Table 4 were statistically analyzed. The figures and tables in this section show the performance of each model. As summarized in the previous section, the DTR model is one of the most commonly used models and therefore we use the DTR model as a standard model to evaluate the performance of other models.
First, we evaluated all models using a 10-fold CV method repeated five times. The results are presented in Table 5. The linear model was the worst type of all models, and ENe was the worst one of the linear models (R 2 = 0.759, MAE = 0.130, RMSE = 0.194); In the bagging ensemble, the RF model was the best performing model; in boosting ensemble, the LGBM model was the best performing model. The validation results of the RF were slightly better than those of the LGBM. The results of the MLP, DTR, and KNN were not very different. We then used the validation dataset in Section 3.3 (2a) to evaluate the predictive ability of each model. The scatter plots of the predicted and observed values of each model are shown in Figure 7, where the red line represents the regression line of the predicted and observed values. Table 6 provides the statistics of the results in Figure 7. The scatter plot in Figure 7 shows that the scatter plots of ensemble models converged better, especially RF and LGBM. The worst performance was that of the linear model. Their scatter plots are more discrete and exhibit obvious overestimation of low-value areas.  We then used the validation dataset in Section 3.3 (2a) to evaluate the predictive ability of each model. The scatter plots of the predicted and observed values of each model are shown in Figure 7, where the red line represents the regression line of the predicted and observed values. Table 6 provides the statistics of the results in Figure 7. The scatter plot in Figure 7 shows that the scatter plots of ensemble models converged better, especially RF and LGBM. The worst performance was that of the linear model. Their scatter plots are more discrete and exhibit obvious overestimation of low-value areas.   Table 6 shows the estimation results of each model on independent datasets. In the bagging ensemble, the RF model was slightly better than BAG. In the boosting ensemble, the LGBM model performed the best, and its evaluation index was the same as that of the RF model (R 2 = 0.991, MAE = 0.019, RMSE = 0.035). Overall, the integrated models, except   Table 6 shows the estimation results of each model on independent datasets. In the bagging ensemble, the RF model was slightly better than BAG. In the boosting ensemble, the LGBM model performed the best, and its evaluation index was the same as that of the RF model (R 2 = 0.991, MAE = 0.019, RMSE = 0.035). Overall, the integrated models, except for the ADA model, showed better performance than the DTR. Comparing all algorithms, the R 2 of the linear regression algorithm was less than 0.8, the MAE was greater than 0.1, and the RMSE was greater than 0.17, the RMSE of the ENe model was more than 0.2, which was the worst performance among the linear models. It can be considered that linear models do not show the nonlinear relationship between FFC and features, so linear models are not suitable for use in this study. The MLP model was better than the linear models, and the evaluation index differed slightly from the DTR model. The KNN slightly outperformed the DTR model. Thus, the RF model and LGBM model in the ensemble algorithm performed better than other models in this study. The linear model was the worst-performing model, so it will not be mentioned in the discussion of the following experiments.
Considering that when building SG models, base models should have low correlation; that is, the types of base models should be different, and base models should have good performance. Based on the results of Table 6, the RF, DTR, LGBM, KNN, and MLP models were selected as base models. The choice of meta-model also affects the performance of the SG model. This study selected different meta-models to construct the SG algorithm model: SG(LGBM), SG(RF), SG(KNN), SG(DTR), and SG(MLP). (SG(RF) denotes the method using the RF algorithm as the meta-model). Figure 8 shows the validation results of SG models in an independent dataset. The scatter plot of the SG(DTR) model is more discrete. For the SG model, although the same base models were selected, owing to the differences in meta-models, the performance of each SG model was different. According to the statistics in Table 7, the SG(LGBM) model performed the best (R 2 = 0.991, MAE = 0.018, RMSE = 0.034). The RMSE of SG(MLP) was 0.001 higher than that of SG(LGBM). Thus, the two had the same performance. The worst performer was SG(DTR) (R 2 = 0.983, MAE = 0.025, RMSE = 0.048), which also indicates that the meta-model plays an important role in the performance of the entire model. According to the statistics in Tables 6 and 7, SG(LGBM) is the best performer among all models.
which was the worst performance among the linear models. It can be considered that linear models do not show the nonlinear relationship between FFC and features, so linear models are not suitable for use in this study. The MLP model was better than the linear models, and the evaluation index differed slightly from the DTR model. The KNN slightly outperformed the DTR model. Thus, the RF model and LGBM model in the ensemble algorithm performed better than other models in this study. The linear model was the worst-performing model, so it will not be mentioned in the discussion of the following experiments.
Considering that when building SG models, base models should have low correlation; that is, the types of base models should be different, and base models should have good performance. Based on the results of Table 6, the RF, DTR, LGBM, KNN, and MLP models were selected as base models. The choice of meta-model also affects the performance of the SG model. This study selected different meta-models to construct the SG algorithm model: SG(LGBM), SG(RF), SG(KNN), SG(DTR), and SG(MLP). (SG(RF) denotes the method using the RF algorithm as the meta-model). Figure 8 shows the validation results of SG models in an independent dataset. The scatter plot of the SG(DTR) model is more discrete. For the SG model, although the same base models were selected, owing to the differences in meta-models, the performance of each SG model was different. According to the statistics in Table 7, the SG(LGBM) model performed the best (R 2 = 0.991, MAE = 0.018, RMSE = 0.034). The RMSE of SG(MLP) was 0.001 higher than that of SG(LGBM). Thus, the two had the same performance. The worst performer was SG(DTR) (R 2 = 0.983, MAE = 0.025, RMSE = 0.048), which also indicates that the meta-model plays an important role in the performance of the entire model. According to the statistics in Tables 6 and 7, SG(LGBM) is the best performer among all models.

Model Validation Based on Independent Validation Area
We used the validation data described in Section 3.3 (2b) to further verify the base model and the SG model. The validation data were taken from an independent area that does not participate in any process of the above models, so the data had better independence. The distribution of forests in these independent validation areas is quite different and was used to verify the performance of different models under different forest coverage conditions.
The scatter plots in Figure 9 are the estimated results of each model in the independent validation area. Table 8 presents the statistics of the estimation accuracy of the model

Model Validation Based on Independent Validation Area
We used the validation data described in Section 3.3 (2b) to further verify the base model and the SG model. The validation data were taken from an independent area that does not participate in any process of the above models, so the data had better independence. The distribution of forests in these independent validation areas is quite different and was used to verify the performance of different models under different forest coverage conditions.
The scatter plots in Figure 9 are the estimated results of each model in the independent validation area. Table 8 presents the statistics of the estimation accuracy of the model in the three validation areas. In area 1, the models with the highest R 2 were LGBM and SG(LGBM) (R 2 = 0.911). The MLP model had the best MAE (0.038) and RMSE (0.055). In the stacking algorithm, SG(MLP) had the best MAE (0.047) and RMSE (0.060). It can be considered that SG(MLP) was the best stacking model in this area. Although the R 2 of MLP was lower than LGBM, it had better MAE and RMSE, so MLP was the best base model in this area, but also the best model compared to all models. The data distribution in area 2 was opposite to that in area 1, and the data were mainly distributed in high value. RF was the best base model, SG(MLP) was the best-stacked model. In area 3, the SG(LGBM) model had the best R 2 , MAE, and RMSE (0.984, 0.034, 0.046). In the base model, the RF model has the best R 2 , MAE, and RMSE (0.983, 0.036, 0.048). The findings from the above analysis indicate that the SG(LGBM) model is an ensemble model with better performance, and the RF model is a base model with better overall performance.  We divided the data of all validation areas into five layers according to the interval of 0-0.2, 0.2-0.4, 0.4-0.6, 0.6-0.8, and 0.8-1. Based on the findings above, firstly, the RF model is the best base model and the SG(LGBM) is the best-integrated model. Considering We divided the data of all validation areas into five layers according to the interval of 0-0.2, 0.2-0.4, 0.4-0.6, 0.6-0.8, and 0.8-1. Based on the findings above, firstly, the RF model is the best base model and the SG(LGBM) is the best-integrated model. Considering that the SG(MLP) model and the SG(LGBM) model are less different. The DTR model, as one of the most used models, can be used as a standard model, so the DTR, RF, SG(LGBM), and SG(MLP) models were chosen for comparative analysis. Table 9 shows the estimation results for the four models. The scatterplot in Figure 10 shows the performance of the RF model for estimating the stratified data. Figure 10a-e shows the estimation results for the stratified data for different intervals and Figure 10f shows the performance of the model for the whole data. the R 2 for the stratified data is very limited due to the scattering pattern along the short distances. Figure 11a,c,e are histograms of the evaluation indicators of the three models at different levels. When the coverage was less than 0.2, the highest R 2 was the SG(MLP) model (R 2 = 0.634), and the DTR model performed the worst; when the coverage is 0.2-0.4, SG(LGBM) performed best (R 2 = 0.411, MAE = 0.061, RMSE = 0.076); when the coverage was 0.6-1, SG(MLP) was the best model. The overall result elicited the same conclusion: The estimation result of the integrated algorithm in the independent validation area is significantly improved compared with the base model, especially for the non-ensemble model. that the SG(MLP) model and the SG(LGBM) model are less different. The DTR model, as one of the most used models, can be used as a standard model, so the DTR, RF, SG(LGBM), and SG(MLP) models were chosen for comparative analysis. Table 9 shows the estimation results for the four models. The scatterplot in Figure 10 shows the performance of the RF model for estimating the stratified data. Figure 10a-e shows the estimation results for the stratified data for different intervals and Figure 10f shows the performance of the model for the whole data. the R 2 for the stratified data is very limited due to the scattering pattern along the short distances. Figure 11a,c,e are histograms of the evaluation indicators of the three models at different levels. When the coverage was less than 0.2, the highest R 2 was the SG(MLP) model (R 2 = 0.634), and the DTR model performed the worst; when the coverage is 0.2-0.4, SG(LGBM) performed best (R 2 = 0.411, MAE = 0.061, RMSE = 0.076); when the coverage was 0.6-1, SG(MLP) was the best model. The overall result elicited the same conclusion: The estimation result of the integrated algorithm in the independent validation area is significantly improved compared with the base model, especially for the nonensemble model.   erage levels of 0.6-0.8, which would suggest that the models had the worst performance at this coverage condition. This is because when the coverage was 0.6-0.8, the surface heterogeneity was greater than other coverages. Whether in sample acquisition or model prediction, the similarity between different samples in the obviously heterogeneous mixed region was greater, which caused errors in the training or prediction of the model and affected the final estimation ability of the model. Based on the comprehensive performance of all models, SG(LGBM) and SG(MLP) model performance were slightly higher than the RF, and much higher than that of the DTR model, indicating that the stacking method is better than some base models such as the DTR model, but some base model (RF) has comparable results to stacking method. As the model constructed by the SG method is a superposition of multiple base models, it requires longer computing time and computing resources. Tables 6 and 7 list the time required for each model.   Figure 11, all models performed poorly when the coverage was between 0.2 and 0.8, in particular, the MAE and RMSE metrics were greatest for all models at coverage levels of 0.6-0.8, which would suggest that the models had the worst performance at this coverage condition. This is because when the coverage was 0.6-0.8, the surface heterogeneity was greater than other coverages. Whether in sample acquisition or model prediction, the similarity between different samples in the obviously heterogeneous mixed region was greater, which caused errors in the training or prediction of the model and affected the final estimation ability of the model.
Based on the comprehensive performance of all models, SG(LGBM) and SG(MLP) model performance were slightly higher than the RF, and much higher than that of the DTR model, indicating that the stacking method is better than some base models such as the DTR model, but some base model (RF) has comparable results to stacking method. As the model constructed by the SG method is a superposition of multiple base models, it requires longer computing time and computing resources. Tables 6 and 7 list the time required for each model.

Comparison with Existing Products
The performance of the SG algorithm was improved compared to the base model, but the SG algorithm requires a longer calculation time and consumes more calculation resources during its operation. Therefore, based on the consideration of the calculation efficiency and estimation accuracy, in this section, the RF model was selected as the estima-tion model to estimate the FFC, and the estimated FFC was compared and verified with other datasets. Figure 12 shows the distribution of forests in the TNR of different datasets, including the FromGLC2017 dataset (Figure 12a), GLS_TCC (2015) dataset (Figure 12b), GFC dataset (Figure 12c), and the FFC2017 dataset estimated by this model (Figure 12d). From an intuitive point of view, the four figures all show the main distribution regularities of the TNR forests. The forests are mainly distributed in the northeastern region, the central region has a slightly lower distribution, and the northwestern region has the sparsest distribution. However, there are differences in the spatial distribution of forests in these figures. The area covered by forests in Figure 11b,d is more extensive than in Figure 11a,c, especially the forests with a low cover density, and the forest cover in Figure 12c is more consistent with the spatial distribution in Figure 12a.

Comparison with Existing Products
The performance of the SG algorithm was improved compared to the base model, but the SG algorithm requires a longer calculation time and consumes more calculation resources during its operation. Therefore, based on the consideration of the calculation efficiency and estimation accuracy, in this section, the RF model was selected as the estimation model to estimate the FFC, and the estimated FFC was compared and verified with other datasets. Figure 12 shows the distribution of forests in the TNR of different datasets, including the FromGLC2017 dataset (Figure 12a), GLS_TCC (2015) dataset (Figure 12b), GFC dataset (Figure 12c), and the FFC2017 dataset estimated by this model (Figure 12d). From an intuitive point of view, the four figures all show the main distribution regularities of the TNR forests. The forests are mainly distributed in the northeastern region, the central region has a slightly lower distribution, and the northwestern region has the sparsest distribution. However, there are differences in the spatial distribution of forests in these figures. The area covered by forests in Figure 11b,d is more extensive than in Figure 11a,c, especially the forests with a low cover density, and the forest cover in Figure 12c is more consistent with the spatial distribution in Figure 12a. We selected six areas as the validation dataset to compare and analyze the FFC, GFC, and GLS_TCC. Figure 13 shows the comparison results. We selected six areas as the validation dataset to compare and analyze the FFC, GFC, and GLS_TCC. Figure 13 shows the comparison results.    Figure 13 shows the results of the comparison among the datasets, Figure 13a shows the result of aggregating the GF-2 forest distribution maps with a 1-m resolution into 30-m resolution, which we call GF2AG30. The aggregated data are completely independent throughout the experiment and can be used as true data. In the visual comparison, compared with GLS_TCC and GFC, FFC has a better spatial distribution consistency with GF2AG30. In areas (1) and (2), FFC contains more forest distribution details than GFC and GLS_TCC, especially at the junctions of the forest and non-forest areas where the changing trend of the forest density can be seen. Compared with GF2AG30 (Figure 13a) for each area, in the non-forest areas, GLS_TCC is overestimated. For example, the white in Figure 13(1a) indicates the value of 0, which means that there is no forest cover, but the corresponding Figure 13(1c) indicates that there is forest cover. In the forest-covered areas, GFC is underestimated. The distributions of the forests in areas (5) and (6) are discrete, but the results of GFC show that there is basically no forest cover in these two areas. Figure 13e shows a stacked histogram of the frequency distributions of the four datasets for each area, which illustrates the differences in the frequencies and distribution ranges of the datasets. In general, the ranges of the four datasets have obvious differences. First, the value range of GF2AG30, FFC, and GFC is 0-100%, but GLS_TCC basically does not have a pixel value of greater than 80%. In forest-covered areas 1 and 2, the data distributions of GF2AG30, FFC, and GFC are relatively consistent; while in GLS_TCC, the forest coverage in these two areas is concentrated between 50% and 60%. In areas 3 and 4, GLS_TCC is concentrated at about 50%; while GFC is concentrated at about 40% in area 3 and at about 60% in area 4. Compared with GF2AG30, GLS_TCC and GFC are underestimated. The forest distribution of FFC is more consistent with that of GF2AG30 than in GLS_TCC and GFC. The land cover in areas 5 and 6 is dominated by non-forest vegetation, and the forests are scattered. GF2AG30 and FFC show the characteristics of the forest distribution better, while GLS_TCC exhibits overestimation in the low-value areas and underestimation in the high-value areas. In Figure 13f-h are the scatter density plots of the correlations between FFC and GF2AG30, GLS_TCC and GF2AG30, and GFC and GF2AG30, respectively. Overall, FFC has the highest correlation with GF2AG30, followed by GLS_TCC and GF2AG30, and GF2AG30 has the worst correlation with GFC.

Algorithmic Factors
In this study, we used a combination of Landsat and GF-2 RS images to evaluate the applicability, stability, and generalization ability of several MLAs at estimating the FFC. In this study, a variety of SG models were constructed, and the results show that the SG models with different meta-models have different performances, among which the SG(LGBM) and SG(MLP) models achieved the best-estimated results and the estimation accuracy was improved compared to the base model (DTR). Since the SG model is based on the fusion of multiple MLAs, its computational efficiency is related to the selection of the base model and meta-model. The more base-model or more complex meta-model is selected, the more computational resources are required and the longer the computation time. Some studies have demonstrated the advantages of integrated models over base models [33,110], but Wen and Hughes [39] concluded that although integrated models have high predictive power, some of their metrics are inferior to the RF models. The SG model developed in this paper is significantly better than a base model such as the DTR, but it has a limited accuracy improvement compared to the integrated type of RF. To obtain the obvious advantages of the SG method, further research and discussion of the applicability of the SG model and the selection of base models are needed.

Time Factors
Relevant studies have demonstrated that using continuous fractions instead of classification classes can enhance the assessment of the forest status and its changes [20,21,28]. In addition, the quality of the samples is related to the performance of the model and the accuracy of the parameter estimation, while the temporal consistency between the reference image and the image to be estimated affects the quality of the sample. In this study, Landsat and GF-2 remote sensing images with small differences in imaging time were selected to ensure the consistency of the extracted information. However, the differences in the revisit periods of the two satellite sensors and the influence of the weather factors, and the amount of time-consistent image data were small. The study area contains a variety of forest types, and different types of forests behave differently on remote sensing images with seasonal changes, especially deciduous broadleaf forests that change significantly with the season [20]. The RS data selected in this study were mostly obtained in the forest growing season, which may limit the time range of the model's usage due to the lack of samples from other seasons [18,20]. Therefore, in future research, it will be necessary to consider the combined use of multi-source data to increase the representativeness of the sample and to explore the effects of different forest types and tree species on the estimation of forest coverage under seasonal changes.

Sample Representativeness Factor
One of the key challenges in estimating forest attributes such as FFC using medium spatial resolution satellite imagery is the difficulty in overcoming background soil reflectance, especially in low-density sparse forests [57]. The entire study area contains a large amount of non-forest vegetation and low-density woodlands, which are prone to higher errors if the information is not sufficiently representative [18,20]. We increased the proportion of low-value samples in the entire dataset to improve the representativeness of the samples. However, the inclusion of too much low-value data tends to cause an unbalanced distribution of samples and leads to more errors in the model estimation. In this study, a 5 × 5 sampling window was used to reduce the effect of the spatial misregistration between the Landsat and GF-2 data. It is easy to generate samples with errors in sparse woodlands or forest boundaries, and such errors have a negative impact on the training and application of the model. Therefore, to ensure the representativeness of the data, our future work will pay more attention to the extraction and analysis of low-density forest information and improve the accuracy of sample extraction.

Differences between Data Products
The comparison between the FFC, GFC, and GLS_TCC datasets shows that there are large differences between them, and the differences are due to several reasons: Source data differences: The GLS_TCC dataset is an annual composite. For example, the 2005 GLS_TCC is a composite of Landsat-5 and Landsat-7 images from 2003 to 2008 [21,111]. Similarly, the GFC is an annual synthesized data product. In contrast, the validation data in this paper are time-determined images, which can be considered to be instantaneous data from the image at the moment of imaging. Inconsistency in the timing of the data may be a reason for the discrepancy between the different products.
Differences in sample data: Both the GLS_TCC and GFC refer to the MOD44B_VCF data product when selecting the samples and constructing the models, and inevitably, they inherit the uncertainty that exists in the product itself [21]. The use of the 250 m resolution MOD44BVCF limits the ability to acquire canopy cover values below 30 m [21,112], while the reference data of the FFC is a higher resolution remote sensing image with a stronger ability to acquire fine canopy cover data.
Algorithm differences: The GLS_TCC uses the standard regression tree (RT) algorithm, and the GFC uses the decision regression tree (RT) method. In this study, several algorithms were compared and it was concluded that the DT algorithm is computationally efficient, but its estimation accuracy and ability to prevent overfitting are not as good as those of the ensemble method. In this study, ensemble learning methods such as the RF were chosen. Although the computational efficiency is not as good as that of the DT, which has a better model generalization ability and estimation performance [83,113].

Conclusions
This study aimed to estimate the 30-m resolution FFC in the TNR of China. Landsat RS images were used to obtain band information and the vegetation index, GF-2 RS images were used as reference image data to obtain forest coverage information in the study area combined with auxiliary topographic factors, and machine learning methods were used to construct a method for estimating forest coverage in China's TNR model. The results of the cross-validation and independent validation indicate that simple linear models cannot accurately represent the relationships between features and the FFC. The performance of ensemble models was better than that of other types of models, and RF (R 2 = 0.991, MAE = 0.019, RMSE = 0.035) and LGBM (R 2 = 0.991, MAE = 0.019, RMSE = 0.035) were the best of bagging ensemble models and boosting ensemble models, respectively. In this study, five models (RF, LGBM, MLP, DTR, KNN) with lower correlations but better performances were selected as the base models, and multiple SG algorithms were constructed using different meta-models. The results show that the selection of the meta-model is one factor that affects the performance of the SG algorithms. Moreover, the best SG model SG(LGBM) (R 2 = 0.991, MAE = 0.018, RMSE = 0.034) had a better performance than the DTR (R 2 = 0.980, MAE = 0.026, RMSE = 0.053). The results of stratification statistics indicate that when the coverage is below 0.2 and the coverages are above 0.8. The model has better performance, but the estimation ability of the model decreases between the coverages of 0.2 and 0.8. There are two possible reasons for this: (1) the number of samples in the range of 0.2 to 0.8 is small during the model training and model validation, so the model's estimation ability in this range is weak; (2) Because the forest is not completely closed within the 0.2-0.8 cover range, the surface coverage is complex and the heterogeneity is high, which also indicates that further research is needed for areas with high surface heterogeneity. Through the comparative analysis of the dataset in the independent validation area, the GLS_TCC data were underestimated in the validation area, and the GFC data were underestimated in the high-value area, and through comparison with the forest aggregation map of the validation area, the forest distribution of the FFC had better consistency with the reference data and showed more detailed forest distribution trends. In addition, we also need to implement a more accurate validation of the estimated FFC and compare it with previous studies and data, as well as a more detailed study of the FFC for regions with a high surface heterogeneity.
Author Contributions: S.L. and X.L. designed the study; X.L. and B.L. collected the data, processed the satellite images, and performed the experiments; X.L. conducted the analysis and drafted the manuscript; and S.L., H.M. and T.H. revised the manuscript. All authors have read and agreed to the published version of the manuscript.