Multispectral Remote Sensing Data Are E ﬀ ective and Robust in Mapping Regional Forest Soil Organic Carbon Stocks in a Northeast Forest Region in China

: Accurately mapping the spatial distribution information of soil organic carbon (SOC) stocks is a key premise for soil resource management and environment protection. Rapid development of satellite remote sensing provides a great opportunity for monitoring SOC stocks at a large scale. In this study, based on 12 environmental variables of multispectral remote sensing, topography and climate and 236 soil sampling data, three di ﬀ erent boosted regression tree (BRT) models were compared to obtain the most accurate map of SOC stocks covering the forest area of Lvshun District in the Northeast China. Four validation indexes, including mean absolute error (MAE), root mean square error (RMSE), coe ﬃ cient of determination (R 2 ), and Lin’s concordance correlation coe ﬃ cient (LCCC) were calculated to evaluate the performance of the three models. The results showed that the full variable model performed the best, except the model using multispectral remote sensing variables. In the full variable model, the regional SOC stocks are primarily determined by multispectral remote sensing variables, followed by topographic and climatic variables, with the relative importance of variables in the model being 63%, 28%, and 9%, respectively. The average prediction results of full variables model and only multispectral remote sensing variables model were 8.99 and 9.32 kg m − 2 , respectively. Our results indicated that there is a strong dependence of SOC stocks on multispectral remote sensing data when forest ecosystems have dense natural vegetation. Our study suggests that the multispectral remote sensing variables should be used to map SOC stocks of forest ecosystems in our study region.


Introduction
Forest ecosystems are the largest carbon pool in terrestrial ecosystems, which account for nearly half of the global total terrestrial carbon stocks [1]. They also maintain 73% of the global soil carbon pool and 86% of the global vegetation carbon pool, which is of great significance to the global

Sampling Strategy and Experimental Method
Due to the abundant forest area (53.1% of the area) and the limited funding and rough roads, extensive field sampling is impractical. In order to truly and accurately reflect the spatial characteristics of SOC stocks, the purpose sampling method [26] is used to design sampling points in this study. According to the corresponding relationship between soil properties and soil forming environment, the method assumes that the soil formed in a similar environment is similar [27]. The fuzzy c-means clustering method is used to cluster the main environmental factors such as slope gradient, slope aspect, elevation, mean annual temperature, mean annual precipitation and normalized vegetation index. According to the fuzzy membership value of each point in the combination of environmental factors, the representative grid of the corresponding category of environmental combination of each point and the accessibility degree of sample points are determined [28]. Finally, the high representative points are selected as the design sample points in combination with the vegetation type map [27]. Twenty-four clustering results were generated, and 8-10 samples were collected in each category, we obtained a total of 236 points in 2016. In order to determine the bulk density of soil, the undisturbed samples of 100 cm 3 cores were collected in the center of soil layer. The samples were numbered and bagged respectively, and brought back to the laboratory for air drying before sample treatment; the bagged soil sample was dried by natural air, weighed, ground and passed through a nylon screen with a diameter of 2 mm. Soil samples were pretreated and measured in the Key Laboratory of agricultural resources and environment of

Sampling Strategy and Experimental Method
Due to the abundant forest area (53.1% of the area) and the limited funding and rough roads, extensive field sampling is impractical. In order to truly and accurately reflect the spatial characteristics of SOC stocks, the purpose sampling method [26] is used to design sampling points in this study. According to the corresponding relationship between soil properties and soil forming environment, the method assumes that the soil formed in a similar environment is similar [27]. The fuzzy c-means clustering method is used to cluster the main environmental factors such as slope gradient, slope aspect, elevation, mean annual temperature, mean annual precipitation and normalized vegetation index. According to the fuzzy membership value of each point in the combination of environmental factors, the representative grid of the corresponding category of environmental combination of each point and the accessibility degree of sample points are determined [28]. Finally, the high representative points are selected as the design sample points in combination with the vegetation type map [27]. Twenty-four clustering results were generated, and 8-10 samples were collected in each category, we obtained a total of 236 points in 2016. In order to determine the bulk density of soil, the undisturbed samples of 100 cm 3 cores were collected in the center of soil layer. The samples were numbered and bagged respectively, and brought back to the laboratory for air drying before sample treatment; the bagged soil sample was dried by natural air, weighed, ground and passed through a nylon screen with a diameter of 2 mm. Soil samples were pretreated and measured in the Key Laboratory of agricultural resources and environment of Liaoning Province to determine SOC (Potassium dichromate sulfuric acid digestion method). Cores were dried at 105°C for 48 h for soil bulk density measurement.

Environmental Data
In this study, 12 environmental variables representing multispectral remote sensing, topography and climate were selected. Since the data were obtained from different departments, we resampled them to raster data with a 30 m spatial resolution in ArcGIS 10.2. High precision environment variable is one of the essential conditions for accurate prediction of SOC stocks. The environment variable with high spatial resolution can provide more detailed attribute information, but the 30 m of spatial resolution is enough to reflect the distribution characteristics of SOC stocks in the area.

Multispectral Remote Sensing Variables
L1T level Landsat-8 imageries were acquired from one image (Path-row: 120-33) downloaded from the United States Geological Survey (USGS) covering the spatial domain of study area from July to September 2016. In order to ensure image quality and avoid the interference of cloud on model prediction, the cloudiness value error of the downloaded image data is controlled within less 10%. The Environment for Visualizing Images (ENVI) Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes (FLAASH) atmospheric correction method was used to correct the multispectral remote sensing data. We used a bilinear interpolation method to resample these data to raster format with a 30m spatial resolution in ArcGIS 10.2. The bilinear interpolation method takes the distance from the sampling point to the surrounding 4 neighborhood pixels to calculate the grid value. The grid value of the pixel is obtained by interpolation in Y direction (or X direction) and then in X direction (or Y direction). The boundary cleaning and main filtering tools were used to generalize the edges of the regions in the grid. According to the value of the neighborhood in each position, the edge was smoothed by expanding and contracting the boundary, or increasing or reducing the area [29]. The selected multispectral remote sensing variables include green band (B GREEN ) (0.52-0.60 µm), red band (B RED ) (0.63-0.69µm), near-infrared band (B NIR ) (0.77-0.90 µm), normalized difference vegetation Index (NDVI), and soil adjusted vegetation index (SAVI). Three selected bands were attributed to the fact that they could represent the growth and biomass of vegetation, and the images of ground objects were rich, distinct, and well-organized, and were used for vegetation classification and water body recognition [16]. NDVI is widely used in the detection of vegetation growth state, and the closer its value is to 1, the better the vegetation growth. Huete [30] proposed soil adjusted vegetation index (SAVI) based on NDVI and a large number of observation data to reduce the impact of soil background. Huete's research [30] was conducted for the grassland and cotton fields, when L is 0.5, SAVI had a better effect on eliminating soil reflectance. NDVI and SAVI were calculated as follows: where B NIR and B RED were expressed as near-infrared band and red band; L is a parameter with the change of vegetation density, the value range is 0-1, the value for this study is set to 0.5. terrain variables are selected in this study. Based on SRTM DEM data, four terrain variables are derived by using Arc GIS 10.2 software, which are elevation, slope gradient, slope aspect, and plan curvature. Topographic wetness index (TWI) is extracted by system for automated geoscientific analysis (SAGA) GIS [31] software.

Climatic Variables
Climate data were obtained from China Meteorological Science Data Sharing Service System (http:// cdc.cma.gov.cn/), including MAP and MAT. Through collecting the MAT and MAP of 673 meteorological stations in China in 30 years (from 1980 to 2020 calculated by monthly average value), the raster data with 1 km resolution is generated by Kriging interpolation. Finally, the raster data were resampled to generate a resolution of 30 m by using the nearest neighbor principle, and the altitude data were used to correct it in ArcGIS 10.2.

Boosted Regression Tree
In this study, the BRT model proposed by Friedman et al. [16] was selected to predict the topsoil SOC stocks of a forest ecosystem in Northeast China. The BRT method had two technical components: boosting technology and regression tree [15]. Regression tree (RT) uses a set of predictive variables to analyze response variables, and applies binary segmentation to fit the simple model to each result part [17]. Boosting technology is to iterate multiple regression trees to make common decisions [7]. When the square error loss function is used, each regression tree drew conclusions and provided residuals of all previous trees, and a current residuals regression tree is obtained by fitting. Regression tree is the accumulation of regression trees generated in the whole iterative process [16]. The BRT model is more accurate and fast through gradient lifting technology and multiple data optimization.
In this study, the BRT model was constructed by "gem" package [32] in R language environment. In BRT model, four parameters required users to set: learning rate (LR), tree complexity (TC), bag fraction (BF), and tree number (TN) [17]. LR represents the contribution of each tree to the growth of the model [16]. TC represents the scale of the control tree and the interaction between variables [8]. BF sets the observation proportion for selecting variables [16]. NT represents the scale of the tree based on the combination of LR and TC [17]. We tested different combinations of LR, TC, BF, and NT. The optimal parameter combination was the one that provided the minimum predictive deviation. Finally, we determined the optimal parameter settings LR, TC, BF, and NT as 0.025, 12, 0.75, and 1200, respectively, so that the model demonstrated the best performance.

Model Evaluation
In order to evaluate the prediction performance of the three constructed BRT models on the spatial distribution of SOC stocks, we choose a common test method, namely the 10-fold cross validation technology. The dataset was divided into ten parts, nine of which were used as training data and one as test data in turn. In order to evaluate the performance of the BRT model, four indexes are calculated, including error (MAE), root mean square error (RMSE), coefficient of determination (R 2 ) and Lin's concordance correlation coefficient (LCCC) [33], in R software 3.2.2. Their specific formulae are as follows: where x i and y, represent the predicted value and measured value at point i; x and y are the mean of the predicted and measured values; ∂ x and ∂ y represent the variance of the predicted and measured values; n is the number of sampling sits; r is the correlation coefficient between the predicted value and the measured value.

Descriptive Statistics
The statistical results between SOC stocks and environmental variables at each sampling site are shown in Table 1. The average value of SOC stocks is 9.07 kg m −2 , and the coefficient of skewness is 0.37 kg m −2 , indicating that SOC presents a slightly skewed distribution. In addition, since SOC stocks are not evenly distributed on both sides of the average values, we Ln-transformed the SOC data s for later modeling and simulation. The SD of the sampling point is 3.95 kg m −2 , and it is lower than the average value. In order to explore the relationship between Log-SOC stocks and environmental variables, we calculated the Pearson correlation coefficient among them, as shown in Table 2. NDVI (r = 0.71), SAVI (r = 0.63), elevation (r = 0.70), slope gradient (r = 0.68), and MAP (r = 0.28) were positively correlated with SOC stocks. Correspondingly, SOC stocks was negatively correlated with B GREEN (r = −0.57), B RED (r = −0.33), B NIR (r = −0.61), TWI (r = −0.59), and MAT (r = −0. 35). Surprisingly, the correlation between multispectral remote sensing variables and SOC stocks was significant in this study.

Model Performance
We compared the spatial prediction performance of three BRT models with different variable combinations to predict the topsoil SOC stocks in Lushun, where is mainly controlled by forest ecosystems. The accuracy verification results are shown in Table 3. Among all models, the full variable model presents the best prediction performance, followed by the only multispectral remote sensing data model, and the finally topographic and climatic variables model, which could explain the spatial variation of 65%, 61%, and 53% SOC stocks in the region, respectively. The results showed that, using the multispectral remote sensing variable, the prediction model (MAE = 0.11, RMSE = 0.13, R 2 = 0.61, and LUCC = 0.81) could catch up with the full-variable prediction model (MAE = 0.06, RMSE = 0.07, R 2 = 0.65, and LUCC = 0.87). Model A and Model C were significantly better than model B (no multispectral remote sensing data). The density distribution map of the observed and predicted values of the sampling sites further confirms this finding (Figure 2). Model A and Model C showed a similar distribution trend, while Model C showed a steeper distribution pattern.
In order to further evaluate the performance of the model, we calculated the absolute residuals of Model A and Model B at all sampling sites. We found that only a small part of the sampling sites showed a decreasing trend and showed a small variation (Figure 3). In addition, we subtracted the final prediction results of Model A and Model C using raster calculation module in ArcGIS 10.1, and formed the prediction difference diagram of the two methods. The average and SD values of the difference results were 0.32 kg m −2 and 4.67 kg m −2 respectively, which further showed that the difference between Model A and Model C in the prediction of SOC stocks was not obvious in the region.    To further evaluate the prediction potential of multispectral remote sensing variables, we calculated the difference between the spatial mapping results of full variable model and only multispectral remote sensing variables mode ( Figure 4). As shown in Figure 4, only some areas showed a downward trend, and the differences between them were slight. In addition, we also calculated the standard deviation (SD) of a hundred iterations of the three models to further evaluate the performance and uncertainty. From Figure 5, it could be seen that all models produce lower SD. The mean SD of Model A, Model B and Model C were 0.17, 0.36, and 0.23 kg m −2 , respectively. To further evaluate the prediction potential of multispectral remote sensing variables, we calculated the difference between the spatial mapping results of full variable model and only multispectral remote sensing variables mode ( Figure 4). As shown in Figure 4, only some areas showed a downward trend, and the differences between them were slight. In addition, we also calculated the standard deviation (SD) of a hundred iterations of the three models to further evaluate the performance and uncertainty. From Figure 5, it could be seen that all models produce lower SD. The mean SD of Model A, Model B and Model C were 0.17, 0.36, and 0.23 kg m −2 , respectively.

Relative Importance of Environment Variables
Model A (full environment variable) was iterated 100 times and the mean relative importance (RI) of each environment variable was calculated. For the convenience of analysis, the relative importance of each environment variable was reduced to 100%. From Figure 6a, we found that NDVI, SAVI, B3, elevation, TWI and MAP were the main environmental variables (accounting for 75% of the total relative importance) that affect the spatial variation of SOC stocks in this region. In addition, multispectral remote sensing variables played an important role in Model A (63%), followed by topographic variables (28%) and climatic variables (9%). In addition, in Model C with only multispectral variables, B GREEN (31%) showed the highest relative importance, followed by NDVI (27%), SAVI (22%), B NIR (11%), and B RED (9%) (Figure 6b).

Spatial Distribution of SOC Stocks
Different from a single linear model, the results of each iteration with the BRT model were different. Therefore, we selected the three BRT models by iterating 100 times and calculating their average values as the final prediction results. Figure 7

Relative Importance of Environment Variables
Model A (full environment variable) was iterated 100 times and the mean relative importance (RI) of each environment variable was calculated. For the convenience of analysis, the relative importance of each environment variable was reduced to 100%. From Figure 6a, we found that NDVI, SAVI, B3, elevation, TWI and MAP were the main environmental variables (accounting for 75% of the total relative importance) that affect the spatial variation of SOC stocks in this region. In addition, multispectral remote sensing variables played an important role in Model A (63%), followed by topographic variables (28%) and climatic variables (9%). In addition, in Model C with only multispectral variables, BGREEN (31%) showed the highest relative importance, followed by NDVI (27%), SAVI (22%), BNIR (11%), and BRED (9%) (Figure 6b).

Spatial Distribution of SOC Stocks
Different from a single linear model, the results of each iteration with the BRT model were different. Therefore, we selected the three BRT models by iterating 100 times and calculating their average values as the final prediction results. Figure 7 showed the final prediction results of the three models. The average prediction results of Model A, Model B, and Model C were 8.99, 10.22, and 9.32 kg m −2 , respectively.

Effect of Multispectral Remote Sensing Variables on SOC Stocks
Multispectral remote sensing variables were the powerful and effectual variables in the full variables model, which showed that multispectral remote sensing data can predict forest topsoil SOC stocks, which is consistent with many previous research results [23,24,[34][35][36][37][38]. In the Bale Mountains, Ethiopia, Yimer et al. [24] concluded that vegetation was the main source of SOC stocks and affected its distribution, and there was a significant positive correlation between vegetation and SOC stocks, which has been certified in this study (Table 2). BGREEN and NDVI were efficient variables to characterize vegetation density and biomass, so they were important predictors in predicting of SOC stocks [17]. SAVI is very sensitive to the change of soil background [30]. Many studies showed that there were some potential applications of multispectral remote sensing data in SOC stocks mapping with better natural vegetation coverage areas [7,17,[34][35][36][37][38][39]. BGREEN is a better predictor than NDVI and SAVI in the full variable model, although many scholars considered NDVI was a more effective and powerful predictor [35,Error! Reference source not found.]. Therefore, BGREEN should be recommended as an influential factor in the future studies, especially in the areas with dense vegetation coverage. In addition, the Pearson correlation coefficients of BGREEN, NDVI, and SAVI with SOC stocks are −0.57, 0.71, and 0.63, respectively, which indirectly proved that they had potential  Terrain related variables are good predictors of SOC stocks in complex terrain change areas [8,17]. As one of the five soil-forming factors [40], topographic variables as a common variable in digital soil mapping (DSM) were selected as the main prediction factor in this study. In the process of SOC stocks prediction, the relative importance of terrain related variable was lower than that of multi spectral remote sensing related variable, which was consistent with previous research results [7,8,17,34]. Yang et al. [17] considered that, in the area with abundant vegetation, the influence of

Effect of Multispectral Remote Sensing Variables on SOC Stocks
Multispectral remote sensing variables were the powerful and effectual variables in the full variables model, which showed that multispectral remote sensing data can predict forest topsoil SOC stocks, which is consistent with many previous research results [23,24,[34][35][36][37][38]. In the Bale Mountains, Ethiopia, Yimer et al. [24] concluded that vegetation was the main source of SOC stocks and affected its distribution, and there was a significant positive correlation between vegetation and SOC stocks, which has been certified in this study (Table 2). B GREEN and NDVI were efficient variables to characterize vegetation density and biomass, so they were important predictors in predicting of SOC stocks [17]. SAVI is very sensitive to the change of soil background [30]. Many studies showed that there were some potential applications of multispectral remote sensing data in SOC stocks mapping with better natural vegetation coverage areas [7,17,[34][35][36][37][38][39]. B GREEN is a better predictor than NDVI and SAVI in the full variable model, although many scholars considered NDVI was a more effective and powerful predictor [35,37]. Therefore, B GREEN should be recommended as an influential factor in the future studies, especially in the areas with dense vegetation coverage. In addition, the Pearson correlation coefficients of B GREEN , NDVI, and SAVI with SOC stocks are −0.57, 0.71, and 0.63, respectively, which indirectly proved that they had potential value in mapping of SOC stocks. In Liaoning Province, China, Qi et al. [7] pointed out that SAVI was a highly effective and powerful predictor in the prediction of topsoil SOC in forest ecosystem. Yang et al. [17] found B GREEN and NDVI were the most important remote sensing variables in the prediction of topsoil SOC in the Tibetan Plateau. Gomez et al. [38] conducted multispectral remote sensing data were effective to deduce the spatial characteristics information of SOC stocks. Kim and Grunwald [23] also proved that multispectral remote sensing related variables could represent the spatial variability and quantity of soil carbon. In Tanzania, Winowiecki et al. [39] found that the spatial distribution pattern of SOC stocks was closely related to vegetation types and coverage. However, we would like to point out that, due to low-density sampling, the sampling in the study may not be representative. The results might overemphasize the importance of some variables in the model and increased the uncertainty of model prediction.
Terrain related variables are good predictors of SOC stocks in complex terrain change areas [8,17]. As one of the five soil-forming factors [40], topographic variables as a common variable in digital soil mapping (DSM) were selected as the main prediction factor in this study. In the process of SOC stocks prediction, the relative importance of terrain related variable was lower than that of multi spectral remote sensing related variable, which was consistent with previous research results [7,8,17,34]. Yang et al. [17] considered that, in the area with abundant vegetation, the influence of terrain elements on SOC stocks was integrated by remote sensing variables. Elevation had the most important influence on the spatial distribution of SOC stocks among all topographic variables. It was observed that the terrain was highly variable in the region, and the elevation could adjust the role of local microclimate [38]. Our research confirmed that the climate characteristics affected by elevation were significantly correlated with MAP (0.35) and MAT (−0. 28) As the main climatic factors, MAP and MAT were widely used in the spatial simulation of SOC stocks [8,36]. MAP was the fourth most important predictor in the full variable BRT model ( Figure 3) and played a more important role than MAT in predicting SOC stocks. In the mountain ecosystem, the hydrothermal condition affected the decomposition and accumulation of organic matter, indirectly impacting the spatial distribution characteristics of SOC stocks in those regions [17]. At the micro-level, the increase of forest productivity means a large amount of input of soil organic matter [37]. However, low temperature will slow down the decomposition rate of SOC. Although SOC stocks had a good correlation with MAP and MAT (Table 2), they were generated by data interpolation of 673 meteorological stations in China. Its spatial resolution was 1000 m, which is too coarse to characterize the spatial distribution of SOC stocks. In contrast, high spatial resolution remote sensing data could improve the spatial distribution characteristics of SOC stocks in the prediction area, because the data contain more detailed, higher resolution vegetation coverage information.
Our research concluded that the multispectral remote sensing data are effective to map SOC stocks in high-density vegetation coverage area. Many studies have a similar conclusion regarding the major controls of SOC distribution [7,[34][35][36][37]. For instance, in the Qilian Mountains of China, Yang et al. [17] used three bands of Landsat-5 TM (B GREEN , B RED , and B NIR ) and random forest to predict SOC and obtain higher prediction accuracy in the natural field (R 2 = 0.72). Similarly, we used three multispectral remote sensing bands (B GREEN , B RED , and B NIR ), NDVI and SAVI constructed a BRT model to predict the spatial distribution of SOC stocks in this densely covered natural vegetation with a high prediction accuracy.

Uncertainty in Current Research
The uncertainty of the three BRT model predictions was shown in Figure 5. These results revealed that the BRT model had excellent performance in predicting the spatial distribution of SOC stocks. However, there were still some uncertainties in this study. First, this study was divided into different groups to collect and analyze samples, so there might be sampling errors and experimental errors. Second, all environment variables were resampled to 30 m spatial resolution in ArcGIS 10.2 using a bilinear interpolation method, which will cause some method errors. Third, the accuracy and scale of the environment data obtained from different platforms were different, which resulted in subsequent modeling errors. Finally, this study was limited to the estimation of SOC in the topsoil (0-30 cm) forest, which will lead to the underestimation of SOC stocks in the region. In the forest ecosystem, the SOC stocks were generally stored in the deeper soil.
In addition, despite the success of using multispectral remote sensing data in this study, there were still some aspects worthy of special attention. These include: (1) affected by the terrain and clouds, the high-altitude areas were prone to produce shadows in the process of segmentation, so these reflectivity errors were large in satellite imageries data; (2) terrain variables have proved to be a powerful predictor of the spatial distribution of SOC stocks, especially in the forest ecosystem with obvious terrain change and dense vegetation coverage. Therefore, we recommend that the terrain related variables shall be added to the remote sensing prediction model for future forest organic carbon stock mapping.
Although the environmental conditions, sampling strategies, and sampling methods in this study were different from previous studies, the performance of the BRT model prediction was comparable to those studies. In Denmark, Adhikari et al. [41] used a regression kriging model and 18 environmental variables as predictors to model the vertical distribution of SOC content, and which only explained 23%-43% of SOC content variability. Were et al. [42] used the RF model to predict the SOC stocks and concluded that the RF model could explain 52% of the spatial changes of SOC in an Afromontane landscape. In Liaoning Province of China, Wang et al. [43] used a BRT model and nine environmental variables to predict the spatial distribution of SOC stocks of topsoil forests and could explain 58% of the spatial change of SOC stocks.

Conclusions
By comparing three BRT models, the best prediction model of topsoil SOC stocks in Liaoning coastal forest ecosystem was obtained. We found that using multispectral remote sensing variables alone, the model was not inferior to the full variable model, suggesting that using the remote sensing data is efficient and robust for predicting topsoil SOC stocks. The addition of climatic and topographic variables could improve the performance of the prediction model to some extent. In the selection of environmental variables, the use of multispectral remote sensing variables had a high practical significance in the spatial simulation of topsoil SOC stocks. Our study suggests that, in the future mapping of topsoil SOC stocks, especially in the forest ecosystem area with dense vegetation, appropriate multispectral remote sensing variables should be selected. Our accurate mapping of SOC stocks shall have great significance for studying the carbon cycle, soil fertility maintenance, and the ecological environment protection of forest ecosystems in our study region.