A Novel Spatial Simulation Method for Mapping the Urban Forest Carbon Density in Southern China by the Google Earth Engine

Urban forest is an important component of terrestrial ecosystems and is highly related to global climate change. However, because of complex city landscapes, deriving the spatial distribution of urban forest carbon density and conducting accuracy assessments are difficult. This study proposes a novel spatial simulation method, optimized geographically weighted logarithm regression (OGWLR), using Landsat 8 data acquired by the Google Earth Engine (GEE) and field survey data to map the forest carbon density of Shenzhen city in southern China. To verify the effectiveness of the novel method, multiple linear regression (MLR), k-nearest neighbors (kNN), random forest (RF) and geographically weighted regression (GWR) models were established for comparison. The results showed that OGWLR achieved the highest coefficient of determination (R2 = 0.54) and the lowest root mean square error (RMSE = 13.28 Mg/ha) among all estimation models. In addition, OGWLR achieved a more consistent spatial distribution of carbon density with the actual situation. The carbon density of the forests in the study area was large in the central and western regions and coastal areas and small in the building and road areas. Therefore, this method can provide a new reference for urban forest carbon density estimation and mapping.


Introduction
Urban forests can not only improve the aesthetic quality of cities but also play an important role in regulating the local climate [1,2]. They help buffer buildings against heat and wind, thus helping reduce the energy demands for heating and cooling while also absorbing the heat emitted from buildings [3]. Furthermore, urban trees are crucial for sequestering carbon and stabilizing soils while filtering dust and other harmful particulates emitted from vehicles and construction sites [4].
To explore the impact of urban vegetation on the regional carbon cycle, it is necessary to accurately estimate the carbon storage of urban vegetation [5,6]. Due to the complexity of landscape types, including buildings, roads, trees, parking lots, water bodies, forests and other green areas, it is undoubtedly challenging to estimate the carbon storage of urban vegetation [7][8][9][10]. Traditionally, plot surveys, biomass conversion factors, process models, the vorticity correlation and covariance method, and remote sensing information models have widely been used for forest carbon storage monitoring [11,12]. However,

Study Area
Shenzhen city (113 • 43 -114 • 38 E, 22 • 24 -22 • 52 N) is located in the south-central coastal areas of Guangdong Province, China ( Figure 1). It has a subtropical maritime climate with an annual average temperature of 22.4 • C and an annual rainfall of 1933.3 mm. The rainy season is from April to September. The average annual sunshine time is 2120.5 h. The vegetation coverage of the city consists of forest lands, shrublands, grasslands and green roofs. The total area covered by forests and green spaces is 79,700 ha, and approximately 4.88 million trees grow along roads and streets. Deciduous forest stands, mixed coniferous and broad-leaved stands and shrubs were dominant among the forest land cover types. The main conifer species include Masson pine (Pinus massoniana Lamb.), black pine (Pinus thunbergii Parl.) and araucaria (Araucaria cunninghamii Sweet). The main deciduous broadleaved tree species are Schefflera (Schefflera octophylla (Lour.) Harms), Delonix regia, Albizia julibrissin Durazz., and Kapok (Bombax ceiba L.). The evergreen broad-leaved species include Acacia confuse (Acacia confusa Merr.), Bauhinia (Bauhinia blakeana Dunn), Camphor tree (Cinnamomum camphora (L.) Presl), Ficus microcarpa (Ficus concinna Miq.), and Eucalyptus (Eucalyptus robusta Smith). The main economic forest tree species are Litchi chinensis (Litchi chinensis Sonn.) and Euphoria longan (Dimocarpus longan Lour).

Satellite Image Acquisition and Processing
In this study, we used Landsat 8-OLI surface reflectance images acquired during the period of June-November 2014. In addition, we used DEM images from two SRTM satellites with a 30-m pixel size. Data acquisition and preprocessing were performed using the GEE API platform, which is based on the JavaScript programming language. We developed a programming script to create all available Landsat imagery collections for the selected time period. An additional advantage was the possibility of use in GEE Landsat 8 images that were already radiometrically corrected and converted to surface reflectance [36,37]. To minimize the cloud problem, we applied the filter to the collection for the purpose of identifying the imagery with less than 10% cloud cover and for avoiding bad collections. After that, using the quality band, we applied a cloud masking function over the images to extract only pixels free from clouds and cloud shadows. Additionally, we collated key information from metadata for each image in the collection (time, solar zenith angle, solar azimuth, etc.). In total, 50 Landsat 8 OLI images were selected and processed in this study. Using the DEM obtained from SRTM data, we calculated the slope and aspect values for every pixel of the study area. Their spatial resolution was 30 m, which corresponded to the Landsat 8 OLI images. Remote Sens. 2021, 13, x FOR PEER REVIEW 4 Figure 1. Location of the study area and the sample plot distribution.

Satellite Image Acquisition and Processing
In this study, we used Landsat 8-OLI surface reflectance images acquired during period of June-November 2014. In addition, we used DEM images from two SRTM sa lites with a 30-m pixel size. Data acquisition and preprocessing were performed using GEE API platform, which is based on the JavaScript programming language. We de oped a programming script to create all available Landsat imagery collections for the lected time period. An additional advantage was the possibility of use in GEE Lands images that were already radiometrically corrected and converted to surface reflecta [36,37]. To minimize the cloud problem, we applied the filter to the collection for the p pose of identifying the imagery with less than 10% cloud cover and for avoiding bad lections. After that, using the quality band, we applied a cloud masking function over images to extract only pixels free from clouds and cloud shadows. Additionally, we lated key information from metadata for each image in the collection (time, solar ze angle, solar azimuth, etc.). In total, 50 Landsat 8 OLI images were selected and proces in this study. Using the DEM obtained from SRTM data, we calculated the slope and pect values for every pixel of the study area. Their spatial resolution was 30 m, wh corresponded to the Landsat 8 OLI images. Having these gathered data, for the terrain correction, we performed an illumination correction algorithm as a function in the GEE API, which is based on an empirical rotation model [36]. The results showed that the influence of the shady slope and shadow was greatly eliminated (Figure 2). After applying this function to all images in the collection, we finally reduced the imagery to a composite of median pixel values. The median provided the finest balance between oversaturated and lower pixel values. Having these gathered data, for the terrain correction, we performed an illumination correction algorithm as a function in the GEE API, which is based on an empirical rotation model [36]. The results showed that the influence of the shady slope and shadow was greatly eliminated (Figure 2). After applying this function to all images in the collection, we finally reduced the imagery to a composite of median pixel values. The median provided the finest balance between oversaturated and lower pixel values.
Then, the image was mosaicked and cropped according to the Shenzhen administrative boundary vector layer.

Measurement of Field Data
In Shenzhen, 234 sample plots were obtained on the principle of stratified random sampling. Using ArcGIS 10.2 software, the total study area was divided into 5 types ac- Then, the image was mosaicked and cropped according to the Shenzhen administrative boundary vector layer.

Measurement of Field Data
In Shenzhen, 234 sample plots were obtained on the principle of stratified random sampling. Using ArcGIS 10.2 software, the total study area was divided into 5 types according to the overall distribution of land types in Shenzhen: forest, urban area, water, grassland, and bare land. The sample plot sizes of the forest, shrub, and grass cover types were 25.82 × 25.82 m, 2 × 2 m, and 1 × 1 m, respectively. The diameter at breast height, height, width of the crown, and height of the branches of each tree in every forest sample plot were recorded. For the shrub sample plot, the species scientific name, ground diameter, total height, and width of the crown of each individual were measured. For the herb samples, the species names, the number of plants (the number of clusters), the average height of the herbs, and the degree of coverage were recorded. To calculate the carbon density per area unit of the plot, the main species in the field survey plot were assigned. Then, the carbon density of each plot (Mg/ha) was calculated using the binary volume table for the main tree species, including soft and hardwood broad-leaved species in Guangdong. All measurements and calculations were performed according to the "Guidelines on carbon accounting and monitoring for afforestation project" [51]. Thus, the total biomass of each plot was calculated. After obtaining the forest biomass (Mg/ha), the forest carbon in each plot was calculated according to the dominant species (group) carbon content table. The carbon sink of shrubs and herbs was obtained based on the model of shrub/herb biomass and height [52] and was then converted to carbon density according to [51].
In total, 234 plots were selected for mapping the carbon density of the urban forests and green spaces in Shenzhen. The statistics in Table 1 show that the mean carbon density among samples that contain carbon was 25.44 Mg/ha and the standard deviation was 20.75 Mg/ha. The obtained urban vegetation carbon density had a large variation coefficient of 81.75%.

Variables Calculation for Carbon Density Mapping
To choose the most important factors for carbon density modeling, in addition to the original band values of Landsat 8 images, we also used various vegetation indices (NDVI, ARVI, SAVI, TVI, EVI, etc.), the ratio vegetation index of Band-2, the ratio vegetation index of Band-3, and other band ratios [50,53]. In addition, the mean, variance, entropy, secondorder matrix, contrast, uniformity, correlation, dissimilarity of each band, and a total of 56 texture features through the texture window of 3 × 3 were considered as potential factors for carbon density modeling. The terrain factors, including the elevation, slope and aspect, and the first four components after the principal component analysis were also extracted. Thus, a total of 273 factors were considered in the modeling. We used ENVI 5.3 software to generate all these variables, and the calculation formulas of the spectral variables are shown in Table 2.

Models for Carbon Density Mapping
MLR, kNN, RF and GWR were frequently used to simulate the relationship between the forest carbon density and spectral variables. MLR needs a significant linear relationship between dependent variables and independent variables, which can measure the impact of multiple variables on prediction. The combination of multiple independent variables Remote Sens. 2021, 13, 2792 6 of 15 is more effective than using only one independent variable to predict carbon density [55]. The general expression of multiple linear regression equation is shown in Equation (1).
where Y denotes the carbon density (Mg/ha) and X denotes the independent variables. The kNN is established and used to estimate vegetation carbon density. The kNN does not need samples to be specifically distributed, so it is flexible and widely used in vegetation parameter estimation and carbon density mapping [13]. The nearest neighbor k, distance measurement and prediction rules were the factors that strongly affect the estimation results [42]. The k samples nearest to the sample plot to be predicted by spectral distance were searched by the kNN algorithm. The sample plot to be predicted was determined by the specific prediction rules and the attributes of the selected samples [13]. In Shenzhen, we used the Euclidean distance and reciprocal weighting of distance method to predict carbon density (Equation (2)). The initial range of the k value was set as 1-50, and the k with the minimum root mean square error (RMSE) was determined.
where CD p denotes the predicted carbon density at pixel p, y j denotes the carbon density observation corresponding to the jth sample plot, d pj is the spectral distance from pixel p to the jth sample plot, and k denotes the optimal number of sample plots. In this study, the kNN model finally obtained the lowest RMSE when k = 23 ( Figure 3).

∑ 1
where denotes the predicted carbon density at pixel p, denotes the carbon density observation corresponding to the jth sample plot, is the spectral distance from pixel p to the jth sample plot, and k denotes the optimal number of sample plots.
In this study, the kNN model finally obtained the lowest RMSE when k = 23 ( Figure  3). RF can quickly build a large number of regression trees for prediction. Strong robustness, easy to understand, and insensitive to noise data make it widely used in land change detection and forest parameter estimation [42]. Ntrees which is the number of decision trees and Mtry are important parameters that affect the random forest estimation. Ntrees is set as 500 by default, and Mtry can be set from 1 to the number of variables. In addition, GWR will also be established for comparison of carbon density estimation.

The OGWLR Model for Carbon Density Estimation
The GWR model combined with logarithmic regression (GWLR) can be used to explore the spatial heterogeneity, which avoids the shortcomings of local variation when deriving spatially invariant regression coefficients [49]. The data should be normalized by Equation (3) before GWR: RF can quickly build a large number of regression trees for prediction. Strong robustness, easy to understand, and insensitive to noise data make it widely used in land change detection and forest parameter estimation [42]. Ntrees which is the number of decision trees and Mtry are important parameters that affect the random forest estimation. Ntrees is set as 500 by default, and Mtry can be set from 1 to the number of variables. In addition, GWR will also be established for comparison of carbon density estimation.

The OGWLR Model for Carbon Density Estimation
The GWR model combined with logarithmic regression (GWLR) can be used to explore the spatial heterogeneity, which avoids the shortcomings of local variation when deriving spatially invariant regression coefficients [49]. The data should be normalized by Equation (3) before GWR: Logarithmic transformation to the normalized data must be performed, and a regression model with ln[y (1 − y )] as the dependent variable must be established. The GWLR model can be written as follows: where (u, v) is the location of the target plot, α 0 (u, v) and α k (u, v) are the regression coefficients of spatial variation, x k is the k-th independent variable, and ε(u, v) is the estimation error obeying the normal distribution. The OGWLR model was developed from the GWLR model, and it selected the optimal sampling size by determining the variation in the sample variance and obtained the carbon density prediction value of each sample on the basis of the spatial autocorrelation of the data. First, it determined the maximum spatial autocorrelation distance of all sample points, which was the extraction radius of the model, by statistical analysis of the sample data. Then, a local regression model was established for the pixels inside the extraction circle. Suppose there are N sample points in the circle. We calculate the variance from the sample points with a minimum of 2 to obtain the variance distribution map. When the variance tends to be balanced, the number is the optimal sampling size. In this case, each cell has an optimal sampling size. Finally, the developed model analyzes each cell separately to obtain the final result.

Model Accuracy Assessment
The leave-one-out cross-validation method (LOOCV) was adopted for the accuracy assessment [56]. This approach is widely used for validation when the field-observed samples are small. This method involves using one field-observed value as a validation sample and the remaining field-observed values as the training samples. We assume N samples and N-1 samples were randomly selected to train the data each time. The average of these N results was used to assess the performance of the model. This process is repeated in all possible combinations to cut the original sample into a validation sample of one field observed value and a training sample [50]. The R 2 and RMSE were also used to evaluate the superiority of the models [57]: where y i is the measured carbon density,ŷ i is the estimated carbon density, and n is the sample size. The smaller the RMSE is, the higher the model estimation accuracy is. The larger the determination coefficient (R 2 ) is, the closer the estimate is to the real value.

Correlation Analysis and Selection of Variables
The sample carbon density data after normalization and logarithmic transformation was taken as the dependent variable. In R, the Pearson's correlation coefficient matrix of the modeling factors and the carbon density was calculated by the corr.test() function of the psycho package [58]. At the significance level of 0.01, 120 variables were significantly associated with carbon density, and the factors with an absolute value of the correlation coefficient greater than 0.600 (p < 0.01) are listed in Table 3. The factor with the highest correlation coefficient of 0.639 was SR536. The stepwise regression method was used to screen out the variable combination for modeling. The variance inflation factor (VIF) was used for testing and eliminating collinearity between variables. Finally, the variable combination significantly related to carbon density and scarce collinearity was used for subsequent modeling and estimation. In Shenzhen, B4, SR536, TVI and SR32 were determined as variables for modeling and estimation.

Result Comparison of the Estimation Models
The MLR, kNN, RF, GWR and OGWLR models were developed using the observed carbon density combined with the Landsat 8 images. The estimated results represented in Table 4 show that there was no significant difference between the estimation accuracy of Remote Sens. 2021, 13, 2792 9 of 15 the kNN and RF. Compared with kNN and RF, the RMSE obtained from GWR decreased by 6.5% and 7.1%, respectively. Additionally, considering the spatial heterogeneity, the accuracy of GWR is slightly improved compared with MLR. However, the OGWLR model achieved the best estimation performance, attaining the highest coefficient of determination (R 2 = 0.54) and the lowest RMSE of 13.28 Mg/ha. The results implied that the OGWLR method achieved the best estimation effect, mostly remaining statistically significant (Table 5). Although not statistically significant, RMSE implemented by OGWLR is still 0.14 Mg/ha lower than GWR. The standard deviation of the estimated values obtained by the MLR, kNN, RF and GWR were similar, while the standard deviation of the estimated values obtained by OGWLR was significantly increased. In addition, compared with MLR, kNN and RF, the RMSE acquired by the OGWLR model decreased by 2.4%, 7.5% and 8.0%, respectively. The scatter plots in Figure 4 show the fitting between the observed carbon density and the carbon density predicted by the MLR, kNN, RF, GWR and OGWLR models. The fitting effects of the MLR, kNN, RF, and GWR were similar, with overestimated and underestimated values. However, the predicted values generated by the OGWLR model achieved the best fitting effect with the measured carbon density. The residuals from the OGWLR model were basically distributed and tended to be randomly distributed.
The vegetation carbon density was high within the forest areas in the central and southeastern regions of the study area ( Figure 5), with values more than 40 Mg/ha. In urban residential areas and waters, the vegetation carbon density was low. The total carbon storage calculated by MLR, kNN, RF, GWR and OGWLR was 319.6 × 10 4 Mg, 547.6 × 10 4 Mg, 567.2 × 10 4 Mg, 850.8 × 10 4 Mg and 1098.5 × 10 4 Mg.
The spatial distribution of carbon density plotted by the OGWLR model was the closest to the actual distribution of carbon density in Shenzhen, and on the basis of this research, it was the optimal model for the study area.
The OGWLR model uses the same dependent and independent variables as the MLR, kNN, RF and GWR models. The obtained optimal sample size of each pixel can clearly reflect the influence of variance variation on GWR. The scatter plot (Figure 4e) and carbon density map (Figure 5e) showed that the residuals were distributed randomly, and the frequencies of overestimation and underestimation were less than those of the previous four models. The correlation coefficients between the maps acquired by MLR, kNN, RF, GWR and OGWLR are 0.79, 0.90, 0.98 and 0.99 (p < 0.001), respectively, indicating that there is almost no difference between pixels. The improvement of the R 2 caused by OGWLR was not significant statistically, while the spatial distribution of carbon density is the most consistent with the actual situation.
The scatter plots in Figure 4 show the fitting between the observed carbon density and the carbon density predicted by the MLR, kNN, RF, GWR and OGWLR models. The fitting effects of the MLR, kNN, RF, and GWR were similar, with overestimated and underestimated values. However, the predicted values generated by the OGWLR model achieved the best fitting effect with the measured carbon density. The residuals from the OGWLR model were basically distributed and tended to be randomly distributed. The spatial distribution of carbon density plotted by the OGWLR model was the closest to the actual distribution of carbon density in Shenzhen, and on the basis of this research, it was the optimal model for the study area. The OGWLR model uses the same dependent and independent variables as the MLR, kNN, RF and GWR models. The obtained optimal sample size of each pixel can clearly reflect the influence of variance variation on GWR. The scatter plot (Figure 4e) and carbon density map (Figure 5e) showed that the residuals were distributed randomly, and the frequencies of overestimation and underestimation were less than those of the previous four models. The correlation coefficients between the maps acquired by MLR, kNN, RF, GWR and OGWLR are 0.79, 0.90, 0.98 and 0.99 (p < 0.001), respectively, indicating that there is almost no difference between pixels. The improvement of the R 2 caused by OG-WLR was not significant statistically, while the spatial distribution of carbon density is the most consistent with the actual situation.

The OGWLR Model
The GWR method has been widely used in forestry research, especially for the estimation of forest parameters and the classification of land cover types [46,47]. The GWR method can detect the spatial non-stationarity of the model and regression coefficients, but it causes anomalies when it is used for estimating the regression coefficient because

The OGWLR Model
The GWR method has been widely used in forestry research, especially for the estimation of forest parameters and the classification of land cover types [46,47]. The GWR method can detect the spatial non-stationarity of the model and regression coefficients, but it causes anomalies when it is used for estimating the regression coefficient because of the limitation of the least squares criterion [48]. Propastin et al. [59] improved the GWR method by adding angles to geographically weighted kernel functions to reduce the levels and elevation errors. He adopted this method to estimate the forest biomass in tropical rainforests and obtained an estimation accuracy much higher than that of the traditional GWR model. Zhao et al. [60] reported that climate and topography were the most important factors affecting the growth and distribution of vegetation and proved that the estimation accuracy of GWR was better than that of a multiple linear regression model for LAI estimation.
In this study, the OGWLR model was proposed to explore the spatial distribution of vegetation carbon density. The OGWLR is a local spatial interpolation model developed from the GWR, and it obtained the optimal modeling sample size for each pixel; additionally, carbon density mapping in Shenzhen showed more accurate results than those from the GWLR model. In some studies, the accuracy of forest carbon density estimation using medium-resolution remote sensing images was not satisfactory [18,19,61,62]. The OGWLR method achieved an RMSE of 13.28 Mg/ha using the Landsat 8 images in this study and could be satisfactory at the regional scale. Due to the complexity of urban forest ecosystem, it is difficult to obtain accurate carbon density distribution. The main purpose of this study is to obtain the spatial distribution and trend of carbon density in Shenzhen, and the OGWLR is more consistent with the actual distribution than the traditional methods. Sun et al. [13] estimated the carbon density of the study area by the kNN method and obtained a minimum RMSE value of 18.37 Mg/ha. The determination coefficient acquired by OGWLR model is similar to that of [13] in Shenzhen, but the RMSE is lower.
The low accuracy may be due to the increased sampling size and the spatial variability of the pixels. Based on our results, OGWLR may contribute to improved accuracy. The RMSE achieved by OGWLR was 1.04% smaller than that of GWR, demonstrating the optimal sample size, and could slightly increase the accuracy of the local GWR model. However, compared with GWR, OGWLR has obtained a more reasonable spatial distribution of carbon density, which greatly reduces the overestimation and underestimate of pixel scale.

Uncertainty, Limitations, and Prospects
This study demonstrated the potential of the OGWLR model for improving the urban vegetation carbon density mapping. However, the estimation error of the vegetation carbon density estimated by OGWLR still has room for improvement. This could be explained by the complex urban landscape and large variation coefficients of the urban vegetation carbon density (81.57%) [63]. A large number of the mixed pixels in Landsat images also contributed to the large relative errors. Therefore, reducing the mixed pixels may improve the urban vegetation carbon density estimation accuracy. Furthermore, in some sample plots, OGWLR obtained a low carbon density and small optimal sampling sizes, which could be caused by insufficient samples for modeling.
The measurement of field sampling data is crucial to obtain the spatial distribution of carbon density of urban vegetation. However, due to the nonnegligible existence of carbon distribution in roadside trees and greenbelts, it is necessary to consider the sample distribution in these areas in order to obtain more reasonable measurement results that can represent the carbon density level in the study area. In addition, shrubs and herbs in the understory of arbor forest are important components of forest carbon storage. In this study, we have considered these factors to reduce the uncertainty caused by the actual sample measurement [64]. Furthermore, the phenomenon of spectral saturation in high vegetation coverage area cannot be ignored. LiDAR can obtain accurate forest information of vertical structure, which has the potential to reduce the saturation and uncertainty of urban forest carbon density estimation [65]. Increasing the number of sample plots of different vegetation types, fusing LiDAR data with optical remote sensing images to extract more accurate vegetation information and considering the spatial heterogeneity of vegetation carbon density distribution may be effective ways to improve the accuracy of urban vegetation carbon density estimation [66].

Conclusions
Accurate mapping of the carbon density of urban forest vegetation is critical for climate change monitoring on regional and global scales. Traditional methods often fail to accurately predict the carbon density in a region. In this study, we proposed an OGWLR method for regional carbon density estimation. It was adopted to map the carbon density of urban forest vegetation in Shenzhen. The comparisons between the established methods showed that OGWLR realized the highest prediction accuracy, and its RMSE values were 2.4%, 7.5%, 8.0% and 1.0% smaller than those of MLR, kNN, RF and GWR, respectively. The OGWLR also obtained a more reasonable spatial distribution of carbon density through the local optimal sample size, which clearly shows the impact of variance changes on the local