Modeling Forest Aboveground Carbon Density in the Brazilian Amazon with Integration of MODIS and Airborne LiDAR Data

: Timely updates of carbon stock distribution are needed to better understand the impacts of deforestation and degradation on forest carbon stock dynamics. This research aimed to explore an approach for estimating aboveground carbon density (ACD) in the Brazilian Amazon through integration of MODIS (moderate resolution imaging spectroradiometer) and a limited number of light detection and ranging (Lidar) data samples using linear regression (LR) and random forest (RF) algorithms, respectively. Airborne LiDAR data at 23 sites across the Brazilian Amazon were collected and used to calculate ACD. The ACD estimation model, which was developed by Longo et al. in the same study area, was used to map ACD distribution in the 23 sites. The LR and RF methods were used to develop ACD models, in which the samples extracted from LiDAR-estimated ACD were used as dependent variables and MODIS-derived variables were used as independent variables. The evaluation of modeling results indicated that ACD can be successfully estimated with a coe ﬃ cient of determination of 0.67 and root mean square error of 4.18 kg C / m 2 using RF based on spectral indices. The mixed pixel problem in MODIS data is a major factor in ACD overestimation, while cloud contamination and data saturation are major factors in ACD underestimation. These uncertainties in ACD estimation using MODIS data make it di ﬃ cult to examine annual ACD dynamics of degradation and growth, however this method can be used to examine the deforestation-induced ACD loss. (B1—Red; (B2—NIR; (B3—Blue; (B4—Green; all


Introduction
Forests, which cover approximately 30% of the Earth's land surface, produce about 75% of the terrestrial gross primary production and contain 80% of total plant biomass [1], thereby playing important roles in the global carbon cycle and global climate changes. Among different types of forests, tropical forests store about half of all forest carbon in the world and play particularly critical roles in atmospheric carbon sequestration [2]. The Brazilian Amazon has the largest rainforest area, but also has significant deforestation rates. In 2014, the World Wildlife Fund reported alarming deforestation rates in the Amazon region during 2001-2012 [3]. Today, 20% of the Amazon forest is gone. Extensive cattle ranching, large-scale intensive agriculture (mainly soybeans), oil and natural gas exploration, emissions. This information will help with regional carbon budgeting and efforts to reduce emissions from deforestation and forest degradation (REDD). Considering mismatches between sample plot sizes and MODIS pixel sizes resulting in difficulties in ACD modeling, the availability of LiDAR data across the Brazilian Amazon makes it possible to extract a sufficient number of samples that can match the MODIS pixel size. Therefore, the objective of this research was to explore the potential use of MODIS and a limited number of LiDAR data samples to map forest ACD in the Brazilian Amazon in order to better understand its spatial pattern. Another objective was to briefly examine the potential factors that could result in ACD estimation uncertainties. The main aims of this research are (1) to better understand the ability of using MODIS data for large-area ACD mapping in tropical forest regions, (2) to better understand the impacts of data saturation and cloud contamination in MODIS data on ACD mapping performance, and (3) to understand whether ACD dynamics caused by forest disturbance can be successfully detected.

Study Area
The Brazilian Legal Amazon Region, which consists of the States of Roraima, Amapá, Amazonas, Pará, Acre, Rondônia, Mato Grosso, Tocantins, and part of Maranhão, was selected as our study area ( Figure 1). It covers about 5.2 million km 2 and occupies 59% of Brazil's land area [57]. The region includes three biomes: Amazon, encompassing the largest contiguous area of tropical forests on Earth; Cerrado, the Brazilian savanna, which occupies portions of Mato Grosso and Maranhão, and all of Tocantins; and Pantanal, a seasonally flooded savanna in southwestern Mato Grosso [58]. The climate is characterized by year-round warm temperatures (averaging 25 • C, with less than 2 • C annual variation and 5-10 • C diurnal variation), with a rainy season between November and March and a dry season between May and September. April and October are transition months [59]. However, such an enormous region has different temperature and precipitation regimes, with the northern portion showing a significant heterogeneity in terms of the distribution of seasonal precipitation and southern and eastern portions showing a dry season of five months. Amazonian soils are divided into oxisols (45.5%), ultisols (29.4%), entisols (14.9%), alfisols (4.1%), inceptisols (3.3%), spodosols (2.2%), mollisols (0.8%), and vertisols (0.1%) [60]. Large-scale processes of colonization, deforestation, and forest degradation in the Brazilian Amazon began in the early 1970s, with settlement projects and construction of roads connecting the region with the southern and northeastern portions of the country. Since 1970, deforestation has resulted in more than 0.7 million km 2 of forest loss [61]. There is a significant uncertainty about the amount and spatial distribution of forest aboveground biomass in the Brazilian Amazon, but it is generally above 300 Mg/ha in its central to western portions. Transitional forests in the southern Large-scale processes of colonization, deforestation, and forest degradation in the Brazilian Amazon began in the early 1970s, with settlement projects and construction of roads connecting the region with the southern and northeastern portions of the country. Since 1970, deforestation has resulted in more than 0.7 million km 2 of forest loss [61]. There is a significant uncertainty about the amount and spatial distribution of forest aboveground biomass in the Brazilian Amazon, but it is generally above 300 Mg/ha in its central to western portions. Transitional forests in the southern portions of the region have biomass amounts ranging from 100 to 200 Mg/ha [48]. Forest degradation has affected the amount of AGB in Amazonian forests, which calls for cutting edge studies on this topic [5].

Datasets Used in This Research
In our research, we used two kinds of remotely sensed data, namely airborne LiDAR and MODIS data, as well as ancillary data (annual land cover maps, 2011-2017) (see Table 1). Airborne LiDAR data from 23 sites in five states-Acre, Rondônia, Mato Grosso, Pará, and Amazonas-were acquired for 2011-2014 and downloaded from the repository at Brazilian Agricultural Research Corporation [62]. Among the 23 sites, 18 were the same as those used by Longo et al. [5]. The spatial locations of all sites are illustrated in Figure 1. The airborne LiDAR data for 2011-2014 were acquired at an average of 850-900 m above ground by Geoid Laser Mapping, Ltd., using the ALTM (Airborne Laser Terrain Mapper) Orion M-200. We established a minimum return density of more than four points per square meter for data collection. A detailed description of the LiDAR systems used for surveys can be found in the paper by Longo et al. [5]. The downloaded point cloud data of each site were processed to generate a digital terrain model (DTM) and digital surface model (DSM) with a cell size of 1 × 1 m. The CHM data were obtained by subtracting DTM from DSM. The MODIS spectral reflectance from nadir bidirectional distribution function (BRDF) adjusted reflectance (NBAR) data at a spatial resolution of 500 m (MCD43A4) was used to map the ACD distribution of the entire Legal Amazon region. Seven BRDF-corrected spectral bands-red (B1-Red; 620-670 nm), near-infrared (B2-NIR; 841-876 nm), blue (B3-Blue; 459-479 nm), green (B4-Green; 545-565 nm), middle infrared (B5-MIR; 1230-1250 nm), shortwave infrared 1 (B6-SWIR1; 1628-1652 nm), and shortwave infrared 2 (B7-SWIR2; 2105-2155 nm)-covering the study area between 2011 and 2017 were processed using the Google Earth Engine (GEE) platform. To minimize the impacts of clouds on MODIS imagery, a dataset of annually composited cloud-free images was produced by averaging the values from all images from a certain year for the pixels without clouds; that is, only the high-quality pixels in all seven bands were extracted.
The annual land cover maps were downloaded from the Brazilian Annual Land Use and Land Cover Mapping Project [63]. The land cover datasets at 30 × 30 m were developed from Landsat images using extensive machine learning algorithms through the GEE platform. The original land cover classification system consists of 21 land cover types under five broad categories: forest, non-forest natural formation, farming, non-vegetated, and water. The complete description of the project can be found at MapBiomas [63]. The land cover classification image was recoded as forest and others, then aggregated into a new dataset, where the pixel size changed from 30 × 30 m to 500 × 500 m using the majority algorithm, meaning the resampled annul forest image has the same cell size as MODIS data.

Strategy of This Research
The framework for mapping the ACD distribution is illustrated in Figure 2, including the following steps: (1) calculation of ACD for the LiDAR sites based on LiDAR metrics generated from LiDAR point data; (2) preparation of samples for modeling and validation, which are from the LiDAR-based ACD; (3) extraction and selection of variables from MODIS data; (4) establishment of ACD models based on linear regression (LR) and RF approaches; (5) prediction of ACD for entire study area using the developed model by incorporating the resampled forest image and evaluation of the prediction results.

Calculation of Aboveground Carbon Density for the 23 Sites Using LiDAR Data
Longo et al. [5] developed a general ACD model based on airborne LiDAR metrics and ACD samples from 18 study sites, for which LiDAR metrics and ACD were calculated at a plot size of 50 × 50 m. The ACD model explained 70% of the variance across forest types. In our research, we applied this model to predict ACD distributions in the 23 sites. The ACD model is re-expressed as follows: where ℎ is the average height; ℎ is the kurtosis of the height distribution; ℎ5 and ℎ10 are the 5th and 10th percentile heights, respectively; ℎ is the interquartile range (h75-h25); and ℎ100 is the maximum height. All of these variables were calculated based on the plot size of 50 × 50 m. The predicted ACD spatial distributions for 23 sites are illustrated in Figure 3, indicating considerably different ACD ranges among these sites.

Calculation of Aboveground Carbon Density for the 23 Sites Using LiDAR Data
Longo et al. [5] developed a general ACD model based on airborne LiDAR metrics and ACD samples from 18 study sites, for which LiDAR metrics and ACD were calculated at a plot size of 50 × 50 m. The ACD model explained 70% of the variance across forest types. In our research, we applied this model to predict ACD distributions in the 23 sites. The ACD model is re-expressed as follows: where h m is the average height; k h is the kurtosis of the height distribution; h 5 and h 10 are the 5th and 10th percentile heights, respectively; h IQ is the interquartile range (h 75 -h 25 ); and h 100 is the maximum height. All of these variables were calculated based on the plot size of 50 × 50 m. The predicted ACD spatial distributions for 23 sites are illustrated in Figure 3, indicating considerably different ACD ranges among these sites. The LiDAR-derived ACD images with a cell size of 50 × 50 m for all sites were aggregated to a cell size of 500 × 500 m using an averaging algorithm to match the MODIS data. The new ACD data were overlaid on the MODIS data for sample plot collection. A systematic sampling approach at one pixel interval (500 m) was used to collect initial samples for each site. These samples were refined by overlaying them on the forest map. The samples located at the borders of LiDAR sites and the non-forest areas were removed so that all selected samples for ACD modeling and validation were located in the forest regions. Finally, a total of 368 samples were collected. As shown in Figure 3, the numbers of samples from 23 sites are considerably different because of the sizes and shapes of LiDAR sites and the forest distributions. The statistical results for these samples are summarized in Table 2. Based on the statistical results, the ACD values are in the range of 0.3 to 31.6 kg C/m 2 , with a mean value of 19.1 kg C/m 2 and standard deviation of 7.5 kg C/m 2 . However, the ACD ranges have considerably different values among the 23 sites. Some LiDAR sites, such as SFX_A01_2012, have ACD values of less than 10 kg C/m 2 , while other sites, such as CAU_A01_2014, have ACD values as high as over 30 kg C/m 2 in some pixels. These different ACD ranges provide reasonably good distribution of the ACD samples. These samples were randomly separated into two groups: 60% of the samples were used for modeling and 40% for validation. The LiDAR-derived ACD images with a cell size of 50 × 50 m for all sites were aggregated to a cell size of 500 × 500 m using an averaging algorithm to match the MODIS data. The new ACD data were overlaid on the MODIS data for sample plot collection. A systematic sampling approach at one

MODIS Potential Variable Predictors of ACD
One key step in the ACD modeling procedure is to extract suitable variables from MODIS data. Spectral bands and spectral indices are the most commonly used predictors [7]. Different types of spectral indices can be used as potential predictors. Most of them are from visible and NIR bands, such as the ratio vegetation index (RVI), NDVI, soil-adjusted vegetation index (SAVI), modified soil-adjusted vegetation index (MSAVI), and optimized soil-adjusted vegetation index (OSAVI). The SWIR spectral band is one of the most important variables in predicting AGB [64][65][66], and spectral indices containing SWIR have stronger relationships with AGB [67,68] under complex forest structures. Thus, some spectral indices derived from SWIR were also calculated. Table 3 lists the spectral indices used in this research and their calculation formulae. A total of 22 predictor variables were explored, including seven MODIS BRDF spectral bands and 15 spectral derivations. Table 3. A summary of equations used for calculation of spectral indices.

Spectral Index Equation Reference(s)
Note: Blue, Green, Red, NIR, MIR, SWIR1, and SWIR2 are spectral bands of MODIS data. MDij represents the ith and jth spectral bands of the MODIS data.

Identification of Key Variables and Development of Aboveground Carbon Density Estimation Models
The relationships between ACD and MODIS-derived variables were examined using Pearson's correlation analysis to understand the potential variables that can be used for ACD modeling. Because the selection of key variables is a critical step in modeling [6], we used stepwise regression and the RF approach to identify key variables. Stepwise regression may be the most commonly used approach to identify key variables for biomass estimation modeling, assuming linear relationships exist between biomass and independent variables [7]. However, in reality, this assumption may not be true. RF is an effective tool to rank the importance of independent variables; thus, we can identify key variables for developing estimation models without the linear relationship assumption. After determination of key variables, we used LR and RF to develop ACD estimation models.
The LR model can be expressed as: Remote Sens. 2020, 12, 3330 where y is the ACD sample data from LiDAR-estimated ACD data; x 1 , x 2 , . . . , x n are prediction variables derived from MODIS data (spectral bands and spectral indices); b 0 is a constant; b 1 , b 2 , . . . , b n are regression coefficients; ε is the error item; and n is the number of variables. Although different machine learning algorithms, such as artificial neural network (ANN), support vector machine (SVM), and random forest (RF) [7] algorithms, can be used for ACD modeling, RF is one of the machine learning algorithms that can effectively handle the high-dimensional variables [80]. Compared to ANN and SVM, RF has advantages in dealing with noisy data in training datasets, using either discrete or continuous datasets, and requiring much less time for large dataset processing and algorithm parameter optimization [7]. Therefore, this research used the RF algorithm to develop the ACD estimation model. Many publications have described the RF theory [81,82], so we do not need to explain it in detail here. The "randomForest" package in R software was used in this research. Two critical parameters-number of decision trees (ntree) and number of variables used in each node (mtry)-needed to be optimized. The optimized ntree and mtry values were determined when the smallest root mean squared error (RMSE) was reached through iterative processes of adjusting mtry and ntree parameters based on training data. In general, mtry was selected as one-third of the number of variables, while ntree was within the range of 100-5000. Since RF can provide rank the variables by importance but does not describe the relationships between the variables, Pearson's correlation analysis can be used to examine those relationships to remove some variables that have high correlations to one another. If the correlation coefficients between two variables had 0.85 or higher, one was dropped and RF was conducted again. The procedure was repeated until the minimum number of variables was achieved, while the RMSE remained stable. The RF algorithm based on the finally selected variables was used to predict ACD for the entire study area for 2011-2017. In total, 60% of the samples (see Table 2) were randomly selected to develop the ACD estimation models.

Evaluation of the Modeling Results
Considering the advantages of spectral indices over spectral bands in reducing environmental impacts, two data scenarios-spectral indices alone and a combination of spectral bands and indices-were examined for ACD modeling using LR and RF approaches, respectively. The modeling results from different data sources and modeling approaches were then quantitatively evaluated using the coefficient of determination (R 2 ), RMSE, and relative RMSE (RMSEr) [7,83]. This research used all validation samples from the sites for all years between 2011 and 2014 for overall assessment; meanwhile, an evaluation was also conducted for individual prediction results from 2012, 2013, and 2014 using the available validation samples. In total, 40% of the samples were randomly selected for use as validation samples.

Impacts of Deforestation on Aboveground Carbon Dynamics
After developing the annual ACD distribution maps between 2011 and 2017 using the best ACD estimation model, we examined the annual ACD change and analyzed the impacts of forest disturbance on ACD dynamics. Considering the uncertainty in estimations, we mainly focused on the analysis of whether ACD change due to deforestation can be detected and how different deforestation rates can be reflected in the ACD change. The deforestation rate here is defined as the ratio of the number of deforested pixels to the total number of pixels within the window size of 17 × 17 pixels in Landsat-derived forest maps, corresponding to one cell in MODIS data. Therefore, based on the annual land use and land cover dataset developed from Landsat data for 2011-2017, we recoded these data as forest and non-forest and selected typical regions in which to identify deforestation sites with various rates. The deforested areas were linked to annual ACD estimation results to analyze the impacts of various deforestation rates on ACD estimates. The procedure to conduct this analysis is illustrated in Figure 4. Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 27

Analysis of the Relationships between Aboveground Carbon Density and MODIS-Derived Variables
The correlation coefficients between ACD and MODIS variables (Table 4) indicate that individual spectral bands have significantly negative correlations with ACD, of which SWIR2 has the highest r value of −0.74, followed by red and SWIR1 bands, with r values of −0.65 and −0.63, respectively. The selected spectral indices also have significant correlation with ACD, either negative or positive. MD75 has the highest r value of −0.68, followed by NDII7 and albedo, with r values of 0.66 and −0.63, respectively. This result indicates that SWIR2 and spectral indices containing SWIR2 have higher correlations with ACD than other spectral bands or spectral indices, which is similar to results from previous studies based on Landsat imagery in the Brazilian Amazon [67].

Analysis of the Relationships between Aboveground Carbon Density and MODIS-Derived Variables
The correlation coefficients between ACD and MODIS variables (Table 4) indicate that individual spectral bands have significantly negative correlations with ACD, of which SWIR2 has the highest r value of −0.74, followed by red and SWIR1 bands, with r values of −0.65 and −0.63, respectively. The selected spectral indices also have significant correlation with ACD, either negative or positive. MD75 has the highest r value of −0.68, followed by NDII7 and albedo, with r values of 0.66 and −0.63, respectively. This result indicates that SWIR2 and spectral indices containing SWIR2 have higher correlations with ACD than other spectral bands or spectral indices, which is similar to results from previous studies based on Landsat imagery in the Brazilian Amazon [67].

Analysis of Aboveground Carbon Density Estimation Models
The best models using LR and RF based on spectral indices alone and combinations of spectral indices and spectral bands (Table 5) indicate the important roles of the red spectral band and the spectral indices containing SWIR2 in both LR and RF models. Considering R 2 values, two data sources have similar performance, but the variables are slightly different, especially for the LR models. When spectral bands and indices were used together in the LR model, red and SWIR2 bands had more important roles than the spectral index based on beta values.

Comparative Analysis of Aboveground Carbon Density Prediction Results
Evaluation of the modeling results (Table 6) indicates that the combination of spectral indices and bands slightly improved the overall ACD estimation when LR was used but did not when using RF. Overall, the RF-based algorithm using spectral indices provided the best performance, with an R 2 value of 0.67 and RMSE of 4.18 kg C/m 2 . When the developed models were used for ACD prediction for 2011-2017, the accuracy assessment results based on individual years from 2012 to 2014 indicate that the RF-based approach, using either spectral indices alone or a combination of spectral indices and bands, provides slightly better estimation results for 2012 and 2013 than the LR-based models do. However, the estimation performances were similar for LR and RF in the 2014 estimations. The scatterplots and residuals (see Figure 5) show the similar performances of the LR and RF models based on either spectral indices or a combination of spectral indices and bands; that is, ACD estimates and reference data have linear relationships (Figure 5a1-d1). Overestimation and underestimation, which are obvious for each model, are different in the LR and RF modeling results (Figure 5a2-d2). For example, the overestimation is slightly better in RF-based modeling results compared to the LR-based estimation; this is especially obvious when the ACD is between 10 and 15 kg C/m 2 . The comparison of the two modeling results indicates that both methods have different advantages. For example, the LR is better in reducing the underestimation problem, especially when the ACD is greater than 25 kg C/m 2 , while RF is better in reducing overestimation. Figure 5 shows that the RF model has relatively less prediction ability than the LR model when the ACD is high, for example greater than 25 kg C/m 2 .
This analysis is based only on the overall performances of different modeling results with two data scenarios and two algorithms, and cannot determine the performances at different ACD ranges. As summarized in Table 7, the ACD modeling results show the highest RMSEr values of 116.8-158.7% when the ACD is less than 10 kg C/m 2 , implying poor performances of these models for ACD prediction within that range. In contrast, the ACD predictions have the lowest RMSE values of 2.57-2.89 kg C/m 2 and lowest RMSEr values of 11.3-12.7% when the ACD is 20-25 kg C/m 2 , implying the good performance of each model in this range. Although RF based on spectral indices alone provided the best modeling performance, for ACD values greater than 25 kg C/m 2 , the LR model based on spectral indices alone provided better estimation than RF-based models, implying different effects of datasets and modeling algorithms on ACD estimations and a relationship between ACD ranges and performance.
Remote Sens. 2020, 12, x FOR PEER REVIEW 15 of 27 Figure 5. The relationships between forest aboveground carbon density (ACD) estimates and reference data: (a1,b1) linear regression models based on spectral indices alone and a combination of spectral indices and bands, respectively; (c1,d1) random forest models based on spectral indices and a combination of spectral indices and bands, respectively; (a2-d2) residuals of ACD estimates against reference data corresponding to each estimation model.
This analysis is based only on the overall performances of different modeling results with two data scenarios and two algorithms, and cannot determine the performances at different ACD ranges. As summarized in Table 7, the ACD modeling results show the highest RMSEr values of 116.8-158.7% Figure 5. The relationships between forest aboveground carbon density (ACD) estimates and reference data: (a1,b1) linear regression models based on spectral indices alone and a combination of spectral indices and bands, respectively; (c1,d1) random forest models based on spectral indices and a combination of spectral indices and bands, respectively; (a2-d2) residuals of ACD estimates against reference data corresponding to each estimation model.

Spatial Distribution of Predicted Aboveground Carbon Density
Since the RF model based on spectral indices performed best, this model was used to predict the ACD distribution values for the entire study area in 2011 and 2017. As illustrated in Figure 6, all predicted ACD images have similar spatial patterns; that is, high ACD pixels were mainly distributed in the northern, western, and central parts, which were mainly over 20 kg C/m 2 , while low ACD pixels (<10 kg C/m 2 ) were located in the southern and eastern parts due to deforestation. Comparison of the spatial patterns of ACD distributions indicates that the ACD values of many forested pixels in the northern area might be underestimated due to the cloud contamination problem, involving mixed pixels consisting of forest and clouds.

Aboveground Carbon Change Caused by Deforestation
In the Amazon basin, it is very important to know how much ACD could be lost due to deforestation, in addition to understanding whether the MODIS-derived ACD estimates can respond to different deforestation rates. As shown in Table 8, the ACD value dropped rapidly when the deforestation rate reached 44.9% (e.g., plot b) or higher (e.g., 56.3% in plot d and 58.2% in plot e). However, for low deforestation rates (e.g., 33.6% in plot f and 38.7% in plot e), the ACD values did not change significantly; that is, the amount of ACD change was less than the RMSE value. This situation implies that when the deforestation area is close to or more than one-half of the cell size of the MODIS pixel, such deforestation can be reflected by the significant drop over two successive years in the ACD estimates; however, when the deforestation size is small relative to a MODIS pixel, the ACD dynamics are not sensitive to such changes. As shown in plot f, the annual ACD estimates in 2012 and 2013 were 19.0 kg and 19.7 kg C/m 2 , respectively, although a small amount of deforestation (5.9%) occurred in this period; a similar case was found in plot d, where a deforestation rate of 33.6% occurred between 2011 and 2012, however the ACD only dropped by 0.2 kg C/m 2 from 7.8 to 7.6 kg C/m 2 . Comprehensive analysis of Tables 7 and 8 and the annual ACD maps ( Figure 6) indicates that the annual ACD change cannot effectively respond to forest disturbance or growth due to the annual ACD change being less than the RMSE amount. For instance, when the ACD is less than 10 kg C/m 2 , the RMSE can be as high as 5.86 kg C/m 2 , or when the ACD is greater than 25 kg C/m 2 , the RMSE can be 5.10 kg C/m 2 ( Table 7). On the other hand, high deforestation rates indeed resulted in significant changes in ACD estimates, implying that the ACD estimates using MODIS data can be used to evaluate the ACD loss due to deforestation when the deforested area reaches one-half of the pixel size of MODIS data, i.e., 250 × 250 m.
Remote Sens. 2020, 12, x FOR PEER REVIEW 17 of 27 Figure 6. Spatial distribution of aboveground carbon density (ACD) estimation using the random forest approach for 2011-2017.

Aboveground Carbon Change Caused by Deforestation
In the Amazon basin, it is very important to know how much ACD could be lost due to deforestation, in addition to understanding whether the MODIS-derived ACD estimates can respond to different deforestation rates. As shown in Table 8, the ACD value dropped rapidly when the deforestation rate reached 44.9% (e.g., plot b) or higher (e.g., 56.3% in plot d and 58.2% in plot e).

Discussion
Various factors could cause uncertainties in the estimation of biomass or carbon stocks [23,49]. Such factors could be the ACD reference data calculated from field measurements using allometric equations; the big regional differences in mean wood density among different tree species; various disturbances such as wind blow-downs, droughts, and deforestation that influence forest species composition and structures; the variables used for ACD modeling; and the relevant algorithms [23,84]. However, some factors such as different wood density and disturbance are often ignored in large-scale estimation of the carbon budget [85,86] because of difficulties in quantifying their impacts on modeling effects. In this section, we mainly discuss the uncertainties caused by data saturation, mixed pixels, cloud contamination, overfitting problems, and selection of data sources and modeling algorithms. We also provide a brief discussion of the advantages and limitations of this study.

Overestimation and Underestimation Problems
In AGB estimation, overestimation and underestimation are common when using Landsat images; that is, overestimation when AGB is relatively small (e.g., less than 40 Mg/ha) and underestimation when AGB is relatively high (e.g., 150 Mg/ha) [87][88][89]. This research also confirmed that ACD estimation has similar problems to MODIS data-overestimation when ACD is less than 10 kg C/m 2 and underestimation when ACD is greater than 25 kg C/m 2 , as shown in Figure 5. The saturation problem in MODIS data is also obvious, as shown in Figure 7, resulting in considerable ACD underestimation. Additionally, the coarse spatial resolution in MODIS data produced a serious mixed pixel problem, especially in the transition areas between forest and deforestation areas, resulting in considerable ACD overestimation. This situation implies the challenge of ACD estimation using MODIS data alone and indicates the necessity to incorporate other data sources in ACD modeling.
In addition to the data saturation and mixed pixel problems, the selection of a suitable modeling algorithm is also needed to reduce overestimation and underestimation problems. The machine learning algorithms often have an overfitting problem, resulting in high estimation uncertainty when the AGB is relatively high (e.g., greater than 120 Mg/ha) or low (e.g., less than 40 Mg/ha), as shown in subtropical AGB modeling using Landsat imagery [87,88]. This research on ACD estimation in a tropical region using MODIS data also confirmed this problem. The overfitting problem is common and cannot be completely eliminated, which is especially serious when the training sample is not large enough. When few samples are used, SVM may provide better modeling results than RF, but when a large number of samples are available, RF can reduce the overfitting problem through boosting and bagging technologies compared to other machine learning algorithms [7,88]. Therefore, it is important to increase the number of training samples. Airborne LiDAR data, especially when an unmanned vehicle is used, provide an important source of quickly and copiously collected samples for model development and validation. As shown in Figure 5, RF has a more serious underestimation problem than LR when ACD is relatively high, but has less of an overestimation problem than LR when ACD is relatively small, implying that modeling performance corresponds to ACD ranges. This situation indicates the need to conduct decision-level fusion based on the estimation results from different modeling approaches to improve estimation accuracy [90].
Remote Sens. 2020, 12, x FOR PEER REVIEW  19 of 27 mixed pixel problem, especially in the transition areas between forest and deforestation areas, resulting in considerable ACD overestimation. This situation implies the challenge of ACD estimation using MODIS data alone and indicates the necessity to incorporate other data sources in ACD modeling. In addition to the data saturation and mixed pixel problems, the selection of a suitable modeling algorithm is also needed to reduce overestimation and underestimation problems. The machine learning algorithms often have an overfitting problem, resulting in high estimation uncertainty when the AGB is relatively high (e.g., greater than 120 Mg/ha) or low (e.g., less than 40 Mg/ha), as shown in subtropical AGB modeling using Landsat imagery [87,88]. This research on ACD estimation in a tropical region using MODIS data also confirmed this problem. The overfitting problem is common and cannot be completely eliminated, which is especially serious when the training sample is not large enough. When few samples are used, SVM may provide better modeling results than RF, but when a large number of samples are available, RF can reduce the overfitting problem through boosting and bagging technologies compared to other machine learning algorithms [7,88]. Therefore, it is important to increase the number of training samples. Airborne LiDAR data, especially when an unmanned vehicle is used, provide an important source of quickly and copiously collected samples for model development and validation. As shown in Figure 5, RF has a more serious underestimation problem than LR when ACD is relatively high, but has less of an overestimation problem than LR when ACD is relatively small, implying that modeling performance corresponds to ACD ranges. This situation indicates the need to conduct decision-level fusion based on the estimation results from different modeling approaches to improve estimation accuracy [90].

Impacts of Cloud Contamination on Modeling Performance
The variation of spectral signatures caused by environmental conditions can considerably influence modeling results, and thus lead to high estimation uncertainties when the model is applied to different years. As illustrated in Figure 8a, visible bands and MIR are seriously influenced by the cloud contamination problem; that is, the cloud-contaminated pixels have higher spectral signature

Impacts of Cloud Contamination on Modeling Performance
The variation of spectral signatures caused by environmental conditions can considerably influence modeling results, and thus lead to high estimation uncertainties when the model is applied to different years. As illustrated in Figure 8a, visible bands and MIR are seriously influenced by the cloud contamination problem; that is, the cloud-contaminated pixels have higher spectral signature values in MODIS visible bands than cloud-free pixels, while the inverse is true for MIR. The NIR, SWIR1, and SWIR2 bands are less influenced by the cloud problem. Because of the different impacts of cloud contamination on spectral signatures, the spectral indices are also influenced to various degrees, as shown in Figure 8b. This problem causes serious ACD underestimation, as shown in Figure 9. The pixels without effects from clouds have similar ACD annual estimations, implying the robustness of this RF-based model for ACD prediction; however, ACD was considerably underestimated in the pixels with cloud impacts. The ACD spatial distribution in Figure 6 confirmed this problem-forests in the north and northeast regions were prone to underestimation in different years due to frequent clouds, even using the yearly MODIS composite images. The cloud problem makes ACD estimation difficult when using optical sensor data, especially in the Amazon basin, which has frequent clouds all year.
robustness of this RF-based model for ACD prediction; however, ACD was considerably underestimated in the pixels with cloud impacts. The ACD spatial distribution in Figure 6 confirmed this problem-forests in the north and northeast regions were prone to underestimation in different years due to frequent clouds, even using the yearly MODIS composite images. The cloud problem makes ACD estimation difficult when using optical sensor data, especially in the Amazon basin, which has frequent clouds all year.  In order to reduce the cloud contamination problem in the MODIS data, the use of satellite radar data such as ALOS PALSAR (Advanced Land Observing Satellite/Phased Array type L-band Synthetic Aperture Radar) L-band and Radarsat C-band data may be an alternative [91,92], however data saturation in these images cannot improve the ACD estimation performance [93]. The long-  robustness of this RF-based model for ACD prediction; however, ACD was considerably underestimated in the pixels with cloud impacts. The ACD spatial distribution in Figure 6 confirmed this problem-forests in the north and northeast regions were prone to underestimation in different years due to frequent clouds, even using the yearly MODIS composite images. The cloud problem makes ACD estimation difficult when using optical sensor data, especially in the Amazon basin, which has frequent clouds all year.  In order to reduce the cloud contamination problem in the MODIS data, the use of satellite radar data such as ALOS PALSAR (Advanced Land Observing Satellite/Phased Array type L-band Synthetic Aperture Radar) L-band and Radarsat C-band data may be an alternative [91,92], however data saturation in these images cannot improve the ACD estimation performance [93]. The long-  In order to reduce the cloud contamination problem in the MODIS data, the use of satellite radar data such as ALOS PALSAR (Advanced Land Observing Satellite/Phased Array type L-band Synthetic Aperture Radar) L-band and Radarsat C-band data may be an alternative [91,92], however data saturation in these images cannot improve the ACD estimation performance [93]. The long-wavelength P-band radar data can reduce the data saturation problem and may be a promising data source for ACD estimation. However, only limited studies have explored the use of simulated P-band data [94]. More research is needed to examine ACD estimation when satellite radar P-band data are available in the near future.

Data Sources and Uncertainties
Various types of spaceborne and airborne sensor data are available, but most belong to optical sensor data. Landsat may be the most common data source for land cover mapping or biomass modeling [7]. However, in the Amazon region, cloud cover is a big problem, resulting in difficulty in collecting cloud-free optical sensor data because of the low revisit period of the Landsat satellite.
Another problem with using Landsat images in a large area such as the Amazon basin is that Landsat images from different dates have differing spectral reflectance values due to the impacts of atmospheric conditions, thus resulting in difficulty developing a biomass or ACD model without sufficient sample plots for each scene of a Landsat image [6]. Use of synthetic aperture radar (SAR) data can avoid the atmospheric-induced image acquisition problem, but at present there are no SAR data for public use at no cost, except the ALOS-1 PALSAR data with 25 m spatial resolution between 2006 and 2011. The daily available MODIS multispectral data can reduce the impacts of cloud problems, and the relatively coarse spatial resolution of MODIS is suitable for large-area studies. Therefore, many previous studies used MODIS data to map biomass distribution in large areas [40][41][42][43].
Some previous studies explored the incorporation of ancillary data, such as digital elevation models and climate data, into MODIS to develop ACD models [42,46], however the poor relationships between ACD and ancillary data, the data quality, and the spatial resolutions among different data sources did not significantly improve ACD modeling performance. In particular, improper incorporation of climate data into remotely sensed data may introduce extra modeling uncertainty due to its very coarse spatial resolution and the data quality problem itself. Special caution should be taken in the selection of proper variables from multiple data sources and relevant modeling algorithms. As spaceborne LiDAR data such as ICESat-2 ATLAS and GEDI have improved quality, incorporation of these data into optical sensor data (e.g., Landsat, MODIS) may further improve ACD modeling performance [33][34][35]95]. More research is needed to explore the integration of multiple data sources and advanced modeling approaches such as deep learning [89,96].
Previous studies have already indicated that the quality of samples used for ACD or biomass modeling is the most important factor inducing high estimation uncertainty [23,84,96]. Most of the time, we cannot do much to improve the quality of reference data, except that we can carefully conduct field measurements of the parameters used for ACD calculation and select proper allometric equations to calculate ACD for specific trees. Therefore, the critical step is to identify proper variables from multiple data sources and modeling approaches to develop the ACD estimation model [7,89]. Through analysis of major factors influencing ACD modeling uncertainty, we can better design an ACD modeling procedure to optimize each step [7]. Therefore, more research is needed to identify major factors influencing ACD estimation uncertainty, although it will be a challenge to quantitatively specify the various contributors.

Implication and Limitation of MODIS-Based ACD Modeling
This research indicates that the RF-based model can estimate ACD annually when we use the selected MODIS-derived variables and that the spatial distribution patterns look reasonable ( Figure 6). Additionally, the MODIS data have advantages in ACD modeling over Landsat images in tropical regions considering the cloud problem and data availability. However, the high uncertainty when ACD is relatively high or low make it difficult to conduct dynamic ACD analysis annually, because the ACD estimation uncertainty may be higher than the real ACD change caused by disturbance or growth between two years, as confirmed in Table 8.
Since time series of Landsat and MODIS data are available, some previous research has explored the possibility of incorporating time series features into forest biomass and ACD dynamics [10,97,98]. However, rarely have studies successfully implemented ACD dynamics, considering the current technological status and limitations of optical sensor data [99]. The key is to considerably reduce the estimation uncertainties so that the ACD estimation error is less than the annual ACD change caused by growth or disturbance.

Conclusions
This research explored the integration of MODIS and LiDAR data for mapping of the ACD spatial distribution in the Brazilian Amazon. A comparative analysis of ACD modeling results using LR and RF based on spectral indices alone and on the combination of spectral indices and spectral bands Remote Sens. 2020, 12, 3330 20 of 25 indicates that overall the RF model based on spectral indices alone provided the best ACD estimation performance, with an R 2 of 0.67 and RMSE of 4.18 kg C/m 2 . However, both LR and RF have different advantages and disadvantages; RF has better estimation performance than LR when the ACD is relatively small, such as less than 15 kg C/m 2 , while LR has slightly better extrapolation ability than RF, especially when the ACD is greater than 25 kg C/m 2 . Cloud contamination and data saturation are the major factors resulting in ACD underestimation, while the mixed pixel problem due to coarse spatial resolution is the major factor resulting in overestimation. Overfitting is another factor resulting in ACD over-or underestimation problems. More research is needed to reduce these problems by incorporating other data sources, such as canopy height features from the ICESat-2 ATLAS and GEDI data and ancillary data such as DEM and climate information, as well as by increasing the number of samples for modeling, especially the number of samples for very small or very high ACD values. Use of LiDAR data, especially by extensively using unmanned aerial vehicles in the near future, is an effective way to considerably increase the number of samples for modeling and validation. Use of satellite radar P-band data in the near future will solve the cloud problem in the Amazon Basin and may play an important role in improving ACD estimation.