Mapping Forest Aboveground Biomass Using Multisource Remotely Sensed Data

: The majority of the aboveground biomass on the Earth’s land surface is stored in forests. Thus, forest biomass plays a critical role in the global carbon cycle. Yet accurate estimate of forest aboveground biomass (FAGB) remains elusive. This study proposed a new conceptual model to map FAGB using remotely sensed data from multiple sensors. The conceptual model, which provides guidance for selecting remotely sensed data, is based on the principle of estimating FAGB on the ground using allometry, which needs species, diameter at breast height (DBH), and tree height as inputs. Based on the conceptual model, we used multiseasonal Landsat images to provide information about species composition for the forests in the study area, LiDAR data for canopy height, and the image texture and image texture ratio at two spatial resolutions for tree crown size, which is related to DBH. Moreover, we added RaDAR data to provide canopy volume information to the model. All the data layers were fed to a Random Forest (RF) regression model. The study was carried out in eastern North Carolina. We used biomass from the USFS Forest Inventory and Analysis plots to train and test the model performance. The best model achieved an R 2 of 0.625 with a root mean squared error (RMSE) of 18.8 Mg/ha (47.6%) with the “out-of-bag” samples at 30 × 30 m spatial resolution. The top ﬁve most important variables include the 95th, 85th, 75th, and 50th percentile heights of the LiDAR points and their standard deviations of 85th heights. Numerous features from multiseasonal Sentinel-1 C-Band SAR, multiseasonal Landsat 8 imagery along with image texture features from very high-resolution imagery were selected. But the importance of the height metrics dwarfed all other variables. More tests of the conceptual model in places with a broader range of biomass and more diverse species composition are needed to evaluate the importance of other input variables. considered at each split within a tree, was tested to produce the accuracy. By testing every possible value of ‘mtry’ for the highest model accuracy, we ﬁnd that the model leads to the most accurate prediction when ‘mtry’ is set to ﬁve.


Importance of Forests in Global Carbon Cycle
Forests provide essential ecosystem goods and services upon which human welfare depends. Removing CO 2 from the atmosphere through photosynthesis and storing the carbon as organic matter are among the most critical services forest ecosystems provide. CO 2 in the atmosphere is the major greenhouse gas that causes global warming [1]. The current concentration of CO 2 in the atmosphere is the highest in the past 800,000 years [2]. Since the Industrial Revolution, the extra CO 2 released into the atmosphere has contributed to 2/3 of the extra energy that all greenhouse gases have trapped [3]. The majority of the increased CO 2 in the atmosphere comes from fossil fuel burning, while land-use change, primarily deforestation, also contributed a significant amount of CO 2 to the atmosphere [4,5]. Reducing emissions from deforestation and forest degradation, plus sustainable forest management, conservation, and enhancement of forest carbon stocks (REDD+), has been recognized as a major mechanism for global warming mitigation (http://un-redd.org, accessed on 5 February 2022). REDD+ projects are primarily implemented in developing countries in the tropical region, with funding donated from developed countries. However, deforestation and forest degradation can happen anywhere in the world as a result of both natural and anthropogenic disturbances. In contrast, forest growth could significantly offset the carbon emissions from fossil fuel burning. Removing CO 2 from the atmosphere by forests is one of the most economically efficient nature-based solutions for global warming mitigation [6]. In fact, the vast majority of the Earth's aboveground carbon is stored in forests as biomass. Therefore, forests play a critical role in the global carbon cycle. According to the most recent accounting, we still cannot balance the global carbon budget [7]. One of the critical limitations in balancing the global carbon budget is the lack of accurate information about forest biomass and its dynamics over time. Without accurate forest biomass information, we do not know the amount of carbohydrates produced in photosynthesis and accumulated through time in the forests. Consequently, we do not know how much carbon is released from the forest ecosystem due to natural and anthropogenic disturbances, such as deforestation from timber harvesting or shifting agriculture, wildfires, forest destruction from hurricanes, mass mortality from droughts [8][9][10] or insect infestations [11]. Therefore, knowledge about where forest biomass is located and how it is changing with time is vital for global carbon cycle science and, consequently, global climate change in the future.

Biomass Estimation on the Ground
Forest biomass is the total dry weight of all live parts of every tree in a unit area, including the parts both aboveground (leaves, branches, and stems) and belowground (fine and coarse roots) [12]. Forest Aboveground Biomass (FAGB) is the total forest biomass minus the belowground components, which have to be estimated through excavation [13]. FAGB is relatively easy to obtain and thus is often sought after instead of the total biomass that includes both above-and belowground components. The most reliable approach to estimate biomass on the ground is through destructive sampling, in which standard trees are cut, and their biomass is estimated from drying samples of the fresh components (e.g., leaves, branches, and stems). Then a species-specific allometric relationship of biomass is derived based on the tree height and diameter at breast height (DBH). The allometry is then applied to all the standing trees to estimate their biomass by species within a sampling plot. Summarizing the biomass from all trees within the sampling plot provides an areabased biomass estimation [12]. Tremendous efforts had been dedicated to developing the species-specific allometric relationships for forest biomass [14,15]. These allometric relationships can be re-used in the place where they were developed. Despite the availability of allometric equations for biomass for nearly all the major species in the U.S. and possibly most countries in the world, they cannot be directly used over large areas as we cannot obtain DBH, height, and species continuously in space for each tree. Although the existing allometric relationships for biomass cannot be used with remote sensing data, the variables used for estimating biomass with allometry should guide the selection of remotely sensed features for accurate biomass estimation, i.e., we need information related to species, DBH, and height to estimate forest biomass accurately over space and time. Many forest biomass products exist [16][17][18], but they are not produced based on such a principle. Consequently, accurate estimation of forest biomass over large areas remains elusive [19].

Challenges in Estimating Forest Aboveground Biomass with Remote Sensing
Remotely sensed data from nearly all types of sensors have been used to estimate FAGB because remotely sensed data can be used to derive a biomass surface. Optical remotely sensed data can be used to effectively estimate the gross primary production of the terrestrial ecosystem [20][21][22]. Gross primary production is the total carbon flux from the atmosphere to the terrestrial ecosystem through photosynthesis. A large portion of gross primary production is consumed by vegetation via autotrophic respiration [23]. The balance of gross primary production after autotrophic respiration is the net primary production, which shows up in plant growth. Not all the net primary production is accumulated through time due to litterfall and mortality [24]. Disturbances from both natural and anthropogenic origins also significantly alter FAGB almost instantaneously. Thus, the dynamics of FAGB are jointly controlled by forest growth and disturbances [25]. Forest biomass dynamics through time are critical information for future global carbon cycle prediction.
Nearly all the optical remotely sensed data currently available had been used for biomass mapping [19]. Spectral reflectance, transformed spectral reflectance (e.g., principal components, Tasseled Cap components, and vegetation indices), and the spatial information (e.g., image texture) have been used as independent variables for biomass estimation [26]. Song [12] identified three primary approaches used to estimate biomass from optical images, including k-nearest neighbor imputation, multiple regression, and machine learning algorithms. More recently, Pflugmacher et al. [27], took advantage of the long-term record of Landsat imagery and developed an algorithm to map FAGB using disturbance and recovery history.
Although tremendous efforts have been dedicated to mapping FAGB with remotely sensed data, the margin of error remains too big to help close the global carbon budget gap. The major challenge in mapping FAGB is that there is no direct remotely sensed biomass signal, unlike the leaf area index. We can only estimate biomass through remotely sensed signals that are correlated to biomass. For example, the leaf area index was used to estimate biomass [28]. However, these signals only work when the biomass is low. Forest leaf area index reaches its asymptote early in its successional stage [29], while the biomass of a forest continues to increase for centuries [30]. Remotely sensed signals that are correlated to forest aboveground biomass, such as vegetation indices or surface reflectance in a particular wavelength, saturate when biomass reaches a threshold of 100-200 mg/ha. Signal saturation from optical and RaDAR sensors is the primary reason why remote sensing-based approaches mapping FAGB do not work well when the FAGB is beyond that threshold from these sensors [19,[31][32][33].
Similarly, extensive research had been conducted on mapping FAGB with RaDAR data, recently reviewed by [19]. There are consistent findings that the backscatter intensity from the longer wavelength (L-or P-band) is better correlated with FAGB than that from the shorter wavelength (C-band) because the longer wavelength RaDAR waves penetrate deeper into the canopy [34][35][36][37], and the cross-polarization (HV or VH) backscatters are more sensitive to biomass than the co-polarization returns (HH or VV) [38,39]. However, a major obstacle for mapping biomass with RaDAR imagery remains-signal saturation, i.e., the remotely sensed signal no longer changes after the aboveground biomass reaches 100-200 Mg/ha [19,32].
LiDAR is a relatively new revolutionary technology that provides information about the vertical structure of vegetation [40]. LiDAR data can be obtained from two types of sensors, discrete-return LiDAR and full-return (also known as waveform) LiDAR [41]. Discrete-return LiDAR provides one or a small number of major signal peaks from the objects, while full-return LiDAR records the returns of the photons continuously along the laser illumination path. Waveform LiDAR provides more detailed information on vegetation structure compared with discrete LiDAR. Most of the LiDAR data currently available were collected by airborne sensors [42][43][44], although data from ICESat were used to map FAGB [45]. Hyde et al. [46] compared LiDAR and RaDAR data in mapping forest biomass, and they found that the performance of LiDAR far exceeds that of RaDAR. Although forest height derived from LiDAR data does not suffer from signal saturation problem, no universal relationship between height and biomass exists due to variation of species [42]. Therefore, variables related to DBH and species from other sensors are needed for accurate biomass mapping [47].
Although many aboveground biomass products had been produced, such as the National Biomass and Carbon Dataset 2000 for the U.S. [16,48] and tropical biomass [45,49,50], these biomass maps are usually made as a snapshot of biomass at a particular time or with coarse spatial resolution. Moreover, there is significant room for accuracy improvement. For example, the U.S. aboveground biomass produced by Blackard et al. [16] only has a correlation coefficient of 0.31 with the FIA biomass in the southern region of the U.S. As a result, there are large discrepancies between biomass maps [49]. Moreover, none of the existing biomass maps can be reproduced in a timely manner with sufficient spatial detail to monitor the dynamics of FAGB change as a result of natural and anthropogenic disturbances. The objective of this study is to develop a new algorithm to map FAGB using remotely sensed data from multiple sensors that are related to forest height, DBH, and species, including LiDAR for canopy height, multispectral for vegetation composition, RaDAR for canopy structure, and texture from high-resolution optical images. Image texture from high-resolution optical images is highly correlated to tree crown size, which in turn relates to DBH [24,51]. These remotely sensed data capture information about height, DBH, and species composition for the forests. The new algorithm has the potential to overcome the limitations in existing FAGB mapping algorithms, enhancing the accuracy of FAGB mapping.

Study Area
The study area is located in eastern North Carolina (NC), USA, spanning approximately 23,000 km 2 covered by a Landsat scene of WGS path = 15, row = 36 ( Figure 1). This region belongs to the Southern Coastal Plain, with most vegetation being classified as coniferous forest, dominated by Loblolly pine (Pinus taeda). The majority of counties within this region are over 70% timberland by area, making it an important forest resource. Being most severely hit by Hurricane Florence in 2018, we selected this region to pave the way for an eventual estimation of the impact of the Hurricane on forest biomass. We currently focus on pre-hurricane forest biomass estimation as the post hurricane LiDAR is not yet available.

Forest Inventory and Analysis Data
We used FAGB derived from U.S. Forest Service Forest Inventory and Analysis (FIA) data. The FIA program collects, analyzes, and reports the status of American Forests for all 50 states of the U.S. and its territories and possessions (http://fia.fs.fed.us, accessed on 5 February 2022). The FIA program designed a national hexagon grid covering 50 states with each hexagon spanning 5937 acres (~2404 ha) in area and contain a randomly located plot [52]. Each FIA plot consists of four 7.3 m radius subplots, with one located in the plot center, and the centers of the other three subplots are 120 ft (36 m) away from the center plot distributed 120 degrees from each other [53]. Because our LiDAR data were collected in 2014, plots that were sampled in 2013, 2014, and 2015 within the study area were used as the reference for model development and evaluation. The sum of the aboveground biomass for all trees within a sampling plot with DBH greater than 5 inches (i.e., 12.7 cm) makes up the total FAGB for the plot. We selected the FIA sampling plots in this study that fall into the forest land cover based on the USGS National Land Cover Dataset [54] with aboveground forest biomass carbon greater than zero and the 95th percentile height greater than ten feet (~3 m). After applying these exclusion criteria, the remaining 227 plots in the study area are available for model training and testing. To ensure the privacy of private landowners and to protect plots from vandalism, the locations of the plots within the publicly available FIA database are fuzzed and swapped. To overcover the confidentiality of the accurate FIA plot locations, our USFS collaborators helped extract the multisource remotely sensed data for plot locations using the accurate plot coordinates and then provided us a table that contains the plot attributes (including biomass) and the extracted remotely sensed data without the accurate plot coordinates to the rest of the research team. This essentially provided this study with access to the data with their precise locations without anyone outside of USFS accessing the confidential information [51].

Forest Inventory and Analysis Data
We used FAGB derived from U.S. Forest Service Forest Inventory and Analysis (FIA) data. The FIA program collects, analyzes, and reports the status of American Forests for all 50 states of the U.S. and its territories and possessions (http://fia.fs.fed.us, accessed on 5 February 2022). The FIA program designed a national hexagon grid covering 50 states with each hexagon spanning 5937 acres (~2404 ha) in area and contain a randomly located plot [52]. Each FIA plot consists of four 7.3 m radius subplots, with one located in the plot center, and the centers of the other three subplots are 120 ft (36 m) away from the center plot distributed 120 degrees from each other [53]. Because our LiDAR data were collected in 2014, plots that were sampled in 2013, 2014, and 2015 within the study area were used as the reference for model development and evaluation. The sum of the aboveground biomass for all trees within a sampling plot with DBH greater than 5 inches (i.e., 12.7 cm) makes up the total FAGB for the plot. We selected the FIA sampling plots in this study that fall into the forest land cover based on the USGS National Land Cover Dataset [54] with aboveground forest biomass carbon greater than zero and the 95th percentile height greater than ten feet (~3 m). After applying these exclusion criteria, the remaining 227 plots in the study area are available for model training and testing.
To ensure the privacy of private landowners and to protect plots from vandalism, the locations of the plots within the publicly available FIA database are fuzzed and swapped. To overcover the confidentiality of the accurate FIA plot locations, our USFS collaborators helped extract the multisource remotely sensed data for plot locations using the accurate plot coordinates and then provided us a table that contains the plot attributes (including biomass) and the extracted remotely sensed data without the accurate plot coordinates to the rest of the research team. This essentially provided this study with access to the data

LiDAR Data
We used airborne LiDAR to provide canopy height information. The statewide discrete return LiDAR data used in this study were collected as part of a joint project by the NC Risk Management Office, NC Department of Transportation, and other state collaborators (available for download at https://sdd.nc.gov/, accessed on 5 February 2022). The LiDAR data were collected in the Spring of 2014 in the leaf-off condition with no snow on the ground. The data were collected using a combination of airborne Leica ALS70HP-II and Optech Pegasus HA500 sensors at a resolution of 2 points per square meter. The NC Risk Management Office provided the LiDAR data in the LAS version 1.3 standard format with associated 5-foot resolution DEMs in a 5000-foot by 5000-foot tiling scheme or~1.5 × 1.5 km.
Using the 5-foot (~1.5 m) DEMs, the LiDAR data were first normalized to the ground, i.e., the height measurements in the LiDAR data are heights from the ground. The normalized LiDAR data were then used to calculate a set of height metrics, each representing height at 25, 50, 75, 85, and 95% of points within the 30 m spatial resolution. These height metrics represent the vertical canopy structure, which are used as variables associated with FAGB. The percentiles used for analysis were chosen to characterize the upper canopy containing the majority of biomass and to eliminate extreme height values due to sensor errors and undesirable interactions.

RaDAR Data
The European Space Agency's (ESA) Sentinel-1 mission comprises two polar-orbiting satellites with a relative azimuth angle of 180 • from each other, Sentinel-1A and Sentinel-1B, each of which is equipped with the Synthetic Aperture Radar (SAR) instrument for collecting surface backscatter in the C-band in multiple modes and in dual-polarization (VV + VH, HH + HV). The use of the C-band allows the acquisition of imagery day and night and in all weather conditions. Two Sentinel-1 scenes were acquired for this study, one dated 3 March 2015, representing winter leaf-off conditions, and one dated 18 August 2015, representing summer leaf-on conditions. The Sentinel-1 scenes were retrieved as the Level-1 High Resolution (roughly 10-m) Ground Range Detected (GRD) Interferometric Wide (IW) Swath mode data product in the VV and VH polarizations from ESA's Copernicus Open Access Hub (https://scihub.copernicus.eu/, accessed on 5 February 2022).
A standard Sentinel-1 GRD preprocessing workflow was applied to process the images acquired for this study. All Sentinel-1 preprocessing was conducted in the ESA's own Sentinel Applications Platform (SNAP). Accurate orbit information was applied, thermal noise was removed, digital pixel values were converted into radiometrically calibrated SAR backscatter, a Lee-Sigma speckle filter was applied, range-doppler terrain correction was performed, and the unitless backscatter was converted to backscatter coefficient in decibels using a logarithmic transformation. Border noise removal performed through SNAP does not adequately work on Sentinel-1 images captured before March 2018. As a result, border noise was manually cropped from the images where needed.

Multispectral Data
Landsat 8 Analysis Ready Data (ARD) data, retrieved from the USGS Earth Explorer data portal (https://earthexplorer.usgs.gov/, accessed on 5 February 2022), were used in this study. We used two images representing the leaf-on and leaf-off conditions similar to the RaDAR data. These images capture species composition information for image classification [55,56]. The leaf-off condition image was acquired on 6 February 2015, and the leaf-on condition image was on 30 June 2015. Both images were almost cloud-free, but any clouded or nonclear pixels were masked out using the pixel quality assessment band. The bands used for this analysis include Band 2 (Visible Blue), Band 3 (Visible Green), Band 4 (Visible Red), Band 5 (Near-Infrared), Band 6 (Shortwave Infrared 1), and Band 7 (Shortwave Infrared 2). We first conducted the Tasseled Cap (TC) transformation for each image and used the brightness, greenness, and wetness components in the biomass model, instead of the six original bands. Three vegetation indices were further derived, including Normalized Difference Vegetation Index (NDVI, Equation (1)), Enhanced Vegetation Index (EVI, Equation (2)), and Structural Index (SI, Equation (3)) [57]. These vegetation indices, along with the TC brightness, greenness, and wetness, represent the data input from the optical sensor.

Very High-Resolution Optical Imagery
The Very-High-Resolution (VHR) imagery for this study was derived from the National Agriculture Imagery Program (NAIP) under the USDA Farm Service Agency (https://www. fsa.usda.gov/programs-and-services/aerial-photography/, accessed on 5 February 2022). The VHR imagery provided image texture for this study. We used the 4-band visible-infrared aerial imagery collected in the Spring of 2014 with a 1-m spatial resolution to cover the study area. All NAIP imagery was acquired as compressed county mosaics from the USDA's geospatial data gateway (https://datagateway.nrcs.usda.gov/GDGHome_ DirectDownLoad.aspx, accessed on 5 February 2022).
We first conducted a principal component transformation of the NAIP imagery and selected the first principal component, which represents a brightness feature that contains information from all bands. The first principal component contains more information in the image than any of the original bands or other principal components. Second, the first principal component image was further resampled to 2 m and 3 m spatial resolutions, and the local texture was calculated for the 1, 2 and 3 m spatial resolution first principal component images. Third, the local variance calculated at 1, 2 and 3 m spatial resolutions were downscaled to 30 m spatial resolution with simple averaging. Fourth, we calculated the ratio (i.e., Equation (4)) of degraded image texture at 30 m spatial resolution with initial image texture derived at the 2 m spatial resolution to that at the 3 m spatial resolution because this ratio correlates with mean stand crown diameter, which in turn strongly relates to DBH [51,58].
where T 2×2 is the image texture calculated at 2 m spatial resolution and downscaled to 30 m spatial resolution with simple average; T 3×3 is the image texture calculated at 3 m spatial resolution and downscaled to 30 m spatial resolution with simple average. R 2/3 is the ratio of T 2×2 to T 3×3 . Eventually, we derived two layers of spatial data both at 30 m spatial resolution as the model inputs, i.e., the image texture initially calculated at the 1 m spatial resolution, and the ratio of image variance at 2 m to that of 3 m.

Biomass Model with Random Forest
We developed our biomass model based on Random Forest (RF), a machine learning algorithm that can be used for both classification and regression [59]. A "tree" in the "forest" is established from a bootstrapped sample from the reference data with the remaining reference data left "out-of-bag", which are used to derive the relative importance of each of the input features, making it easy to compare predictor variables and determine the most useful set for regression. RF is not as susceptible to overfitting as an individual tree [59,60], and it is especially effective in capturing nonlinearity and interactions among the predictive variables [61]. In addition, RF models are relatively robust to noise in training data. For this implementation of random forest, variable importance is measured as the percent increase in mean squared error that results in the exclusion of the given variable. The general workflow is shown in Figure 2.
We tested many RF models, each with a different set of independent variable combinations. Each model run produces a degree of fitness, i.e., R 2 , and the root mean squared error (RMSE) based on the "out-of-bag" samples. We selected the model that predicts the aboveground biomass with the highest R 2 and smallest RMSE. Finally, we evaluated the selected RF model with a predicted FAGB map and the biomass from all FIA plots within the study area.  [59,60], and it is especially effective in capturing nonlinearity and interactions among t predictive variables [61]. In addition, RF models are relatively robust to noise in traini data. For this implementation of random forest, variable importance is measured as t percent increase in mean squared error that results in the exclusion of the given variab The general workflow is shown in Figure 2. We tested many RF models, each with a different set of independent variable comb nations. Each model run produces a degree of fitness, i.e., R 2 , and the root mean squar error (RMSE) based on the "out-of-bag" samples. We selected the model that predicts t aboveground biomass with the highest R 2 and smallest RMSE. Finally, we evaluated t selected RF model with a predicted FAGB map and the biomass from all FIA plots with the study area. Table 1 lists all remote sensing features derived from LiDAR, RaDAR (Sentinelmultispectral (Landsat 8), and very high-resolution (NAIP) data that were used for bi mass mapping with RF. Due to the large number of feature variables, we implemented automatic feature selection with KnowGRRF, which is an R package developed by Gu and Liu [62] for statistical computation, to identify the most important remotely sens predictors for biomass estimation. In order to use KnowGRRF, we first need to generate regularization coefficient for each variable by running the RF to map FAGB with all va ables included. The regularization coefficient for each variable is calculated using Equ tion (5) based on the variable importance value from the initial RF run.
where Cj is the regularization coefficient for the jth variable; Ij is the importance value jth variable generated by the initial RF model, and Imax is the maximum importance val from the initial RF model. After we generated the regularization coefficient for each va able, we used KnowGRRF to eliminate the least important variable following a stepwi model based on the Akaike Information Criterion (AIC). Due to the randomness of boo strapping in RF, there will be some variations in the outputs from different model run To produce a robust feature selection, the KnowGRRF was run 100 times. Based on t  Table 1 lists all remote sensing features derived from LiDAR, RaDAR (Sentinel-1), multispectral (Landsat 8), and very high-resolution (NAIP) data that were used for biomass mapping with RF. Due to the large number of feature variables, we implemented an automatic feature selection with KnowGRRF, which is an R package developed by Guan and Liu [62] for statistical computation, to identify the most important remotely sensed predictors for biomass estimation. In order to use KnowGRRF, we first need to generate a regularization coefficient for each variable by running the RF to map FAGB with all variables included. The regularization coefficient for each variable is calculated using Equation (5) based on the variable importance value from the initial RF run.
where C j is the regularization coefficient for the jth variable; I j is the importance value of jth variable generated by the initial RF model, and I max is the maximum importance value from the initial RF model. After we generated the regularization coefficient for each variable, we used KnowGRRF to eliminate the least important variable following a stepwise model based on the Akaike Information Criterion (AIC). Due to the randomness of bootstrapping in RF, there will be some variations in the outputs from different model runs. To produce a robust feature selection, the KnowGRRF was run 100 times. Based on the frequency of feature selection in the 100 runs, the variables were added sequentially from the most frequently selected ones and gradually adding the less frequently selected variables from the 100 model runs, and the final set of features was selected based on the smallest AIC values [62]. We also tested the effect of the sample size on the robustness of the model output. We tested the final model with two-thirds, half, and one-third of the FIA biomass data. We ran the model 50 times for each subset of the samples, and each time we randomly selected the desired subset of plot samples as input to the model. We analyzed the out-of-bag R 2 and RMSE from these model outputs to evaluate the model performance. In addition to automatic feature selection using KnowGRRF for the final biomass mapping model, we tested the performance of various subsets of the remotely sensed data with RF. Sub-datasets included data from each sensor, i.e., LiDAR, RaDAR, multispectral, and very-high resolution aerial imagery, as well as the combination of the RaDAR and multispectral data split by the season they were acquired. The overall performance of each model was evaluated based on selected features that maximized accuracy while reducing the overall number of inputs. The number of trees contained in each RF model was set to 300, sufficiently high to allow all input rows to be used multiple times in model development and to produce a stable output.

Results
The automatic feature selection R package, KnowGRRF, selected 18 variables from Table 1 based on AIC. Figure 3 shows the importance order of these 18 variables. The top five most important variables are all from the lidar height metrics, including ZQ95, ZQ85, ZQ75, ZQ50, and SD_ZQ85. In addition, ZQ25 and SD_ZQ50 from the LiDAR sensor also made it to the model, although they were not as important. For multispectral remotely sensed features from Landsat 8, the selected features include G_summer, SI_winter, EVI_winter, NDVI_summer, and B_summer. Thus, the multiseasonal optical images are valuable for biomass mapping. For features from the RaDAR sensor, VH_winter, VV_winter, SD_VV_winter, and SD_VH_summer were selected into the model. Moreover, VH_winter is the most important variable not from the LiDAR sensor in the model. It is understandable why VH_winter, not VH_summer, was selected because the short wavelength C-band from Sentinel-1 cannot penetrate much into the canopy in the summer when the forest canopies are dense. Variables from the VHR image include T 1×1 and SD_T 1×1 , which are related to tree crown size. Therefore, all four types of data made it into the final model, indicating the validity of our conceptual model.  Figure 3 shows the importance order of these 18 variables. five most important variables are all from the lidar height metrics, including ZQ9 ZQ75, ZQ50, and SD_ZQ85. In addition, ZQ25 and SD_ZQ50 from the LiDAR sen made it to the model, although they were not as important. For multispectral r sensed features from Landsat 8, the selected features include G_summer, SI EVI_winter, NDVI_summer, and B_summer. Thus, the multiseasonal optical im valuable for biomass mapping. For features from the RaDAR sensor, VH_winter, V ter, SD_VV_winter, and SD_VH_summer were selected into the model. M VH_winter is the most important variable not from the LiDAR sensor in the mo understandable why VH_winter, not VH_summer, was selected because the sho length C-band from Sentinel-1 cannot penetrate much into the canopy in the when the forest canopies are dense. Variables from the VHR image include SD_T1×1, which are related to tree crown size. Therefore, all four types of data mad the final model, indicating the validity of our conceptual model.  The variable importance values provided by RF (Figure 3) indicate the top four most important variables are the ZQ95, ZQ85, ZQ75, and ZQ50 canopy height of total LiDAR returns, followed by SD_ZQ85 of LiDAR points. Among the height metrics, the height of the 95th percentile LiDAR points is the most important height metric. The subsequent highly important feature after the height metrics is RaDAR backscatter intensity from the VH polarization in the wintertime. Although VV_winter and SD_VV_winter also enter the model, they are not important as VH_winter. Summer backscatter intensity do not enter the model, but only SD_VH_summer is selected, ranking last in its importance among the variables selected. The features from the Landsat 8 imagery include G_summer, SI_winter, EVI_winter, NDVI_summer, and B_summer in order of importance. Variables selected from NAIP include T 1×1 and SD_T 1×1 with relatively low importance ranking. Surprisingly R 2/3 do not enter the model. This may be because of the relatively low aboveground forest biomass (Figure 4) with relatively small trees. Thus, the spatial information is not very helpful in mapping the biomass. For this model, the parameter 'mtry', the number of variables considered at each split within a tree, was tested to produce the highest accuracy. By testing every possible value of 'mtry' for the highest model accuracy, we find that the model leads to the most accurate prediction when 'mtry' is set to five.
among the variables selected. The features from the Landsat 8 imagery include G mer, SI_winter, EVI_winter, NDVI_summer, and B_summer in order of importan iables selected from NAIP include T1×1 and SD_T1×1 with relatively low importanc ing. Surprisingly R2/3 do not enter the model. This may be because of the relativ aboveground forest biomass (Figure 4) with relatively small trees. Thus, the spatia mation is not very helpful in mapping the biomass. For this model, the parameter the number of variables considered at each split within a tree, was tested to prod highest accuracy. By testing every possible value of 'mtry' for the highest model ac we find that the model leads to the most accurate prediction when 'mtry' is set to Figure 4. The scatter plot of predicted forest aboveground biomass with that derived from F The selected model yields an R 2 of 0.625 with an RMSE of 18.8 Mg/ha (47.6% on the "out-of-bag" samples. When using the selected model to produce the bioma and validated with the biomass from all the FIA plots, the R 2 increased to 0.64 w regression RMSE = 10.8 Mg/ha (Figure 4), which is smaller than the RMSE based "out-of-bag" samples. The R 2 for the whole sample was slightly higher than the bag sample because each plot would be "seen" by some of the trees in the RF, inc the R 2 . The out-of-bag samples are not seen by any of the "trees" in the training p thus, it is a more rigorous validation. Figure 4 shows that the predicted biomass FIA biomass distribute well along the 1:1 line. There is a slight overestimation of b for plots with low biomass and an underestimation for plots with high biomass.
To understand the contribution of each sensor to biomass mapping, we teste spective predictive power with the data ( Table 2). When all variables are includ model performance is not as good as the parsimonious model after eliminating important variables. Among all the different types of data, the performance of f The selected model yields an R 2 of 0.625 with an RMSE of 18.8 Mg/ha (47.6%) based on the "out-of-bag" samples. When using the selected model to produce the biomass map and validated with the biomass from all the FIA plots, the R 2 increased to 0.64 with the regression RMSE = 10.8 Mg/ha (Figure 4), which is smaller than the RMSE based on the "out-of-bag" samples. The R 2 for the whole sample was slightly higher than the out-of-bag sample because each plot would be "seen" by some of the trees in the RF, increasing the R 2 . The out-of-bag samples are not seen by any of the "trees" in the training process; thus, it is a more rigorous validation. Figure 4 shows that the predicted biomass and the FIA biomass distribute well along the 1:1 line. There is a slight overestimation of biomass for plots with low biomass and an underestimation for plots with high biomass.
To understand the contribution of each sensor to biomass mapping, we tested its respective predictive power with the data ( Table 2). When all variables are included, the model performance is not as good as the parsimonious model after eliminating the less important variables. Among all the different types of data, the performance of features from the LiDAR sensor dwarfed all other sensors. It is not surprising that the top five most important variables are all from the LiDAR sensor. The performances of multiseasonal features from RaDAR and multispectral sensors are similar, with an R 2 of 0.065. It is interesting to notice that the Tasseled Cap transformation components have much higher R 2 . Vegetation indices alone performed much poorer than the Tasseled Cap components. The effects of sample size with our final model on mapping FAGB is shown in Table 3, which shows basic statistics of the 50 model runs for each subset of the samples. As expected, the models with the smaller subsamples have a lower R 2 , larger variation in R 2 among different model runs, and consequently produced a larger out-of-bag RMSE and larger variation in RMSE. As demonstrated in Table 3, the model is very robust to the variation in sample sizes. The out-of-bag R 2 decreased only 0.044 using only one-third of the whole sample, and the RMSE increased by 1.0 Mg/ha. Therefore, we can safely say that our model results are robust. Using the selected RF model, we produced the forest aboveground biomass map as shown in Figure 5, on which we masked out nonforest areas based on the 2016 land-cover map from the National Land Cover Dataset [54]. The area is dominated by agricultural land, and we do not see a large expanse of land with high biomass. We can see that the highest biomass is distributed in the riparian zones, partly because these areas are strictly protected from logging and partly because the abundance of nutrients and water that favorably supports forest growth over the areas that are further away from the riparian zones.

Discussion
This study used remotely sensed data from multiple sensors, including Landsat 8, airborne LiDAR, Sentinel-1 RaDAR, and very high-resolution optical imagery from the National Agriculture Imagery Program. Multiseasonal Landsat 8 imagery was intended to capture the species variation in space; LiDAR height metrics to describe vertical canopy structure; Sentinel-1 RaDAR backscatter in the C-band in dual-polarization (VV and VH) for both winter and summer seasons to capture the canopy volume; the image texture from very high-resolution optical imagery to provide texture measures of forest canopies as well as the ratio of the image texture of 2 m to that of 3 m spatial resolution. However, the importance ranking information provided by the RF model showed the height metrics dominate all other variables. Moreover, it is not the height at a single energy level, but the entire height profile. This finding is different from a previous study that found the height of the median energy, i.e., the height where the 50th percentile of LiDAR data points from the ground, was the most important variable [42]. The forests in the study area here are primarily Loblolly pine stands with relatively simple canopy structures, making ZQ95 the most important c anopy height metric in biomass mapping, but heights at other energy levels are also important for biomass mapping. The height metrics alone capture 59.5% of the variance in biomass, compared to 62.5% from the full model. The importance of canopy height in biomass mapping is consistent with findings in the literature [41,42,[63][64][65].
The performance of our model is equivalent to other studies using RF with multiple sources of data mapping biomass [63,66]. Despite the robust model performance, there remains some systemic bias in the biomass estimation, i.e., a slight overestimation at the lower end and an underestimation at the higher end of biomass. Such a bias pattern is a common issue for remotely sensed biomass estimation in other studies [50,66].

Discussion
This study used remotely sensed data from multiple sensors, including Landsat 8, airborne LiDAR, Sentinel-1 RaDAR, and very high-resolution optical imagery from the National Agriculture Imagery Program. Multiseasonal Landsat 8 imagery was intended to capture the species variation in space; LiDAR height metrics to describe vertical canopy structure; Sentinel-1 RaDAR backscatter in the C-band in dual-polarization (VV and VH) for both winter and summer seasons to capture the canopy volume; the image texture from very high-resolution optical imagery to provide texture measures of forest canopies as well as the ratio of the image texture of 2 m to that of 3 m spatial resolution. However, the importance ranking information provided by the RF model showed the height metrics dominate all other variables. Moreover, it is not the height at a single energy level, but the entire height profile. This finding is different from a previous study that found the height of the median energy, i.e., the height where the 50th percentile of LiDAR data points from the ground, was the most important variable [42]. The forests in the study area here are primarily Loblolly pine stands with relatively simple canopy structures, making ZQ95 the most important canopy height metric in biomass mapping, but heights at other energy levels are also important for biomass mapping. The height metrics alone capture 59.5% of the variance in biomass, compared to 62.5% from the full model. The importance of canopy height in biomass mapping is consistent with findings in the literature [41,42,[63][64][65]. The performance of our model is equivalent to other studies using RF with multiple sources of data mapping biomass [63,66]. Despite the robust model performance, there remains some systemic bias in the biomass estimation, i.e., a slight overestimation at the lower end and an underestimation at the higher end of biomass. Such a bias pattern is a common issue for remotely sensed biomass estimation in other studies [50,66].
Unlike other land surface biophysical parameters, such as leaf area or canopy height, that can be measured directly from remotely sensed signals, there is no direct remotely sensed signal for biomass [32]. Remote sensing only measures biomass proxies that are imperfectly related to biomass, such as canopy height and species composition. The leaf area index in the canopy had been used to measure biomass fairly effectively [28]. However, the leaf area index of forest canopies saturates within a couple of decades, whereas forest biomass can continue to increase for centuries [24]. Among the biomass proxy variables measured remotely, the relationship of vegetation canopy height does not saturate with FAGB. Therefore, canopy height from LiDAR sensors has proved to be the most important variable for biomass mapping [67]. However, the height-biomass relationship is speciesdependent, as is a well-known fact in allometry [14,15,68]. Although reports used optical sensors alone that produced reliable biomass maps, these studies are generally based on coarse resolution remotely sensed data [50,69,70]. The relative success of using the optical sensors in mapping forest biomass over a large area using coarse resolution imagery is primarily driven by the vegetation cover effect. Such an approach may not apply to higher spatial resolution applications. Many localized studies found that optical sensors alone cannot accurately map biomass beyond 150 Mg/ha [33], and the Root Mean Squared Error (RMSE) for biomass mapping can be as high as 50% or higher [28,71]. As a result, many studies used remotely sensed data from multiple sensors.
Tremendous efforts were also dedicated to mapping forest aboveground biomass using RaDAR data because RaDAR signals can penetrate clouds, which are a constant impediment to biomass mapping using optical images, particularly in the tropics. An early attempt to map forest aboveground biomass with RaDAR was conducted with NASA/JPL SAR data [35,36], and they found that RaDAR backscatter intensity of P-band best correlated with biomass, and the relationship decreases with increasing frequencies. Moreover, they found that the cross-polarized backscatter intensity best explains the forest biomass variation [36,39]. However, RaDAR signals saturate with biomass at about 200 Mg/ha for P-band, 100 Mg/ha for L-band, and C-band backscatter is much less sensitive to forest aboveground biomass variation. Soil and vegetation moisture can have a stronger influence on high frequency (X or C-band) backscatter RaDAR signals than FAGB [37,72]. These findings were later confirmed by Luckman et al. [73]. Recently, Liao et al. [74] compared coherence magnitude, interferometric phase, and backscatter signals of P-band PolInSAR from TropiSAR to map forest aboveground biomass and found the volume backscatter from the forest canopy best predicts tropical FAGB. Two promising RaDAR instruments, BIOMASS with a full polarimetric P-band SAR to be launched by the European Space Agency in 2023 and NISAR with L-and S-band SAR also to be launched in 2023 by a joint U.S.-India effort [75], will bring new momentum in mapping FAGB with RaDAR data in the near future.
Given the complexity of remotely sensed signal interactions with land surface conditions, it seems no single sensor can provide data that can reliably map FAGB. Most recent efforts in mapping forest aboveground biomass almost all engage with remotely sensed data from multiple sensors. Blackard et al. [16] produced the nationwide forest biomass for the U.S. using MODIS remote sensing data and data products as well as topographic, climatic variables, and other ancillary data. However, this biomass map tends to overestimate low biomass and underestimate high biomass. Animi and Sumantyo [76] found that biomass estimation accuracy based on a multilayer perceptron neural network model was significantly better when using both RaDAR and optical data than either alone. Huang et al. [66] and Cutler et al. [77] found that RaDAR image texture significantly improved biomass mapping with Landsat TM data. Image texture for both optical and RaDAR sensors, SD_VV_winter, SD_VH_Summer, and T 1×1 _ were selected into our final model using an automatic feature selection algorithm, KnowGRRF, based on AIC in this study. Similarly, synergistic use of optical and LiDAR data also improves the accuracy of biomass mapping [78][79][80]. More recently, Brovkina et al. [81] used airborne hyperspectral and LiDAR data to map FAGB in central Europe and found that the biomass maps estimated using both data simultaneously were much more accurate than using either datum alone. Andersen et al. [82] and Babcock et al. [67] found that stratifying LiDAR data based on land cover derived from optical sensors improved the biomass mapping accuracy in interior Alaska, USA. There is strong evidence in the literature that using remotely sensed data from multiple sensors enhances the accuracy of biomass mapped. This paper proposes to use data from multiseasonal optical, multiseasonal RaDAR, LiDAR, and very high-resolution data to map FAGB. These data provide information related to forest species, canopy volume, canopy height, and DBH, which are the key data needed to estimate individual tree biomass on the ground. Therefore, the use of data from multiple sensors that provide information on these factors should be the theoretical basis for mapping FAGB with remote sensing. Its potential has not been fully demonstrated in this study. More studies are needed to test this mapping algorithm in areas with a higher biomass density and more heterogeneity species variation because the variables accounting for the effects of tree crown size and species variation did not fully realize their potential in this study.

Conclusions
This study proposes a conceptual model to map aboveground forest biomass using remotely sensed data. The new conceptual model posits that we need remotely sensed data to provide information about species composition, canopy height, and diameter at breast height. We test the conceptual model by fusing optical medium resolution data from Landsat, very high-resolution images from USDA NAIP, airborne LiDAR, and RaDAR images from Sentinel-1 to map aboveground forest biomass. Our final model is able to explain 62.5% of the biomass variation and the RMSE of the model is 18.5 Mg/ha (47.6%) calculated from out-of-bag samples. We find that the LiDAR height metrics are the most important variables. We need the height profile for the 95th, 85th, 75th, 50th, and 25th percentile canopy heights of the LiDAR points in mapping the biomass, rather than a single height metric. The importance of height metrics dwarfs all other variables, although the inclusion of Landsat, very high-resolution imagery, and the RaDAR data from Sentinel-1 only marginally improves the performance of the model. The lack of importance from multiseasonal optical imagery in aboveground forest biomass mapping in this study is likely due to the dominance of evergreen pine forests in the region. The lack of significant contribution from Sentinel-1 C-band SAR backscatter is consistent with the literature because C-band has a limited capacity to penetrate the canopy. The limited contribution from the image texture from USDA NAIP imagery deserves further investigation because biomass for these FIA plots is relatively low. In addition, we find that our model is robust because its performance is not sensitive to minor changes in training sample size. More studies are needed to further test the conceptual model for aboveground biomass mapping in areas with a broader biomass range and a more diverse species composition.