Evaluating Site-Specific and Generic Spatial Models of Aboveground Forest Biomass Based on Landsat Time-Series and LiDAR Strip Samples in the Eastern USA

Large-area assessment of aboveground tree biomass (AGB) to inform regional or national forest monitoring programs can be efficiently carried out by combining remotely sensed data and field sample measurements through a generic statistical model, in contrast to site-specific models. We integrated forest inventory plot data with spatial predictors from Landsat time-series imagery and LiDAR strip samples at four sites across the eastern USA—Minnesota (MN), Maine (ME), Pennsylvania-New Jersey (PANJ) and South Carolina (SC)—in statistical modeling frameworks to analyze the performance of generic (all sites combined) versus site-specific models. The major objective was to evaluate the prediction accuracy of generic and site-specific models when applied to particular sites. Pixel-level polynomial model fitting was applied to the time-series of near-anniversary date Landsat variables to obtain projected metrics in the target year 2014 for which LiDAR strip samples were available. Two forms of models based on ordinary least-squares multiple linear regressions (MLR) and the random forest (RF) machine learning approach were developed for each site and for the pooled (i.e., generic) reference data frame. The models were evaluated using national forest inventory (NFI) data for the USA. We observed stronger fit statistics with the MLR than with RF for both the site-specific and the generic models. The proportions of variances explained (adjusted R2) with the site-specific models were 0.86, 0.78, 0.82 and 0.92 for ME, MN, PANJ and SC, respectively while the generic model had adjusted R2 = 0.85. A test of statistical equivalence of observed and predicted AGB for the NFI locations did not reveal equivalence with any of the models, possibly due to the different resolutions of the observed and predicted data. In contrast, predictions by the generic and site-specific models were equivalent. We conclude that a generic model provides accuracies comparable to the site-specific models for large-area AGB assessment across our study sites in the eastern USA.


Introduction
Forest ecosystems store the majority (~80%) of terrestrial biomass and carbon [1], as well as contribute significantly to global carbon emissions (~33% over the past 150 years) through land use and land cover changes [2][3][4]. Sustainable land use and forestry practices focused on enhancing carbon storage in live woody biomass have been recognized as effective means to reduce the effects of emissions and mitigate the effects of climate change, because forests are estimated to absorb about 25% of annual anthropogenic carbon-dioxide emissions [5]. The United Nations Framework Conventions on Climate Change (UNFCCC) has endorsed programs for reducing emissions from deforestation and forest degradation (REDD+), and mandated that member countries periodically report forest carbon estimates via national greenhouse gas inventories (NGHGI). The REDD+ initiatives and NGHGI will benefit from wall-to-wall (spatial) inventories of aboveground forest biomass (AGB) at regional and/or national scales. Indeed, estimation and monitoring of AGB over a large space and time are crucial to inform strategic and operational-scale forest management plans. It can be hypothesized that a large-area spatial inventory of AGB can be implemented more efficiently with a generic model for diverse forest types, as site-specific data or models may not be available for individual forest types in any area of interest.
Spatially explicit (wall-to-wall) inventories of AGB for regional or national scales can be made cost-effective by integrating inventories of field sample plots with freely available remotely sensed predictors. The cutting-edge light detection and ranging (LiDAR) technology has been extensively used in forestry practices including AGB mapping, because these active sensors are highly sensitive to structural attributes and provide accurate three-dimensional coordinates of the pulse returns from intercepting canopy and terrain elements [6,7]. In addition, the freely available archive of Landsat data, especially multispectral imagery by the Thematic Mapper (TM) and successor sensors in use since 1984, have been integral in portraying the spatio-temporal distribution of AGB. More recently, the Sentinel-2 land monitoring satellite mission of the European Space Agency is publicly providing imagery of improved spectral and spatial resolutions that also supports Landsat data continuity. Landsat surface reflectance bands and derived indices have long been used to produce spatio-temporal biomass maps and to differentiate forest structure and composition over landscapes [8,9]. Although the passive Landsat optical data are less sensitive to complex forest conditions [10], their continuous data collection for large scenes since 1972, moderate spatial resolution (30-m × 30-m) and synoptic coverage characteristics can improve cost-efficiency in forest inventories across large spatial and temporal domains [11,12]. Furthermore, improved accuracy of AGB prediction can be attained when Landsat data are combined with LiDAR-derived fine resolution metrics, because Landsat spectral data can more accurately predict species composition [13,14]. LiDAR data acquired via discrete or full waveform instruments, with small or large footprints and scanning or profiling characteristics, have been extensively used in past studies for AGB mapping [14,15].
AGB estimation using LiDAR depends on canopy characteristics such as tree branching and foliage structure, because these elements intercept laser pulses in various canopy height layers. Consequently, the accuracy of LiDAR-based AGB models can vary with species and forest types, as well as the strength of relationships among tree diameter at breast height (DBH), total height, and crown dimensions. Fortunately, highly accurate absolute elevations of canopy elements and overall three-dimensional canopy structures can be estimated from LiDAR data. The general procedure is to separate the LiDAR point cloud data into ground and non-ground returns, create a digital terrain model (DTM) from the ground returns, and then subtract the DTM from the elevations of canopy returns to obtain actual return heights [16]. The height information of LiDAR returns is ultimately used to develop area-level metrics that can be used to predict statistical distributions of canopy elevations, strata density, and cover. LiDAR data should be binned to regular grid cell sizes approximately matching field sample plot sizes to enable wall-to-wall mapping of the target response attribute [17,18].
For remotely sensed AGB mapping, formulation of a relationship (model) between a sample of field inventory data and co-located spatial predictors is crucial. The AGB of sample plots may be obtained using localized species-specific or generalized allometric models dependent on commonly measured tree variables such as DBH and total height. Because localized allometric models are not available for most forest ecosystems around the world, generalized models are commonly applied to the tree list of local forest inventory data with the tree AGB estimates being aggregated to the plot level. For example, the tree AGB models developed by Jenkins et al. [19] for 10 species groups in the continental USA are widely used in North America, despite large biogeoclimatic gradients across the continent. The concerted efforts in recent years to create legacy tree databases based on past studies across a large region (e.g., Falster et al. [20]; www.legacytreedata.org) have also facilitated formulation of generic models. However, a drawback of these generalized models is that they ignore the effects of local site characteristics on tree growth [21,22]. Further, the use of DBH alone in an allometric model assumes that the relationship between tree diameter and height is invariant, which is unrealistic [23]. With the widespread use of generalized tree allometric models in design-based frameworks for estimating AGB at multiple spatial scales, the scope for generalized (or generic) remote sensing-based models in large-area inventories can be emphasized. This is because of the increasing availability of consistently collected remotely sensed data in the public domain (e.g., Landsat, and satellite based LiDAR), which generally reduces costs and improves accuracy compared to approaches based on field data alone.
Many LiDAR-based AGB estimation applications have focused on local models for two basic reasons: (i) the lack of a consistent LiDAR dataset over a large range of diverse forest types, and (ii) concerns for ecological variations (i.e., biogeoclimatic conditions) over large geographic areas (Wulder et al., 2012). However, with pending spaceborne LiDAR such as the Global Ecosystem Dynamics Investigation (GEDI) and other missions, which are expected to provide consistent data across regions, there is potential for pooling data from different forest types to construct a generic model for enhancing the efficiency of LiDAR-based inventories. Such efficiency is achieved in the sense that LiDAR processing will require derivation of only a subset of predictors at the spatial resolution that matches the size of field inventory plots compared to a varying suite of predictors for site-specific models. While the validity of a generic model is improved by including a large number of data points from across the representative ecological sites [24,25], extension of the geographic domain complicates model selection and parameterization with the increasing complexity of the reference training frame [26][27][28][29]. Nevertheless, regional biomass assessments based on remotely sensed data could benefit from a generic model, because the need for numerous site-specific models and the aggregation of local estimates for large-area totals would be avoided. Performance of a remote sensing-based generic model relative to site-specific models for AGB estimation remains largely unknown. A thorough analysis of prediction errors with generic and site models would inform users as generic models do not accommodate the effects of differences in local growing conditions. Empirical parametric or non-parametric relationships between sample plot AGB as the response and LiDAR and Landsat derived variables as the predictors can be constructed and subsequently applied in the same population at any area where only the remotely sensed data are available. However, the form of the models and importance rankings of predictors can vary with forest types [30]. In any case, formulation of a generic model requires selection of an optimal set of predictors [29]. Careful selection of predictors is important in parametric regression, since a large number of strongly correlated variables can produce overfitting (i.e., unstable estimates of parameters) of models that poorly reflect the general trend in data and adversely affect predictions accuracy outside of the training data [31]. Non-parametric modeling techniques, such as random forest, are free from many statistical assumptions, and have the ability to rank importance of predictors and estimate mean square error through internal cross-validation [32][33][34]. Random forest techniques have been frequently used in AGB modeling by integrating samples from field inventories, remote sensing, and other auxiliary data [35]. For large-area estimation in a parametric mode, model-assisted approaches to inference are largely preferred because (i) the reference (training) data frame relies on a sample of population units selected randomly from a consistent probability sampling design, and (ii) the relationship between sampling unit (plot) observations and co-located auxiliary variables is exploited to achieve a reduction in error variance (compared to what is possible without auxiliary data) in the estimate of a population parameter [36,37]. Simple or stratified random sampling designs can be employed to acquire the sample and infer the population mean or aggregate by expressing the sampling error in the form of, for example, standard error of estimates [38].
The Carbon Monitoring System (CMS) program of NASA in collaboration with the Forest Inventory and Analysis (FIA) program of the U.S. Department of Agriculture Forest Service aims to assess forest carbon sinks and fluxes in the USA. CMS has emphasized large-area (continent-scale) assessment for strategic purposes using Landsat time-series data, and small-area assessment for operational plans using fine resolution remotely sensed data such as LiDAR. The FIA program has focused on methods that provide consistent inventory estimates across time and space. The broad objective of this study is to integrate field sampling, LiDAR sampling, and Landsat time-series data to formulate parametric and non-parametric models for plot-level AGB prediction within the context of CMS. Here, we investigate the utility and performance of a remote sensing-based generic model and site-specific models for estimating plot-level AGB across a wide range of forest ecosystems in five states across the eastern USA. The specific objectives are threefold: (1) to compare the effectiveness of LiDAR and Landsat variables for estimating plot-level AGB using ordinary multiple linear and random forest based models at four study sites (i.e., Landsat scenes) across the eastern USA; (2) to evaluate the accuracy of site-specific and generic AGB models across the four study sites; and (3) to validate the performance models using FIA data.

Study Area
The study area was located in the eastern USA and covered four Landsat scenes corresponding to the WRS-2 paths/rows of 12/28, 27/27, 14/32 and 16/37 ( Figure 1). Scene 14/32 covers parts of Pennsylvania and New Jersey (PANJ) while scenes 12/28, 27/27 and 16/37 lie completely within Maine (ME), Minnesota (MN) and South Carolina (SC), respectively. The scenes are spread over diverse landscapes with substantial differences in bioclimatic characteristics ( Table 1). The terrain elevations (based on LiDAR strip samples) were found to range from 0 to 674, 151 to 561, 98 to 1138 and 0 to 114 m above sea level in PANJ, MN, ME, and SC, respectively.
The study area in ME lies in a transition zone between boreal forests to the north and eastern deciduous forests to the south, and forest composition is divided nearly equally between hardwoods and softwoods with spruce/fir, maple/beech/yellow birch, white/red/jack pine, and aspen/white birch as the dominant forest types [39]. Differences in harvesting intensity across the landscape have impacted stand-size classes across different areas of ME. Laurentian mixed forests characterize the MN site that lies in the transition zone between the Canadian boreal forests to the north and the broadleaf deciduous forests to the south [40]. Aspen, pine, oak, elm/ash/cottonwood, and spruce/fir are the dominant forest types in MN where nearly two-thirds of the growing stock are hardwoods, and the rest are softwoods. Mixed-oak, northern hardwoods and loblolly/pitch/shortleaf pine forest-type groups dominate the PANJ site where hardwoods are largely distributed to the northern areas and softwoods are concentrated primarily in the south [41,42]. Loblolly-shortleaf pine, oak-hickory and oak-gum-cypress are the dominant forest-type groups in SC where a large part of the forest is young and naturally regenerated [43].

LiDAR Strip Samples and Processing
LiDAR strip samples were acquired in summer (28 June-3 September) 2014 at the four study sites using a Leica Reigl 680i airborne laser scanner onboard twin engine aircrafts (Cessna 310, Piper Aztec and Piper Navajo) flown at an average altitude of 731.5 m above the terrain level. The LiDAR strips covered about 2590, 3056, 4454 and 3571 square kilometers in 24, 24, 35 and 29 flight lines for ME, MN, PANJ, and SC, respectively. The average width of the strips was about 1.2 km. The multi-return LiDAR system captured up to six returns per pulse and generated point cloud data with densities of four to six pulses per square meter and average precision of 5 cm for elevations. The LiDAR datasets were processed to produce 55 grid metrics (Table A1) at 30-m spatial resolutions, following McGaughey [16]. These metrics represented the descriptive statistics such as elevation range and percentiles based on the height information of all LiDAR returns falling within each 30-m × 30-m grid cell. The heights of aboveground pulse returns were also summarized to obtain the grid metrics for canopy cover and vertical strata density as defined in Table A1 and Deo et al. [17].

Landsat Time-Series Data and Pixel Level Curve Fitting
Near-anniversary date time-series Landsat surface reflectance high-level data products, managed in the USGS climate data records, were obtained for each of the sites from the Earth Resources Observation and Science (EROS) Center Science Processing Architecture (ESPA) on-demand interface (http://espa.cr.usgs.gov/). The time-series data were acquired by Landsat-5 Thematic Mapper (TM) and Landsat-8 Operational Land Imager (OLI) sensors at the peak of the growing seasons (July-September) from 1984 to 2015 ( Table 2). The high-level products included surface reflectance of individual bands (visible and infrared), some derived spectral indices, and cloud masks for each image (https://landsat.usgs.gov/cdr-ecv) at 30-m spatial resolution. The image data were terrain corrected (L1T) and radiometrically preprocessed at the source (USGS EROS [44] ), and we enhanced the input quality by selecting only images having less than 10% cloud cover. The peak growing season and annual quality of imagery provided data with consistent landscape conditions and phenology within a site due to similar solar geometry. Any narrow cloud and shadow patches in some of the images were masked out and substituted with clear imagery acquired in adjacent years. The collection consisted of 16, 18, 18 and 17 images for ME, MN, PANJ, and SC sites, respectively, where data gaps as long as three years were present. The data products are ready-to-use in terms of predicting forest attributes.
We considered 12 spatial predictors derived from the Landsat data for predicting plot-level AGB: shortwave infrared (SWIR) surface reflectance, normalized difference vegetation index (NDVI), normalized burn ratio (NBR), integrated forest z-score (IFZ), tasseled cap brightness (TCB), greenness (TCG) and wetness (TCW) indices, tasseled cap angle (TCA), disturbance index (DI), enhanced vegetation index (EVI), normalized difference moisture index (NDMI), and soil adjusted vegetation index (SAVI). The predictors were either SWIR band surface reflectance, or ratio of bands, or any combination of surface reflectance bands. The SWIR band, NDVI, NBR, NDMI, SAVI, and EVI were obtained directly from the USGS climate data records via the ESPA online interface, while IFZ, TCA, and DI were derived as described in Huang et al. [45], Pflugmacher et al. [11] and Healey et al. [46], respectively. The coefficients for the derivation of tasseled cap indices were adopted from Crist et al. [47] for Landsat-5 data and Baig et al. [48] for Landsat-8 data. The derivations of IFZ and DI metrics required average and standard deviation values of forested samples of the imagery. Hence, several homogeneous polygons (>20) were digitized manually over dense forest areas of each image in ArcGIS. The dense forest areas were identified using the dense-dark vegetation raster available for each Landsat image in the surface reflectance high-level data products of the USGS [44]. The descriptions of individual predictors are given in Deo et al. [8] and USGS EROS [49]. These predictors were considered to simplify model and minimize error, although published works have reported varying sensitivity of the same predictors in different forest types [31,50,51]. Pixel-level polynomial (third-degree) models were fit to the time-series Landsat metrics (12 predictors at each of the four sites) to characterize spectral trends over time and to obtain fitted (projected) metrics for any target year that did not have cloud-free Landsat data. A previous study revealed that AGB models dependent on the polynomial-projected Landsat variables produce accuracies comparable to the models using actual Landsat metrics [8]. The pixel-level temporal model fittings have the potential to minimize noise due to atmospheric factors, such as cloud and shadows. The temporal model fit was performed using R statistical software packages [52], including 'raster' and 'rgdal', which generated pixel-level parameter estimates of model fit and coefficients of determination (R 2 ). The model parameters (as rasters) were used to obtain projected metrics for the Landsat variables in the year 2014 when LiDAR was also flown. The projected Landsat variables were used in AGB modeling, because cloud-free imagery was not available in 2014 for all sites, and the projected variables are most likely to provide consistency in the predictive power over time. However, other approaches to obtaining projected predictors from a time-series of surface reflectance imagery have also been published by others (e.g., [45,53,54]).

Field Data for Model Training
Field data for model training were collected in summer 2015 and consisted of approximately 50 specialized forest inventory plots within the LiDAR strips of each site (Table 3). Although there is a one year gap between the field and LiDAR data, the plots were located in undisturbed forest conditions, so there was minimal difference in forest structure and composition in the two datasets. A total of 197 plots, hereafter called Carbon Monitoring System (CMS) plots, across the four sites (48 in ME, 50 in MN, 49 in PANJ and 50 in SC) were established in forest lands and were distributed across 15 LiDAR-based structural strata (3 cover classes × 5 height classes). The plots were fixed-area circular (16.1 m radius), and closely matched the 30-m pixel size of the Landsat data and LiDAR grid metrics. The coordinates of plot centers were estimated using a Javad Triumph-2 survey-grade global positioning system and differential correction post-processing that resulted in an average horizontal error of less than a meter. The field measurements included records of species, DBH, and total heights of trees with DBH of at least 12.7 cm within the plots, following Forest Inventory and Analysis (FIA) measurement protocols [55]. The standard FIA algorithms were used to predict AGBs for individual sample trees [56] which were aggregated to obtain plot-level estimates.  30 in PANJ, and 60 in SC. The FIA plots were selected such that each represents a single condition and per hectare basal-area and basal area weighted height is above zero.

Field Data for Model Validation
Inventory data collected on FIA plots within the LiDAR strips measured in 2014 (i.e., the year of LiDAR acquisition) were used for model validation. The FIA program has a national network of permanent plots established using a sampling design that produce equal probability random samples [57]. In addition, FIA plots located within the strips and measured in the previous and subsequent years of LiDAR acquisition were added to the validation dataset because there were few FIA plots inventoried within strips in the same year of acquisition. Thus, FIA plots measured in 2013, 2014 and 2015 were used for the validation of site-specific models in ME and MN, while the plots measured in 2012, 2013 and 2014 were used for the validation of site-specific models in PANJ and SC, per the availability of measurements in the FIA database. The number of FIA plots used for model validation was 86 in ME, 102 in MN, 42 in PANJ, and 75 in SC, such that each plot has at least one tree (i.e., plot-level basal area > 0). Actual coordinates for these plots were obtained from the Northern Research Station of FIA in St. Paul, Minnesota.
The FIA plot configuration consists of a cluster of four subplots with a central and three peripherals at 36.58-m (120-ft) horizontal distance and 0 • , 120 • and 240 • azimuths from the central subplot. Each subplot has a 7.32-m (24-ft) radius in which all trees with DBH of at least 12.7 cm are measured for various attributes, including species, DBH, and total height. The tree-size measurements were used with species-specific allometric models to predict tree AGB [56] which were then summed to obtain plot-level estimates ( Table 3). The FIA plots are distributed over all land-use classes and ownership, and the sampling intensity on the ground was up to three plots per 2400 ha; NJ portion of the area had intensity of only one plot per 2400 ha.

Modeling and Validation
The predicted Landsat variables and actual LiDAR grid metrics of 2014 were intersected with actual CMS and FIA plot locations in a GIS environment. In the spatial joining of response and predictor variables, we attached the digital value of co-located pixels of predictors to the CMS data, while the mean spectral values from each 3 × 3 pixel-window were attached to the FIA plot data. This was performed because the CMS plots matched the spatial resolution of predictors but small FIA subplots are distributed over a square block of nine 30-m by 30-m pixels [57]. For this reason, the FIA plots were not used for model training, because using plots of smaller size than the spatial resolution of predictors has negative effects on model prediction accuracy [58]. Hence, a site-specific training data frame, based on the CMS field measurements, was developed for each site and a generalized training data frame was produced by pooling over all the CMS data points from the four sites. Both training data frames had 67 predictors (12 Landsat fitted and 55 LiDAR grid metrics, Table A1).
We applied ordinary least squares multiple linear regression (MLR) and the k-nearest neighbors technique using an RF-based distance metric (RF-kNN) to each of the training data frames to construct site-specific and generic AGB models. In the RF-kNN method available in the 'yaImpute' package of R, a set of training points with known values of response and predictors are used to predict a response at any target point (where only spatial predictors are known) as a weighted average of k-nearest neighbors from the reference set where neighbors are identified via the RF algorithm [33]. The spectral similarity of the predictors at the target and reference points provides the basis for a nearness or proximity measure. In the process, RF creates an ensemble of multiple regression tree models (for continuous variables), each constructed from a bootstrap subsample of the reference frame, and the nearness is estimated as the proportion of all trees in the ensemble that associates a target point with a particular subset of k points in the reference set. RF-kNN can accommodate a large number of predictors, rank their importance in the models, and generate a measure of model accuracy through internal cross-validation. The RF-kNN model generates mean square error (MSE) as the average value of the errors met with the out-of-bag data corresponding to approximately 33% of the plots that are withheld from the bootstrap subsample when constructing the tree models. Our previous studies [8,59] have revealed that using about 2000 trees and three neighbors (i.e., the value of k) in the RF-kNN model provides optimal results; so, we used these numbers and other parameters were set to the default values in the R package [33].
Although RF is considered robust to the spatial auto-correlation of predictors, multi-collinear variables were identified and removed before fitting the models. First, the variables were pruned using a multivariate variable screening process based upon QR-matrix decomposition [60,61]. Subsequently, an RF-based model selection procedure [61,62] was applied to obtain a parsimonious set of model predictor variables for each training data frame. The RF model selection procedure uses a percentage increase in model MSE to select a suite of the least number of predictor variables that explains the greatest proportion of variation in AGB (e.g., [29,63]. The set of predictors identified in the RF model selection procedure were further subjected to both forward and backwards elimination methods of stepwise regression [59,64] to remove predictors that did not contribute statistically significantly to improving the fit of the model to the data. The optimal sets of predictors thus selected for each site and for the pooled (generic) dataset were used separately in multiple-linear regression (MLR) and the RF-kNN frameworks to construct AGB models. In addition, the same set of predictor variables was used for both MLR and RF-kNN modeling and the accuracies were compared with respect to adjusted R 2 and root mean square error (RMSE) statistics; the RF-kNN only provides pseudo-R 2 and MSE. The pseudo-R 2 is the percent variance explained that measures how well out-of-bag predictions explain the target variance of the training set [34]. The square root of MSE estimate from the RF was taken to obtain RMSE. The importance ranking of predictors based on RF is shown in Figure 2.
The models were developed for two scenarios, one using both LiDAR and Landsat predictors and the other using only Landsat predictors. Because LiDAR acquisition is still a big constraint for a large-area spatio-temporal assessment of AGB, and Landsat data are freely available (since 1970s), we wanted to see how the combined variables from LiDAR and Landsat sensors improve model accuracy compared to when only LiDAR or Landsat data are used. The models with minimum RMSE (as the fit statistic) for both the scenarios were applied to each site to obtain AGB maps that were intersected with the FIA plots within the LiDAR strips to evaluate the AGB predictions from the generic and site-specific models with the field observations, with emphasis on mean deviation (bias) and the standard error of the mean. The accuracy of predictions at the FIA plot locations were calculated in terms of mean deviation, RMSE and Pearson's correlation coefficient (r). The model assisted regression estimators [38], especially standard errors of estimates, were used to evaluate the confidence interval for the large-area estimates. In addition, equivalence tests [65,66] for the predictions based on generic and site-specific models at the FIA plot locations were conducted. The equivalence testing uses two one-sided tests and relies on a subjective choice of a region in which differences between predicted and observed values are considered negligible. A region of equivalence, for example, set to ±25% of the standard deviation around the mean infers that predicted and observed values do not differ if the absolute value of the mean of the differences is less than 25% of the standard deviation.

Results
The simple multiple linear regression analysis dependent on both LiDAR and Landsat (LiD.LS) variables produced more accurate models in terms of fit statistics than the RF-based models when the same optimal set of predictors was used in both the approaches ( Table 4). The site-specific MLR models produced adjusted R 2 ≥ 0.78, and the generic model from the pooled reference frame explained a comparable amount of variance (adj. R 2 = 0.85). In fact, the generic model had larger adjusted R 2 than the site-specific models for MN and PANJ, although the numbers of selected predictors and type of metrics were different among the models. The largest and smallest proportions of explained variance with the LiD.LS dependent model were obtained for SC and MN, respectively. As expected, the RMSE for the site-specific models followed the average and standard deviation trends of AGB distributions in the sample plots (e.g., the PANJ site displayed the largest mean and standard deviation in the training data and had largest RMSE; Tables 3 and 4). The models constructed using only Landsat (LS) variables resulted in a large reduction in the proportion of variance explained relative to the LiD.LS models. The proportion of explained variance for the generic LiD.LS model was more than twice of the proportion for the generic LS model. The LS model with minimum RMSE and maximum adjusted R 2 was obtained in ME (adj. R 2 = 0.61), and the opposite was found in MN (adj. R 2 = 0.26). One notable observation was that the Landsat metrics were least important for the LiD.LS models and only a single predictor was found to contribute statistically significantly in one of the site-specific models, as well as in the generic model (i.e., SWIR band in ME and TCB in the generic, Figure 2).

Results
The simple multiple linear regression analysis dependent on both LiDAR and Landsat (LiD.LS) variables produced more accurate models in terms of fit statistics than the RF-based models when the same optimal set of predictors was used in both the approaches ( Table 4). The site-specific MLR models produced adjusted R 2 ≥ 0.78, and the generic model from the pooled reference frame explained a comparable amount of variance (adj. R 2 = 0.85). In fact, the generic model had larger adjusted R 2 than the site-specific models for MN and PANJ, although the numbers of selected predictors and type of metrics were different among the models. The largest and smallest proportions of explained variance with the LiD.LS dependent model were obtained for SC and MN, respectively. As expected, the RMSE for the site-specific models followed the average and standard deviation trends of AGB distributions in the sample plots (e.g., the PANJ site displayed the largest mean and standard deviation in the training data and had largest RMSE; Tables 3 and 4). The models constructed using only Landsat (LS) variables resulted in a large reduction in the proportion of variance explained relative to the LiD.LS models. The proportion of explained variance for the generic LiD.LS model was more than twice of the proportion for the generic LS model. The LS model with minimum RMSE and maximum adjusted R 2 was obtained in ME (adj. R 2 = 0.61), and the opposite was found in MN (adj. R 2 = 0.26). One notable observation was that the Landsat metrics were least important for the LiD.LS models and only a single predictor was found to contribute statistically significantly in one of the site-specific models, as well as in the generic model (i.e., SWIR band in ME and TCB in the generic, Figure 2). Coefficients of all the predictors in the models have significant p-values at α = 0.05 in multiple linear regressions (MLR). The predictors were selected by applying the QR-decomposition algorithm for pruning the collinear variables followed by the random forest (RF) model selection approach (that uses standardized importance values of predictors and model improvement ratio) and both direction elimination methods in stepwise regression. Pseudo-R 2 = % explained variance in the RF models. RMSE = root mean square error.
The importance rankings and the number of predictor variables making statistically significant contributions for the generic and site-specific LiD.LS models were not consistent across the sites ( Figure 2). However, the cover metrics and density of LiDAR returns in vertical height strata were important for all the models. Interestingly, the proportions of LiDAR returns in the first five vertical strata (sum of which is a proxy for canopy volume) along with the Landsat-derived TCB index were found to make significant contributions for the generic LiD.LS model. When the six predictors selected for the generic LiD.LS model were used to construct models for the individual sites in the RF framework, the fit statistics (i.e., pseudo-R 2 and RMSE) degraded somewhat with respect to the site-specific models produced using the optimal set of LiDAR and Landsat predictors (Tables 4 and 5). However, the same six predictors in simple MLR analyses provided similar or improved fit statistics compared to the site-specific models based on the optimal set of LiD.LS predictors. When scatterplots of CMS plot-level AGB against some selected LiDAR metrics such as cover (CovDBH) and average elevation (ElevAv) were made (figures not shown), large correlations were observed. The Pearson's correlation coefficients (r) between AGB and ElevAv were 0.81, 0.70, 0.79 and 0.88, and between AGB and CovDBH were 0.77, 0.70, 0.76 and 0.63 in ME, MN, PANJ, and SC, respectively. However, Landsat variables such as the SWIR band reflectance provided r = −0.64, −0.21, −0.18 and −0.54, and TCB provided r = −0.63, −0.19, 0.04 and −0.53 for ME, MN, PANJ, and SC, respectively. The negative correlations here imply that with increasing biomass, the SWIR band reflectance and TCB of canopy diminishes (e.g., [9]).
Compared to RF models, the MLR models resulted in larger R 2 and smaller RMSE at each site (Table 4). Hence, the site-specific and generic MLR models were used to develop AGB maps for the LiDAR coverage, and the predictions were evaluated with the observations at FIA plot locations ( Table 6). The maps dependent on LiD.LS are not shown because of narrow LiDAR strips; only the maps dependent on LS models are shown in Figures A1 and A2. The generic model compared to site-specific models resulted in similar or larger correlations with the field observations over the FIA plots at the four sites ( Table 6). The mean deviations (bias) of site-specific models in MN and SC were less than in the generic model; otherwise the generic model had smaller mean deviation at the other two sites (i.e., ME and PANJ). In terms of RMSE, the generic model produced smaller error than the site-specific models for ME and PANJ, while the results were opposite for MN and SC. Equivalence tests of the observed AGB against the predicted AGB either by generic or site-specific models at FIA plot locations did not reveal equivalence at 25% region of similarity and 5% level of significance (α = 0.05). When AGB was predicted at each FIA plot location by the generic and site-specific models, the two sets of predictions did not differ at the 25% region of similarity and α = 0.05 ( Figure 3). The correlation coefficients of AGB predictions by the generic and site-specific models were 0.95, 0.93, 0.97, and 0.96 for ME, MN, PANJ and SC, respectively. The training data frame of the generic LiD.LS model (i.e., 197 CMS plots with the six predictors as in Table 5)  The negative correlations here imply that with increasing biomass, the SWIR band reflectance and TCB of canopy diminishes (e.g., [9]). Compared to RF models, the MLR models resulted in larger R 2 and smaller RMSE at each site (Table 4). Hence, the site-specific and generic MLR models were used to develop AGB maps for the LiDAR coverage, and the predictions were evaluated with the observations at FIA plot locations ( Table 6). The maps dependent on LiD.LS are not shown because of narrow LiDAR strips; only the maps dependent on LS models are shown in Figures A1 and A2. The generic model compared to site-specific models resulted in similar or larger correlations with the field observations over the FIA plots at the four sites ( Table 6). The mean deviations (bias) of site-specific models in MN and SC were less than in the generic model; otherwise the generic model had smaller mean deviation at the other two sites (i.e., ME and PANJ). In terms of RMSE, the generic model produced smaller error than the site-specific models for ME and PANJ, while the results were opposite for MN and SC. Equivalence tests of the observed AGB against the predicted AGB either by generic or site-specific models at FIA plot locations did not reveal equivalence at 25% region of similarity and 5% level of significance (α = 0.05). When AGB was predicted at each FIA plot location by the generic and site-specific models, the two sets of predictions did not differ at the 25% region of similarity and α = 0.05 (Figure 3). The correlation coefficients of AGB predictions by the generic and site-specific models were 0.95, 0.93, 0.97, and 0.96 for ME, MN, PANJ and SC, respectively. The training data frame of the generic LiD.LS model (i.e., 197 CMS plots with the six predictors as in Table 5) when subjected to an iterative analysis that trained a model using 177 plots and applied to the remaining 20 plots, produced r = 0.91, bias = −0.13 Mg·ha −1 and RMSE = 27.40 Mg·ha −1 after 1000 iterations.  . Equivalence plot of aboveground biomass (AGB) predictions by the generic and site-specific models at FIA plot locations. The black line represents the line of best fit, the dashed gray lines represent the 25% region of similarity for slope, and the shaded gray polygon represents the 25% region of similarity for intercept. The black and white vertical bars represent 95% confidence intervals for slope and intercept, respectively.  (44.99) † Bias is calculated as predicted minus observed aboveground biomass and the numbers in parentheses represent relative bias (%). ‡ Numbers in parentheses represent relative root mean square error (RMSE, %).
The standard errors (SE) of estimates of the population means, an important statistic for model inference in the form of a confidence interval to the mean for large area estimation [38], using the site-specific models were comparable to the SE obtained when applying the generic model to the individual sites (Tables 7 and 8). The prediction estimates at the CMS plot locations revealed that the SE values of the site-specific models are similar but smaller than the SE values for the generic model (Table 7). In contrast, the prediction estimates at the FIA plot locations revealed that the generic model produce smaller SE values than site-specific models for all sites ( Table 8). The estimated bias (mean deviation of the residuals) with each site-specific model was zero when the models were applied to the CMS plot locations (i.e., the same data used to construct the models); but the site-specific models when applied to the FIA plot locations produced positive estimated bias for all sites. The generic model also produced positive estimated bias for CMS and FIA plot locations of individual sites. The estimated biases of generic model predictions at the FIA plots were 4.8%, 41.2%, 12.8% and 41.1% of the mean of sample observations for ME, MN, PANJ and SC, respectively. The biases with site-specific models were 18.5%, 22.8%, 13.7% and 35.1% of the mean of sample observations for ME, MN, PANJ and SC, respectively. The differences of model-assisted regression estimates (MARE) of population mean using the generic and site-specific models at the FIA plots were 4.7%, 8.2%, 14.6% and 6.6% of the mean of sample observations for ME, MN, PANJ and SC, respectively.

Discussion
Previous studies have shown strong relationships between LiDAR metrics and coincident in situ measurements, and most of the models are representative of small test sites with little variation in biophysical environments. This study also revealed the strength of LiDAR dependent AGB models that are applicable to regional assessment. We did not attempt to produce forest type or strata-specific generic models (because of limited field plot data), although published works have shown better performance of such models. For example, Latifi et al. [28] classified field data into broadleaved, coniferous and mixed forest types, and found better performance of a strata-specific model compared to a model from a pooled reference data frame. White et al. [67] evaluated various models based on LiDAR data acquired in both leaf-off and leaf-on canopy conditions in coniferous, deciduous and mixed forest types in the Rocky Mountains of Alberta, Canada. They found that pooling leaf-on and leaf-off data for a selected forest type did not adversely impact model accuracy (as relative RMSE and bias of pooled models differed by <2% compared to either leaf-on or leaf-off models). They also noted poor accuracy of the generic model built by pooling all data regardless of forest types and canopy conditions at the time of LiDAR acquisitions; RMSE of the generic model was twice as large as that of forest type-specific models. White et al. [67] and Villikka et al. [68] found that mixing models and data types (i.e., applying leaf-off models to leaf-on data, and vice versa) resulted in large error and bias in model outcomes. Considering these observations, we did not apply site-specific models to the other sites (e.g., using developed models in ME to predict AGB in SC), but found generic and site-specific models to perform similarly.
Comparable to this study, some large-area studies employing LiDAR strip datasets do exist, but comprehensive analyses of site-specific and generic AGB models are lacking. For example, Lefsky et al. [69] used large-footprint spaceborne profiling LiDAR data to predict maximum canopy height in three different ecosystems: tropical broadleaf, temperate broadleaf and temperate conifer forests. They found that the generic model explained 48% of the variability while site-specific models explained 59% to 68% of the variability. Nelson et al. [70] used airborne profiling LiDAR in the entire state of Delaware, USA, and developed separate AGB prediction models for deciduous, conifer and mixed forest types, along with a generic model across all strata. Their generic model explained higher variance (R 2 = 0.66) compared to the forest type-specific models where R 2 ranged from 0.28 for the mixed forest to 0.44 for the conifer forest. Maltamo et al. [71] used LiDAR strip samples (200 m wide and 1500 km long) combined with geo-climatic variables and major species information for AGB modeling in mountain forests in Norway. However, their study categorized only three forest types: spruce-, pine-and deciduous-dominated. They found that plots across all species can be pooled for modeling and inventory of spruce forests, but site-specific models are optimal in areas dominated by pine or deciduous species. They also noticed that geo-climatic variables, serving as proxies for growing conditions, are less influential compared to species information when combined with LiDAR metrics in AGB modeling (addition of geo-climatic variables improved RMSE by <2% and species information improved RMSE by >19%, with respect to the model based on only LiDAR metrics). In Maltamo et al. [71], the RMSE was 47.8% for the generic model and 33.1 to 47.3% for the forest-specific models. In the absence of species information in our study, the MLR generic model explained 85% of the variance compared to a range of 78% to 92% in the site-specific models dependent on both LiDAR and Landsat predictors. However, our generic model produced AGB estimates that are equivalent to the estimates of site-specific models at 25% region of similarity ( Figure 3). It should be noted that the equivalence is based on a subjective choice of a region of similarity, and users should decide on how large a difference is acceptable to conclude that the generic provide predictions equivalent to the predictions of the site-specific model.
Laser pulses have varying levels of interactions in the canopy of broadleaf and conifer trees due to the differences in crown shape. Hence, it is logical to assume that species-specific models would perform better. If we assume that Landsat data capture more species information than LiDAR data, then it is logical to state that models including both LiDAR-and Landsat-derived predictors have high accuracies. However, only one Landsat metric was shown to be useful at one site (i.e., ME). Maltamo et al. [71] explained that species-and location-wise variations in crown biomass are largely accounted for by LiDAR, because the echo distribution is directly dependent on the three-dimensional crown structure. In the leaf-on canopy conditions for our LiDAR datasets, we expected accurate estimates of canopy heights, as pulses are more likely intercepted in upper canopy levels. Particularly in deciduous forests, leaf-on data can be expected to provide better models [67]. However, White et al. [67] and Naesset [72] found that leaf-off LiDAR data correlate better to field observed AGB than leaf-on data, so we recommend further study for sites with leaf-off data to reevaluate the performance of generic model. Considering the limitations of remotely sensed data (e.g., limited spatial coverage and incapability for species mapping with LiDAR, and signal saturation of radar or passive optical imagery in high biomass areas), several studies have underscored the need for data fusion from different sources to accurately estimate AGB over a large area (e.g., [30,[73][74][75]).
The importance rankings of predictors in the site-specific models varied with sites. This was also revealed from the pixel-level polynomial model fitting to the time-series of observed metrics of Landsat data. The polynomial fit statistics of the same metrics at different sites varied significantly. For example, the fitting to the SWIR surface reflectance produced R 2 values ≥0.40 in 71% of the pixels at ME, but only in 19% of pixels at the PANJ site (Table 9). It may be inferred that the low average R 2 for a predictor metric at a given site is due to larger sensitivity to disturbances. This is because the same set of Landsat imagery was used to obtain several predictors at a given site and when pixel-level polynomial were fit, some of the metrics resulted in large and others small R 2 . Table 9. Proportion of total pixels that attained R 2 in the range of 40-100% with the pixel-level polynomial curve fittings at four sites in the eastern USA. LiDAR metrics among the selected variables in the LiD.LS model at each site are much more influential than Landsat variables. Canopy height, canopy density, and cover metrics derived from LiDAR data are commonly used and the variables selected in this study match the key variables reported elsewhere in forest biomass inventories [76]. Lefsky et al. [69] found that maximum canopy height alone explained about 73% of the variance in AGB. Maltamo et al. [71] found that LiDAR metrics related to canopy density were useful in the prediction of AGB in spruce-dominated forests. Nelson et al. [70] found that average height, quadratic mean height, and canopy cover accounted for the most of the variation in AGB. Lim and Treitz [77] concluded that laser-based quantiles of canopy heights are useful in predicting forest biomass.

Landsat
The reduced accuracy estimates, such as the percent variance explained (pseudo-R 2 ) and RMSE, of RF-models compared to the MLR models are justified. The MLR relies on all observations in the training data to minimize the residuals of prediction, while RF performs using bootstrap samples. The RF prediction is the average of the predictions by numerous tree models, each built from a bootstrap sample of the training data, applied to out-of-bag data for a target observation [34]. An estimate of error rate in the RF algorithm is obtained from the difference of out-of-bag predictions and corresponding observations. Since RF predictions are out-of-bag cross validated, the variance of predictions is generally larger in RF compared to MLR [78]. This implies that pseudo-R 2 in RF would be less than the R 2 in MLR, when the same set of predictors is used in both approaches. Because the CMS data were representative of all forest types in the study sites and satisfied the assumptions of parametric regression (e.g., normal distribution of plot-level AGB), the MLR models obtained after screening of collinear and insignificant predictors are reliable for application in operational forestry settings. In addition, MLR models are easy to apply using GIS platforms to quickly produce maps, whereas implementation of RF models can be time-intensive for large geographical areas. The better performances of MLR than RF are also observed in other studies (e.g., [78,79]).
Spatial and temporal match of field sampling and remote sensing data significantly influence model accuracy [80,81]. Intuitively, very accurate models need to be site-specific, species-specific and scale-dependent. This implies that models are first formulated by integrating field and remotely sensed data of closely matching resolutions, and subsequently extended spatially in the same coverage area using spatial predictors at the same resolutions [82]. Non-equivalence of all model predictions and FIA field observations can be attributed to spatial and temporal mismatch of FIA plot data as well as inaccuracy in the precise coordinates of FIA plot locations. The FIA plot validation data contained measurements that occurred between 2012 and 2015, while the remote sensing predictors were for the year 2014. The spatial resolution of predictors in the study was 30-m while each FIA plot with four subplots is spread over a block of a 3 × 3 pixel window. A single FIA subplot is 168 m 2 which is less than one-fifth of the area of 30-m pixel (900 m 2 ). It is understandable that the smaller size of FIA subplots resulted in a poor match with pixel-level predictions because of the mixed spectral signals of 30-m pixels. Furthermore, reduced precision in the coordinates of FIA plot centers poorly links in situ data with corresponding remote sensing variables. Published studies have shown that larger circular plots (10-25 m radius) improve model fit and prediction accuracy of forest inventories, because a higher degree of spatial overlap between ground data and remotely sensed metrics are attained for any given GPS error [17,82,83]. Furthermore, edge-effects (i.e., counting of trees along plot boundary as in or out) are reduced with larger plot sizes [58]. One more source of error in our models was the 12.7 cm minimum DBH for sample tree measurements in the CMS plots. This DBH minimum restricted field measurements of smaller trees while LiDAR metrics captured those in the reference data. Published studies have also indicated that the selection of predictors and model types plays a critical role in biomass modeling and different methods may be necessary for diverse forest conditions [29,84,85]. In the remote sensing community, there is differing opinion regarding the replication potential of a statistical model and the predictor metrics across different forest types. One contribution of this study is the finding that data from the different forest types across the eastern USA can be integrated with large-scale LiDAR acquisitions (provided the sensors have similar configuration) or Landsat data with little impact on model performance compared to site-specific models. There is minimal penalty in the application of LiD.LS generic versus site-specific models, if the stratification is ignored due to a small number of field plots (for both model training and validation) in a regional assessment.

Conclusions
Within the context of remote sensing-assisted NGHGIs, this study highlights the potential use of a generic model for estimation of forest AGB across diverse forest types. The overall conclusion is that the generic model provided accuracy comparable to site-specific models in which local forest types and growing conditions were pooled. The generic model using both LiDAR and Landsat variables explained more than twice the variance of the generic model using only Landsat variables. In conditions where the samples of field plots by forest or species types are insufficient for model training, it is useful to design a generic model that renders a minor penalty with respect to site-specific models, if the objective is to reduce costs for application over large geographic areas. Some published studies have shown that leaf-off data produce better models than leaf-on data for AGB mapping; however, we recommend future studies that apply leaf-off data to examine a similar modeling framework as conducted in this study. Figure A1. Aboveground biomass distribution maps in the study areas in Maine (ME) and Minnesota (MN), based on the site-specific (left column) and generic models (right column) dependent only on Landsat derived predictors.

LiDAR Metrics Description DenStra1
Total returns in the vertical height interval of 1 to 4 m divided by 3 DenStra2 Total returns in the vertical height interval of 4 to 8 m divided by 4 DenStra3 Total returns in the vertical height interval of 8 to 16 m divided by 8 DenStra4 Total returns in the vertical height interval of 16 to 32 m divided by 16 DenStra5 Total returns in the vertical height interval of 32 to 64 m divided by 32 Stratum0 Total return proportion in the vertical height interval 0 to 1 m Stratum1 Total return proportion in the vertical height interval 1 to 4 m Stratum2 Total return proportion in the vertical height interval 4 to 8 m Stratum3 Total return proportion in the vertical height interval 8 to 16 m Stratum4 Total return proportion in the vertical height interval 16 to 32 m Stratum5 Total return proportion in the vertical height interval 32 to 64 m CovMean Percentage of all returns above mean (all returns above mean × 100/total count of all returns) CovMode Percentage of all returns above mode per pixel (all returns above mode × 100/total count of all returns) CovDBH Percentage of all returns above DBH per pixel (all returns above DBH × 100/total count of all returns) ElevAAD Average absolute deviation of elevations of all returns above DBH ElevAv Average of elevations of all returns above DBH CRR Canopy relief ratio ((mean − min)/(max − min)) of elevations of all returns Figure A2. Aboveground biomass distribution maps in the study areas in Pennsylvania-New Jersey (PANJ) and South Carolina (SC), based on the site-specific (left column) and generic models (right column) dependent only on Landsat derived predictors. Table A1. Description of 54 different LiDAR metrics used in the study.

DenStra1
Total returns in the vertical height interval of 1 to 4 m divided by 3 DenStra2 Total returns in the vertical height interval of 4 to 8 m divided by 4 DenStra3 Total returns in the vertical height interval of 8 to 16 m divided by 8 DenStra4 Total returns in the vertical height interval of 16 to 32 m divided by 16 DenStra5 Total returns in the vertical height interval of 32 to 64 m divided by 32 Stratum0 Total return proportion in the vertical height interval 0 to 1 m Stratum1 Total return proportion in the vertical height interval 1 to 4 m Stratum2 Total return proportion in the vertical height interval 4 to 8 m Stratum3 Total return proportion in the vertical height interval 8 to 16 m Stratum4 Total return proportion in the vertical height interval 16 to 32 m Stratum5 Total return proportion in the vertical height interval 32 to 64 m CovMean Percentage of all returns above mean (all returns above mean × 100/total count of all returns) CovMode Percentage of all returns above mode per pixel (all returns above mode × 100/total count of all returns) CovDBH Percentage of all returns above DBH per pixel (all returns above DBH × 100/total count of all returns) ElevAAD Average absolute deviation of elevations of all returns above DBH ElevAv Average of elevations of all returns above DBH CRR Canopy relief ratio ((mean − min)/(max − min)) of elevations of all returns ElevCM Cubic mean of elevations of all returns ElevCV Coefficient of variation of elevations of all returns above DBH Table A1. Cont.

ElevIQ
Interquartile range of elevations of all returns above DBH ElevKurt Kurtosis of elevations of all returns above DBH ElevL1 First L-moment of elevations of all returns above DBH ElevL2 Second L-moment of elevations of all returns above DBH ElevL3 Third L-moment of elevations of all returns above DBH ElevL4 Fourth L-moment of elevations of all returns above DBH ElevLCV L-moment coefficient of variation of elevations of all returns above DBH ElevLkurt L-moment kurtosis of elevations of all returns above DBH ElevLskew L-moment skewness of elevations of all returns above DBH EMADmed Median of the absolute deviations from the overall median of elevations EMADmod Mode of the absolute deviations from the overall mode of elevations ElevMax Maximum of elevations of all returns above DBH ElevMin Minimum of elevations of all returns above DBH ElevMod Mode of elevations of all returns above DBH