Forest Structural Estimates Derived Using a Practical, Open-Source Lidar-Processing Workﬂow

: Lidar data is increasingly available over large spatial extents and can also be combined with satellite imagery to provide detailed vegetation structural metrics. To fully realize the beneﬁts of lidar data, practical and scalable processing workﬂows are needed. In this study, we used the lidR R software package, a custom forest metrics function in R, and a distributed cloud computing environment to process 11 TB of airborne lidar data covering ~22,900 km 2 into 28 height, cover, and density metrics. We combined these lidar outputs with ﬁeld plot data to model basal area, trees per acre, and quadratic mean diameter. We compared lidar-only models with models informed by spectral imagery only, and lidar and spectral imagery together. We found that lidar models outperformed spectral imagery models for all three metrics, and combination models performed slightly better than lidar models in two of the three metrics. One lidar variable, the relative density of low midstory canopy, was selected in all lidar and combination models, demonstrating the importance of midstory forest structure in the study area. In general, this open-source lidar-processing workﬂow provides a practical, scalable option for estimating structure over large, forested landscapes. The methodology and systems used for this study offered us the capability to process large quantities of lidar data into useful forest structure metrics in compressed timeframes.


Introduction
Light detection and ranging (lidar) is an active remote-sensing technique that uses laser light ranging to record the distance from the sensor to the surface of the object it strikes. Discrete-return lidar systems are increasingly used in forestry and natural resource fields [1]. Lidar data are increasingly accessible over larger spatial extents. For example, the USGS 3D Elevation Program is acquiring complete lidar coverage over the United States while providing open access through data download portals [2]. Simultaneously, collection platforms and sensor technology have become cheaper and more compact, which allows for new platforms to procure lidar data from drones and terrestrial scanning devices. Consequently, this has resulted in greater ability to collect lidar quickly at finer spatial resolution. This increase in availability and abundance of lidar data provides opportunities for land managers to create and use fine spatial resolution and large spatial extent vegetation information to inform and monitor their management activities. However, exploiting this increasing volume of data presents major processing challenges that require the use of open-source, flexible software; and the development of practical implementation workflows [3].
When combined with field plots, lidar data can be used to estimate forest stand density, basal area, vegetation biomass [4][5][6][7], and understory canopy structure [8], which gives lidar data a distinct advantage over passive spectral sensors that cannot observe structure under dense-canopied forests. Some recent studies have shown that fine-resolution remotely

Forestry Plot Data
The 246 forested field plots used in this study were established between December 2017 and March 2018. These plots were designed to be related to remotely sensed data following recommendations from Hogland et al., 2020 [27]. The plots were 36 m square plots containing four 9 m diameter nonoverlapping subplots. The area of the subplots, where observations were taken, represented 78.5% of the total area of the plot. Subplot observation data consisted of tree species, tree condition, and diameter at breast height (DBH) for each tree within the subplots, as well as vegetation cover for the entire subplot and subplot tree counts. These plots covered 27 unique natural communities as defined by the Florida Natural Areas Inventory [28], of which the most common were pine plantations, wet and mesic flatwoods, and sandhills.
Forest structural metrics produced in this study were trees per hectare (TPH), basal area, and quadratic mean diameter (QMD) for living trees. TPH was the number of tree stems in a hectare area; basal area was the cross-sectional area of trees at breast height with the units of meters squared per hectare (m 2 /ha). Within each subplot, individual tree's DBH were measured in square inches; this measurement was used to calculate the basal area in square feet using Equation (1): The basal area for each tree was converted to square meters and then summed to the subplots; an expansion factor of 9.8 was used to convert from square meters per subplot to meters square per hectare. QMD is a commonly used measure of central tendency in forestry, and was calculated using (Equation (2)): where Di is the diameter at breast height in centimeters of the ith tree and n is the number of trees in the plot. Summary statistics from our forested field plots are shown in Table 1.

Forestry Plot Data
The 246 forested field plots used in this study were established between December 2017 and March 2018. These plots were designed to be related to remotely sensed data following recommendations from Hogland et al., 2020 [27]. The plots were 36 m square plots containing four 9 m diameter nonoverlapping subplots. The area of the subplots, where observations were taken, represented 78.5% of the total area of the plot. Subplot observation data consisted of tree species, tree condition, and diameter at breast height (DBH) for each tree within the subplots, as well as vegetation cover for the entire subplot and subplot tree counts. These plots covered 27 unique natural communities as defined by the Florida Natural Areas Inventory [28], of which the most common were pine plantations, wet and mesic flatwoods, and sandhills.
Forest structural metrics produced in this study were trees per hectare (TPH), basal area, and quadratic mean diameter (QMD) for living trees. TPH was the number of tree stems in a hectare area; basal area was the cross-sectional area of trees at breast height with the units of meters squared per hectare (m 2 /ha). Within each subplot, individual tree's DBH were measured in square inches; this measurement was used to calculate the basal area in square feet using Equation (1): The basal area for each tree was converted to square meters and then summed to the subplots; an expansion factor of 9.8 was used to convert from square meters per subplot to meters square per hectare. QMD is a commonly used measure of central tendency in forestry, and was calculated using (Equation (2)): where D i is the diameter at breast height in centimeters of the ith tree and n is the number of trees in the plot. Summary statistics from our forested field plots are shown in Table 1.

Lidar Data
Lidar data for this study were obtained from the United States Geological Survey 3D Elevation (3DEP) Program [2]. Lidar was delivered as four collections of classified LAS version 1.4 files. We refer to these collections as Leon, Block 2, Block 3, and Choctawhatchee ( Figure 1). The Leon lidar collection consisted of 876 (5000 ft × 5000 ft) tiles covering all of Leon County. Leon data had a nominal pulse spacing (NPS) of 0.35 m and was acquired between 5 February 2018 and 25 April 2018 [29]. Blocks 2 and 3 consisted of 6845 and 6598 (1 km × 1 km) tiles, respectively, and covered the western and southern counties surrounding Leon County (Figure 1). Block 2 and Block 3 lidar were collected in early 2018 and had an NPS of 0.7 m [30]. The Choctawhatchee data were acquired in early 2017, and had an NPS of 0.7 m. The Choctawhatchee lidar collection consisted of 3893 (1 km × 1 km) tiles [31].

Relative Density Canopy Cover Function
Analysis was conducted using the R statistical software [32], and the R package lidR (version 3.1.2) [22,33]. This software package is free, open source, and available for multiple operating systems. In addition, lidR is readily customizable and provides many advanced functions for working with lidar data for forestry applications. For correlation analysis and figure production, the following R packages were utilized: PerformanceAnalytics [34], corrplot [35], and Hmisc [36].
The relative density canopy cover function (RDCC) is a custom forest metric function we built using tools provided in the lidR package [22,33]. This function provided specific forest metric outputs (Table 2) identified as useful for quantifying tree volume, density, and canopy cover in previous work [5,11]. Additionally, cover and density metrics specific to multiple forest canopy layers (shrub, low midstory, high midstory, and upper canopy) were produced using the RDCC function, as these have been identified as important for multiple forest management objectives [8,11,12]. Table 2. List of the relative density canopy cover (RDCC) function output bands, variable names, and descriptions of the output variables. These variables were forest metrics derived from LAS point cloud input files.  We applied the RDCC function to our four lidar datasets in R version 4.0.1 using lidR's "LAScatalog processing engine" [33], which is a tool to manage large numbers of lidar point cloud files. LidR's LAScatalog processing engine handles buffering, merging, and parallel processing required to efficiently process large LAS collections [33]. The RDCC function also includes vertical and horizontal unit checks of the LAS data, and automatically converts units to meters from feet, as was the case for Leon County and Choctawhatchee LAS files. The adjustable output raster horizontal grid cell size was set to 5 m. The RDCC function's multiband TIF raster output was reprojected to UTM Zone 16N using a separate R function.
The RDCC function took LAS or LAZ file inputs and outputted 2D multiband raster .tiff files containing 28 variable bands (Table 2) at a user-defined cell size. The first four variable band outputs from the RDCC function were general point cloud summary metrics useful for QA/QC and the generation of the proceeding 24 variable bands. Band 1 was the total number of returns in a cell, and was used as the denominator in the relative density variable bands. Band 3 was the number of first returns in a cell, and was used in the canopy cover calculations. Band 2 was the number of ground returns, and band 4 was the ground elevation. These bands defined ground points in one of two ways. If point clouds were classified following the American Society for Photogrammetry and Remote Sensing LAS classification format 1.4 (or previous versions) [37], RDCC defined ground points as any points of code 2 (ground), 7 (low points), 9 (water), or 11 (road surface). If there was at least one ground point in a cell, the ground elevation (variable band 4) was defined as the mean elevation of ground points. If there were no classified ground points within a cell, then ground elevation was defined as the mean elevation of the lower 5% of all points. Ground elevation was then used to normalize the height of remaining cell points (i.e., height above ground) to derive the canopy metrics shown in Table 2.
This memory-resident method of normalizing point clouds is atypical, but was used in this study to address several issues that arose when processing large LAS datasets. The first issue with normalizing point cloud data using traditional LAS functions, even those included in the lidR package, is that they roughly double the size of LAS datasets by writing new normalized LAS files to disk. This process would have increased data storage requirements from~10.7 TB to~21 TB, and significantly increased processing time. In addition, processing time can be compounded due to error handling in edge cases. This method avoided those issues; however, ground point and normalization routines currently used in the RDCC function should only be deployed with fine spatial resolution cell sizes and in areas without significant topography. This study area did not have severe topographical changes, as can be seen in Table 1. When LAS file height normalization was run prior to processing the RDCC function had to be modified to account for this prior normalization. The R code for the RDCC function is provided in Appendix A.
RDCC output bands 5 and 6 provided means and standard deviations, respectively, for all relative heights within the cell. Bands 7 through 13 were the relative height bands at set height percentile increments. Relative height was the height above ground elevation Remote Sens. 2021, 13, 4763 6 of 27 at which the indicated percentage of returns was reached. Bands 14 to 28 used set height breaks that were meant to reflect the canopy layers of interest in the study area based off [11]. These layers were the shrub layer at a 0.6096 m to 3.048 m height, the high shrub or low midstory layer at 3.048 m to 6.096 m, the high midstory layer of 6.096 m to 14.935 m, and the upper canopy layer with returns greater than 14.935 m. Bands 14 to 16 were relative density within the shrub, low, and high midstory layers. Bands 17 to 20 were relative density greater than the lower range of each canopy height break and lower than the upper range of each canopy height break. The equation for relative density is shown as Equation (3): where RD is relative density, a is all returns in a cell, LH is the lower height break, and UH is the upper height break. Bands 21 to 24 were canopy cover greater than the lower range of each canopy height break and lower than the upper range of each canopy height break. The equation for canopy cover is shown as Equation (4): where CC is canopy cover, f is first returns in a cell, LH is the lower height break, and UH is the upper height break. The final four bands, 25 to 28, were the mean of all relative heights greater than the lower range of each of the four canopy height breaks.

Cloud Processing
This study used a Microsoft Azure cloud computing environment for data processing as part of an AI for Earth grant. A Linux virtual machine running Ubuntu server 20.04 LTS with 32 GB RAM was deployed in our Azure cloud environment and loaded with R Studio software (version 1.4) package. Our custom lidR function RDCC was run in parallel on this virtual machine using 30 logical cores. Four lidar datasets, totaling 18,212 LAS files and~10.77 terabytes (TB), were processed separately. The lidar data, as LAS 1.4 files, were stored in blob storage in our Azure environment and were accessed from our Linux virtual machine; raster outputs were also saved back to this blob storage container, and links to these output folders were shared directly with project partners.

Forest Metric Models and Raster Outputs
Model development and raster analysis were performed using R statistical software (R Core Team, 2021), ESRI geographic information system (ArcGIS), and the ArcGIS plug-in RMRS Raster Utility toolbar [38]. RDCC output rasters of lidar data were transformed into predictive rasters with 5 m cell resolution, which represented the values at that point as if it was the center of a 40 m × 40 m forest plot. This was done using the tools provided in the RMRS Raster Utility toolbar, specifically the 'Extract Bands', 'Focal Analysis' using the mean and standard deviation and an 8 × 8 multiplier, and 'Create Composite' tools [38]. These tools calculated the plot-level mean and standard deviation of bands 5 through 24 of the RDCC outputs, a total of 40 variables, across the study area. The cell values of this 40-band raster were extracted at the forest plot locations, using the 'Sample Values' tool found in the RMRS Raster Utility toolbar [38]. These values were appended to the forest plot summary tables that were subsequently used in the R statistical software for model development.
Predictor variables were selected from the 40 available variables to create general linear models of BAH, TPH, and QMD following the R script sequential general additive model (GAM) routine detailed in Hogland et al., 2019 [39], using the gaussian family and identity link distribution, alpha of 1, and increases in percent deviance threshold set to 0.005. These conservative thresholds, especially the alpha value, were set to allow for the least number of variables to be removed from the model at this step. The second variable selection step was to create a general linear model using all the variables selected in the first step and then to manually perform a reverse stepwise selection using the significance of each variable. This step was used to remove non-significant variables while maximizing the models' overall adjusted r 2 value. Once this step was complete, further variable selection was performed by comparing models with different variables using their Akaike information criterion (AIC) [40], root mean square error (RMSE), and mean absolute error (MAE) scores, with the emphasis on selecting the most parsimonious model for each forest metric.
For the imagery-only models and the combined imagery and lidar models, we used pre-Hurricane Michael Sentinel-2 data from previous work in this study area [41]. This imagery data consisted of the blue, green, red, and near-infrared spectral bands with a spatial resolution of 10 m, and was a combination of images taken in two phenological time periods, winter and spring, with individual scene acquisition dates in 2017 and 2018. Sentinel-2-only models of BAH that were presented in [41] were used directly for comparison here, while Sentinel-2 models of TPH and QMD were created in R using the spectral data values of our field plots from the same study. These models had variables designated following the same forward stepwise GAM selection, and backwards manual selection processes as the RDCC (lidar-only) models, the exception being that combination models required at least one predictor variable from both lidar and Sentinel-2 imageryderived data. Figure 2 displays our general workflow.
Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 28 of each variable. This step was used to remove non-significant variables while maximizing the models' overall adjusted r 2 value. Once this step was complete, further variable selection was performed by comparing models with different variables using their Akaike information criterion (AIC) [40], root mean square error (RMSE), and mean absolute error (MAE) scores, with the emphasis on selecting the most parsimonious model for each forest metric.
For the imagery-only models and the combined imagery and lidar models, we used pre-Hurricane Michael Sentinel-2 data from previous work in this study area [41]. This imagery data consisted of the blue, green, red, and near-infrared spectral bands with a spatial resolution of 10 m, and was a combination of images taken in two phenological time periods, winter and spring, with individual scene acquisition dates in 2017 and 2018. Sentinel-2-only models of BAH that were presented in [41] were used directly for comparison here, while Sentinel-2 models of TPH and QMD were created in R using the spectral data values of our field plots from the same study. These models had variables designated following the same forward stepwise GAM selection, and backwards manual selection processes as the RDCC (lidar-only) models, the exception being that combination models required at least one predictor variable from both lidar and Sentinel-2 imagery-derived data. Figure 2 displays our general workflow. The secondary raster processing took the rasters produced from the RDCC function and turned them into forest metric rasters. This portion of the overall workflow can be conducted using any geographic or spatial software, and uses complimentary independent data such as the spectral variables we used in this study.

Results
The RDCC function output resulted in a multiband raster dataset of 18,212 ~1 km × 1 km files (1524 m × 1524 m for Leon) of 5 m spatial-resolution-summarized lidar variables. This process took 28.5 h to complete for the entire ~22,900 km 2 study area. These multiband rasters provided useful outputs without further processing. However, a linear artifact was noted in the raster outputs from Blocks 2 and 3 in the 50% percentile relative height band and shrub relative density bands. This linear gap of no data correlated to the edge of flight lines and was generally as wide as the raster's 5 m cell size, but extended over the edge of the entire flight line. We determined that this midstory gap was due to an extreme scan angle at the scan overlap juncture. We attempted to fix this artifact by using all points, and not excluding the points categorized as withheld by the original contractor and using the scan angle of the returns to individually exclude points in the point cloud by their scan angle. By doing this, we were able to remove the linear gaps in our rasters; however, this also created different artifacts in our other RDCC bands, and therefore this method was not used.
Combined models and lidar-only models showed very similar results (Table 3), while the Sentinel-2 imagery-only models had the highest measure of errors and lowest adjusted r 2 across all response variables when compared with the models that used lidar data. Combination models had the highest adjusted r 2 values and lowest AIC scores, except for the TPH models and the MAE for QMD models, where the RDCC models performed slightly better. The Sentinel-2 basal area model used here varied slightly from the one presented The secondary raster processing took the rasters produced from the RDCC function and turned them into forest metric rasters. This portion of the overall workflow can be conducted using any geographic or spatial software, and uses complimentary independent data such as the spectral variables we used in this study.

Results
The RDCC function output resulted in a multiband raster dataset of 18,212~1 km × 1 km files (1524 m × 1524 m for Leon) of 5 m spatial-resolution-summarized lidar variables. This process took 28.5 h to complete for the entire~22,900 km 2 study area. These multiband rasters provided useful outputs without further processing. However, a linear artifact was noted in the raster outputs from Blocks 2 and 3 in the 50% percentile relative height band and shrub relative density bands. This linear gap of no data correlated to the edge of flight lines and was generally as wide as the raster's 5 m cell size, but extended over the edge of the entire flight line. We determined that this midstory gap was due to an extreme scan angle at the scan overlap juncture. We attempted to fix this artifact by using all points, and not excluding the points categorized as withheld by the original contractor and using the scan angle of the returns to individually exclude points in the point cloud by their scan angle. By doing this, we were able to remove the linear gaps in our rasters; however, this also created different artifacts in our other RDCC bands, and therefore this method was not used.
Combined models and lidar-only models showed very similar results (Table 3), while the Sentinel-2 imagery-only models had the highest measure of errors and lowest adjusted r 2 across all response variables when compared with the models that used lidar data. Combination models had the highest adjusted r 2 values and lowest AIC scores, except for the TPH models and the MAE for QMD models, where the RDCC models performed slightly better. The Sentinel-2 basal area model used here varied slightly from the one presented in [41], as that study used a hurdle modeling approach to exclude field plots with zero basal area (n = 2), while this study included the zero basal area plots. Table 3. Model results for basal area per hectare (BAH), trees per hectare (TPH), and quadratic mean diameter (QMD). Variables were either from only lidar (RDCC), only satellite spectral imagery (Sentinel-2), or a combination of both (Combo); v is the number of variables used in the model, RMSE is the root mean squared error, MAE is the mean absolute error, and AIC is the Akaike information criterion. Best model results for each response variable are in bold. The full list of the variables selected for use in these general linear models is shown in Appendix B. The most selected variable was the relative density 3.048 m to 6.096 m variable (RD_10to20ft), followed by the canopy cover greater than 3.048 m variable (CC_gt10ft). RD_10to20ft was selected in every single model when it was available, while the CC_gt10ft was selected in all the RDCC and combo BAH and TPH models. These two variables were moderately correlated, but still provided complimentary information about the forest structure. The correlation chart of the four variables used in the BAH RDCC model is presented in Figure 3, and shows the variables' distribution in the diagonal boxes, as well as their bivariate correlation scatterplots, correlation scores, and significance. in [41], as that study used a hurdle modeling approach to exclude field plots with zero basal area (n = 2), while this study included the zero basal area plots. The full list of the variables selected for use in these general linear models is shown in Appendix B. The most selected variable was the relative density 3.048 m to 6.096 m variable (RD_10to20ft), followed by the canopy cover greater than 3.048 m variable (CC_gt10ft). RD_10to20ft was selected in every single model when it was available, while the CC_gt10ft was selected in all the RDCC and combo BAH and TPH models. These two variables were moderately correlated, but still provided complimentary information about the forest structure. The correlation chart of the four variables used in the BAH RDCC model is presented in Figure 3, and shows the variables' distribution in the diagonal boxes, as well as their bivariate correlation scatterplots, correlation scores, and significance.  As seen in Figure 3, RD_10to20ft was only slightly correlated with one of the other two variables selected in the model besides CC_gt10ft, while CC_gt10ft was strongly correlated with both. The correlation plot of the BAH combo model, which included the Sentinel-2-derived PCA variables and is shown in Figure 4, demonstrates that the RD_10to20ft variable was also not significantly correlated with the Sentinel-2 variables selected in that model. This pattern continued across all models and variables (see Figures A1-A6), suggesting that RD_10to20ft was capturing unique information about forest structure that other variables were not, including the Sentinel-2 imagery-derived variables. CC_gt10ft contained much of the same information that the other variables provided, leading to these two combined variables capturing much of the information provided in all RDCC variables as they related to basal area, trees per area, and quadratic mean diameter.

Response
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 28 As seen in Figure 3, RD_10to20ft was only slightly correlated with one of the other two variables selected in the model besides CC_gt10ft, while CC_gt10ft was strongly correlated with both. The correlation plot of the BAH combo model, which included the Sentinel-2-derived PCA variables and is shown in Figure 4, demonstrates that the RD_10to20ft variable was also not significantly correlated with the Sentinel-2 variables selected in that model. This pattern continued across all models and variables (see Figures  A1-A6), suggesting that RD_10to20ft was capturing unique information about forest structure that other variables were not, including the Sentinel-2 imagery-derived variables. CC_gt10ft contained much of the same information that the other variables provided, leading to these two combined variables capturing much of the information provided in all RDCC variables as they related to basal area, trees per area, and quadratic mean diameter. Using the RMRS Raster Utility Arc plugin [38], the RDCC models from Table 3 (also see Figures A7-A9) were used to produce raster surface outputs of BAH, TPH, and QMD. These raster outputs were shared with forestry professionals in the study area to assess their practical and qualitative value, and were given a positive assessment. Sharing was facilitated by providing weblinks to cloud-stored raster data given to practitioners. Figure  5 shows the BAH raster that was produced from our RDCC model and distributed over the entire ~22,900 km 2 study area. Using the RMRS Raster Utility Arc plugin [38], the RDCC models from Table 3 (also see Figures A7-A9) were used to produce raster surface outputs of BAH, TPH, and QMD. These raster outputs were shared with forestry professionals in the study area to assess their practical and qualitative value, and were given a positive assessment. Sharing was facilitated by providing weblinks to cloud-stored raster data given to practitioners. Figure 5 shows the BAH raster that was produced from our RDCC model and distributed over the entire~22,900 km 2 study area. Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 28

Discussion
In this study, we used the lidR R package and our lidR user-extension function, referred to as RDCC (Appendix A), to rapidly process a very large (~11 TB) lidar point cloud dataset into multiband rasters of height, relative density, and canopy cover forest structure metrics. This process was relatively user friendly; and used free, open-source, and scalable software, lowering the barriers for using lidar data at regional scales. It was also

Discussion
In this study, we used the lidR R package and our lidR user-extension function, referred to as RDCC (Appendix A), to rapidly process a very large (~11 TB) lidar point cloud dataset into multiband rasters of height, relative density, and canopy cover forest structure metrics. This process was relatively user friendly; and used free, open-source, and scalable software, lowering the barriers for using lidar data at regional scales. It was also a robust and transferable process that worked with lidar datasets with different units, without classified ground points, and even from different lidar platforms. This transferability allowed us to rapidly create products for forest managers without first needing to classify the point cloud, which could be very important in time-sensitive scenarios such as during postdisaster forest-damage assessments.
The free, open-source, and scalable design allows this workflow to be used on local computers, a hybrid cloud computing environment, or a fully cloud-based environment using any modern operating system. We chose to use a distributed cloud computing environment because of the advantages such a setup could provide, specifically scalability and processing efficiency [42]. The lidR package's parallel-processing capabilities allowed this workflow to be easily scalable [33]. Scalability in the cloud environment refers to the ability to easily increase the number of our virtual machine's logical cores, the number of virtual machines, and increases in data storage without ever having to buy or maintain hardware [42]. For example, by reading lidar point cloud data directly from a blob storage container (and writing raster products back to this container), we were able to reduce storage costs compared to adding large, high-speed solid-state drives, and still gained the processing benefit of collocating our data with computational resources. However, we did not have to rely heavily on the scalability of the cloud environment to process our data efficiently. The R studio, lidR software package, and our RDCC function were efficient enough at processing large lidar datasets that we did not encounter a processing bottleneck when using our moderately equipped VM, which had the processing power equivalent to our physical workstations. To process much larger extents, or much higher density lidar datasets, such as state or regional scale projects, additional scaling of the cloud environment would be required for timely processing.
As a general comparison, this study processed lidar data covering 22,900 km 2 into 28 band 5 m cell GeoTIFFs in 28.5 h, while the equivalent processing software 'Laserchicken' reported processing a 30,000 km 2 lidar dataset into 22-band, 10 m cell GeoTIFFs over 96 h [3]. This is not a direct comparison, as 'Laserchicken' calculates several process-intensive lidar features, such as eigenvalues, that our RDCC function does not. However, it demonstrated that this workflow performed well compared to other lidar-processing studies.
Our forest metric model results also compared favorably with previous studies.  [9]. These studies had higher lidar point densities and covered smaller study areas with a smaller range of forest types than our study, yet our best model results for BAH and QMD had similar or better r 2 and RMSE values (adj. r 2 = 0.80, adj. r 2 = 0.58; and RMSE = 7.59, RMSE = 5.428, respectively).
In this study, we showed that lidar-derived canopy metrics produced better models of forest structure when compared with Sentinel-2-only models, though this is not always that case, as [9] demonstrated. Additionally, when we combined imagery and lidar variables in our forest metric models, we created marginally better estimates for two out of our three forest metrics. However, our model comparison results were possibly influenced more by spatial resolution then data type, as the Sentinel-2 imagery data had a 10 m cell size, meaning only 16 cells were within each field plot. The lidar-data-derived raster's cell size was 5 m, allowing for 64 cells for each field plot, and therefore contained more textural information. Our correlation matrix of variables (Figures 3 and 4) did show that the lidar data contained important midstory variables that the spectral imagery alone did not. The degree of difference in our model results due to spatial resolution and observations between lidar and spectral imagery is yet to be determined.
We chose to use easily interpretable general linear models for our forest metric models. General additive models (GAMs) were also explored to create the BAH, TPH, and QMD model estimates, but were not found to add significant predictive power over the general linear models. The interpretability of general linear models, over more 'black box' models such as RF, allowed us to easily identify the selection of the low midstory relative-density variable in all RDCC and combination forest metric models, and some studies have found that linear models outperformed RF when modeling forest metrics from lidar data [43]. Our variable correlation results (Figures 3, 4 and A1-A6) indicated that the relative density of low midstory was a variable that captured unique data in our study area that was important for estimating several forest metrics. We are not aware of any other studies that have shown the importance of a lidar-derived midstory metric for estimating a broader forestry metric such as BAH, TPH, or QMD. We also discovered a linear artifact in some datasets that affected points at the edges of flight lines related to this important low midstory metric. This was the result of high scan angles at the flight line overlaps. High scan angles were shown to impact forest metrics estimates in at least one other study [46]. Future airborne lidar contracts should address this issue by incorporating stricter limits on allowable scan angles, which will help avoid gaps in the midstory canopy at flight line overlaps.
This study focused on estimating forest metrics using an area-based approach. A potential next step would be to develop similar workflows focused on individual tree observations from high-resolution point clouds that currently face the same processing bottleneck that we addressed in this study. Previous studies have already shown the value of lidar data from drone platforms, such as Dalla Corte et al., 2020, who estimated tree diameter and height from drone-based lidar using R software packages [47]. Processing these individual tree metrics from lidar datasets is becoming more accessible using free and open-source R software packages [48]. To process large-extent lidar datasets using an individual tree approach efficiently will require the creation of workflows such as that presented in this study.

Conclusions
While lidar data is increasingly available, analysis of lidar data for forest structure is not yet ubiquitous. This can lead to natural resource managers being data rich, but information poor. In this study, we presented a practical, scalable, and free workflow to help derive useful information from large lidar datasets. We used free, open-source, and user-extendable software in a cloud-based environment to efficiently process~11 TB of lidar point cloud data covering 22,900 km 2 of a diverse landscape containing at least 27 natural communities. Our lidar processing workflow outputted 28-band, 5 m cell spatial resolution height normalized GeoTIFFs, containing relative height, relative density, and canopy cover data. These outputs were used to develop general linear models of BAH, TPH, and QMD, which outperformed Sentinel-2 imagery-based models in our study area, and performed as well or better than previous studies using higher-density point clouds in much smaller, less structurally diverse landscapes. This ability to quickly process landscape scale lidar data, with user-created functions, into forest metrics has only recently been possible, as analysis software such as the lidR [22] package have become available. We overcame the processing bottlenecks that large lidar datasets present by using software designed for parallel processing (lidR) coupled with a distributed cloud computing environment. In creating our models of BAH, TPH, and QMD, we discovered the importance of our relative density of low midstory lidar variable (RD_10to20ft). This variable was shown to contain information that was not captured in other lidar and Sentinel-2 variables.

Appendix B
The following is the full list of the predictive variables and their descriptions used in the final general linear models of basal area per hectare, trees per acre, and quadratic mean diameter. For details on production and interpretation of Sentinel-2 principal component analysis (PCA) variables, see St. Peter et al., 2020. Variables beginning with the 'Sent' came from Sentinel-2 imagery; all other variables were derived from the Lidar RDCC function described in this paper.