Over the past two decades, airborne Light Detection and Ranging (LiDAR) has become an invaluable tool for remotely quantifying the three-dimensional structure of the forest canopy. This has enabled scientists to estimate forest attributes over large areas traditionally only measurable with an intensive field campaign. Attributes predicted with LIDAR include estimates of tree count, height, species, stem volume, and aboveground biomass [1
]. Such enhanced forest inventories (EFIs) have proven useful for research and applications in forest and wildlife ecology, forest carbon cycling, and sustainable forest management [4
The typical methodology used for developing EFIs is called the Area Based
(AB) approach. The AB approach utilizes predictive modelling to associate plot-based field measurements with explanatory variables derived from various measures of a LiDAR dataset of the same forest area. These models are then applied to make estimates of forest attributes for new areas without field measurements [7
]. Numerous statistical techniques have been used to develop AB models, among which linear mixed modelling and random forest imputation are common modelling tools amongst many established modelling techniques [8
Model explanatory variables derived from LiDAR data can take several forms, but mostly constitute either measurements of point height or distribution along the vertical stratum [1
]. Height measurements typically constitute summary statistics that calculate the mean, maximum, and percentile heights of points within each grid cell. Distribution metrics quantify the proportion of points found above certain height thresholds. Since their introduction in the early 2000s, these traditional height metrics (THMs) have been proven to be effective predictors for modeling forest attributes [11
]. Common software packages currently used for LiDAR EFI modeling, including FUSION [13
] and rLiDAR [14
], are based on extracting THMs from point clouds.
Despite the well-established effectiveness of THMs, these metrics come with several drawbacks. Many THMs are highly correlated with one another, and so care must be taken during model development to avoid issues of multicollinearity [15
]. Furthermore, THMs generated from one LiDAR data set are sometimes not stable when applied to another due to variation in acquisition parameters such as laser penetration, pulse density, and scan angle [16
]. Finally, THMs do not quantify measures of horizontal complexity, and so the nuance of distinct tree crown shapes is lost. This may partly account for the difficulty in estimating tree count from THMs, as reported in previous studies [19
Like LiDAR modelers, computer vision scientists have struggled to develop meaningful tools for quantifying remotely sensed data. Early computer vision scientists developed a series of metrics for estimating an image’s content using summary statistics of color histograms, detected edges, and blobs [22
]. This methodology is known as feature engineering, and bears some resemblance to LiDAR’s THMs. Feature engineering, however, has largely since been discarded by the computer vision field in favor of deep learning, which can learn to self-identify important spatial features in a dataset [23
Deep learning is a form of machine learning, and refers primarily to artificial neural networks of a sufficient complexity so as to interpret raw data without the need for human-derived explanatory variables [23
]. This differs from simpler neural networks (such as perceptrons), which impute attributes using features such as THMs and have been used to estimate forest attributes in remote sensing for many years [24
]. Convolutional neural networks (CNNs), such as those employed here, are a subdivision of deep learning and are distinct in that they interpret spatial data by scanning it using a series of trainable moving windows. The values of these windows are initially randomized. During training, the CNN uses an optimizer to tune these values to identify useful features and objects for estimating the response variable. This is accomplished without user input.
The first CNN was developed in 1995 to classify images of hand-written digits [27
]. However, the technique was largely underrepresented in scientific literature for the following decade, partially owing to computational and software constraints [23
]. This changed in 2012, when a deep CNN (consisting of many layered convolutions) won the ImageNet image classification competition by a wide margin [23
]. Since then, CNNs of increasing complexity and depth have consistently outperformed models based upon feature extraction for computer vision [29
]. Other variants, such as inception and residual models, have also been developed to extract greater numbers of meaningful features by using windows of varying sizes and preserving useful input data as it passes through the model [33
Though CNNs have been mostly developed for computer vision with two-dimensional images, those in other fields are beginning to apply these algorithms to novel, three-dimensional problems. A few studies have begun using deep learning for measuring and analyzing forest attributes. For example, Guan et al. [34
] used a segmentation technique to isolate tree crowns, and then used a neural network to classify species based on point distribution. In another study, Ghamisi et al. [35
] applied a 2D CNN to estimate forest attributes from rasterized LiDAR and hyperspectral data. A 2D CNN is designed to scan two-dimensional images, and is only capable of identifying spatial features along two axes. A 3D CNN uses three-dimensional windows to scan volumetric space and identify spatial features in the X, Y, and Z axes. In this context, a 2D CNN is only capable of scanning a height map derived from a LiDAR point cloud, whereas a 3D CNN is capable of scanning the entire cloud for 3D features such as tree crowns, stems, or branches.
Presently, 3D CNNs have not been used to interpret LiDAR data for forest inventory; however, there are examples of their use in other fields. One study implemented a 3D CNN for use in airborne LiDAR to identify helicopter landing zones in real time [36
]. Others have used them in conjunction with terrestrial LiDAR to map obstacles for autonomous cars [37
]. One common application has been to identify malignancies using 3D medical scans [39
]. Several studies have also used 3D CNNs for household object detection [42
The goal of this study is to adapt common 2D CNN implementations to scan the 3D volumetric pixels, or voxels, derived from a LiDAR point cloud to estimate aboveground biomass, tree count, and the percent of needleleaf stems. Several CNN architectures were adapted, beginning with the least complex (LeNet) [27
] and working towards deeper, more contemporary architectures (Inception V3) [45
]. We compare these architectures against one another, noting performance with model complexity and type. The best performing CNNs were then tested against random forest and linear mixed models trained using THM.
We first compared results in terms of the RMSE and bias of aboveground biomass predictions among each of the five CNN architectures relative to the models trained using the THMs (Figure 1
). Four of the five architectures we tested outperformed the both the linear mixed model with traditional height metrics (LMM-THM) and the random forest model with traditional height metrics (RF-THM) in terms of RMSE. Only the Inception-V3 CNN exhibited less absolute bias than the THM-RF model and equal absolute bias of the LMM-THM.
The best performing CNN architecture was Inception-V3, which was also the most complex CNN, with the greatest number of layers and over 20 million trainable parameters. Inception-V3 had a RMSE of 48.1 Mg/ha, representing an 11% decrease in error from 54.1 Mh/ha with the RF-THM model, and a 16.5% decrease from the 57.6 Mg/ha RMSE in the LMM-THM model. The absolute bias of Inception-V3 was lower than the RF-THM and equal to the absolute bias of the LMM-THM model, at 1.3 Mg/ha. This was 68% lower than the next best performing CNN architecture, GoogleNet with 4 Mg/ha bias. The architectures using inception layers (GoogleNet, and Incepton-V3) were the top two performing architectures in terms of lowest RMSE and bias.
Of the five CNN architectures, only AlexNet performed worse in terms of RMSE than the LMM-THM and RF-THM models. AlexNet’s RMSE and bias were 59.7 and −3.9 Mg/ha, respectively. The next-worst performing CNN was ResNet-50, with a RMSE and bias of 53.8 and 4.2 Mg/ha, respectively. Models using residual layers have garnered some degree of popularity in recent years; however, these results indicate that they may not be as effective at quantifying LiDAR, or at the very least require more fine-tuning. LeNet performed slightly better than ResNet-50 in terms of error, and equally in terms of bias despite its relative simplicity (53.3 Mg/ha and 4.2 Mg/ha).
Based on these results for modeling biomass, Inception-V3 models were assumed to be the best for interpreting LiDAR data and were developed for the other two desired forest attributes: tree count and percent of needleleaf trees. Results of those models are shown in Table 3
. Each of the Inception-V3 CNNs outperformed the LMM-THM and RF-THM models in terms of RMSE; however, the improvement in accuracy was less pronounced for both the tree count and percent needleleaf models than with the biomass estimation. Random forest models consistently had a lower RMSE and higher coefficient of determination (pseudo R2
) than the linear mixed models, while the linear mixed models had slightly lower absolute bias than random forest.
The CNN model predicting tree count resulted in 6% less error than the RF-THM model and 10% less RMSE than the LMM-THM, with a RMSE of 2.78 trees. However, the CNN model’s bias was double that of THM-RF, at 0.2 and 0.1 trees, respectively. The LMM-THM exhibited almost no overall bias in estimating tree count (0.3 trees). With a 10 × 10 m cell size, these biases represent between 3–20 trees/ha, which is a relatively low bias for all three models (0.5–2.8%) when taking into consideration that the mean value of these plots was 714 trees/ha.
The CNN model predicting percent needleleaf had a RMSE of 18.7%, which is 2% less than the RF-THM model’s RMSE of 19.1% and 22% less than the LMM-THM model’s RMSE of 24.1%. The percent needleleaf CNN also had 60% less bias than the RF-THM and 50% less absolute bias than the LMM-THM, with a bias of 0.2%. Once again it should be noted that the overall biases of all three percent needleleaf models are only several tenths of a percent, and thus are negligible in any practical context.
In order to determine whether biases were consistent across plots, predicted versus observed values were plotted for each model type/forest attribute combination in Figure 2
. In the estimating biomass, the LMM-THM appeared to slightly over-predict in the highest biomass plots, while the CNN appeared to slightly under-predict in the highest biomass plots (Figure 2
A,C). It should be noted, however, that few plots fell within these extremes, and that loess regression is susceptible to outliers at the tail ends of data.
In predicting tree count, the LMM-THM and RF-THM both appeared to underestimate the number of trees in plots with high tree counts (Figure 2
D,E). This trend appears to be less substantial in the CNN model, despite having slightly more overall bias than both the THM based models. In predicting percent needleleaf, the LMM-THM appeared to over-predict in plots with more needleleaf stems (Figure 2
G). Error was also greatest for this model around plots with a mix of species. The RF-THM and the CNN models both tended to slightly over-predict percent needleleaf in plots with few needleleaf trees, and under-predict in plots with a higher percentage of needleleaf trees. Overall, however, there appeared to be no obvious trend in observed vs. fitted biases with model type.
Our results indicate that 3D CNNs can be used to develop an area-based forest inventory with less error and often less bias than a model based upon traditional height metrics. Model performance varied based on the specifics of the CNN architecture: those CNNs that made use of inception layers (GoogleNet and Inception-V3) outperformed those that did not. We also note that the deeper inception-based CNN, Inception-V3, outperformed the shallower inception-based GoogleNet. The CNN employing residual layers (ResNet-50) performed relatively poorly. We also evaluated ResNet-35 and ResNet 101 models, and obtained poorer and similar results, respectively. However, the top-performing CNNs for image recognition presently make use of layers that combine residual and inception elements [45
]; it is possible that such a model may outperform those tested here.
In general, the RF-THM models outperformed the LMM-THM models. Random forest models resulted in lower RMSE and greater explained variance in terms of pseudo-R2
. The linear mixed models often had slightly lower absolute bias than the random forest models, but this benefit was at times negated by greater bias in plots representing extremes. These findings match those of others who have likewise concluded that random forest performs equal to or better than linear modelling for LiDAR inventories [8
]. It should also be noted that although the THM we tested here are the most popular means of measuring LiDAR for forest attribute estimation, other studies have extracted different features from LiDAR and accompanying spectral data that may perform differently [62
Performance of the CNNs relative to the THM-based models varied by the forest attribute being predicted, though in every instance the CNNs produced a lower RMSE. In modelling biomass and percent needleleaf, the CNNs performed equal to or better than THM-based models in terms of bias. The tree count CNN resulted in slightly more bias than both the THM models; however, in practical terms this bias was minimal, and both the THM based models appeared to underestimate tree count to a greater extent in plots with greater numbers of trees.
We note that the greatest performance gains of using CNNs over the THM-based models in terms of RMSE and explained variance were achieved when estimating biomass. However, it should also be noted that we also spent a greater amount of time and effort modelling biomass, as it was the attribute used to decide upon a model architecture. Neural networks offer modelers a great number of user-specified hyperparameters and architecture decisions. We believe that model performance was at least somewhat related to the amount of time spent manually fine-tuning these during training. This usually amounts to more-or-less a game of trial and error, made harder given the lengthy time it takes to train these models. Some studies have automated this process through random searches or more sophisticated means of optimization [65
Many in the literature have noted that CNNs require a very large quantity of training data to achieve optimal performance [28
]. This will likely relegate the use of CNNs to only those instances in which multiple forest inventories are combined, as was the case here, or when a large national forest inventory dataset is used. We did, however, demonstrate here that the number of plots used to train these models can be artificially inflated through the collection of coincident remote sensing data. This mirrors results found by other studies [67
], in which dataset size was increased via transformations and the inclusion of many images rapidly taken of the same object. Several studies have also achieved good results by generating artificial training data [69
], and we believe that it may be possible to do the same with LiDAR. For example, artificial plots could be generated by combining pieces of other plots, individual tree crowns, or crowns derived from allometry or forest modeling [71
In terms of computational performance, we made no formal effort to assess the time it took to train each model included in this study. We did note, however, that training the more complex models (such as Inception V3 and Resnet-50) could take one or more days using GPU acceleration with an Nvidia Tesla k80. As settling upon the optimal model architecture and hyperparameters will likely take many attempts, the modelling process may take days or weeks. A process called “warm starting” may offer future modelers a potential solution by allowing them to reuse some or all parameters of a pre-trained 3D CNN (including those developed here), reducing the amount of training time and field plots required [72
], see Supplementary Materials
. We would also note that once the models were trained, predictions could be made very rapidly at a landscape level, perhaps offsetting the initial cost in training time.
Critics of deep learning have justifiably noted that the results of a CNN may be more difficult to explain given that these models lack human intuited covariates. This, combined with the size of the models, makes it nearly impossible to trace the route of data through these models and justify the result. It is possible however to visualize features identified by early layers of the model to get an understanding of the shapes and patterns being used by the model to make predictions. These visualizations are often referred to as feature maps [73
]. The raw values of a feature map have little biological relevance, although the relative values score features according to their utility in model predictions. The higher relative values on a feature map highlight features that are more likely to be retained by the model following an activation function. Figure 3
illustrates this with example feature maps generated over a plot. We note that some convolution layers appear to be identifying edges (Figure 3
D), while others appear to be identifying surfaces and possibly branches (Figure 3
In addition to area-based forest inventories, we believe that CNNs may also be able to address the issue of individual tree segmentation. A wide array of algorithms has been put forward to segment the crowns of individual trees from a LiDAR point cloud for the purpose of developing a tree-list inventory [75
]. Concurrently, CNNs have been enormously effective at segmenting objects from photographic and video imagery [78
]. Most CNN-based segmentation algorithms work by identifying potential bounding boxes of objects and then analyzing the interior of those bounding boxes to assess their validity. We believe that a similar algorithm could be adapted to identify the 3D bounding boxes of individual trees. Another CNN-based segmentation method known as semantic segmentation seeks to isolate individual pixels (or voxels) that represent a desired object [80
]. We believe that this method could be of use in classifying objects or terrain in LiDAR point clouds.
Another intriguing potential use for CNNs is in the development of pseudo-LiDAR point clouds. A number of studies have demonstrated the ability of CNNs to essentially be trained in reverse, producing images from other images or data [81
]. These types of CNNs are known as deconvolutional, inverse graphic, or transposed neural networks. For example, a CNN could be trained to produce voxelized point clouds over a forested area from either previous LiDAR or a standard forest inventory. These could then be used to inform modelers as to which features their neural networks are making use of, to test ecological hypotheses, to aid in visualization, and perhaps even to project LiDAR point clouds forward in time.
We have demonstrated that deep-learning methods using CNNs to interpret LiDAR data sets can improve upon traditional methods for area-based predictions of forest attributes. However, these improvements come with some drawbacks. Large amounts of training data, time, and effort are required for any modeling application that uses deep learning. Ultimately, it falls upon modelers to use their own best judgment to decide whether improvements in model performance are worth the effort involved in successfully training these models. That said, given our success and the prevalent adoption of deep learning in related fields, it is safe to assume that deep learning will play a large role in remote sensing in the future.