Predicting Stem Total and Assortment Volumes in an Industrial Pinus taeda L. Forest Plantation Using Airborne Laser Scanning Data and Random Forest

Improvements in the management of pine plantations result in multiple industrial and environmental benefits. Remote sensing techniques can dramatically increase the efficiency of plantation management by reducing or replacing time-consuming field sampling. We tested the utility and accuracy of combining field and airborne lidar data with Random Forest, a supervised machine learning algorithm, to estimate stem total and assortment (commercial and pulpwood) volumes in an industrial Pinus taeda L. forest plantation in southern Brazil. Random Forest was populated using field and lidar-derived forest metrics from 50 sample plots with trees ranging from three to nine years old. We found that a model defined as a function of only two metrics (height of the top of the canopy and the skewness of the vertical distribution of lidar points) has a very strong and unbiased predictive power. We found that predictions of total, commercial, and pulp volume, respectively, showed an adjusted R2 equal to 0.98, 0.98 and 0.96, with unbiased predictions of −0.17%, −0.12% and −0.23%, and Root Mean Square Error (RMSE) values of 7.83%, 7.71% and 8.63%. Our methodology makes use of commercially available airborne lidar and widely used mathematical tools to provide solutions for increasing the industry efficiency in monitoring and managing wood volume.


Introduction
The area of planted forests worldwide has been steadily growing, with an estimated 6.95% of total global forested area being plantations in 2010 [1]. Tropical regions may be experiencing particularly rapid rates of plantation expansion [2]. For example, the area of pine plantations in Brazil has dramatically risen in the last few decades to increase pulp and paper production. Currently~20% of the total reforested area of Brazil is comprised of pine forest plantations [2]. to maximize economic return. Therefore, the development of robust frameworks for modeling and mapping stem total and assortment volumes at plot and stand levels is still needed to increase the efficiency in monitoring and managing wood and pulp productions in forest plantations. Moreover, efficient frameworks also play important role in helping lidar technology move from research to operational modes, especially in industrial forest plantation settings where lidar applications are relatively new. The aims of this study were to: (i) present a robust and efficient framework for modeling, predicting and mapping stem total volume (Vt), saw logs (in this study mentioned as commercial) volume (Vc) and pulpwood volume (Vp) in a P. taeda plantation in southern Brazil using airborne lidar data; (ii) evaluate the use of the RF machine learning algorithm for modeling stem total and assortment volumes; and (iii) generate maps representing the spatial distribution of Vt, Vc and Vp in differently aged plantations of P. taeda. This investigation was based on the hypothesis that lidar technology and Random Forest analysis can facilitate accurate and precise inferences of forest volumes in P. taeda plantations in southern Brazil.

Study Area Description
The study area consisted of P. taeda stands located within the Telêmaco Borba municipality in the state of Paraná, southern of Brazil ( Figure 1). Trees were planted using a 3.0 × 2.0 m or 2.5 × 2.5 m grid configuration, resulting in an average tree density of 1667 or 2000 trees ha −1 , respectively. The climate of the region is characterized as warm and temperate [28], with annual average precipitation of approximately 1378 mm and an annual average temperature of 18.4 • C. The P. taeda stands are situated on a plateau where the topography is relatively flat. The plantations are managed by Klabin S.A., a pulp and paper company. to maximize economic return. Therefore, the development of robust frameworks for modeling and mapping stem total and assortment volumes at plot and stand levels is still needed to increase the efficiency in monitoring and managing wood and pulp productions in forest plantations. Moreover, efficient frameworks also play important role in helping lidar technology move from research to operational modes, especially in industrial forest plantation settings where lidar applications are relatively new. The aims of this study were to: (i) present a robust and efficient framework for modeling, predicting and mapping stem total volume (Vt), saw logs (in this study mentioned as commercial) volume (Vc) and pulpwood volume (Vp) in a P. taeda plantation in southern Brazil using airborne lidar data; (ii) evaluate the use of the RF machine learning algorithm for modeling stem total and assortment volumes; and (iii) generate maps representing the spatial distribution of Vt, Vc and Vp in differently aged plantations of P. taeda. This investigation was based on the hypothesis that lidar technology and Random Forest analysis can facilitate accurate and precise inferences of forest volumes in P. taeda plantations in southern Brazil.

Study Area Description
The study area consisted of P. taeda stands located within the Telêmaco Borba municipality in the state of Paraná, southern of Brazil ( Figure 1). Trees were planted using a 3.0 × 2.0 m or 2.5 × 2.5 m grid configuration, resulting in an average tree density of 1667 or 2000 trees ha −1 , respectively. The climate of the region is characterized as warm and temperate [28], with annual average precipitation of approximately 1378 mm and an annual average temperature of 18.4 °C. The P. taeda stands are situated on a plateau where the topography is relatively flat. The plantations are managed by Klabin S.A., a pulp and paper company.

Field Data Collection
A total of 50 rectangular plots, each approximately 600 m 2 (i.e., 20 m × 30 m) were randomly established and measured across 50 stands distributed in four plantations. As such, the sample plots

Field Data Collection
A total of 50 rectangular plots, each approximately 600 m 2 (i.e., 20 m × 30 m) were randomly established and measured across 50 stands distributed in four plantations. As such, the sample plots well represent the study area, and they capture the entire structural variability in these stands with ages ranging from three to nine years old. All plots were geo-referenced with a geodetic GPS with differential correction capability (Trimble Pro-XR, Trimble, Sunnyvale, CA, USA) ensuring a location error lower than 10 cm. In each sample plot, individual trees were measured for dbh (diameter at breast height) at 1.30 m and a random subsample (15%) of trees for tree height (Ht). For those trees in the plots that were not directly measured for Ht, the inventory team of Klabin S.A. predicted heights from hypsometric equations [29], employing dbh as the independent variable, and Ht as the dependent variable, following the model below: where ln(Ht) is the natural logarithm of tree height (m); β 0 and β 1 are the intercept and the slope of the model; dbh is the diameter at breast height (1.30 m) and e is the random error of the model. The coefficients of the hypsometric models are the companies intellectual property and not made available to the public, however, the adjusted coefficient of determination (adj. R 2 ) and standard error of estimate in percentage (SEE%) of the models ranged from 0.96 to 0.98 and 5.1 to 6.5, respectively. The management goal of the P. taeda plantations at Klabin is optimized to produce commercial logs of 2.65 m in length, which are then classified in four timber assortment classes: 18 to 25 cm (Vc 1), 25 to 30 cm (Vc 2), 30 to 40 cm (Vc 3), and diameter ≥ 40 cm (Vc 4). The logs designated for pulpwood are produced with log lengths of 2.40 m and diameters ranging from 8 to 18 cm (Vp), as illustrated in Figure 2. well represent the study area, and they capture the entire structural variability in these stands with ages ranging from three to nine years old. All plots were geo-referenced with a geodetic GPS with differential correction capability (Trimble Pro-XR, Trimble, Sunnyvale, CA, USA) ensuring a location error lower than 10 cm. In each sample plot, individual trees were measured for dbh (diameter at breast height) at 1.30 m and a random subsample (15%) of trees for tree height (Ht). For those trees in the plots that were not directly measured for Ht, the inventory team of Klabin S.A. predicted heights from hypsometric equations [29], employing dbh as the independent variable, and Ht as the dependent variable, following the model below: where ln(Ht) is the natural logarithm of tree height (m); β and β are the intercept and the slope of the model; dbh is the diameter at breast height (1.30 m) and e is the random error of the model. The coefficients of the hypsometric models are the companies intellectual property and not made available to the public, however, the adjusted coefficient of determination (adj. R 2 ) and standard error of estimate in percentage (SEE%) of the models ranged from 0.96 to 0.98 and 5.1 to 6.5, respectively. The management goal of the P. taeda plantations at Klabin is optimized to produce commercial logs of 2.65 m in length, which are then classified in four timber assortment classes: 18 to 25 cm (Vc 1), 25 to 30 cm (Vc 2), 30 to 40 cm (Vc 3), and diameter ≥ 40 cm (Vc 4). The logs designated for pulpwood are produced with log lengths of 2.40 m and diameters ranging from 8 to 18 cm (Vp), as illustrated in Figure 2. In this study, Vt, Vc and Vp for each tree were computed using the fifth-degree polynomial [30] as presented below: In this study, Vt, Vc and Vp for each tree were computed using the fifth-degree polynomial [30] as presented below: where β = parameters to be estimated; d i = stem diameter (cm) at the ith position; dbh = diameter (cm) at breast height (1.30 m); h = total height (m); h i = height (m) at the ith position; and K = π/40,000 is an adjustment factor to estimate volume as m 3 ·ha −1 . The polynomial models were adjusted for classes of dbh, and the coefficients of the models are the companies' intellectual property and not made available to the public; however, the classes of diameter, adj. R 2 and standard error of the estimate (SEE; given in %) for the polynomial models used in this study are presented in Table 1. The total of Vt, Vc and Vp of all individuals were summed at plot-level and scaled to a hectare. The summary of volumes in m 3 ·ha −1 for each class of stand ages is presented in Table 2. SEE (%) is the standard error of the estimate, expressed as a percentage.

Lidar Data Acquisition and Data Processing
Lidar data were obtained by a Harrier 68i sensor (Trimble, Sunnyvale, CA, USA) mounted on a CESSNA 206 aircraft. The characteristics of the lidar data acquisition are listed in Table 3. Lidar data processing consisted of several steps that ingested the lidar point cloud data and provided two major outputs: the digital terrain model (DTM), and the lidar-derived canopy structure metrics. All lidar processing was performed using FUSION/LDV 3.42 software (US Forest Service, Washington, DC, USA) [31]. The original point cloud data were filtered using Kraus and Pfeifer's algorithm [32] and a 1 m resolution DTM was generated from the points classified as ground. Subsequently, the height of the returns was computed by subtracting the elevation of the DTM from each return. Once the heights were normalized, the metrics shown in Table 4 were computed at plot and stand levels, at a grid cell resolution of 25 m, using all lidar returns. Table 4. Lidar-derived canopy height metrics considered as candidate variables for predictive V models [31].

Predictor Variables Selection
In order to derived accurate estimates of stem volumes from lidar, it is essential to select the most significant lidar metrics (predictor variables) for modeling within a parsimonious statistical model framework. Because the number of candidate lidar metrics can be very large (e.g., 30 metrics), in our study we selected the best lidar metrics for modeling stem volumes based on two steps. First, even though highly correlated variables will not cause multi-collinearity issues in RF [20], Pearson's correlation (r) was used to identify highly correlated predictor variables (r > 0.9) as presented in previous studies (e.g., [14,33,34]). If a given group (2 or more) of lidar metrics were highly correlated, we retained only one metric by excluding the others that were most highly correlated with the remaining metrics. Second, we identified the most important metrics based on the Model Improvement Ratio (MIR), a standardized measure of variable importance [35,36]. MIR scores are derived by dividing raw variable important scores (output from RF models) by the maximum variable importance score, so that MIR values range from 0 to 1. MIR scores allow for variable importance comparisons among different RF models. We ran the RF routine (package randomForest [37]) in R [38] 1000 times to compute MIR. In each MIR iteration, we bootstrapped the data by randomly selecting a sample of 50 plots with replacement. RF requires two parameters to be set: (i) mtry, the number of predictor variables performing the data partitioning at each node which in this study was defined by the number of highly uncorrelated preliminary set of lidar metrics and (ii) ntree, the total number of trees to be grown in the model run which was set to 1000 (e.g., [34,39]). Running 1000 iterations of RF produced consistent MIR distributions and avoided unnecessary processing time [39]. To create parsimonious models, we reserved the metrics for final RF models that exhibited the highest mean MIR values.

Random Forest Model Development
The three stem volumes (Vt, Vc and Vp) of interest were predicted at the plot and stand-levels using also the RF package [37] in R [38]. The number of RF trees to grow was set to 1000, and the number of predictor variables performing the data partitioning at each node was set to equal the number of best lidar metrics selected by MIR on Section 2.4 [37]. The accuracy of estimates for each model was evaluated in terms of Adj. R 2 , Root Mean Square Error (RMSE), and Bias (both absolute and relative) by the linear relationship between predicted (output from RF) and observed stem volumes: where n is the number of plots, y i is the observed value for plot i, andŷ i is the predicted value for plot i. Moreover, relative RMSE (%) and biases (%) were calculated by dividing the absolute values (Equations (5) and (6)) by the mean of the observed stem volume. Based on earlier experiences and recommendations from literature [4,5], we defined acceptable model accuracy as a relative RMSE and Bias of <15%. For validation purposes, RF models were embedded in a bootstrap with 500 iterations. In each bootstrap iteration, we drew 50 times with replacement from the 50 available samples. In this procedure, on average 44% of the total of sample (~22 samples) are not drawn. These samples were subsequently used as holdout samples for an independent validation (e.g., [40]). In each bootstrap iteration, Adj. R 2 , absolute and relative RMSE and Bias were computed based on the linear relationship between observed and predicted volumes using the holdout samples. We used also two-sided Kolmogorov-Smirnov (KS) in R [38] and a statistical equivalence test [41] to compare the field-and lidar-based stem volume estimates in each iteration.

Predictive Stem Volumes Maps
Predictive maps of stem volumes at 25 m of spatial resolution were generated based on the RF models containing the best lidar metrics according to MIR analysis. Because we have a large number of stands in this study, stem volumes predictions at the stand level were then presented herein by stand ages of 3-5, 5-7 and 7-9 years. Additionally, maps of coefficient of variation (CV, given in %) values for the stem volume predictions (as obtained from the 500 bootstrap runs) was also produced for each stand (e.g., [40]). Figure 3 provides an overview of the study methodology. . Procedure for predicting stem total and assortment volumes in an industrial P. taeda forest plantation using airborne laser scanning data and random forest.

Predictor Variable Selection
A total of 25 of the 32 lidar metrics showed a very strong correlation (r > 0.9). We retained one of the highly correlated metrics (H99TH), which along with seven other remaining metrics not extremely highly correlated (r ≥ 0.9) were included in the MIR analysis (Table 5). LiDAR metrics that were retained after correlation analysis included HMIN, HCV, HIQ, HSKEW, HKUR, H99TH and COV. Among these, H99TH and HSKEW exhibited the highest mean MIR values (Table 6) and therefore, were used for model development. Although HCV also showed high mean MIR values, its inclusion in the models did not significantly improve model performance.   . Procedure for predicting stem total and assortment volumes in an industrial P. taeda forest plantation using airborne laser scanning data and random forest.

Predictor Variable Selection
A total of 25 of the 32 lidar metrics showed a very strong correlation (r > 0.9). We retained one of the highly correlated metrics (H99TH), which along with seven other remaining metrics not extremely highly correlated (r ≥ 0.9) were included in the MIR analysis (Table 5). LiDAR metrics that were retained after correlation analysis included HMIN, HCV, HIQ, HSKEW, HKUR, H99TH and COV. Among these, H99TH and HSKEW exhibited the highest mean MIR values (Table 6) and therefore, were used for model development. Although HCV also showed high mean MIR values, its inclusion in the models did not significantly improve model performance.

Model Performances
The H99TH and HSKEW that exhibited high MIR values explained more than 80% variations of the stem volumes in Vt, Vc and Vp components with relative RMSE and Bias less than 10% and −0.10%, respectively ( Table 7). The negative values in Bias indicate that the models are slightly underestimating the stem volumes. Predicted stem volumes at plot level did not differ significantly to the observed stem volumes by the KS and equivalence tests (p-values > 0.05). Figure 4 shows the distribution of observed and stem volumes and a good agreement can be observed.

Model Performances
The H99TH and HSKEW that exhibited high MIR values explained more than 80% variations of the stem volumes in Vt, Vc and Vp components with relative RMSE and Bias less than 10% and −0.10%, respectively ( Table 7). The negative values in Bias indicate that the models are slightly underestimating the stem volumes. Predicted stem volumes at plot level did not differ significantly to the observed stem volumes by the KS and equivalence tests (p-values > 0.05). Figure 4 shows the distribution of observed and stem volumes and a good agreement can be observed.  The performance of RF model to predict Vt, Vc and Vp was also summarized in terms of Adj. R 2 , RMSE and Bias for all 500 bootstrap runs (Table 8). Observed and predicted stem volumes in each bootstrap iteration did not differ significantly by the statistical KS and equivalence test (p-values > 0.05) as well. Overall, all models using H99TH and HSKEW performed very well, with relative RMSE and Bias <15% in the bootstrap procedure. The observed and the average of the predicted stem volumes from the 500 bootstrap runs were also compared and according to the KS and equivalence tests those values did not differ significantly (p-values > 0.05) too ( Figure 5). The performance of RF model to predict Vt, Vc and Vp was also summarized in terms of Adj. R 2 , RMSE and Bias for all 500 bootstrap runs (Table 8). Observed and predicted stem volumes in each bootstrap iteration did not differ significantly by the statistical KS and equivalence test (p-values > 0.05) as well. Overall, all models using H99TH and HSKEW performed very well, with relative RMSE and Bias <15% in the bootstrap procedure. The observed and the average of the predicted stem volumes from the 500 bootstrap runs were also compared and according to the KS and equivalence tests those values did not differ significantly (p-values > 0.05) too ( Figure 5).   . The equivalence plot design presented herein is an adaptation of the original equivalence plots presented by [41]. The grey polygon represents the ±25% region of equivalence for the intercept, and the green vertical bar represents a 95% of confidence interval for the intercept. The predicted stem volumes from the RF models are equivalent with reference to the intercept and slope since the green bar is completely within the grey polygon. If the grey polygon is lower than the green vertical bar, the predicted stem volumes are negatively biased; and if it is higher than the green vertical bar, the predicted stem volumes are positively biased. Moreover, the grey dashed line represents the ±25% region of equivalence for the slope, the fit line is within the dotted lines and the black vertical bar is within the gray rectangle, indicating that the pairwise measurements are equivalent. A green bar that is wider than the region outlined by the grey dashed lines indicates highly variable predictions. The white dots are the pairwise measurements, and the solid line is a best-fit linear model for the pairwise measurements. The light grey dashed line represented the relationship 1:1. The horizontal red bars represent the standard deviation of the 500 bootstraping predictions.

Prediction Maps
Box plots of predicted stem values of Vt, Vc and Vp of P. taeda at the stand level are shown in Figure 5. On average, naturally predicted stem volumes tended to be lower at young age ( Figure 6A) and higher at advanced age stands ( Figure 6C). Herein, because it is not convenient to show all the maps for the 50 stands, Figures 7 and 8 is showing as an example the predicted map of stem volumes and CV (%) with spatial resolution of 25 m for only three stands, but with ages ranging from three to nine years old.  N = 50). The equivalence plot design presented herein is an adaptation of the original equivalence plots presented by [41]. The grey polygon represents the ±25% region of equivalence for the intercept, and the green vertical bar represents a 95% of confidence interval for the intercept. The predicted stem volumes from the RF models are equivalent with reference to the intercept and slope since the green bar is completely within the grey polygon. If the grey polygon is lower than the green vertical bar, the predicted stem volumes are negatively biased; and if it is higher than the green vertical bar, the predicted stem volumes are positively biased. Moreover, the grey dashed line represents the ±25% region of equivalence for the slope, the fit line is within the dotted lines and the black vertical bar is within the gray rectangle, indicating that the pairwise measurements are equivalent. A green bar that is wider than the region outlined by the grey dashed lines indicates highly variable predictions. The white dots are the pairwise measurements, and the solid line is a best-fit linear model for the pairwise measurements. The light grey dashed line represented the relationship 1:1. The horizontal red bars represent the standard deviation of the 500 bootstraping predictions.

Prediction Maps
Box plots of predicted stem values of Vt, Vc and Vp of P. taeda at the stand level are shown in Figure 5. On average, naturally predicted stem volumes tended to be lower at young age ( Figure 6A) and higher at advanced age stands ( Figure 6C). Herein, because it is not convenient to show all the maps for the 50 stands, Figures 7 and 8 is showing as an example the predicted map of stem volumes and CV (%) with spatial resolution of 25 m for only three stands, but with ages ranging from three to nine years old.

Discussion
Detailed information on stem total and assortments volumes is required in industrial forest plantations to achieve production efficiency. For instance, incomplete or inaccurate forest information adds to the expense and challenge of forest operations (e.g., [42]). Moreover, improving forest plantation productivity and efficiency are important for reducing harvest pressure on natural forests. To achieve efficiency gains in operational forest management, a wide range of forest inventory attributes are required to be measured accurately at high spatial resolution and landscape to regional extents [34,43]. More detailed inventory information can allow forest owners to make better decisions concerning the timing of timber sales, and allow forest companies to optimize their wood supply chain from forest to factory [44]. In this study, we present a framework for predicting and mapping total, commercial and pulpwood volumes in industrial P. taeda forest plantations using airborne lidar data and RF. While there have been previous studies exploring the use of lidar and non-parametric machine learning algorithm for forest inventory modeling (e.g., [19,34,40,45-48]), no studies yet have demonstrated the potential of lidar and RF combined for predicting and mapping commercial and pulpwood volumes in industrial pine forest plantations.
Stem total and assortment volumes are directly related to the supply of fiber to pulp and paper companies. Herein, the accuracy of lidar for retrieving Vt, Vc and Vp using RF models was clearly demonstrated through achieving a relative RMSE and Bias less than <15% both for modeling and for validation. As we are predicting forest attributes at a homogenous and single layered forest structure, our measures of precision and accuracy were similar to or higher than those who used lidar data for predicting stem volume through a RF framework in other forest types [15,16,42,44,49]. Among prior

Discussion
Detailed information on stem total and assortments volumes is required in industrial forest plantations to achieve production efficiency. For instance, incomplete or inaccurate forest information adds to the expense and challenge of forest operations (e.g., [42]). Moreover, improving forest plantation productivity and efficiency are important for reducing harvest pressure on natural forests. To achieve efficiency gains in operational forest management, a wide range of forest inventory attributes are required to be measured accurately at high spatial resolution and landscape to regional extents [34,43]. More detailed inventory information can allow forest owners to make better decisions concerning the timing of timber sales, and allow forest companies to optimize their wood supply chain from forest to factory [44]. In this study, we present a framework for predicting and mapping total, commercial and pulpwood volumes in industrial P. taeda forest plantations using airborne lidar data and RF. While there have been previous studies exploring the use of lidar and non-parametric machine learning algorithm for forest inventory modeling (e.g., [19,34,40,45-48]), no studies yet have demonstrated the potential of lidar and RF combined for predicting and mapping commercial and pulpwood volumes in industrial pine forest plantations.
Stem total and assortment volumes are directly related to the supply of fiber to pulp and paper companies. Herein, the accuracy of lidar for retrieving Vt, Vc and Vp using RF models was clearly demonstrated through achieving a relative RMSE and Bias less than <15% both for modeling and for validation. As we are predicting forest attributes at a homogenous and single layered forest structure, our measures of precision and accuracy were similar to or higher than those who used lidar data for predicting stem volume through a RF framework in other forest types [15,16,42,44,49]. Among prior studies, RF has generally showed better performance compared to other statistical approaches, such as multiple linear regression, boosting trees regression and support vector regression [50][51][52][53]. Lidar-derived stem total and saw log volumes and their estimation accuracies have previously been reported at the forest stand level (e.g., [15,16,42,54,55]). For instance, in Eastern Finland in a typical Finnish southern boreal managed forest area, two studies used lidar data for estimating species-specific diameter distributions and saw log volumes [15,16]. Two years later, in Southern Wisconsin, USA, lidar data were used for predicting not only saw log volume, but also pulpwood volume [55]; the models produced R 2 of~0.65 for estimating both saw log and pulpwood volumes. While those authors have showed the great potential of lidar in retrieving assortment volumes, this specific application is still relatively novel and further studies, such as presented herein, still need to be carried out.
In this study, we showed that lidar measurements could be used as input data to predict and map stem total and assortment volumes through a RF framework. High levels of accuracy were found when predicting Vt, Vc and Vp volumes across variable stand ages of P. taeda using only H99TH and HSKEW as predictor variables. Lidar-derived H99TH represents the top of the canopy (height at 99th percentile) and HSKEW is a measure of the asymmetry of height distribution, which is associated with the age of the stands because older trees are taller and cause a more positively skewed distribution. Skewness and height percentile variables are logical selections for distinguishing between different volume levels based on distributional shapes and height frequencies [56]. In particular, these variables can explain changes in the volume distribution [5], thus providing a solid justification for inclusion in the predictive model. Our results suggest that models based on variables describing the height of the canopy and the symmetry of the distribution of the returns are capable of predicting stem total and assortment volumes across different tree ages in industrial P. taeda forest plantations. Height percentile lidar metrics, such as H99TH, and height distributional metrics, such as HSKEW, have been shown to be powerful metrics for modeling and predicting forest attributes (e.g., [5][6][7]33,34,48]).
A disadvantage of using the RF framework presented here is that RF models do not extrapolate predictions beyond the trained data, and consequently, as found herein, reduce the variance compared to the observations ( Figure 5). However, an important advantage of non-parametric approaches, such as RF, is that they can model non-linear, complex relationships between the dependent and the independent variables more efficiently than parametric approaches [46]. Furthermore, RF is insensitive to data skew, robust to a high number of variable inputs, and its implementation does not require pre-stratification by forest type [20,34,46]. From an overall statistical perspective, the predicted and observed volumes were equivalent, although our RF model validations showed a systematic tendency to overestimate small values and underestimate high values. The same was found in previous studies (e.g., [40,57]). According to one study [57], a possible cause might be that because the RF model estimates values by averaging the predictions of many decision trees, it might tend to underestimate when the predicted value is close to the maximum value of the training data. Similarly, when the estimated value is close to the minimum value of training data it might tend to overestimate. Other possible causes might be that we have a relatively small number of field plots, especially in the young and older stands.
Traditional forest inventory approaches are not effective in terms of costing and mobility especially in P. taeda forest plantations, where there is a need to monitor annual forest growth and properties are very large. Lidar remote sensing constitutes an important step towards operational wood procurement planning and is of high current interest to forestry organizations. Such technology is of great interest owing to their spatial sampling capabilities within plantations, and have had great reliability in forest inventory work in countries such as Norway, Canada, or the USA (e.g., [6,7,12,58]). Moreover, the application of airborne lidar technology for Brazilian industrial management is relatively new. While some studies have showed that the cost of the forest inventory derived from lidar could be lower than conventional forest inventory [59,60], the cost of lidar data acquisition could still be high to monitor forest growth annually; however, lidar has the ability to provide wall-to-wall, accurate mapping of forest attributes at high spatial resolutions (e.g., Figures 7 and 8).
Traditional forest inventory approaches are based on sampling theory, and forest attributes measured at plot level are then used to infer inventory attributes for an entire stand [5,14]. We showed here that lidar and RF machine learning combined can be a powerful tool for mapping forest attributes in P. taeda forest plantations. In practice, lidar-derived maps of stem total and assortment volumes (Figures 7 and 8) allow the owners to evaluate the production and forest structure variability within stands in a spatially explicit manner, which is not possible in a traditional forest inventory of P. taeda. Also, such maps may allow managers to detect spatial patterns related to tree diseases, fire or forest clearing.
Recently, a study carried out in Eucalyptus spp. forest plantations showed that lidar and RF could be combined to predict and map aboveground carbon at high spatial resolution (5 m), even if the models are calibrated using field plots with area larger than the cell size used for mapping [34]. Therefore, future studies should be also test the ability of lidar and RF to map stem total and assortment volumes even at higher spatial resolution than presented in this study (e.g., Figures 7 and 8). Herein, we demonstrated the potential of combined lidar-derive metrics and RF to predict forest attributes through a lidar-plot based approach framework, however, to get even higher amount of details in P. taeda forest plantations, RF could be also tested in a lidar-individual tree based approach. For instance, RF has been successfully used to impute individual tree height and volume in longleaf pine (Pinus palustris Mill.) forest in Southern USA [61]; therefore, lidar and RF could be also used to predict stem total and assortment volumes at an individual tree level in P. taeda forest plantations, if carefully implemented.

Conclusions
Refining strategies for improving productivity of forest plantations requires accurate and detailed spatial information on forest structure and growing stock volume. In this study, we showed that airborne lidar data metrics can predict total, commercial and pulpwood volumes in a P. taeda forest plantation in Brazil. We found that different stem volumes can be estimated with high levels of accuracy from two lidar-derived variables describing the height and the shape of the vertical distribution of the height. The use of a model based on two variables suggests a higher generalization potential than models based on specific metrics that could result in over-fitting. However, this potential should be tested in other plantations and forested environments. Although airborne lidar data has not been adopted by paper companies operationally, our results show that the method used could be readily applied to support the supply chain of pulp and paper companies in Brazil or elsewhere.