Combined Impact of Sample Size and Modeling Approaches for Predicting Stem Volume in Eucalyptus spp. Forest Plantations Using Field and LiDAR Data

Light Detection and Ranging (LiDAR) remote sensing has been established as one of the most promising tools for large-scale forest monitoring and mapping. Continuous advances in computational techniques, such as machine learning algorithms, have been increasingly improving our capability to model forest attributes accurately and at high spatial and temporal resolution. While there have been previous studies exploring the use of LiDAR and machine learning algorithms for forest inventory modeling, as yet, no studies have demonstrated the combined impact of sample size and different modeling techniques for predicting and mapping stem total volume in industrial Eucalyptus spp. tree plantations. This study aimed to compare the combined effects of parametric and nonparametric modeling methods for estimating volume in Eucalyptus spp. tree plantation using airborne LiDAR data while varying the reference data (sample size). The modeling techniques were compared in terms of root mean square error (RMSE), bias, and R2 with 500 simulations. The best performance was verified for the ordinary least-squares (OLS) method, which was able to provide comparable results to the traditional forest inventory approaches using only 40% (n = 63; ~0.04 plots/ha) of the total field plots, followed by the random forest (RF) algorithm with identical sample size values. This study provides solutions for increasing the industry efficiency in monitoring and managing forest plantation stem volume for the paper and pulp supply chain.

the combined effect of sample size and data modeling (parametric and non-parametric approach) may impact the accuracy of the stem total volume estimation from LiDAR. Although several studies have demonstrated the effectiveness of the area-based approach for Airborne Laser Scanning (ALS)-based estimation of stem volume, the combined impact of different modeling techniques and sample size in Eucalyptus spp. forest plantations remains unexplored.
Accurate forest inventory is of foremost importance to make operational, tactical, and strategic management decisions efficiently. Therefore, to improve plantation management, there is a need to develop and implement more accurate, repeatable, and robust frameworks for modeling and mapping forest attributes at plot and stand levels. Moreover, efficient frameworks also play a key role in helping LiDAR technology move from research to operational modes, especially in industrial forest plantation settings where LiDAR applications are relatively new [31].
In this context, the aim of this study was, through the integration of field-based forest inventory and LiDAR data, to compare the performance of parametric and nonparametric modeling methods in the estimation of stem total volume in industrial Eucalyptus spp. forest plantations while assessing how the combined effect of sample size and different modeling techniques may impact the accuracy of the predictions. In this study, we offer insights and recommendations to forest managers and modelers for enhancing their model selection, data collection, and decision-making strategies and thereby assist them in optimizing cost, energy, labor, and overall efficiency of the forest inventory operations.

Study Area
The study area consisted of three farms located in the municipalities of Pilar do Sul and São Miguel Arcanjo, the southeast region of the state of São Paulo, Brazil ( Figure 1). According to the Köppen classification, the climate of the region is characterized as humid subtropical, with wet and hot summers and dry and cold winters. The mean annual precipitation is~1700 mm, and the mean annual temperature is 18.8 • C [32]. The topography in the selected plantations ranges from mildly hilly to very hilly, with an elevation ranging from 659 m to 1210 m. The soils of the region are predominantly red and yellow-red latosol, all classified as clayey or very clayey.
The farms contained industrial Eucalyptus plantations managed by Suzano S.A., a pulp and paper company located in São Paulo state, Brazil. The plantations were composed of hybrid clones of two Eucalyptus species, Eucalyptus grandis W. Hill ex Maid and Eucalyptus urophylla S.T. Blake, and covered an area of 2067.49 ha. All the trees were planted predominantly in a 3 × 2 m grid configuration, resulting in an average density of 1667 trees/ha. Standage across the farms was variable and ranged from 2 to 6 years. The farms contained industrial Eucalyptus plantations managed by Suzano S.A., a pulp and paper company located in São Paulo state, Brazil. The plantations were composed of hybrid clones of two Eucalyptus species, Eucalyptus grandis W. Hill ex Maid and Eucalyptus urophylla S.T. Blake, and covered an area of 2067.49 ha. All the trees were planted predominantly in a 3 × 2 m grid configuration, resulting in an average density of 1667 trees/ha. Standage across the farms was variable and ranged from 2 to 6 years.

Field Data
This study was based on data collected in a set of temporary and permanent sample plots installed for the purpose of annual forest inventory by the Suzano S.A. company. A total of 158 circular plots of 400 m 2 were established in stands ranging in age from 2.2 to 6 years. In each stand, the plot was randomly established within the stand boundary. Measurements were carried out during the months of April to November of 2013. All the sample plots were georeferenced in the field using a geodetic GPS (Global Positioning System) unit with differential correction capability (Trimble Pro-XR). The projected coordinate system used was UTM SIRGAS 2000, zone 23 S.
In each sample plot, individual trees were measured for diameter at breast height (dbh; cm) at 1.30 m, and a random subsample (15%) of trees for tree heights (Ht; m). Heights of unmeasured trees were estimated using locally adjusted hypsometric models, which use dbh as the predictor of Ht, following the model below: where ln(Ht) = the natural logarithm of tree total height (m), β0 and β1 = the intercept and the slope of the model, dbh = diameter (cm) at breast height (1.30 m), and ε = model's random error. Coefficients of determination (R²) and standard error (SE) were 0.97 and 3.18 m (6.09%).

Field Data
This study was based on data collected in a set of temporary and permanent sample plots installed for the purpose of annual forest inventory by the Suzano S.A. company. A total of 158 circular plots of 400 m 2 were established in stands ranging in age from 2.2 to 6 years. In each stand, the plot was randomly established within the stand boundary. Measurements were carried out during the months of April to November of 2013. All the sample plots were georeferenced in the field using a geodetic GPS (Global Positioning System) unit with differential correction capability (Trimble Pro-XR). The projected coordinate system used was UTM SIRGAS 2000, zone 23 S.
In each sample plot, individual trees were measured for diameter at breast height (dbh; cm) at 1.30 m, and a random subsample (15%) of trees for tree heights (Ht; m). Heights of unmeasured trees were estimated using locally adjusted hypsometric models, which use dbh as the predictor of Ht, following the model below: where ln(Ht) = the natural logarithm of tree total height (m), β 0 and β 1 = the intercept and the slope of the model, dbh = diameter (cm) at breast height (1.30 m), and ε = model's random error. Coefficients of determination (R 2 ) and standard error (SE) were 0.97 and 3.18 m (6.09%). Field measurements were used to estimate stem total volume (V; m 3 ·tree −1 ) by applying the respective diameter and height into the Schumacher-Hall allometric model [33], adjusted for each region, rotation, and genetic material, following the model below: where ln(V) = the natural logarithm of stem total volume (m 3 ), β = model's parameters to be estimated (i = 0, 1, 2), dbh = diameter (cm) at breast height (1.30 m), Ht = total height, and ε = model's random error. All the field measurements and predictions calculations from the hypsometric and allometric models were provided by the inventory team of Suzano S.A. The coefficients of the models are under the company's intellectual property rights and not made available to the public; however, the R 2 and SE of the estimate for the volume models used in this study ranged from 0.96 to 0.98 and 8.3 to 12.7 m 3 ·ha −1 (3.18% and 6.09%), respectively. Each variable of all individuals was summed at plot-level and scaled to a hectare. A summary of plot-level forest attributes, including V (m 3 ·ha −1 ) calculations for each class of stand ages, is presented in Table 1.

LiDAR Data Collection Specifications and Processing
An airborne LiDAR survey was conducted in the study area on 5 December 2013, using a Harrier 68i sensor (Trimble, Sunnyvale, CA, USA) mounted on a CESSNA 206 aircraft. The characteristics of the LiDAR data acquisition are listed in Table 2. LiDAR data processing steps were performed using FUSION/LDV 3.7 software [34], which provided three major outputs: the digital terrain model (DTM), the normalized point cloud, and the LiDAR-derived canopy structure metrics. In order to differentiate between ground and vegetation points, the original LiDAR cloud data were filtered using the classification algorithm proposed in Reference [35]. The ground points were used to generate the 1 m resolution Digital Terrain Models (DTMs). The LiDAR clouds were normalized to heights by subtracting the DTMs elevations from each LiDAR return. Normalized point clouds were subset within the field sample plots of interest, and the canopy metrics were computed at plot using all returns above 1.30 m. We generated only those metrics that have been often used as candidate predictors for forest attribute modeling in other recent studies [14,20,36,37]. Therefore, a total of 26 LiDAR metrics calculated from all returns were considered as a candidate for predicting stem volume ( Table 3). All the LiDAR processing was performed by FUSION/LDV [34].

Modeling Development and Assessment
The modeling approaches evaluated in this study to estimate the statistical relationship between stem volume and LiDAR metrics fall into two different categories: parametric methods (e.g., multiple linear regression) and non-parametric methods (e.g., machine learning regression). Parametric and non-parametric models have been proven to be useful for developing predictions from LiDAR-derived metrics and field-estimated forest structural attributes [20,31,[36][37][38][39] Even though machine learning algorithms are usually not sensible for collinearity, normality, or linearity, in order to obtain a set of predictor variables that could be commonly applied to all the selected modeling methods, we used two variable selection approaches. First, Pearson's correlation (r) analysis was carried out to identify highly correlated metrics and to exclude redundant predictors (r > 0.9) [31,40]. Second, we implemented principal component analysis (PCA) to the most relevant LiDAR-derived candidate metrics to achieve a final set of predictor variables. Using PCA, a subset of variables that explain the majority of variation can be selected from a large set of (possibly highly correlated) predictor variables.
PCA was applied over the selected LiDAR metrics for each of the 158 sample plots. A correlation matrix derived from the LiDAR metrics provided the basis for the eigenvalue and eigenvector calculations and for the subsequent determination of the PC scores. Each score represented a transformed metric from the linear combination of the LiDAR metrics of the sample plots. By analyzing the eigenvectors and the PC score, we established differences in the contribution of each LiDAR metric to the variability in the dataset, as well as the similarity in metrics calculated across the different aged stands [14]. The first five metrics that were most likely to contribute to the model development were identified by inspecting the eigenvectors in each PC. We then used the metrics with the highest loading on the PCs as input variables for every modeling method.
For assessing the effect of modeling approaches for predicting stem total volume in Eucalyptus forest plantation, we used the following modeling approaches: (i) Ordinary least-squares (OLS) multiple regression: The OLS regression algorithm fits a linear model by minimizing the residual sum of squares between the observed values in the training dataset and the predicted values by the linear model [41]. (ii) Random forest (RF) algorithm: RF is a combination of a decision tree with a value of a random independently sampled vector and with the same distribution for all trees in the forest [22]. Based on binary rule-based decisions, the algorithm indicates which particular tree should be used for each specific data input. RF was adjusted using 1000 trees, and one-third of the number of variables to be randomly sampled at each split. (iii) k-nearest neighbors (k-NN) imputation: k-NN methods work by direct substitution (imputation) of measured values from sample locations (references) for locations for which we desire a prediction (targets). In this strategy, key considerations include the distance metric that is used to identify suitable references and the number of references (k) that are used in a single imputation [20].
In this study, we examined k = 1 neighbors for each of the mentioned distance metrics in order to keep the original variation in the data [42]. Many imputation methods can be used for associating target and reference observations. We decided to evaluate six different distance metrics for the k-NN-based approach: raw, Euclidean (k-NN-EUC), Mahalanobis (k-NN-MA), most similar neighbor (k-NN-MSN), independent component analysis (k-NN-ICA), and random forest (k-NN-RF). (iv) Support vector machine (SVM): SVM considers a statistical learning principle to fit a hyperplane that superimposes as much training data as possible. Instead of error minimization, SVM uses structural risk minimization of the distance from training points to the hyperplane [43,44].
To warranty a nonlinear response space, our SVM uses a Radial Base Function for the Kernel function. (v) Artificial neural network (ANN): The ANNs algorithm is inspired by the work of neurons in the human brain [45]. The neural network was set up with two hidden layers: 7 neurons in the first layer (same length of the variables vector) and one neuron in the second layer. The initial weights were set randomly, and the decay parameter was set to 0.1.
For assessing the effect of the sample size within each modeling approach, the models were embedded in a bootstrapping approach with 500 iterations. In each bootstrap iteration, we drew from 10% to 90% the number of observations with replacement from the available samples and validated the model with all observations. In each bootstrap iteration, relative root mean square error (RMSE; Equation (3)), coefficient of determination (R 2 ; Equation (4)), and bias (Equation (5)) were computed based on the linear relationship between observed and predicted volumes.
where y i is the observed value for plot i,ŷ i is the estimated value for plot i, and n is the number of plots. Relative RMSE and bias were calculated by dividing the absolute values by the mean of the observed response parameters. We defined acceptable model precision and accuracy as a relative RMSE and bias of ≤15% to have a model precision and accuracy higher than or equal to the conventional forest inventory standard in fast-growing Eucalyptus plantations in Brazil [31].

Statistical Comparisons
Considering each tested modeling approach, to assess how the combined effect with sample size may impact the accuracy of the predictions, we used the Wilcoxon-Mann-Whitney test to determine if the differences between the methods and sample sizes were statistically significant (at p-value = 0.05). We developed all the statistical analyses in the R statistical package [46]. The RF algorithm was implemented by package randomForest [47], k-NN by yaImpute package [48], in combination with the randomForest package [47], the SVM by the e1071 package [49], and the ANN was implemented by the nnet package [50].

Predictor Variable Selection
A total of 19 of the 26 LiDAR metrics showed a very strong correlation (r > 0.9). To represent the 19 metrics, we retained the H99TH along with six other remaining metrics not highly correlated (r ≤ 0.9) (Table 4). HMEAN, HMODE, HCV, HKUR, H25TH, H99TH, and COV were included in the PCA. Among these, HMEAN, HMODE, HKUR, H99TH, and COV exhibited the highest PC eigenvector loadings (Table 5), which represented the contribution of each LiDAR metric toward the component, and therefore, were used for model development.  The first five PCs accounted for 98.9% of the total variance contained in the selected set of seven LiDAR metrics. PC1, PC2, PC3, PC4, and PC5 accounted for 40.0%, 35.3%, 12.9%, 6.8%, and 3.8% of the total variance, respectively (Figure 2a). PCs 6-7 explained a less than significant percentage (<2.5%) of the remaining variance and were discarded. The first PC captured the canopy height variation and showed positive loadings by height metrics (i.e., HMEAN and H25TH) and negative loading of metrics of HCV and COV. The second PC was mainly influenced by density metrics, and the third PC highlighted canopy cover.

Combined Impact of Sample Size and Data Modeling
The evaluation of the modeling methods' accuracy throughout the sample size was carried out by three performance measures indicators: R 2 , RMSE, and bias ( . Comparisons across the ten prediction methods indicated that OLS and RF outperformed the other tested methods. A relatively stable increase in accuracy and decrease in RMSE were observed along with increasing sample size in all methods, but only the OLS and RF methods were able to meet the acceptable model precision criteria (RMSE and bias of ≤15%) from 30% of the sample size. OLS presented R 2 values ranging from 0.82-0.85 for 30% to 90% of the sample size and demonstrated a more stabilized pattern. In terms of Bias, the variation with respect to increased sample size was very balanced, which shows the robustness of the model. The RF method showed more sensitivity towards the number of samples and presented R 2 values in the range of 0.80-0.91 for 30% to 90% of the sample size. When the sample size was over 50%, the R 2 values of the RF algorithm were higher than that compared to the OLS models.
The SVM algorithm presented similar performance to the RF algorithm, however showing lower values in all parameters evaluated. The algorithm was able to meet the acceptable model precision criteria (RMSE and bias of ≤15%) from 50% of the sample size, presenting R 2 values ranging from 0.80-0.85 for 50% to 90% of the sample size. From the six derivations of the k-NN algorithm tested, the RF-based k-NN approach showed the best results and was able to meet the criteria while using 50% of the sample size, presenting R² values ranging from 0.78-0.86 for 50% to 90%. The poorest performance in this k-NN group was found to be for the k-NN MA algorithm, which presented a relative RMSE of 18.06%, a bias of −0.18%%, and R 2 of 0.67, even when the sample size was 100%. The k-NN ICA algorithm also behaved in a similar fashion. Among all the machine learning algorithms, ANN presented the worst performance in terms of outliers.
The best method and sample size combination (minimum sample size) to provide better R² values and a relatively lower number of outliers was found to be OLS with 40% of the sample size (which accounts for a sample size of n = 63 and ~0.04 plots/ha). The use of only 40% of the full dataset combined with the OLS method was able to provide an average of 0.83 for R 2 and 12.53% and −0.14% for relative RMSE and bias, respectively. No significant improvement for predicting stem volume was found by increasing the sample size from 40% to 50%. The Wilcoxon test comparing RMSE values derived from 40% with the full (100%) dataset showed a p-value > 0.05; hence, 40% and 100% had similar distributions and mean, evidencing no significant difference between them. Whereas, in the case of RF, at a 40% sample size, we obtained an average of 0.78 for R 2 and 13.07% and 0.19% for relative RMSE and bias, respectively.

Combined Impact of Sample Size and Data Modeling
The evaluation of the modeling methods' accuracy throughout the sample size was carried out by three performance measures indicators: R 2 , RMSE, and bias (Figures 3-5). Comparisons across the ten prediction methods indicated that OLS and RF outperformed the other tested methods. A relatively stable increase in accuracy and decrease in RMSE were observed along with increasing sample size in all methods, but only the OLS and RF methods were able to meet the acceptable model precision criteria (RMSE and bias of ≤15%) from 30% of the sample size. OLS presented R 2 values ranging from 0.82-0.85 for 30% to 90% of the sample size and demonstrated a more stabilized pattern. In terms of Bias, the variation with respect to increased sample size was very balanced, which shows the robustness of the model. The RF method showed more sensitivity towards the number of samples and presented R 2 values in the range of 0.80-0.91 for 30% to 90% of the sample size. When the sample size was over 50%, the R 2 values of the RF algorithm were higher than that compared to the OLS models.
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 20 Furthermore, we created two generic visual representations to help us distinctly comprehend and compare performance trends of various modeling techniques-in terms of R 2 , RMSE %, and relative bias (Figure 6), in addition to volume predictions derived from reduced sample sizes ( Figure  7). For instance, it is easy to notice from Figure 6a how R 2 of OLS keeps increasing with sample size, how this compares with respect to RF, and at what point their performances are overlapping-in this case at around 20% and 40% sample sizes-and switching of trends which denotes higher sensitivity of RF towards sample sizes. The same applies to other modeling techniques as well as the RMSE %and bias-related line graphs presented in Figure 6 (b and c, respectively). In Figure 7, we present the percentage of times we obtained a p-value > 0.05 for the Wilcoxon test (from a sum total of 500 iterations) for a particular combination of sample size and modeling approach with respect to the full dataset ( Figure 7a) and reference volume (Figure 7b). Herein, if we look at the Figure 7a, we notice that when volume predictions from various combinations of sample size levels (10% to 90%) are compared to the full LiDAR-based dataset, for most of the modeling approaches, adding more than 40% of the sample size does not make any accountable difference-that is, for example, if we take the case of OLS, the high percentage of p-value > 0.05 for 40% of data represents that this amount of sample size gives mean and spatial distribution similar to using 100% of data; on the other hand, in the case of ANN, even at 90% sample size, the percentage of p-value > 0.05 does not reach 100%, denoting its inapplicability. However, these results do not represent precise predictions or might not provide high accuracy as they are based on 500 iterations where sample points are taken randomly and evaluated for calculating p-values and total percentage. Hence, each time we run the model, our observations can be different, and we cannot be sure that the one we obtained at a particular time exactly represents the real scenario. Nonetheless, we can notice how much sample size % in average is needed with respect to a parametric or non-parametric algorithm to reach 100% in terms of pvalues, and this allows us to evaluate the stability of models across sample sizes. Whereas, in Figure  7b, we observe a much smoother trend-that is, above 30% sample size, most of the modeling approaches give 100% in p-values > 0.05. This is because the respective model-based predicted volumes are being compared with reference (inventory-based) volumes which are fixed values.        The SVM algorithm presented similar performance to the RF algorithm, however showing lower values in all parameters evaluated. The algorithm was able to meet the acceptable model precision criteria (RMSE and bias of ≤15%) from 50% of the sample size, presenting R 2 values ranging from 0.80-0.85 for 50% to 90% of the sample size. From the six derivations of the k-NN algorithm tested, the RF-based k-NN approach showed the best results and was able to meet the criteria while using 50% of the sample size, presenting R 2 values ranging from 0.78-0.86 for 50% to 90%. The poorest performance in this k-NN group was found to be for the k-NN MA algorithm, which presented a relative RMSE of 18.06%, a bias of −0.18%%, and R 2 of 0.67, even when the sample size was 100%. The k-NN ICA algorithm also behaved in a similar fashion. Among all the machine learning algorithms, ANN presented the worst performance in terms of outliers.
The best method and sample size combination (minimum sample size) to provide better R 2 values and a relatively lower number of outliers was found to be OLS with 40% of the sample size (which accounts for a sample size of n = 63 and~0.04 plots/ha). The use of only 40% of the full dataset combined with the OLS method was able to provide an average of 0.83 for R 2 and 12.53% and −0.14% for relative RMSE and bias, respectively. No significant improvement for predicting stem volume was found by increasing the sample size from 40% to 50%. The Wilcoxon test comparing RMSE values derived from 40% with the full (100%) dataset showed a p-value > 0.05; hence, 40% and 100% had similar distributions and mean, evidencing no significant difference between them. Whereas, in the case of RF, at a 40% sample size, we obtained an average of 0.78 for R 2 and 13.07% and 0.19% for relative RMSE and bias, respectively. Furthermore, we created two generic visual representations to help us distinctly comprehend and compare performance trends of various modeling techniques-in terms of R 2 , RMSE%, and relative bias ( Figure 6), in addition to volume predictions derived from reduced sample sizes (Figure 7). For instance, it is easy to notice from Figure 6a how R 2 of OLS keeps increasing with sample size, how this compares with respect to RF, and at what point their performances are overlapping-in this case at around 20% and 40% sample sizes-and switching of trends which denotes higher sensitivity of RF towards sample sizes. The same applies to other modeling techniques as well as the RMSE%and bias-related line graphs presented in Figure 6 (b and c, respectively). In Figure 7, we present the percentage of times we obtained a p-value > 0.05 for the Wilcoxon test (from a sum total of 500 iterations) for a particular combination of sample size and modeling approach with respect to the full dataset ( Figure 7a) and reference volume (Figure 7b). Herein, if we look at the Figure 7a, we notice that when volume predictions from various combinations of sample size levels (10% to 90%) are compared to the full LiDAR-based dataset, for most of the modeling approaches, adding more than 40% of the sample size does not make any accountable difference-that is, for example, if we take the case of OLS, the high percentage of p-value > 0.05 for 40% of data represents that this amount of sample size gives mean and spatial distribution similar to using 100% of data; on the other hand, in the case of ANN, even at 90% sample size, the percentage of p-value > 0.05 does not reach 100%, denoting its inapplicability. However, these results do not represent precise predictions or might not provide high accuracy as they are based on 500 iterations where sample points are taken randomly and evaluated for calculating p-values and total percentage. Hence, each time we run the model, our observations can be different, and we cannot be sure that the one we obtained at a particular time exactly represents the real scenario. Nonetheless, we can notice how much sample size % in average is needed with respect to a parametric or non-parametric algorithm to reach 100% in terms of p-values, and this allows us to evaluate the stability of models across sample sizes. Whereas, in Figure 7b, we observe a much smoother trend-that is, above 30% sample size, most of the modeling approaches give 100% in p-values > 0.05. This is because the respective model-based predicted volumes are being compared with reference (inventory-based) volumes which are fixed values.

Discussion
Although LiDAR has shown to be a powerful technology for forest inventory around the world, its application for monitoring Eucalyptus forest plantations in Brazil is relatively new [18,51]. On examination of the trends observed in previous studies, that have employed a wide range of modeling methods for forest attribute estimation and reported results representing varying accuracies, it is clear that appropriate selection of methods is paramount for attaining the best prediction results [20,37,44,52,53]. The novelty of this research is to investigate how the combined influence of sample size and different modeling techniques affect the overall prediction accuracy of forest plantation attributes and demonstrate the potential of reduced sample sizes to generate accurate prediction results.
For reducing model complexity and boosting overall prediction accuracy, it is imperative to select a minimal number of parameters by means of variable selection approaches [14,54]; this task, however, gets more challenging when highly correlated predictors are present. Application of dual variable selection approaches-Pearson's correlation analysis and PCA-proved beneficial in our case and allowed us to shortlist the five major variables: HMEAN, HCV, HMODE, HKUR, and COV, from a total of 26 LiDAR metrics. These five variables, which were used for model development, accounted for 98.9% of the total variation contained in the pre-selected set of LiDAR metrics. Recent

Discussion
Although LiDAR has shown to be a powerful technology for forest inventory around the world, its application for monitoring Eucalyptus forest plantations in Brazil is relatively new [18,51]. On examination of the trends observed in previous studies, that have employed a wide range of modeling methods for forest attribute estimation and reported results representing varying accuracies, it is clear that appropriate selection of methods is paramount for attaining the best prediction results [20,37,44,52,53]. The novelty of this research is to investigate how the combined influence of sample size and different modeling techniques affect the overall prediction accuracy of forest plantation attributes and demonstrate the potential of reduced sample sizes to generate accurate prediction results.
For reducing model complexity and boosting overall prediction accuracy, it is imperative to select a minimal number of parameters by means of variable selection approaches [14,54]; this task, however, gets more challenging when highly correlated predictors are present. Application of dual variable selection approaches-Pearson's correlation analysis and PCA-proved beneficial in our case and allowed us to shortlist the five major variables: HMEAN, HCV, HMODE, HKUR, and COV, from a total of 26 LiDAR metrics. These five variables, which were used for model development, accounted for 98.9% of the total variation contained in the pre-selected set of LiDAR metrics. Recent

Discussion
Although LiDAR has shown to be a powerful technology for forest inventory around the world, its application for monitoring Eucalyptus forest plantations in Brazil is relatively new [18,51]. On examination of the trends observed in previous studies, that have employed a wide range of modeling methods for forest attribute estimation and reported results representing varying accuracies, it is clear that appropriate selection of methods is paramount for attaining the best prediction results [20,37,44,52,53]. The novelty of this research is to investigate how the combined influence of sample size and different modeling techniques affect the overall prediction accuracy of forest plantation attributes and demonstrate the potential of reduced sample sizes to generate accurate prediction results.
For reducing model complexity and boosting overall prediction accuracy, it is imperative to select a minimal number of parameters by means of variable selection approaches [14,54]; this task, however, gets more challenging when highly correlated predictors are present. Application of dual variable selection approaches-Pearson's correlation analysis and PCA-proved beneficial in our case and allowed us to shortlist the five major variables: HMEAN, HCV, HMODE, HKUR, and COV, from a total of 26 LiDAR metrics. These five variables, which were used for model development, accounted for 98.9% of the total variation contained in the pre-selected set of LiDAR metrics. Recent studies done on Eucalyptus plantations which had applied PCA for variable selection, found similar total variance contained in the selected set of LiDAR metrics (97.7%) and showed HCV, H99TH, COV, H01TH, and H05TH as the most important variables for predicting stem volume [14].
There was a significant relationship between field-based volume estimates and LiDAR-derived metrics selected from the PCA analysis. The selected metrics from the PCA analysis were consistent with previous studies which have also observed that mean height had the largest absolute correlation with the first principal component, coefficient variation of height had the largest absolute correlation with the second principal component, and canopy cover had the largest absolute correlation with the third principal component [55]. Models using these three first principal components likely capture the fundamental allometric relationships between volumes and heights, as seen in results from large-footprint data [15], in which mean height, canopy cover, and height variability were found to explain most of the variability in forest physical characteristics. Several previous studies [56,57] have also found that metrics such as HMEAN and HCV have shown to be effective predictors of forest attributes, such as stem volume, height, basal area, and aboveground carbon in Eucalyptus spp. plantations. The biological basis behind these results is due to the ecological and biomechanical links between canopy vertical structure and forest stand structure parameters. From the perspective of tree form and function development, there is usually a connection between the differences in vertical canopy structure and differences in forest volume, both through forest succession and across areas with contrasting environmental conditions [55].
From our results, it was evident that algorithm performance was sensitive to sampling size and the level of influence varied from one algorithm to another. On placing constraints (<15%) for RMSE values, only 4 models-SVM, RF-based k-NN, RF, and OLS-were found to be feasible for making predictions for 50% (or less) of the sample size. In the case of OLS and RF, a sample size greater than 30% fell within the RMSE threshold. For OLS, this might be because of the low level of multicollinearity within the model. Whereas, for RF-based k-NN and SVM, the ideal sample sizes were equal to or above 50%. In terms of bias, we noticed that all the models fell within the maximum set limit, which was 15%. With respect to R 2 values, OLS proved to be the best among the given modeling methods, followed by RF, when minimal sample size was given priority. The range of R 2 values was comparatively stable for OLS: 0.82-0.85 for 30% to 90%; however, RF: 0.80-0.91 for 30% to 90%, reached higher values when the sample size was increased. The increase in R 2 values with increasing sample size was very evident in the case of other non-parametric models as well. However, this pattern was expected, considering the fact that the non-parametric models learn their functional form from the training data [52,58], which means that the higher the sample size, the better their prediction accuracy will be. This dependence on sample size might be the reason several other non-parametric algorithms failed to provide satisfactory results in our case, where field plots considered were limited [20,59].
Even though for OLS, sample size above 30% met the chosen criteria, high levels of outliers were observed in this case. A former study [60] came up with the generalization that average standard deviation tends to increase with a reduction of sample size, which matches our findings very well. However, the sample size on reaching 40% showed a significant reduction in the number of outliers. On further increase to 50% of the sample size, not much difference occurred in the outliers count or the R 2 values. Additionally, by performing the Wilcoxon test (p > 0.05), we confirmed that 40% and 100% were not significantly different in terms of distribution and mean. When the sample size was 30%, RF also gave satisfactory results, even though the R 2 value was slightly lower (0.8) compared to OLS (0.82). Based on our results and core objective-which was to find the minimum sample size required for attribute estimation-we inferred the best combination to be the linear regression (OLS) model with a sample size of 40%, followed by the random forests (RF) method with an identical sample size value.
Since there existed no extensive studies that accounted for the combined influence of modeling methods and sample size, evaluating the accuracy of our model in regard to established and identical workflows was near impossible. Nevertheless, in comparison with studies that have evaluated the influence of sample size and modeling methods on a discrete basis, we noticed our obtained trends and accuracy of the high performing models to be quite comparable with the inferences made by other studies. A recent study [61] investigated the influence of number and size of sample plots, as well the effect of a single selection, on modeling growing stock volume (GSV) of a Scots pine (Pinus sylvestris L.)-dominated forest in Poland, with 900 available study plots, using airborne LiDAR data. Based on their three major findings: (i) influence of number of sample plots on the accuracy of GSV estimation above 400 sample plots was nominal, (ii) number of sample plot size and estimation accuracy revealed an inverse relationship, irrespective of the number of plots considered, and (iii) single selection does not have any impact when plots considered were above 400, the authors concluded that it is possible to reduce the number of ground sample plots by almost one-third and still retain reasonable accuracy and precision levels, even when the sample plot area is relatively small. This was highly evident in our case as well-for a sample size of less than 40%. Caution is necessary to evaluate the accuracy per age (or another sub-population), since we have unbalanced number of samples per group (Table 1). We did not explore in this study the sample size for groups within the population.
Another study [20] compared the performance of seven modeling methods-k-NN-MSN, gradient nearest neighbor imputation, k-NN-RF, Best NN imputation, OLS, spatial linear model, and geographically weighted regression-for predicting five forest attributes, including basal area, stem volume, Lorey's height, quadratic mean diameter, and tree density, from airborne LiDAR metrics in a mixed conifer forest in southwestern Oregon, in the United States. Contrary to our results, in this case, the authors were not able to come up with a single modeling method that always performed superior to the others in the prediction of the forest attributes; nonetheless, OLS and the spatial linear model gave the best results in terms of RMSE values in the maximum number of cases. From the paragraphs above, we can see that OLS has consistently returned similar estimates (and performance) as compared to more advanced methods, being consistently included among the best models. OLS also has an important advantage when considering the facility to find out the explanatory power of independent variables and make comparison to models generated by other studies.
The major takeaway from our study is that with LiDAR data of only 40% of the total field plots, we are able to make accurate predictions, given that the right modeling technique is employed. This, when translated into large-scale area projects, means savings of a huge amount of money and faster processing with high accuracy. With the same amount of time, we can get more things done or maybe even utilize the available budgets for performing surveys at an increased frequency. Future studies can even narrow these results by reducing the intervals in sample size (that is, instead of the 10% used here, perhaps use 5% or even 1%) and repeating the same process.
Results also highlight that multiple modeling methods work well on predictions and depending on the level of data in hand, these methods can be selected. However, it is incumbent on the modelers to keep in mind the limitations of each algorithm before applying them. For example, for applying linear regression models, assumptions of a linear relationship, homoscedasticity, etc., need to be met, and this is not always true in the case of several plantation data. In a lot of cases, since the data is collected from a copious amount of sources and often has data of the same location for multiple dates, a data hierarchy tends to exist, and in this case, a mixed-effects model needs to be used to account for the random effects happening within the models [62][63][64][65][66]. Therefore, a minimum knowledge of the study site and exhaustive data exploratory analysis is recommended before making the method selection. One should also acknowledge the errors associated with field measurements, ALS data acquisition, and data processing steps while interpreting the model results.
Previous studies have reported the minimum sample size required to vary with respect to the attribute and tree species under consideration. For instance, a study undertaken by the authors of Reference [67] observed the accuracy of estimated Picea abies (L.) Karst volumes at the forest stand level to show no decrease until the number of plots was reduced to below 200 (46.4% of the total number of sample plots). Whereas, for the case of other deciduous tree species, the volume estimation accuracy plummeted, with a gradual decrease in the number of sample plots. Also, more often than not, limited field data and/or acquired LiDAR data quality place additional constraints on complementing studies that intend to evaluate the minimum sample size required for estimating the accuracy of forest attributes using LiDAR metrics [61].
Here, we tested the combined influence of only sample size and modeling algorithms, nonetheless, the influence of additional features, such as plot size, LiDAR pulse density, GPS location errors, etc., would also be interesting and helpful to the research community [52,68,69]. Another thing to keep in mind is the cost associated with LiDAR, which makes this approach economically feasible for only large study areas [10,31,70]. It is always a possibility to improve estimation by adopting a proper sampling method. A combination of field data and LiDAR considering a double sampling approach can significantly reduce the estimation error [71].
Updating data over time using LiDAR can be perceived as a hurdle for the same reason. However, if we are willing to adopt a different perspective, that views the potential reduction of fieldwork cost as compensation for ALS data acquisition, then multiplying the ALS data collection frequency can be treated as a reasonable initiative. Data fusion techniques that integrate LiDAR with other more affordable methods such as unmanned aerial vehicle (UAV) remote sensing or other low-cost, available cutting-edge technologies, can be deemed as an interesting strategy having great potential for forest plantation assessment [11,[72][73][74][75].
Translating this framework from the research to the operational arena requires additional work, especially to test its applicability on multiple sites and to verify stability in results, which needs more investment in terms of fieldwork and analysis. Even so, the expected benefits, that come in the form of reduced inventory cost in the forest plantation sector, will be a huge leap for the forest management sector.

Conclusions
The importance of a framework with more robust and accurate techniques that consider auxiliary data in the process of estimating stem total volume is evident. In this study, we evaluated the impacts of different modeling methods and sample size on the accuracy of volume estimates predicted from LiDAR data in a Eucalyptus forest plantation in Brazil. Our results showed that the precision of LiDAR-derived stem total volume estimates was considerably impacted by the prediction method while varying sample sizes. Higher levels of accuracy were obtained by employing a multiple linear regression model, which was able to provide comparable results using only 40% of the total field plots (~0.04 plots/ha), followed by the random forest with an identical sample size value. The precision of the combined impact of sample size and modeling methods was demonstrated through a relative RMSE and bias less than 15%, which is equal to or less than the level of error that is traditionally accepted in a conventional field inventory. The methods used in this study formulate a framework for integrating field and LiDAR data, highlighting the importance of sample size for volume estimates.
The major takeaway from our study indicates that collecting larger field reference data is not necessarily the most effective option for improving the accuracy of volume estimates while dealing with forest plantations, which in general comprise of relatively simple vegetation structures. Thus, this study should be able to assist in the selection of an optimal sample size that minimizes estimation errors, processing time, and plot establishment costs. Future directions for this research include the use of a larger number of datasets that tests additional features (i.e., plot size, LiDAR pulse density, GPS location errors), integrating multi-sensor data fusion approaches (i.e., terrestrial or UAV LiDAR, radar), and estimating forest attributes at an individual tree level. Additionally, the development of further studies to increase our understanding of the statistical modeling methods' role in the volume estimation of this forest type would be able to shed more light on the ideas presented herein. We hope that the findings from our study give more credibility and encouragement for specialists to pursue research in directions that will ultimately result in the development of site-independent LiDAR data-based models for predicting a wide range of forest attributes.