Artificial Neural Networks and Linear Regression Reduce Sample Intensity to Predict the Commercial Volume of Eucalyptus Clones

Equations to predict Eucalyptus timber volume are continuously updated, but most of them cannot be used for certain locations. Thus, equations of similar strata are applied to clonal plantations where trees cannot be felled to fit volumetric models. The objective of this study was to use linear regression and artificial neural networks (ANN) to reduce the number of trees sampled while maintaining the accuracy of commercial volume predictions with bark up to 4 cm in diameter at the top (v) of Eucalyptus clones. Two methods were evaluated in two scenarios: (a) regression model fit and ANN training with 80% of the data (533 trees) and per clone group with 80% of the trees in each group; and (b) model fit and ANN training with trees of only one clone group at ages two and three, with sample intensities of six, five, four, three, two, and one tree per diameter class. The real and predicted v averages did not differ in sample intensities from six to two trees per diameter class with different methods. The frequency distribution of individuals by volume class by the two methods (regression and ANN) compared to the real values were similar in scenarios (a) and (b) by the Kolmogorov–Smirnov test (p-value > 0.01). The application of ANN was more effective for total data analysis with non-linear behavior, without sampled environment stratification. The Prodan model also generates estimates with accuracy, and, among the regression models, is the best fit to the data. The volume with bark up to 4 cm in diameter at the top of Eucalyptus clones can be predicted with at least three trees per diameter class with regression (root mean square error in percentage, RMSE = 12.32%), and at least four trees per class with ANN (RMSE = 11.73%).


INTRODUCTION
Eucalyptus plantations are located all over the world [1]. In Brazil, most tree species used to supply the forest industry are of the genus Eucalyptus; they have favorable characteristics for the industry and can adapt to different edaphic and climatic conditions [2,3]. In the Amazon region, forest plantations with Eucalyptus are expanding, especially in pasture areas [4], where volumetric timber stock should be constantly evaluated. Despite the importance of Eucalyptus in Brazil, there is a lack of technical and scientific information on cultures in the Amazon region [5,6]. This is also the case for the availability of volumetric equations.
In forest measurement, allometric equations are important in the quantification of many dendrometric variables, such as the volume and tree biomass, to evaluate productivity and carbon stocks at specific (stands) to general (regions) levels [7][8][9][10]. Studies on the relationships between the independent variables of stem diameter and height to estimate timber production are useful in the management of commercial forest plantations [11]. In Brazil, about 91% of timber is produced from commercial forest plantations [12]. Considering the fact that volume predictions impact the planning and monitoring of these plantations, it is important to find ways to predict this variable with low cost and high accuracy.
Timber volume can be affected by environmental factors and stand characteristics, such as productive capacity, age, genetic material, and environmental variations [13]. Thus, equations to predict timber volume should be appropriate for different locations. However, some commercial and experimental clonal plantations have strata where it is not possible to harvest trees for fitting in volumetric models, making it necessary to apply equations of similar strata, which reduces the predicted accuracy [14].
Therefore, finding an efficient and objective methodology to calculate timber volume [14] is crucial for the design and monitoring of forest inventory and for predicting forest growth and yield [1]. An alternative is to consider that forest companies accumulate large quantities of tree volume data with different characteristics and forms [15]. These data can be used to implement artificial neural network (ANN) projects to predict tree volume accurately, without the need to stratify the data [16]. ANN is important for predicting the volume of trees at a lower cost [17].
In forestry, ANNs have been applied in several studies [18]. For instance, for modelling forest growth and dynamics [14], predicting height of individual trees [19], diametric distribution [20], prediction of biomass above ground [21], prognosis of diameter [22], volume of stems and branches [23], and modeling of survival and mortality [24]. ANNs are computer systems that are parallel distributed and composed of simple mathematical processing units. These processing units are the artificial neurons with one or more layers interconnected by a large number of connections in analogy to biological neural networks [20,22].
In this context, this study aims at using artificial neural networks-without losing accuracy-to evaluate the reduction in the number of trees sampled, in terms of the commercial timber volume of Eucalyptus clones in the Brazilian Amazon and compared with the classical method of linear regression. The final models should be easy to implement at a lower cost and result in high-precision estimates, thus contributing to the economic planning of timber production.

Data
The research was carried out in clonal Eucalyptus plantations, spaced 3 × 3 m and aged 2-6 years, in the municipalities of Dom Eliseu, Paragominas, and Ulianópolis, mesoregion of Pará State, Brazil ( Figure 1). The representation of the Eucalyptus plantations structure can be visualized in Figure 2.
The minimum diameter at breast height (dbh) was 4.15 cm and the maximum was 25.90 cm. Thus, the trees were classified in classes of 2 cm amplitude ( Table 1). The minimum volume was 0.0018 m 3 and the maximum was 0.6822 m 3 .  The real volume with bark up to 4 cm in diameter at the top (v) of felled trees was obtained by the Smalian method [25].

Clone Cluster Analysis
The cluster analysis was performed to divide the clones into homogeneous groups and allow the comparison between the regression model and ANN in different scenarios. The variables dbh, h, and v (means per clone) were used in the clustering analysis. The Euclidean distance [26], the measure of dissimilarity used to distinguish between groups, and Ward's method [27], an agglomerative hierarchical technique, were both used to obtain data using the statistics package in R version 3.4.1 (R Core Team, Vienna, Austria).
The efficiency of the grouping method was evaluated with the cophenetic correlation coefficient. A horizontal line between branches was plotted on the resulting dendrogram, forming three groups of clones.

Settings to Select the Best Model
The 666 trees were used in five statistical models to identify the one with the best fit (Table 2). Table 2. Volumetric models tested to predict volume with bark up to 4 cm in diameter at the top of Eucalyptus clone trees.
The model's parameters were estimated by the ordinary least squares (OLS) method with the "lm" function in R. The individual volume predicted by logarithmic models was multiplied by the Meyer correction factor [31] [F = e 0.5 × MSres] to correct the logarithmic discrepancy, where: F = Meyer correction factor; e = base of natural logarithms; and MSres = mean square of the residuals. The variance inflation factor (VIF) was calculated for models 2, 4, and 5, to verify the presence of multicollinearity in the predictor variables.
The models were compared using the adjusted coefficient of determination (R 2 adj), residual standard error (Sy.x), variation coefficient (VC%), Student's t-test for the estimated parameters (5% significance), Akaike's Information Criterion (AIC), graphical analysis of percentage relative errors (E%) vs. real volumes, and histogram of frequency of E%.

ANN Training
The trained ANNs were of the multilayer perceptron type (MLP). The typical MLP architecture consists of one input layer containing the predictor variables, one or more hidden layers, and one output layer containing the predicted variable. The layers have a high degree of connectivity, whose strength of the connection is determined by the synaptic weights [32]. In the present study, the ANNs were trained with one hidden layer. The initial ANN weights were randomly generated. The ANNs were obtained with the Neural Networks tool in Statistica 13.2 Trial version (StatSoft Inc., Tulsa, USA).

Selection of the Best ANN
The ANNs retained in the training stage were applied in the generalization dataset and

Experimental Scenarios
The methods were applied and evaluated in two scenarios: (a) the regression model fit and ANN training with 80% of the data (533 trees), and per clone group, after cluster analysis, with 80% of the trees in each group; and (b) model fit and ANN training with trees of only one clone group at the ages of 2 and 3 years, with six, five, four, three, two, and one tree per diameter class as sample intensities (Figure 3). The datasets for the model fit and ANN training in the two scenarios were randomly selected. The age, in months, was included in the fit of the volumetric model in both scenarios to obtain more accurate predictions.

Methods Analysis
The analysis of the method predictions (regression and ANN) in both scenarios was based on the statistics cited: ryŷ, RMSE%, bias and error variance. Scatter plots of the standardized residuals as a function of the predicted v were plotted using the following equation: In scenario (b), an analysis of variance (ANOVA) was performed with a 3 × 5 factorial arrangement: three treatments (regression, ANN, and real) and five sample intensity levels (six to two trees/class) were used to evaluate the differences between the means of the predicted and real values in the sample intensities.
Before ANOVA, the normality of the data was verified with the Lilliefors test and the homogeneity of variance (homoscedasticity) with the Bartlett's test at the 1% significance level.
The "lillie.test" function of the 'nortest' package was used in the normality test and the "bartlett.
test" function of the statistics package was used in the homoscedasticity test, both in R.

Methods Validation
The regression and ANN predictions, at the 5% significance level, were tested with the Kolmogorov-Smirnov test [20]. The hypothesis tested was as follows: Hn = the frequency distribution of individuals by volume class was equal, and Ha = the frequency distribution of individuals by volume class was different. This statistic was calculated with the "ks.test" function of the statistics package inside R. The trees used to validate the datasets were those that did not participate in the ANN training and regression fit; thus, biases in the predictions were avoided.

Clone Grouping
In the evaluation of the resulting dendrogram, the cophenetic correlation coefficient was  (Table A2).

Best Model Selection
According to the statistical result, the Model 5 (Prodan) presented a higher R 2 adj value and smaller Sy.x, VC%, and AIC (Table 3). The distribution frequency of E% was more concentrated in the zero value for Model 5. The variation in errors of Model 5 was also smaller than that in the other models, and therefore, this model was selected for evaluation with ANN in the later stages ( Figure 5). Models 2 and 5 presented VIF > 10, indicating the presence of multicollinearity.

ANN Retained in Scenarios (a) and (b)
The selected ANNs presented a high degree of correlation between the predicted and real v in the generalization, with ryŷ > 0.9860 (Table 4). The exponential activation function was the most frequent and the identity was the least frequent in the selected ANNs. The neuron number in the hidden layer varied from 3-25 in scenario (a) and from 1-30 in (b). In scenario (a), the number of input variables of ANNs with the general data was higher  than the other approaches of this same scenario and, in scenario (b), the number of input variables in the intensity of the one tree per diameter class was lower (13)(14)(15)(16)(17)(18)(19). Table 4. Configurations, correlation coefficients between real and predicted values (ryŷ), and weighted value (WV) of the best artificial neural network in the generalization of the approaches in scenarios (a) and (b).

Predictions Assessment in Scenario (a)
The volume predictions of Model 5 and the best ANN selected for the general data and for each clone group showed a high correlation between real volumes, with ryŷ ≥ 0.9940 (Table 5). The ANN was modeled more efficiently in the approach with the general data, with an RMSE smaller than 8%. The error variations of Model 5 in clone groups A and B were smaller than those of ANN. The accuracy in the v predictions of upper classes was lower in both methods ( Figure 6).

Predictions Assessment in Scenario (b)
As in scenario (a), the ryŷ values in the two methods were relatively high (≥0.9843) (  (Figure 7).

Variance Analysis for Means in Scenario (b)
In the normality test, values of v with Model 5 and ANN were significant at 1%, except for the intensity of five and four trees per diameter class, respectively (Table A3) (Table A4).

Predictions Validation in Both Scenarios
The predicted v distributions relative to the real values in scenarios (a) and (b) with the

Kolmogorov-Smirnov test were similar in both methods (Model 5 and ANN), with p-values
greater than 1% (Table A5), thus, the hypothesis Hn was not rejected.

DISCUSSION
New methods for predicting timber volume without reducing accuracy are needed to reduce the costs of felling sample trees. ANN has advantages over linear regression models due to factors such as the ability to learn and generalize, noise tolerance, and modeling of non-linear relationships between variables [34][35][36][37]. However, in the present study, it was observed that linear regression may also overcome the ANNs, contrary to other studies [21,38,39]. In addition, regression models are easier to implement and can provide logical estimates under extrapolation [1,40]. Thus, the comparison between the two methods is valid to investigate which is more appropriate for data modeling.
The overestimation of some smaller values of v in the five fitted models ( Figure 5) indicated the presence of outliers [41][42][43], however, this did not significantly influence the predictions of the best model. The selection of Model 5 for ANN evaluation was due to the good statistical fit, more compact residual distribution, and lower variability and bias in the predictions. This choice reinforces the importance of testing more than one model per situation, despite the Schumacher-Hall model being reported as the best model to predict tree volume [44][45][46]. Thus, Model 5 best fits the data of this study, being the most suitable to predict commercial volume at the regional level. The presence of multicollinearity was expected for the independent variables of Models 2 and 5. Thus, the models should be used with caution for data coming from different locations and differing from the range of variables presented in this study.
The high ryŷ of ANN retained in both scenarios was due to their ability to model complex relationships between qualitative and quantitative variables [47][48][49]. The largest neuron number in the hidden layer for the general data was due to the modeling complexity with qualitative variables (region, farm, and groups of clones) [50]. The input number reduction in ANN up to the one tree per class intensity was due to data randomization without the inclusion of some categories of qualitative variables in the training (region and farm). According to Silva et al. [51], obtaining the appropriate architecture of the neural network is optimized by successive attempts that produce satisfactory results. Görgens et al. [15] argue that architecture is directly related to the learning power of the network, and, in more complex data, the demand for neurons and even for layers increases. In the present study, the non-linear behavior of the data required the use of more complex variables, and, therefore, influenced the number of neurons in the input and hidden layers. Thus, it is important to test as many architectures as possible to find the best fit for the data distribution.
Analyzing scenario (a), the highest ryŷ value and lowest RMSE% of the ANN, with the general data, indicate that the precision increases with the inclusion of the clone group as a qualitative variable [22,23,52]. The mean similarities of v in ANOVA between treatments and intensities of six to two trees per diameter class indicates that the reduction in the number of trees did not significantly decrease the accuracy of the estimates. The reduction in the sampling intensity may be feasible for the prediction of v, representing time and cost savings in the collection of volume data [17]. The increase in the error for volume prediction using one tree per diameter class indicates the difficulty in this sampling to represent all variations in the data. Therefore, further studies on volume estimation at different sample intensities can be performed locally or regionally in order to verify the gains in the optimization of the sampling and in the precision and accuracy of the estimates. In addition, tests of parameterization of ANNs and evaluation of possible gain by changing the algorithm to others such as the resilient propagation, backpropagation, and Levenberg-Marquardt, as well as normalization and equalization tests can bring better results when estimating commercial volume [23].
Concerning the Kolmogorov-Smirnov test, the similarities in the predicted v distribution relative to the real values indicated ANN and Model 5 as the most appropriate methods to predict timber volume of Eucalyptus clones at the tested sample intensities due to the accuracy [20]. The v distribution with the trend of real data, even with their complexity, resulted in a good quality fit for Model 5.

CONCLUSIONS
The ANN and linear regression proved to be efficient in predicting commercial volume in scenarios (a) and (b). The ANN more efficiently predicted data at more general levels (all clone groups). Model 5 (Prodan) also generated predictions with accuracy, and, among the regression models, was the best fit to the data.