Predicting Eucalyptus Diameter at Breast Height and Total Height with UAV-Based Spectral Indices and Machine Learning

: Machine learning techniques (ML) have gained attention in precision agriculture practices since they efﬁciently address multiple applications, like estimating the growth and yield of trees in forest plantations. The combination between ML algorithms and spectral vegetation indices (VIs) from high-spatial-resolution line measurement, segment: 0.079024 m multispectral imagery, could optimize the prediction of these biometric variables. In this paper, we investigate the performance of ML techniques and VIs acquired with an unnamed aerial vehicle (UAV) to predict the diameter at breast height (DBH) and total height (Ht) of eucalyptus trees. An experimental site with six eucalyptus species was selected, and the Parrot Sequoia sensor was used. Several ML techniques were evaluated, like random forest (RF), REPTree (DT), alternating model tree (AT,) k-nearest neighbor (KNN), support vector machine (SVM), artiﬁcial neural network (ANN), linear regression (LR), and radial basis function (RBF). Each algorithm performance was veriﬁed using the correlation coefﬁcient (r) and the mean absolute error (MAE). We used, as input, 34 VIs as numeric variables to predict DHB and Ht. We also added to the model a categorical variable as input identifying the different eucalyptus trees species. The RF technique obtained an overall superior estimation for all the tested conﬁgurations. Still, the RBF also showed a higher performance for predicting DHB, numerically surpassing the RF both in r and MAE, in some cases. For Ht variable, the technique that obtained the smallest MAE was SVM, though in a particular test. In this regard, we conclude that a combination of ML and VIs extracted from UAV-based imagery is suitable to estimate DBH and Ht in eucalyptus species. The approach presented constitutes an interesting contribution to the inventory and management of planted forests.


Introduction
Modeling the growth and yield of trees is an essential issue in forest management. Brazilian tree plantations are the most productive on a worldwide scale, and eucalyptus trees are the most common species used in reforestation activities [1]). In 2018, the mean annual production of eucalyptus in Brazil was 36 m 3 ha −1 year −1 , and this number increased to 39 m 3 ha −1 year −1 in 2019, which corresponds to more than 5.7 million hectares planted with this crop in Brazilian lands [1]. There are more than 700 species of Eucalyptus worldwide, and they are used for different applications, like paper, cellulose, and energy generation, vegetal-charcoal, and others [1]. For this, and as a result of their high expansion rate in many tropical countries [2], eucalyptus trees attracted attention as an important commercial role in the Brazilian economy, and is mainly produced in states like Minas Gerais (24%), São Paulo (17%), and Mato Grosso do Sul (16%). On average, the area planted with eucalyptus grew 1.1% per year in the last seven years. Mato Grosso do Sul led this expansion with an average growth of 7.4% per year [1].
In the past, the traditional strategy to estimate growth and yield in trees was based on the adoption of regression models [2][3][4][5][6][7][8][9][10], which can predict the DBH, Ht, competition index of trees, among other parameters and important metrics to describe the aforementioned variables. Nonetheless, over the last years, more efficient methods related to ML techniques [11][12][13][14] has gained prominence to perform this related task. Classification and regression problems are the core research direction in ML. Lately, research on regression problems, where many of the agricultural and environmental related problems are included, has received high attention, becoming a research hotspot in the ML field [15].
Many studies [2,14,[16][17][18][19] show meaningful improvements in the accuracy of estimates when ML models are implemented compared to traditional methods. It is also worth mentioning that there are several studies [2,12,14,[20][21][22][23] that have adopted ML algorithms to model yield or determine the DBH and Ht of eucalyptus trees in Brazil. However, a characteristic noted among these studies is the data source used to estimate these variables. Datasets are originated from measurements made directly and annually on the field, resulting in a continuous forest inventory for eucalyptus' yield estimation. Still, this is a time-consuming, costly, and highly demanding task, which could be supported by new approaches, like the use of remote sensing data.
A study proposed by Maire et al. [2] made the use of time-series of the Normalized Difference Vegetation Index (NDVI) from MODIS satellite data to monitor the biomass of 15,000 ha of eucalyptus plantations in southern Brazil. The authors estimated the standage through a time-series analysis, and the volumes and dominant heights of individual forest plots applying linear (Stepwise) and nonlinear RF regression models. The authors concluded that the accuracy of biomass prediction using the RF algorithm was improved by implementing the NDVI data during the first two years after planting. Another research [13] integrated a group of variables like stand-age, remote sensing data (multispectral optical imagery-Landsat 8, Operational Land Imager sensor, and radar imagery-Sentinel-1B satellite), and terrain attribute extracted from a digital elevation model, to map the volume of eucalyptus plantations with the RF algorithm. Among the data combinations evaluated, the model that integrated remote sensing data with standage variables was able to improve volume estimation significantly.
More recently, a popular remote sensing platform for agricultural applications, with emphasis on crop monitoring, is the Unmanned Aerial Vehicle (UAV) [24][25][26]. The wide market availability, low operating cost, and optimization for image acquisition and process are some characteristics of these remote sensing platforms [24,27,28]. Regardless, concerning eucalyptus tree mapping, no study has investigated the performance of ML algorithms for predicting the DBH and Ht using spectral vegetation indices extracted from UAV-imagery, representing a gap in the literature related to precision farming applications. In this regard, here we present the performances of several ML algorithms to predict DBH and Ht for different Eucalyptus species based on spectral indices computed from high-spatial-resolution multispectral imagery acquired by UAV-embed remote sensor. Six species of eucalyptus

Study Area
The study area has an altitude of approximately 820 m and is located at Chapadão do Sul, Mato Grosso do Sul, Brazil, in the experimental site of the Federal University of Mato Grosso do Sul ( Figure 1). The soil is classified as medium-textured Red Oxisol. According to the Köppen classification, the climate is tropical humid (Aw) with a rainy season from October to April and a dry season between May and September. The experiment plantation was initialized in January 2014 in randomized blocks with four replicates, with 20 plants inside each plot. The treatments were composed of six species of eucalyptus, including E. camaldulensis, E. uroplylla, E. saligna, E. grandis, E. urograndis, and Corymbria citriodora. E. grandis is a species that has excellent qualities for silviculture, surpassing any other in increment when the environmental conditions are adequate, this being the cause of its great use [29].
There are some restrictions on its ability to regrow after periodic cuts, which makes it inferior to E. saligna in this regard [30]. Corymbria citriodora wood is considered excellent for sawmill, charcoal production, posts, and railroad ties [31]. E. saligna is a species very close to E. grandis in botanical, ecological, and silvicultural aspects. Under Brazilian conditions, the growth of E. saligna is generally lower than that of E. grandis [32]. The interest in E. urophylla has arisen in Brazil in recent years, after its high resistance to eucalyptus canker caused by Cryphonectria cubensis has been proven and, also, due to the properties of its wood, which is highly indicated for the production of cellulose [33]. E. camaldulensis is a preferred species for planting in tropical regions subject to periods of drought, due to its greater tolerance to water deficiency in the soil [34].

Data Acquisition and Pre-Processing
The DBH and Ht at stand level were obtained by measuring five trees in each experimental unit (24 sample plots located randomly). To obtain the DBH (cm), a tape was used to measure the circumference at breast height, which was later converted to DBH. The Ht (m) was obtained with the aid of a Haglof hypsometer. Seven measurements were performed, which occurred on 1 November 2018, 6 December 2018, 22 January 2019, 29 March 2019, 10 May 2019, 30 October 2019 and 28 November 2019. These measurements were obtained in both seasons (dry and wet) aiming at generating a great variability in the data and, from that, building robust models that can be used at any time of the year. Therefore, 168 samples were used for each model obtained by combining 24 plots multiplied x seven acquisition dates.
The flights were carried out with the Sensefly eBee RTK fixed-wing UAV equipped with the Parrot model Sequoia multispectral sensor (G: 550 nm; R: 660 nm; RE: 735 nm; and NIR: 790 nm). Further details on flight procedures and the acquisition of wavelengths for calculating VIs can be found in [32]. Here we highlight mainly the main aspects of this processing. They were made with 80% lateral overlap and 85% longitudinal images, as well as the same area, which were photographed twice using perpendicular flight lines. The increase in the overlap between the images was necessary to obtain a high number of scenes containing the same control points, allowing greater precision in the orthomosaic generation in the Pix4Dmapper software. This occurs according to the plant height, which is subject to oscillations of the stem due to the wind, because, regardless of its speed, it interferes in the mosaic process. The overflights were carried out close to the zenith due to the minimization of the shadows of the trees at 11 am since the multispectral sensor is passive type, that is, dependent on the solar luminosity.
Radiometric calibration was performed for the entire scene, based on calibrated reflective surfaces. For correcting the parameters of solar irradiation, the Pix4Dmapper [35] software was used and the reflective target of the camera with reflectance calibration plate is individualized for each device. It contains information on the reflectance rates for each wavelength measured by the multispectral sensor, for the entire scene. This procedure is performed in the field immediately before the flight is performed with the e-Motion software. The reflectance of multispectral images was obtained at green (550 nm), red (660 nm), red border (735 nm), and near-infrared (790 nm) wavelengths.The trees were illustrated from the RGB image of the same scene, with that it is possible to define what a leaf and visualize the soil. From the orthomosaic images, the extractions of the VIs were performed by the ArcGis 10.5 program, taking as a reference a layer (polygonal layer) created manually from the image, manually surrounding each crown of the trees. Table  S1 contains the vegetation indices (VIs) used for the prediction of DBH and Ht. The combination of ML algorithms with the high spatial-resolution of the VIs (0.079024 m), may contribute to optimize the prediction of these biometric variables.

Statistical Analyses and Machine Learning Approach
To verify the linear relationship between the DBH and Ht variables with the VIs, a correlation network was built. This procedure is a technique used to graphically visualize a Pearson correlation matrix between variables. Variables that are positively correlated are linked by green lines, while variables that are negatively correlated are linked by a red line. The thickness of the line is proportional to the magnitude of the correlation, i.e., when closer to 1 (or −1) thicker is the line. This analysis was performed with the graph package of the R software (Version 3.6.3, The R Foundation for Statistical Computiong, Vienna, Austria) Eight ML algorithms were applied using the correlation coefficient (r) and mean absolute error (MAE) metric into a randomized stratified 10-fold cross-validation with 10 repetitions, giving a total of 100 runs for each model. Aside from using all the VIs (Table S1) as input variables to predict the DBH and Ht, another scenario has been tested where a new categorical variable indicating the species of eucalyptus was included. All 8 ML techniques were tested with (Yes) and without (No) species.
The ML techniques used in this study are displayed in Table 1. Among them is the RF, which is an ensemble-based algorithm capable of producing several classification trees for the same dataset and using a voting scheme among all these learned trees to classify new instances [36,37]. Two other tree-based algorithms were used: The Reduced Error Pruning Decision Tree (DT) and the Alternating Decision Tree (AT). DT is an adaptation of the C4.5 classifier that can be used in regression problems with an additional pruning step based on an error reduction strategy [38,39]. AT is another ensemble-based algorithm that applies a boosting strategy to reduce data overfitting [40]. The K-Nearest Neighbours (KNN) algorithm has also been used, with k = 5, as previous experiments in similar datasets suggest that higher and lower values of k do not improve the overall performance. KNN is a non-parametric lazy learning approach that interpolates the closest training instances to the input to get its regression value [41,42].
The four last ML techniques adopted are the Support Vector Machines (SVM), the Artificial Neural Networks (ANN), the Linear Regression (LR), and the Radial Basis Function Network (RBF). SVM has been tested with minimal sequential optimization (SMO) strategy and a polynomial kernel [43][44][45]. ANN has been tested using the default Weka's architecture that consists of a unique hidden layer formed by many neurons that is equal to the number of attributes, plus the number of classes, all divided by 2 [46]. LR has been tested with the Akaike information criteria for attribute selection [47] during linear regression and, lastly, we tested an RBF that induces a Gaussian basis function network by minimizing the quadratic error with the Broyden-Fletcher-Goldfarb-Shanno algorithm (BFGS) [48]. The experiments were run on an Intel ® Core™ i7 CPU with 8 Gb RAM and all hyperparameters were set according to the Weka (Version 3.9.4, The University of Waikato, Hamilton, New Zealand) default library. Boxplots for all configurations evaluated are presented together with the Scott-Knott [49] test results at a 5% significance level for r and MAE of DBH and Ht.

Results
The linear correlation between the variables was expressed graphically in Figure 2. It is possible to verify that Ht and DBH are positively correlated with each other (r = 0.8361). However, these variables have a low correlation with the evaluated VIs (note that the lines connecting the VIs to Ht and DBH are thin). In general, the VIs have a high correlation with each other and can be viewed in two groups.   Figure 4 shows the boxplots for the Ht predictions both for r and MAE, with (Yes) and without (No) the categorical species variable included in the test datasets, considering 100 runs. A clear pattern is readily indicating that all ML techniques were able to improve their performances, for r and MAE metrics and for DBH and Ht predictions when the species variable was included (Yes) in the models. These improvements varied among techniques, with KNN and LR having a much higher improvement for r in DBH prediction than RF and DT, for instance.
The number of outliers and the interquartile range (IQR) are also different for distinct techniques, but in the MAE of the Ht predictions, RF and KNN are the only ones with no outliers. However, RF presents a IQR lower than KNN. In general, RF seems to have a more stable performance with fewer outliers and IQR lower than other techniques overall configurations. For RF and some other techniques, besides improving the overall performance (higher median for r and lower median for MAE), the introduction of the species information has also lowered the IQR and reduced the number of outliers. The box plot medians for RF, SVM, and RBF seem to be consistently higher (for r) and lower (for MAE) than for the other techniques. ML models are differentiated by the type of inductive bias that is a set of hypotheses, in each algorithm, linked to the criteria that a model uses to restrict the concept space or select concepts from that space, generalizing a set of data from training [30,50,51]. An inductive polarization of each method follows: The RF works with data subsets where it will select characteristics to be assembled, generating several decision trees at random, based on the sets [36]; the DT, a model that has agile learning, based on the C4.5 algorithm, produces classifications with a discrete result or regression with the continuous result, regression of built trees, decision and with a variation of information, pruning with reduced error, subsequent adjustment [38,39]; the AT, a method to be applied in a regression of built trees, decision tree with a variation of information, pruning with reduced error, and subsequent adjustment, that shows the linearity with the leaf nodes [48]; the KNN uses the concept of classification and Euclidean distance [42]; the SVM has a separation through classes through wide margins [44]; an ANN is classified in connected systems, through distributed parallel processing represented by numerical values [46,50]; the LR, to minimize the sum of the quadratic errors, is related to the attributes to generate a linear output [47]; and RBF, based on fully supervised training using several parameters, one to penalize the size of the weights in the output layer, and it also uses a global sigma to streamline the combined results [48].  Tables 2 and 3 show the results of the Scott-Knott [49] statistical test for DBH and Ht predictions, respectively. RF is the only ML technique in the best performance group for all configurations, but RBF has also presented high performance for DBH predictions, numerically outperforming RF both in r (0.776 against 0.765) and MAE (2.019 against 2.068 cm). For Ht prediction, the lowest MAE was achieved by SVM (MAE = 1.596 m). Still, again, all mean values improved significantly when the species information was used. The mean values in both tables that are followed by different letters in the same column differ by the Scott-Knott test at 5% probability.
The highest correlation coefficient was achieved for Ht prediction using RF (r = 0.793), slightly higher than the correlation coefficient for DBH using RBF (r = 0.776). However, when considering a 5% significance level, RF, SVM, and RBF mean performances were not statistically different for DBH prediction using species information. In the Ht prediction, also using the species categorical variable, we have LR in the same group as RF, SVM, and RBF with the best, non statistically different, performances.

Discussion
This research investigated the possibility of VIs obtained with a UAV-multispectral sensor to be used to predict DBH and Ht in six species of eucalyptus. The results shown in Figure 2 indicate that all VIs have a low linear relationship with DBH and Ht. Therefore, it is necessary to look for algorithms based on nonlinear behavior, such as machine learning, to predict DBH and Ht. In this study, we demonstrated the capability of different ML algorithms to predict biometric variables, like DBH and Ht of eucalyptus trees, based on spectral indices only computed from high-spatial-resolution multispectral imagery acquired by a UAV. A robust analysis was conducted using correlation coefficient (r) and mean absolute error (MAE) metrics and a randomized stratified 10-fold cross-validation with 10 repetitions. The main spectral VIs were considered as input to the methods. We also considered the eucalyptus species (E. camaldulensis, E. uroplylla, E. saligna, E. grandis, E. urograndis, and Corymbria citriodora) as categorical variables, which significantly improved the achieved results, as can be verified in Figures 2 and 3.
Based on the statistical results, RF, SVM, and RBF algorithm performances are not statistically different for DBH prediction using species information. Regarding Ht prediction, LR, RF, SVM, and RBF provided similar performances. While the comparison between the eight ML methods showed that the RF did not significantly outperform the other methods, it should be highlighted that our results demonstrate that the RF presented a stable performance, in terms of fewer outliers and smaller interquartile range for both predicted variables DBH and Ht. Previous works [28,52] also showed that RF outperformed other ML methods to estimate variables related to agriculture applications.
According to the literature review, we noted that there are few studies related to the estimation of the DBH and Ht variables. Herein, we apply ML algorithms in spectral VIs which were extracted from images captured by UAV. we found studies that use a similar approach for predicting other variables, such as leaf nutrient content in citrus orchards [28] and yield in corn cultivars [52]. While [23] obtained accurate results, with R 2 greater than 0.95 when they applied models of ANN (artificial neural network) and ANFIS (adaptive neuro-diffuse inference system) to estimate DBH and Ht, these models depend on the inclusion of elements inputs that require measurements in loco, which represents a laborious task. Our proposal speeds up the prediction process, eliminating field measurement. Even so, the previous studies analyzed show that the genetic material did not impact the results achieved, which are also different from our findings.
Cut et al. [53] used density point cloud data, provided by UAV LiDAR to estimate DBH and Ht. These authors considered only one species of eucalyptus tree, obtaining a correlation coefficient of 0.77, and 0.91 for the DBH and Ht variables, respectively. Compared to the LiDAR-based approach, our method provided accurate results with a low-cost solution. It is important to highlight that six Eucalyptus species were considered in our study. Multispectral imagery, acquired from UAV-embed remote sensors, are commonly more used in precision farming applications mainly due to the cost-effective resource and production agility [24,26]. When conducting forest inventories, usually, the dendrometric variables measured directly on delimited plots are DBH and Ht. Subsequently, these variables are used to generate indirect estimates of the variables that express production (i.e., volume and biomass), which are extrapolated to the plantation area. However, this process has high costs since it demands a displacement to the planting area, an allocation of plots representative of the entire population, and the measurement itself, which requires time and skilled labor. The main cost to replace the human effort to automate this process is the acquisition of the UAV and the multispectral camera, in addition to costs with training for image processing. The main limitation of UAV-multispectral imagery is the occurrence of winds at the time of flight, which can reduce the accuracy of the process. While the use of multispectral imagery acquired from UAV is still incipient in the forestry sector when compared to agriculture, studies using this approach have been increasingly common in recent years. Given this scenario, our approach may help ease the forest inventory and rapidly provide relevant information to assist technicians in the estimation of such variables.
Our study explores ML algorithms to process spectral VIs extracted from UAV-based multispectral imagery aiming to estimate DBH and Ht of multiple species of eucalyptus. Deep learning-based methods can be assessed in future works; however, there is a need to increase the data set since these models often require higher amounts of samples. Indeed, many relevant and recent works [28,52,54,55] still show the potential of traditional ML methods in agricultural parameters estimation. In this regard, our approach is suitable to predict the DBH and Ht variables with satisfactory performance.

Conclusions
Our study explores ML algorithms to process spectral VIs extracted from UAV-based multispectral imagery aiming to estimate the DBH and the Ht of multiple species of eucalyptus. We verified a similar performance for the prediction of DBH among the RF, SVM, and RBF algorithms, while LR, RF, SVM, and RBF methods provide approximative performance on Ht prediction. An important finding is that improvements occurred when the eucalyptus species were considered as categorical variables in the ML models. To conclude, the developed investigation constitutes a promising approach to contribute to forest inventory management. For future works, we intend to evaluate the exploitation of deep learning regression-based methods as more data should become available.
Supplementary Materials: The following are available at www.mdpi.com/xxx/s1, Table S1: Equations used to calculate the vegetation indices implemented in the experiments.