Machine Learning Methods to Estimate Productivity of Harvesters: Mechanized Timber Harvesting in Brazil

: The correct capture of forest operations information carried out in forest plantations can help in the management of mechanized harvesting timber. Proper management must be able to dimension resources and tools necessary for the fulﬁllment of operations and helping in strategic, tactical, and operational planning. In order to facilitate the decision making of forest managers, this work aimed to analyze the performance of machine learning algorithms in estimating the productivity of timber harvesters. As predictors of productivity, we used the availability of hours of machine use, individual mean volumes of trees, and terrain slopes. The dataset was composed of 144,973 records, carried out over a period of 28 months. We tested the predictive performance of 24 machine learning algorithms in default mode. In addition, we tested the performance of blending and stacking joint learning methods. We evaluated the model’s ﬁt using the root mean squared error, mean absolute error, mean absolute percentage error, and determination coefﬁcient. After cleaning the initial database, we used only 1.12% to build the model. Learning by blending ensemble stood out with a determination coefﬁcient of 0.71 and a mean absolute percentage error of 15%. From the use of data from machine learning algorithms, it became possible to predict the productivity of timber harvesters. Testing a variety of machine learning algorithms with different dynamics contributed to the machine learning technique that helped us reach our goal: maximizing the model’s performance by conducting experimentation.


Introduction
Management integrates the routine of forest managers responsible for guiding and implementing mechanized logging operations. The optimization of time and biological assets capitalized in planted forests, when exhausted by timber harvesters, affects the success of the operation. Thus, it is necessary to know the variables that influence mechanized timber harvesting, allowing for more effective planning.
Quality indicators, evaluation criteria, and risk analysis techniques enhance the structures that support decision makers. In doing so, data collected in forest inventory, measurements at the stand level, operational forest management, and onboard computers in forest machinery help and allow the management procedures for timber harvesting operations [1][2][3][4][5][6].
The manipulation and reuse of this information promote its use in the management development itself and helps in the identification of opportunities that can be foreseen. However, a mechanized timber harvesting operation planning scope necessarily requires a quantitative, robust, and reliable quantitative database [7,8].

Dataset
We used structured data from the production and operation of mechanized timber harvesting in Eucalyptus-and Pinus-planted forests carried out by cut-to-length systems with harvesters. The planted forests with Eucalyptus had a spacing of 3.3 m × 1.8 m and mean age of 14 ± 9.87 years. The Pinus forest had a spacing of 3.3 m × 1.8 m and mean age of 22 ± 9.09 years. The wood from these forests was used as raw material for the production of pulp and paper.
The average meteorological conditions in the study region, according to the National Institute of Meteorology [56], were a relative humidity of 69.24%, wind speed of 4.29 ms −1 , and an air temperature of 289.3 K. The operations took place in Brazil, in a region with a slope gradient from 7.32% to 35.06%. The intervals were categorized by a gentle (3% to 10%), moderate (10% to 32%), and steep slope relief (32% to 56%), according to Speight [57].
This research was based on empirical data and silvicultural inputs; therefore, the data were part of the daily records collected in the field by the onboard computers of timber harvesters. Despite considering all records in the initial analysis, we employed a series of compensatory controls from data wrangling.
In the two 10 h shifts daily, the machine availability, the individual mean volumes of trees, and the terrain slope added up to 144,973 instances incurred in the period of 28 months. These data were categorical, numerical, and ordinal, according to the box plot and distributions of predictor and target variables provided in the Supplementary Material ( Figure S1). The bases were labeled, joined, and manipulated using the R programming language [58].
The actual times spent in activities were recorded in the onboard computers of timber harvesters. This way, we estimated productivity from the ratio between the timber volume extracted by a harvester, in cubic meters, and the effective operation time, in seconds [59,60]. The same operator could operate different brands and models of timber harvesters; however, this variable was not added in the construction of the model, due to the difficulty in tracking these data in the database. Altogether, the operating records of 21 harvesters were used (Table 1). Through the programming language R, when implementing machine learning routines for management planning and detecting data quality, we considered, according to Konstantinou and Paton [61], procedures for transforming, cleaning, and merging different sources. We built a data wrangling routine, in which the outliers and potentially correlated variables were removed.
The instances went through the data wrangling process, which was performed in order to properly transform and gather acquired data. Additionally, through the interquartile range, conceptually defined by the Tukey range [62], we removed outliers and, using Spearman correlation, we verified the correlations between attributes (p < 0.05).
Furthermore, data balancing from SMOTE was performed. The SMOTE was adopted because it is a reference algorithm to solve the class disequilibrium learning problem [63]. The SMOTE algorithm has the dynamics of generating new synthetic examples in the neighborhood of small groups of nearby instances, using the k-nearest neighbor [64]. The function was implemented from the smotefamily package.

Different Learning Methods and Algorithm Approaches
Using a single dataset, we compared the predictive performance of 24 machine learning algorithms to estimate the productivity of timber harvesters. These algorithms were based on a decision tree, gradient boosting machine, linear regression, k-nearest neighbors, support vector machine, and artificial neural network.
For determining the best model, we used the metrics: root mean error (RMSE), mean absolute error (MAE), and mean absolute percent (MAPE). We used the determination coefficient (R 2 ) as a final performance measure for each method. We adopted the gradient (5, 10, 15, 20, and 25) for cross-validation, in which the hyperparameters were automatically optimized. Finally, we implemented stacking ensemble and blending ensemble learning methods. We ordered stacking ensemble learning methods in a hierarchical data structure. On each fold set, we applied k-fold cross-validation.
The predictive performance of models in relation to unseen data was maximized by determination coefficient (R 2 ), minimizing test RMSE, which was determined from the random sample mean generated (n = 80). We implemented supervised learning regression using the Python programming language PyCaret library [65] to automate machine learning workflow and model development. From data instances, we grouped 90% (n = 1466) as a training set and 10% (n = 163) as a test set. We tested machine learning algorithms and selected them according to their performance in predicting productivity of forest machines, using universal statistical metrics for evaluating the performance of models [66], such as root mean squared error (RMSE), mean absolute error (MAE), mean absolute percent error (MAPE), and determination coefficient (R 2 ).
Thus, we subjected the algorithms to different machine learning methods. First, we verified the performance of algorithms in a decoupled version, with hyperparameters in default mode. To improve performance, we adjusted the hyperparameters of the selected algorithms. We combined the validated data and formed the meta-feature set, the test data, and the target set. Again, we combined the sets using new meta-resource sets, creating a new meta-training set. The new target sets formed a new meta-test. We generated final predictions by meta-learner level one, from training with the meta-training set.
The combination learning method consisted of combining machine learning algorithms to minimize prediction error rates. For this, we divided the dataset into training and testing, as well as implementing zero-layer algorithms, which generated validation and test sets. We combined respective sets with new meta-training and meta-test sets [67] and generated final predictions by level-one meta-learner, from training with a meta-training set.

Dataset Quality
The manipulation of the dataset with daily records of the mechanized timber harvesting operation resulted in a sample of 144,973 instances. However, because it was consolidated from the unity of different sources, including manual notes, the goodness of the dataset was partially compromised. Thus, we removed duplicate instances and instances with missing information.
With a data wrangling routine, in addition to cleaning filtering and transforming data, we carried out an examination of data quality, excluding outliers and promoting balancing. It is noteworthy that, despite timber harvesters having onboard computers, the data recording process still required manual interactions. Consequently, we implemented this process in 1.12% of the dataset, the models with machine learning algorithms ( Table 2). The attributes selected for the model building were individual mean volumes of trees, terrain slope, and availability of hours of machine use. More details about the mean, standard deviation, and median of the dataset from mechanized timber harvesting operation for attributes under study are shown in Table 3. Table 3. Mean, standard deviation, and median of dataset from mechanized timber harvesting operation, after process of removing outliers of three initial attributes.

Attribute
Minimum

Different Learning Methods and Algorithm Approaches
First, we analyzed the predictive performance of 24 algorithms, individually, based on model fit metrics. Of the three trained algorithms, based on the decision tree, the determination coefficient of extra trees stood out, as it was 0.01 higher than the coefficient of determination of random forest and 0.22 higher than that of the decision tree (Table 4). When analyzing the four algorithms based on gradient-boosted machines, the best determination coefficient was obtained by CatBoost Regressor, which was 0.04 higher than Gradient Boosting Regressor and 0.20 higher than AdaBoost Regressor (Table 5). The algorithm's availability based on linear regression contributed to the application of the twelve trained algorithms (Table 6). It was found that the Automatic Relevance Determination, Kernel Ridge, Linear Regression, Huber Regression, Ridge Regression, and Bayesian Ridge algorithms showed the same determination coefficient, which was 0.02 higher than that of TheilSen Regressor, 0.06 higher than Least Angle Regression, 0.13 higher than Orthogonal Matching Pursuit, and 0.42 higher than Lasso Regression and Elastic Net.
Despite having different dynamics, the best determination coefficient was obtained by the k-neighbors regressor, which was 0.05 higher than the multi-layer perceptron regressor, 0.18 higher than the Random Sample Consensus, and 0.45 higher than Support Vector Regression (Table 7). Table 7. Evaluation metric of models based on k-nearest neighbor, multi-layer perception regressor, random sample consensus, support vector regression, dummy regressor, and passive-aggressive regressor applied to the training set from mechanized timber harvesting operation. Among applied models that presented better determination coefficients were blending ensemble and stacking ensemble. Next, the algorithms were carried out in default mode, highlighting the Extra Trees Regressor (Table 8). When analyzing the metrics in the dataset test, the blending ensemble model was confirmed as the best predictor of productivity of timber harvesters (Table 9). In addition, as an assessment of overall model performance, we verified the 80 combinations of test set data, with the response. Thus, it was evident that the blending ensemble, followed by the stacking ensemble, produced relatively higher average values of R 2 ( Figure 1) and a lower degree of dispersion. Figure 2 illustrates the performance of the main algorithms used in model construction to predict productivity, relating to observed values.   When we selected the black box algorithm, the increase in performance compromised the interpretation of the relationships between the predictor variables and the target variable. Complex mathematical functions made it difficult to infer from technical experts. However, by visualizing the distributions of predictor variables in each quartile of the response variable it was possible to infer that higher productivity was associated with greater machine availability and lower slope levels.

Model
Although the algorithms used in the construction of models do not allow interpretability, as evidenced by productivity quartiles of the test set, the density distribution of predictor variables of individual mean volumes of trees, terrain slope, and machine availability was determined (Figure 3). When we selected the black box algorithm, the increase in performance compromised the interpretation of the relationships between the predictor variables and the target variable. Complex mathematical functions made it difficult to infer from technical experts. However, by visualizing the distributions of predictor variables in each quartile of the response variable it was possible to infer that higher productivity was associated with greater machine availability and lower slope levels.
Although the algorithms used in the construction of models do not allow interpretability, as evidenced by productivity quartiles of the test set, the density distribution of predictor variables of individual mean volumes of trees, terrain slope, and machine availability was determined (Figure 3).

Discussion
Incorporating machine learning models into forest operations management routines allows managers to infer tactical and operational adjustments, with agility in decision making and accurate prognosis. In mechanized timber harvesting, the scenario dynamics and external influences that impact activities demand this adaptability together with forecasting capacities.
However, conducting and monitoring the performance of mechanized timber harvesting operations, using analytical tools such as machine learning, is restricted due to the quantity and quality of available data. Liski, et al. [31] and Maktoubian et al. [68]. Demirci et al. [69] and Abbasi et al. [70] report that decentralization and lack of data management in forest environments reduce the achievement of significant results.
The harvesters that acted as data sources had embedded technology, with output records of activities of timber cutting and sectioning. The lack of interoperability among electronic devices made communication and data transfer susceptible, compromising the possibility of instant corrections and perceptions of deviations in notes. Furthermore, Buccafurri et al. [71] and Shi et al. [72] point out that the quality of instances generated is part of a cooperative process, which requires participation, and therefore, leveling of all

Discussion
Incorporating machine learning models into forest operations management routines allows managers to infer tactical and operational adjustments, with agility in decision making and accurate prognosis. In mechanized timber harvesting, the scenario dynamics and external influences that impact activities demand this adaptability together with forecasting capacities.
However, conducting and monitoring the performance of mechanized timber harvesting operations, using analytical tools such as machine learning, is restricted due to the quantity and quality of available data. Liski, et al. [31] and Maktoubian et al. [68]. Demirci et al. [69] and Abbasi et al. [70] report that decentralization and lack of data management in forest environments reduce the achievement of significant results.
The harvesters that acted as data sources had embedded technology, with output records of activities of timber cutting and sectioning. The lack of interoperability among electronic devices made communication and data transfer susceptible, compromising the possibility of instant corrections and perceptions of deviations in notes. Furthermore, Buccafurri et al. [71] and Shi et al. [72] point out that the quality of instances generated is part of a cooperative process, which requires participation, and therefore, leveling of all those involved. Data management must be aligned with operations' organization, which makes it the responsibility of forest managers to coordinate these efforts.
The data residual volume, after execution of data wrangling processes, was still sufficient to verify the performance of machine learning algorithms in the productivity modeling of mechanized timber harvesting. Of the algorithm groups applied in the modeling, in the default process, the ones that performed best were those based on the decision tree, gradient-boosted machine, and k-nearest neighbor. Therefore, the best individual performance algorithms were, respectively: extract trees, gradient-boosted, and k-nearest neighbor.
There are many types of decision trees that have as their core the entropy of information. According to An and Zhou [73], in the specific analysis process, the gained information for each attribute is classified and ordered. Among the decision tree algorithms evaluated, the one that presented the best performance was the extremely randomized trees or extra trees algorithm. This algorithm was developed by Geurts et al., [74] and uses the same principle of random forests. However, as supported by Ahmad et al. [75], the extra trees may have differentiated themselves by using the entire training dataset to train each regression tree and not just a bootstrap replica.
Of the algorithms based on gradient-boosted machines, the CatBoost Regressor showed the best fit model to the data. This algorithm developed by Prokhorenkova et al. [76] is an enhancement of gradient boosting, designed to avoid attribute dependency and improve prediction accuracy on small datasets. As it is a non-parametric algorithm that, according to Ortiz-Bejar et al. [77], stores all known observations and uses them in the prediction based on similarity functions, the third-best performance was from the model based on k-nearest neighbor.
As a way of enhancing the prediction, tests were carried out with the blending ensemble and stacking ensemble learning methods, using combined learning. These learnings were combined from the three algorithms, in default mode, which presented the best performances. The predictions obtained by both methods were superior to those obtained by algorithms in default mode. Jong et al. [78] and Jordan and Mitchell [79] point out that, in general, combined learning methods increase the performance of models built with machine learning.
Associating the blending ensemble use with the possibility of pre-determining productivity, based on attributes of individual mean volumes of trees, terrain slope, and availability of hours of machine use, promotes dynamism in managers' planning, especially in operational planning, which requires quick responses in adverse operating conditions. This corroborates the limitations of traditional estimating method productivity through the study of times.
In addition, the comparison through values of employed models' scatter diagrams demonstrated the effects of predictor variables on productivity. In upper quartiles, in operating conditions with lower slopes and longer availability harvesters, their effects increase considerably the productivity.
The building of models involving machine learning algorithms, in addition to providing prediction of harvester productivity in the mechanized timber harvesting operation, allowed us to look at the bases that guide strategic decisions of operations in planted forests. This opportunity has shown that, despite the quality, suitable data promote knowledge extraction, mainly from attributes not correlated with productivity.

Conclusions
From the use of adjusted data of machine learning algorithms, it is possible to predict the productivity of timber harvesters.
Among the attributes that compose datasets of mechanized timber harvesting activities, the individual mean volumes of trees, terrain slope, and machine availability are the main factors that impact harvester productivity estimation.
Testing a variety of machine learning algorithms with different dynamics contributed to the development of a machine learning technique that enabled what it proposes, i.e., experimentation and good performance of the models. Thus, the choice for blending ensemble learning was guided by the comparison of model fit statistical metrics.
Among the learning methods by blending ensemble, stacking ensemble, and algorithms, in default mode, the blending ensemble had a determination coefficient of 0.71 and a mean absolute percent error of 15%.