Individual Tree Diameter Growth Models of Larch – Spruce – Fir Mixed Forests Based on Machine Learning Algorithms

Individual tree growth models are flexible and commonly used to represent growth dynamics for heterogeneous and structurally complex uneven-aged stands. Besides traditional statistical models, the rapid development of nonparametric and nonlinear machine learning methods, such as random forest (RF), boosted regression tree (BRT), cubist (Cubist) and multivariate adaptive regression splines (MARS), provides a new way for predicting individual tree growth. However, the application of these approaches to individual tree growth modelling is still limited and short of a comparison of their performance. The objectives of this study were to compare and evaluate the performance of the RF, BRT, Cubist and MARS models for modelling the individual tree diameter growth based on tree size, competition, site condition and climate factors for larch–spruce–fir mixed forests in northeast China. Totally, 16,619 observations from long-term sample plots were used. Based on tenfold cross-validation, we found that the RF, BRT and Cubist models had a distinct advantage over the MARS model in predicting individual tree diameter growth. The Cubist model ranked the highest in terms of model performance (RMSEcv [0.1351 cm], MAEcv [0.0972 cm] and Rcv [0.5734]), followed by BRT and RF models, whereas the MARS ranked the lowest (RMSEcv [0.1462 cm], MAEcv [0.1086 cm] and Rcv [0.4993]). Relative importance of predictors determined from the RF and BRT models demonstrated that the competition and tree size were the main drivers to diameter growth, and climate had limited capacity in explaining the variation in tree diameter growth at local scale. In general, the RF, BRT and Cubist models are effective and powerful modelling methods for predicting the individual tree diameter growth.


Introduction
Forest growth models are important tools in providing quantitative and reliable information for forest management decisions making [1,2].These models can be developed with different resolutions including stand, diameter class and individual tree levels [3].Both stand and diameter class growth models are unable to represent the growth dynamics of individual trees and are only applicable to homogeneous and even-aged stands [2].On the other hand, individual tree growth models are flexible and able to represent the growth dynamics of individual trees, which are commonly used to represent growth dynamics for the heterogeneous and structurally complex mixed uneven-aged stands [2,4,5].Tree growth is often expressed as a function of tree size, competition and site condition [5][6][7][8][9].Besides these aforementioned factors, climate factors such as temperature, light and precipitation are not negligible in determining tree growth [10].Climate-sensitive tree growth models have also been developed recently [11][12][13].The climate, site condition and competition interactively influence individual tree growth.Meanwhile, these influences may vary with tree species and size.However, so far information about contributions of different factors to individual tree growth in mixed forests is still insufficient, and it is necessary to identify the dominant factors affecting tree growth with appropriate methods [14].
Traditionally, individual tree diameter growth models related tree growth to stand-and tree-level variables using statistical models such as linear and nonlinear regression models with or without random effects [15][16][17][18].These statistical models are often applied under certain statistical assumptions, such as the data are normally distributed, homoscedastic and independent.However, due to continuous observations, hierarchy and intrinsic variability of forest growth data, the above assumptions are usually difficult to satisfy [4].Moreover, the selection of specific mathematical equation (model form) and convergence is rather difficult when modelling individual tree growth in nonlinear form with multiple variables.Therefore, data-driven methods were also applied and showed their potential such as generalized additive model (GAM) and artificial neural networks (ANN) [4,19,20].
In contrast to traditional statistical models, machine learning (ML) has the ability to model complex and nonlinear interactive relationship between the predictor variables and the response variable without statistical assumptions and predetermined mathematical equations [21].Many state-of-the-art ML algorithms have been developed, including classification and regression trees (CART), random forest (RF), boosted regression tree (BRT), ANN, support vector machine (SVM), cubist (Cubist) and multivariate adaptive regression splines (MARS).These data-driven methods have already been successfully applied to ecology and remote sensing to perform tasks such as soil organic carbon mapping [22], carbon and energy fluxes [23,24], and species distribution [25,26].However, the applications in forest growth and yield prediction are still limited.Few applications of these methods have been found in individual tree growth prediction [4,19].With the increasing number of the ML algorithms, the comparison among these methods becomes more important.A direct comparison of ML algorithms has not yet been attempted in individual tree growth predicting.Unlike ANN and SVM which are often adversely affected by noninformative predictors and high correlation among the predictors, RF, BRT, Cubist and MARS models are naturally resistant to noninformative predictors due to the built-in feature selection mechanisms [27,28].Highly correlated predictors do not drastically affect model performance in RF, BRT, Cubist and MARS [27].In addition, unlike the ANN and SVM models, the RF, BRT, Cubist and MARS models do not need to pre-process predictors (such as centralization and standardization), and can well handle the case of many qualitative predictive variables or qualitative predictive variables with many levels.It is noted that both RF and BRT have the ability to provide quantitative and reliable measurements of relative importance of variables.Thus, the information about contributions of different factors to individual tree growth in mixed forests can be determined using both BRT and RF.
Therefore, the goal of this study is to use machine learning algorithms to predict the individual tree diameter growth with long-term plot observations of semi-natural mixed larch-spruce-fir-broadleaved forests in northeast China.The specific objectives were: (1) to compare and evaluate the performance of the RF, BRT, Cubist and MARS models for tree diameter growth modelling based on tree size, competition, site condition and climate factors; and (2) to identify the contributions of different factors to the individual tree diameter growth.

Study Area
This study was conducted in long-term permanent plots of larch-spruce fir mixed forests located in the Wangqing Forestry Bureau, Jilin Province, northeast China (43 The study area belongs to middle or low hilly area of the Changbai Mountains with elevation ranging between 360-1477 m above sea level.The mean annual precipitation and temperature were 547 mm and 3.9 • C, respectively.

Permanent Plot Data
Plot data used to develop individual tree diameter growth models were obtained from 20 remeasured permanent plots of semi-natural conifer-broadleaf mixed forests.The area of plots ranges in size from 0.075 to 0.25 hm −2 .All plots, which originated from pure larch (Larix olgensis Henry) plantation planted in 1962-1964, had become seminatural conifer-broadleaf mixed forests of larch-spruce-fir.Larch, spruce (Picea jezoensis Carr.) and fir (Abies nephrolepis (Trautv.)Maxim.) were the dominant tree species.The main associated tree species were classified as three tree species (groups): Korean pine (Pinus koraiensis Sieb.et Zucc.), slow growing tree species group (acer mono (Acer mono Maxim.),Manchurian ash (Fraxinus mandshurica Rupr.), tilia amurensis (Tilia amurensis Rupr.), betula costata (Betula costata Trautv.),amur corktree (Phellodendron amurense Rupr.) and medium growing tree species group (birch (Betula platyphylla Suk), elm (Ulmus propinqua Koidz.)).Beginning with the first inventory in the year of 1986, tree-and stand-level characteristics of these permanent plots have been remeasured periodically with an interval of two or three years from 1986 to 2010.Characteristics recorded for each tree include DBH above 5 cm, tree species, and status (live, dead, cut and damage).Site characteristics recorded for each plot include slope, aspect and elevation.Our analyses focused on the plot data within an interval of any five consecutive years in 25 years (1986-2010).In total, we received 16,619 observations.Summary statistics of tree-and stand-level variables were listed in Table 1.The climate data were spatially interpolated using ANUSPLIN software with a spatial resolution of 300 m × 300 m [29].For each plot, 44 candidate climatic variables corresponding to individual tree mean annual DBH increment (∆D) within any five years' interval were calculated including temperature, precipitation, illumination and composite climatic factors generated by their interactions (Table 2).The values of candidate climate variables were the mean of corresponding climatic factors during the five-year interval.

Modelling Methods and Development
Individual tree diameter growth is often modeled as the function of tree size, competition, site condition and climate factors.In our study, the individual tree mean-annual DBH increment (∆D) as an output were modeled using the RF, BRT, Cubist and MARS models respectively with the mentioned variables below as inputs.As climate, site condition and competition interactively influence individual tree growth, these influences may vary with tree species and size.In order to analyze the influences of tree species on ∆D, tree species were treated as a categorical variable in our study.Thus, the model form was expressed as follows: ∆D = f (Size, Competition, Site, Climate, Species). (1) Forests 2019, 10, 187 5 of 20 where f () represents the RF, BRT, Cubist or MARS model; Size represents tree size variables including initial tree DBH at the beginning of five-year interval (D 0 ) and stand average DBH (D g ); Competition represents competition variables (Equations ( 2)-( 5)) including the sum of basal area larger than the subject tree (BAL), the sum of square DBH in trees which are larger than subject tree (DL), the ratio of DBH of subject tree to average DBH (RD), the ratio of DBH of subject tree to the maximal DBH (DDM), the number of trees per hectare (N) and basal area per hectare (BA); Species represents a tree species category including six tree species (groups): larch, spruce, fir, Korean pine, slow growing tree species group and medium growing tree species group; Site represents site condition variables including aspect, slope and altitude; the "Climate" represents climate variables including 44 climatic factors (see Table 2).Totally there were 56 input variables.
where n lager is the number of the trees with the basal area larger than that of the subject tree, D i is the DBH of the i-th tree, D subject is the DBH of the subject tree, D average is the stand average DBH, D maximum is the maximal DBH of the stand, Area is the plot area.
(1) Random Forest Model The RF algorithm was first proposed by Breiman [30].It is a decision tree model of ensemble learning.For regression prediction, the base learner of RF is regression tree.Each regression tree of RF is constructed by drawing bootstrap sample of the original dataset with replacement.When a bootstrap sample is drawn, approximately 36.8% of the original dataset is not included in the sample.The portion of the data not drawn into the sample composes the "out-of-bag" data, used to assess the generalization errors of the selected regression tree, whereas the bootstrap sample is used to train a regression tree.For each split node of the regression tree, a randomized subset of predictors is selected: randomly select k (<P) of the original predictors (P is number of original predictors).The best predictor to split a "parent" split node into two "child" nodes is the predictor among the k predictors leading to the largest decrease in the overall sums of squares error.A regression tree is completed until the terminal nodes contain only individuals belonging to the same classes or until they contain no more than five individuals.Thus, multiple trees are generated by repeating the partition process for a pre-specified number of trees.The RF predictions are the averaged predictions over all trees.The importance score of predictors can be estimated based on the total decrease in residual sum of squares over all nodes for which it was selected as the splitting predictor and averaged over all trees.
The RF modeling was implemented by the package "randomForest" [31] in R software [32].Two tuning parameters require users to make choices: "ntree" and "mtry" [33].The ntree, is the number of trees to grow and governs the total number of independent trees.The mtry is the number of predictive variables randomly sampled as candidates at each split and controls the correlations between trees, and decreasing mtry will decrease the correlations between trees.The default value of ntree is 500.In general, the overall error rate tends towards stability while the ntree is greater than 500 [34].In order to ensure the reliability of the prediction results without affecting the computational efficiency, the value of ntree in our case was set to be 1000.For mtry, the default value is recommended set to be the number of total predictor variables divided by three in the regression context [30].However, the default value of mtry is not always able to obtain the optimal model.It is necessary to tune the parameter.Since there were 56 predictive variables in our case, we tested 56 possible values for mtry: mtry = 1, 2, 3, . . ., 56.A resampling technique-tenfold cross-validation-was implemented to train and evaluate these 56 RF models (ntrees = 1000, mtry = 1, 2, 3, . . ., 56).The R package "caret" [35] and a random seed were used to tune these RF models. (

2) Boosted Regression Tree Model
The BRT is also an ensemble learning method combining regression trees and "gradient boosting" algorithm.Unlike RF, BRT develops multiple regression trees sequentially from the gradient of loss function of the previous regression tree, whereas RF is a simple ensemble of independent trees [33].More specifically, at the first step, a regression tree is fitted to a subset of data for minimizing the loss function (e.g., the squared error for regression).Then the second tree is fitted to the gradient (e.g., residual) of the first tree [36].The two regression trees are combined and the gradient is computed.A new tree is fitted, and so on.The above procedure continues until a pre-specified number of trees is reached.In order to reduce overfitting and improve model prediction performance, each regression tree of BRT is constructed by drawing bootstrap sample of the original dataset without replacement (usually 50% of the original sample) [37].The BRT predictions are the weighted average of predictions over all trees, the weights are proportional to regression tree performances [36].The importance score of a predictor can be constructed for BRT in exactly the same way as it was for RF.
The BRT modeling was implemented by the R package "gbm" [38].Five tuning parameters require users to make choices: "distribution", "bag.fraction","shrinkage", "n.trees" and "interaction.depth".As the tuning parameters are multitudinous and changing any of the parameters can influence the optimal values of the other parameters, tuning BRT model is complex [33].The distribution is the form of loss function.For regression prediction, the distribution is set to be "gaussian" (squared error).The bag.fraction (0 < bag.fraction ≤ 1) is the sampling faction of the training data randomly selected to create each tree.Lower bag.fraction can introduce more randomness into the model and counteract overfitting while also reducing the required computing costs [37].However, increased randomness can lead to higher between-model variability [33,39].Friedman [37] recommended using the bag.fraction of around 0.5.The shrinkage (0 < shrinkage ≤ 1) is the step-size reduction or learning rate.The n.trees, is the total number of trees to grow.The parameter shrinkage affects the optimal n.trees.Decreasing shrinkage increases the value of n.trees required, and a lower shrinkage improves model performance but also increases computing costs (both storage and time) [38].Balancing computing costs and model predictive performance, Ridgeway [38] recommended setting shrinkage at a range from 0.01 to 0.001 with n.trees between 3000 and 10,000.Thus, we consider three possible values for shrinkage (0.01, 0.005, 0.001) and 15 possible values for n.trees (3000, 3500, . . ., 10,000).The interaction.depth, is the maximum depth of variable interactions for each tree, with the default value of 1.However, the parameter interaction.depthaffects the optimal setting for both n.trees and shrinkage.As interaction.depth is increased, shrinkage needs to be decreased if sufficient trees are grown [39].Considering the trade-off between computing costs and model predictive performance, we tried 5 possible values for interaction.depth:1, 3, 5, 7, 9.The tenfold cross-validation was also implemented to train and evaluate these 225 BRT models (distribution = "gaussian", bag.fraction = 0.5, shrinkage = 0.01, 0.005, 0.001, n.trees = 3000, 3500, . . ., 10,000, and interaction.depth= 1, 3, 5, 7, 9).The R package "caret" [35] and a random seed, which is identical to the seed used for RF models, were used to tune BRT models.
(3) Cubist Model The Cubist can be an ensemble learning method combining model tree and "committees" (boosting-like) algorithm.The model tree used in Cubist works in a similar way as the classical regression tree does.However, unlike the regression tree, the terminal nodes of the model tree use linear regression models rather than discrete values, and the splitting criterion of the model tree uses the reduction in standard deviation instead of the reduction in squared error [27].Each model tree of Cubist is affected by the result of the previous model tree.More specifically, subsequent model trees of Cubist are built as follows: if a data point is over-predicted by the prior model tree, then the same value is adjusted downward for the next model tree.Similarly, underpredicted data points are adjusted so that the next model tree will produce a larger prediction in the next iteration.The above procedure continues until a pre-specified number of committee model trees is reached.Unlike BRT, the predictions of Cubist are simple average of the predictions from each model tree instead of the weighted average of predictions over all trees.
The Cubist modeling was implemented by the R package "Cubist" [40].Two tuning parameters require users to make choices: "committees" and "neighbors".The committees is the number of iterative model trees to grow in sequence whose maximal value is limited to 100 [41].The neighbors is the number of neighboring training points which can be used to adjust the final predictions and is an integer with range 0-9 [42].Considering the trade-off between computing costs and model predictive performance, we tried 20 possible values for committees (1, 6, 11, . . ., 96) and 6 possible values for neighbors (0, 1, 3, 5, 7, 9).The tenfold cross-validation was also implemented to train and evaluate these 120 Cubist models (committees = 1, 6, 11, . . ., 96, and neighbors = 0, 1, 3, 5, 7, 9).The R package "caret" [35] and a random seed, which is identical to the seed used for RF models, were used to tune Cubist models.

(4) Multivariate Adaptive Regression Splines Model
The MARS, which was firstly proposed by Friedman [43], can be deemed to be an extension of linear models that automatically captures nonlinearity and interactions between variables.The modeling procedures of MARS are inspired by the recursive partitioning approach such as CART [43,44].However, unlike CART, MARS creates continuous models with continuous derivatives instead of discontinuous branching at tree nodes [28,43].More specifically, MARS creates flexible regression models by using basis function to fit piecewise linear regression models to distinct intervals of the independent variable space [28,45].Both the knots (the break values of the intervals) and the variables to use are selected by using basis functions with exhaustive search.After the initial model is created with the first two basis functions, the model keeps on conducting next exhaustive search to get another set of basis functions that yield the best model fit.The above procedure continues until a stopping point is reached.The model form of MARS can be presented as follows: where, f (x) represents the dependent variable; x represents the independent variables; β is the intercept; B n (x) is the basis function, which may be a single hinge function or a product of two or more hinge functions; β n represents the coefficient of the n-th basis function; N is the number of basis functions of the model.Hinge functions are a key part of MARS models.A hinge function takes the form as follows: max(0, x − c) or max(0, c − x) where c is a constant, also called the knot.
The MARS modeling was implemented by the R package "earth" [46].Two tuning parameters require users to make choices: "degree" and "nprune" [27].The parameter degree is the maximum degree of interaction.The default of degree is 1, which means an additive MARS model without interaction terms is built.Although MARS can build models with two or more degree interactions, Hastie et al. [47] recommended setting an upper bound on the degree of interaction.The lower degree interaction can aid in the interpretation of the final model, whereas the higher degree interaction can lead to occasional instabilities in the model predictions (perhaps an order of magnitude of the true value) [27,47].Thus, we tried 3 possible values for degree: 1, 2, and 3. Parameter nprune is the maximum number of terms (including intercept).The value of nprune is equal or greater than 2 and less than "nk".The nk is the maximum number of model terms before pruning (nk = min (200, max (20, 2 × ncol(x))) + 1, where ncol(x) is the number of predictive variables).Since there were 56 predictive variables in our case, we considered 112 possible values for nprune: nprune = 2, 3, . . ., 113.The tenfold cross-validation was also implemented to train and evaluate these 360 MARS models (degree = 1, 2, 3, and nprune = 2, 3, . . ., 113).The R package "caret" [35] and a random seed, which is identical to the above ML methods, were used.

Model Evaluation
The performance of the RF, BRT, Cubist and MARS models was evaluated using tenfold cross-validation resampling technique.Three measures for model performance were computed for each fold: coefficient of determination (R 2 ), root mean square error (RMSE) and mean absolute error (MAE).Then the 10 resampled validation measurements of model performance were averaged as follows.
Coefficient of determination of cross-validation (R 2 cv ): Root mean square error of cross-validation (RMSE cv ): Mean absolute error of cross-validation (MAE cv ): where, k is the number of fold, and k = 10 in our case; O ij is the i-th observed value of the j-th fold; P ij is the i-th predicted value of the j-th fold; O j is the average observed value of the j-th fold; n j is the number of samples in j-th fold; R 2 j , RMSE j and MAE j are the R 2 , RMSE and MAE of the j-th fold, respectively.

Model Comparison
Once model performance has been quantified through the tuning process described above, the optimal RF, BRT, Cubist and MARS models were determined by selecting parameter settings associated with the numerically best performance estimates (the smallest RMSE cv in our case).Then the comparisons between the four optimal models were made by assessing the performance measures on each fold.As the performances were measured using identically resampled datasets, statistical methods for paired comparisons can be used to detect pairwise differences in the performance among models [27,48].Therefore, paired t-test was applied to verify if the differences in performance among the four optimal models with the significant level of 0.05.
Moreover, the four models in our study could provide estimations of importance score.For limited space of the manuscript and wide use, only the results of RF and BRT models were reported after the pre-examination, which showed similar results.The importance score of a predictor was estimated based on the total decrease in residual sum of squares over all nodes for which it was selected as the Forests 2019, 10, 187 9 of 20 splitting predictor and averaged over all trees.It was normalized to 100% by its relative value (rI M i ), which was expressed as following for the RF and BRT models: where, the I M i and I M q are the importance score of the i-th and q-th predictor, P is the total number of the predictors.

RF
The effect of mtry on RF model performance is presented in Figure 1.Parameter values of mtry at 1 and 9 had the lowest and highest R 2 cv (0.4460 and 0.5565), respectively.For RMSE cv , mtry at 9 and 1 resulted in the lowest and highest values (0.1377 and 0.1751 cm).For MAE cv , the lowest and highest MAE cv (0.0996 and 0.1341 cm) were at 11 and 1, respectively.As mtry increased, R 2 cv initially increased and then stabilized at the values larger than 6, and both RMSE cv and MAE cv initially decreased and then stabilized at values between 6 and 56.Thus, the optimal model of RF (ntree = 1000 and mtry = 9) was determined using the smallest RMSE cv .Where, the and are the importance score of the i-th and q-th predictor, P is the total number of the predictors.

RF
The effect of mtry on RF model performance is presented in Figure 1.Parameter values of mtry at 1 and 9 had the lowest and highest R 2 cv (0.4460 and 0.5565), respectively.For RMSEcv, mtry at 9 and 1 resulted in the lowest and highest values (0.1377 and 0.1751 cm).For MAE cv, the lowest and highest MAEcv (0.0996 and 0.1341 cm) were at 11 and 1, respectively.As mtry increased, R 2 cv initially increased and then stabilized at the values larger than 6, and both RMSEcv and MAEcv initially decreased and then stabilized at values between 6 and 56.Thus, the optimal model of RF (ntree = 1000 and mtry = 9) was determined using the smallest RMSEcv.

BRT
Figure 2 shows the effect of n.trees, shrinkage and interaction.depthon BRT model performance.Parameter values of n.trees, shrinkage and interaction.depthat (3000, 0.001, 1) and (8500, 0.01, 9) resulted in the lowest and highest R 2 cv (0.4171 and 0.5714).For RMSE cv , n.trees, shrinkage and interaction.depthat (7000, 0.01, 9) and (3000, 0.001, 1) had the lowest and highest RMSE cv (0.1353 and 0.1611 cm).For MAE cv , the lowest and highest MAE cv (0.0982 and 0.1225 cm) were at (9500, 0.01, 9) and (3000, 0.001, 1), respectively.In general, as n.trees increased, R 2 cv initially increased and then stabilized at higher number of n.trees, both RMSE cv and MAE cv initially decreased and then stabilized at higher number of n.trees.Overall, shrinkage of 0.01 had better model performance than shrinkage of 0.001 and 0.005.In addition, interaction.depth of 9 had the best model performance.Thus, the optimal model of BRT (distribution = "gaussian", bag.fraction = 0.5, shrinkage = 0.01, n.trees = 7000 and interaction.depth= 9) was determined using the smallest RMSE cv .

Figure 2.
Tuning boosted regression tree (BRT)-the effect of n.trees, shrinkage and interaction.depthon BRT model performance.The tenfold cross-validation was used for validating.The error bar is standard deviation.The parameters, n.trees, shrinkage and interaction.depthare the total number of trees to grow, the step-size reduction or learning rate and the maximum depth of variable interactions for each tree, respectively.

Cubist
The lowest and highest R 2 cv (0.4564 and 0.5734) were at committees values of 1 and 86 and neighbors values of 1 and 9, respectively (Figure 3).For RMSEcv, the lowest and largest values (0.1351 and 0.1650 cm) were at committee values of 31 and 1 and neighbors values of 9 and 1.For MAEcv, the lowest and highest (0.0970 and 0.1077 cm) were at committee values of 31 and 1 and neighbors values of 5 and 1.In generally, as committees increased, R 2 cv initially increased and then stabilized at higher number of committees, and both RMSEcv and MAEcv initially decreased and then stabilized at higher number of committees.Increasing neighbors would increase R 2 cv and decrease both RMSEcv and MAEcv with the exception of the neighbors' value of 0. Thus, the optimal model of Cubist (committees = 31 and neighbors = 9) was determined using the smallest RMSEcv.shrinkage: 0.001 shrinkage: 0.005 shrinkage: 0.01 interaction.depth: 1 interaction.depth:3 interaction.depth:5 interaction.depth:7 interaction.depth:9 3000 4000 5000 6000 7000 8000 9000 10000 3000 4000 5000 6000 7000 8000 9000 100003000 4000 5000 6000 7000 8000 9000 10000 Tuning boosted regression tree (BRT)-the effect of n.trees, shrinkage and interaction.depthon BRT model performance.The tenfold cross-validation was used for validating.The error bar is standard deviation.The parameters, n.trees, shrinkage and interaction.depthare the total number of trees to grow, the step-size reduction or learning rate and the maximum depth of variable interactions for each tree, respectively.

Cubist
The lowest and highest R 2 cv (0.4564 and 0.5734) were at committees values of 1 and 86 and neighbors values of 1 and 9, respectively (Figure 3).For RMSE cv , the lowest and largest values (0.1351 and 0.1650 cm) were at committee values of 31 and 1 and neighbors values of 9 and 1.For MAE cv , the lowest and highest (0.0970 and 0.1077 cm) were at committee values of 31 and 1 and neighbors values of 5 and 1.In generally, as committees increased, R 2 cv initially increased and then stabilized at higher number of committees, and both RMSE cv and MAE cv initially decreased and then stabilized at higher number of committees.Increasing neighbors would increase R 2 cv and decrease both RMSE cv and MAE cv with the exception of the neighbors' value of 0. Thus, the optimal model of Cubist (committees = 31 and neighbors = 9) was determined using the smallest RMSE cv .

MARS
Parameter values of nprune and degree at (2, 3) and (51, 2) resulted in the lowest and highest R 2 cv (0.3062 and 0.4993), respectively (Figure 4).The lowest and highest RMSEcv (0.1462 and 0.1721 cm) were at (39,2) and (2,3).Parameters nprune and degree at (45,3) and (2, 3) had the lowest and highest MAEcv (0.1085 and 0.1278 cm).Generally, as nprune increased, R 2 cv initially increased and then stabilized at values between 25 and 113, both RMSEcv and MAEcv initially decreased and then stabilized at the same value.The degree of 2 had better model performance than degree of 1 and 3. Thus, the optimal model of MARS (nprune = 39 and degree = 2) was determined using the smallest RMSEcv.The parameters, committees and neighbors, are the number of iterative model trees to grow in sequence and the number of neighboring training points which can be used to adjust the optimal predictions, respectively.

MARS
Parameter values of nprune and degree at (2, 3) and (51, 2) resulted in the lowest and highest R 2 cv (0.3062 and 0.4993), respectively (Figure 4).The lowest and highest RMSE cv (0.1462 and 0.1721 cm) were at (39,2) and (2,3).Parameters nprune and degree at (45,3) and (2, 3) had the lowest and highest MAE cv (0.1085 and 0.1278 cm).Generally, as nprune increased, R 2 cv initially increased and then stabilized at values between 25 and 113, both RMSE cv and MAE cv initially decreased and then stabilized at the same value.The degree of 2 had better model performance than degree of 1 and 3. Thus, the optimal model of MARS (nprune = 39 and degree = 2) was determined using the smallest RMSE cv .

Model Evaluation and Comparisons
The predictive performance of four selected optimal models is presented in Table 3.The results of tenfold cross-validation indicated that the RF, BRT and Cubist models significantly outperformed the MARS model in terms of three performance measures (the t-test p value for corresponding pairs <0.05).The highest R 2 , the lowest RMSE and the lowest MAE were observed for Cubist model, followed by the BRT and RF model, while the lowest R 2 , the highest RMSE and the highest MAE were found for the MARS model.

Model Evaluation and Comparisons
The predictive performance of four selected optimal models is presented in Table 3.The results of tenfold cross-validation indicated that the RF, BRT and Cubist models significantly outperformed the MARS model in terms of three performance measures (the t-test p value for corresponding pairs <0.05).The highest R 2 , the lowest RMSE and the lowest MAE were observed for Cubist model, followed by the BRT and RF model, while the lowest R 2 , the highest RMSE and the highest MAE were found for the MARS model.Values in the bracket were standard deviation.Different lowercase letters (a, b, c) indicated significant difference between models at 0.05 level.

Variable Importance Based on the Optimal RF and BRT Models
Relative importance of each variable as determined from the optimal RF and BRT models is presented in Figure 5.
addition, among three site factors, aspect had the highest influence on individual tree diameter growth (relative importance of 3.15%), secondly elevation (0.82%) and slope (0.57%).
Generally, it was shown that the relative importance of predictors as determined from the RF model was basically identical with that from the BRT model: competition and size factors were the most important variables (accumulating relative importance was over 73.00%), followed by climate factors (not more than 18.03%) and site factors (not more than 4.54%).
. Relative importance of each variable as determined from (A) random forest (RF), and (B) boosted regression tree (BRT), which were normalized to 100%.The explanations of stand factors: BAL is the sum of basal area larger than the subject tree, Dg is average DBH, BA is the basal area per hectare, RD is the ratio of DBH of subject tree to average DBH, DL is the sum of square DBH in trees which are larger than subject tree, D0 is the individual tree DBH at the beginning of 5-year interval, DDM is the ratio of DBH of subject tree to the maximal DBH, TreeSpe_Code is the tree species, and N is the number of trees per hectare.The explanations of site factors: Aspect is aspect, Slope is slope and Altitude is altitude.The explanations of climate factors were presented in Table 2.
Generally, it was shown that the relative importance of predictors as determined from the RF model was basically identical with that from the BRT model: competition and size factors were the most important variables (accumulating relative importance was over 73.00%), followed by climate factors (not more than 18.03%) and site factors (not more than 4.54%).

ML Model Development
In the process of ML model evaluation and selection, it is not only required to select suitable learning algorithms, but also to set the corresponding parameters, which is commonly referred to as "parameter tuning" [49].The performance of models can vary drastically depending on the value selected for the parameters.Therefore, parameter tuning was implemented and tested for the four 4 ML algorithms in the study.
In RF modeling, two parameters were required to be tuned: "ntree" and "mtry" [33].The default value of ntree is 500 and the overall error rate tends towards stability when the ntree is greater than 500 [34].Breiman [30] proved that RF is less vulnerable to over-fitting and RF model will not be negatively impacted even though a large ntree is set.Thus, the only limitation of ntree is computing costs (both storage and time) [30].In our study, the value of ntree was set to be 1000 in order to balance model predictive performance and computing costs.The parameter mtry usually performed well at the recommended value (one-third the number of total predictors for regression ) [33].In addition, both Kuhn and Johnson [27] and Liaw and Wiener [31] found that RF was relatively insensitive to the selection of mtry.Our study found that the performance of RF would be adversely affected by the selection of mtry when the mtry was small (model performance increased with the increasing of mtry when mtry < 6).Whereas, mode performance of RF was relatively insensitive to the selection of mtry when mtry was large (model performance stabilized at between 6 and 56 mtry).Recommended default mtry of 18 was almost equivalent to that with mtry of 9 in the study in terms of model performance.Our study supported the conclusion that recommended default mtry for regression tended to perform well [27,33].
In BRT modeling, we tested parameters "shrinkage", "n.trees" and "interaction.depth".The shrinkage controls the contribution of each tree on the optimal predictions [33,47], and contrary results were found including lower and large shrinkage could improve model performance [27,38].In our study, we found that shrinkage of 0.01 had better model performance than those of 0.001 and 0.005.The n.trees controls the total number of iterations.Our results were consistent with Kuhn and Johnson [27] that RMSE cv decreased as n.trees increased and then stabilized at higher values.The interaction.depthparameter controls the number of nodes in a tree [39].Freeman et al. [33] found that model performed better with a fairly large interaction.depth.Our results were in agreement with them.
The maximal value of committees is limited to 100 in Cubist modeling [41].The other parameter neighbors, is an integer with 0-9 range [42].We found that the RMSE cv initially significantly reduced as the committees increased and then stabilized at a higher committees, and a model with neighbors of 0 performed better than the model with neighbors of 1.These findings were consistent with the previous study [27].
In MARS modeling, our study found that degree of 1 in predicting individual tree diameter growth performed better than degree of 1 and 3 in terms of RMSE cv , MAE cv and R 2 cv .Kuhn and Johnson [27] noted that models with degree of 1 in predicting solubility were identical to models with degree of 2 in terms of RMSE cv .The result that RMSE cv initially significantly reduced as the nprune increased and then stabilized at a higher nprune in the literature [27] was also supported by our study.In general, as the RF, Cubist and MARS models had fewer parameters to be set by users and were less sensitive to tuning of these parameters; they were more user-friendly than BRT which had many parameters to be tuned.In contrast to RF, parameters of BRT had stronger influence on model quality [33].

Performance Evaluations of Four Optimal Models
In this study, the four ML methods, including RF, BRT, Cubist and MARS, were developed and compared for modelling individual tree periodical annual DBH increment with the inputs of stand, climate and site factors of larch-spruce-fir mixed forests in China.Unlike traditional statistical models, collinearity test and the filtering of predictors were not performed in our study.All four models (the RF, BRT, Cubist and MARS models) have the ability to model complex and nonlinear interactive relationship between the predictor variables and the response variable without statistical assumptions and predetermined mathematical equations.In addition, these models are naturally resistant to noninformative predictors due to the built-in feature selection mechanisms while the number of predictors is high [27,28].In addition, these models are insensitive to the collinearity between independent variables, and model results are robust to unbalanced data and missing data [27,28,50].Based on the aforementioned advantages, Prasad et al. [28] evaluated the MARS and RF models in reproducing current tree importance values distribution for four tree species without pre-processing of the data.They noted that RF was superior in reproducing current tree importance values distribution.Besides, Yang et al. [22] compared BRT and RF to predict soil organic carbon content based on 12 environment variables, and found that BRT and RF had similar performance.Based on our data, of all models, the RF, BRT and Cubist models had a distinct advantage over MARS in predicting individual tree diameter growth (Table 3).The Cubist model ranked the highest in terms of model performance (RMSE cv [0.1351 cm], MAE cv [0.0972 cm] and R 2 cv [0.5734]), followed by BRT and RF model, whereas the MARS ranked the lowest (RMSE cv [0.1462 cm], MAE cv [0.1086 cm] and R 2 cv [0.4993]).In generally, RF, BRT and Cubist models had better generalization ability and statistical reliability than MARS in our study.The reason is that the RF, BRT and Cubist models are ensemble learning methods, which construct a set of base learners and then produce new predictions by taking a weighted average of their predictions.Ensemble learning methods often obtain better predictive performance than those obtained from any of the constituent base learners alone.In previous study with the same data, Yu et al. [19] used generalized additive model (GAM) to investigate the effects of stand factors and climate factors on individual tree periodical annual DBH increment of larch-spruce-fir mixed forests in China.They found that the RMSE, MAE and R 2 of GAM were 0.1500 cm, 0.1100 cm and 0.5080, respectively.Comparing the models performance between our study and Yu et al. [19], we found that GAM was slightly better than MARS, but was far below RF, BRT and Cubist.However, when these models were used in different research areas, it was noted that there was no single model that would always do better than any other models.Moisen and Frescino [51] found that MARS and GAM had similar performance for predicting forest characteristics.Aertsen et al. [52] found that the GAM model performed better than BRT for the prediction of site index.On the other hand, Leathwick et al. [53] found that BRT performed better than GAM for predicting variation in demersal fish species richness.
Both the ML models and traditional statistical methods have strengths and weaknesses.The traditional statistical methods can give clear model formulas and corresponding parameters through predetermined mathematical equations (model form), and thus they are intuitive and easy to be understood in terms of model form and the relationship between dependent and independent variables.However, these models also have some limitations.They are often applied under certain statistical assumptions, such as the data are normally distributed, homoscedastic and independent.However, due to continuous observations, hierarchy and intrinsic variability of forest growth data, the above assumptions are usually difficult to satisfy.What's more, the selection of specific mathematical equation (model form) and convergence is rather difficult when modelling individual tree growth in nonlinear form and with multiple variables.In contrast to the traditional statistical models, the ML methods have the ability to model complex and nonlinear interactive relationship between the predictor variables and the response variable without the test of statistical assumptions and the selection of predetermined mathematical equations.As a data-driven method, it is considered as "black box" model, however, the relative importance and partial dependence plot produced from RF and BRT also provide a clear explanation as traditional methods.When there were a large number of predictors, ML could help select the most important ones.The combinations between ML and traditional statistical methods may be a better option.

Variable Importance in Predicating Individual Tree DBH Growth
Climate, site condition and competition factors interactively influence individual tree growth.Meanwhile, these influences may vary with tree species and size [14].In this study, nine stand factors, three site condition factors and 44 climate variables were selected as predictors for predicting individual tree diameter growth because these variables were frequently used as key predictors in previous diameter growth research [5,9,19,54,55].Our results showed that stand factors were the most influential factors in BRT and RF models (Figure 5).More specially, both BRT and RF showed that competition including stand-(N and BA) and tree-level competition indices (BAL, DL, RD, DDM) was the primary factor affecting individual tree diameter growth, followed by tree size (including D 0 and D g ) and stand characteristic (including tree species).These findings were consistent with Yu et al. [19] which found that stand factors were the main drivers to diameter growth.Similar results have been found in other research, for example, Jiang et al. [14] found that competition indices were the primary factors affected individual tree diameter growth for trembling aspen (Populus tremuloides michx.),followed by tree size, and that diameter growth of white spruce (Picea glauca (Moench.)Voss) was primarily influenced by tree size, followed by competition.Alam et al. [56] found for white spruce that diameter growth of trees was primary constrained by competition.Our study showed that both tree size and competition were the main drivers to diameter growth.Stand-level competition indices can indicate the average crowding degree of the whole stand or the full utilization degree of the stand to the site.The bigger the competition index of BA and N stands, the smaller average occupancy of each tree to growth space is, so the smaller the average growth of DBH of individual trees [19].Generally speaking, the growth space occupied by individual trees varied with trees sizes, and trees with different sizes bear different competitive pressures.The bigger the competition index of BAL and DL, the larger competition pressure is, so the smaller average DBH increment is.In contrast, the bigger the competition index of RD and DDM, the smaller the competition pressure is, so the bigger average DBH increment is.The D g is a basic index reflecting the average size of a stand.The bigger D g , the smaller average occupancy of each tree to the growth space is, so the smaller the average DBH increment is.The larger D 0 , the greater growth advantage of individual tree is.When D 0 reaches a certain maximum, tree growth gradually slows down.Tree growth is co-influenced by these factors.Although age is one of the most important drivers of tree or stand growth, it was not included as an input in our study since the measurement of tree age for uneven-aged mixed forest was difficult and tree size could be as a proxy in terms of model application.All inputs in the ML methods in the study only explain about 60% variation of tree DBH growth of larch-spruce-fir mixed forests, which showed the complexity of uneven-aged mixed forests growth modelling, and thus more predictors or further algorithms are needed in the future.
In our study, the relative importance of climate factors was less than that of stand factors.This finding was consistent with Yu et al. [19], which showed climate factors had limited capacity in explaining the variation in diameter growth, while stand factors were the main drivers to diameter growth.Similar results have been found in other research [14,56,57], for example, Jiang et al. [14] found for both trembling aspen and white spruce that the climate factors showed less effects on tree growth than that of stand factors.Alam et al. [56] found for white spruce that diameter growth of trees was primary constrained by competition with minor effects from climate factors.Besides, our results noted that the relative importance of site factors was less than that of stand factors and climate factors.We found that aspect was a critical site factor in BRT and RF models, suggesting that aspect exerted a larger influence on individual tree diameter growth than elevation and slope did.
At first sight the relative importance of each variable as determined from BRT seemed to have a much steeper importance slope than that from RF (Figure 5).It is due to the fact that the base learners from BRT are dependent on each other and hence will have correlated structures among base learners.Thus, the same predictor will be selected across the base learners, increasing its contribution to the variable importance [27].

Conclusions
Four ML methods were applied and compared for predicting the individual tree diameter growth of larch-spruce-fir mixed forests in northeast China.Based on tenfold cross-validation resampling, we found that the RF, BRT and Cubist models had a distinct advantage over MARS in predicting individual tree diameter growth.The Cubist model ranked the highest in terms of model performance (RMSE cv [0.1351 cm], MAE cv [0.0972 cm] and R 2 cv [0.5734]), followed by the BRT and RF models, whereas the MARS ranked the lowest (RMSE cv [0.1462 cm], MAE cv [0.1086 cm] and R 2 cv [0.4993]).In generally, the RF, BRT and Cubist models were effective and powerful modelling methods for predicting the individual tree diameter growth.In addition, as the RF and Cubist models had fewer parameters to be set by users and were less sensitive to tuning of these parameters, they were more user-friendly than BRT.
Relative importance of each predictor variable was determined from the optimal RF model and the optimal BRT model, respectively.Our results showed that both the RF and BRT models can be used to reasonably analyze the complex relationship between predictor variables and individual tree diameter growth, and to identify the dominant factors affecting tree growth.The stand factors were the main drivers to diameter growth, while both site factors and climate factors had limited capacity in explaining the variation in diameter growth at local scale.
Although the ML methods were examined in the case of individual tree DBH growth prediction of larch-spruce-fir mixed forests in northeast China, they could be generally applicable to growth and yield predictions of other forest types.

Figure 1 .
Figure 1.Tuning random forest (RF)-the effect of mtry on RF model performance.The tenfold crossvalidation was used for model validating.The error bar is standard deviation.The parameter, mtry, is the number of predictive variables randomly sampled as candidates at each split and controls the correlations between trees.

Figure 1 .
Figure 1.Tuning random forest (RF)-the effect of mtry on RF model performance.The tenfold cross-validation was used for model validating.The error bar is standard deviation.The parameter, mtry, is the number of predictive variables randomly sampled as candidates at each split and controls the correlations between trees.

2 Figure 2 .
Figure 2.Tuning boosted regression tree (BRT)-the effect of n.trees, shrinkage and interaction.depthon BRT model performance.The tenfold cross-validation was used for validating.The error bar is standard deviation.The parameters, n.trees, shrinkage and interaction.depthare the total number of trees to grow, the step-size reduction or learning rate and the maximum depth of variable interactions for each tree, respectively.

Figure 3 .
Figure 3. Tuning Cubist-the effect of neighbors and committees on Cubist model performance.The tenfold cross-validation was used for validating.The error bar is standard deviation.The parameters, committees and neighbors, are the number of iterative model trees to grow in sequence and the number of neighboring training points which can be used to adjust the optimal predictions, respectively.

2 Figure 3 .
Figure 3. Tuning Cubist-the effect of neighbors and committees on Cubist model performance.The tenfold cross-validation was used for validating.The error bar is standard deviation.The parameters, committees and neighbors, are the number of iterative model trees to grow in sequence and the number of neighboring training points which can be used to adjust the optimal predictions, respectively.

Figure 4 .
Figure 4. Tuning multivariate adaptive regression splines (MARS)-the effect of degree and nprune on MARS model performance.The tenfold cross-validation was used for validating.The error bar is standard deviation.The parameters, degree and nprune, are the maximum degree of interaction and the maximum number of terms (including intercept), respectively.

Figure 4 .
Figure 4. Tuning multivariate adaptive regression splines (MARS)-the effect of degree and nprune on MARS model performance.The tenfold cross-validation was used for validating.The error bar is standard deviation.The parameters, degree and nprune, are the maximum degree of interaction and the maximum number of terms (including intercept), respectively.

Table 1 .
Summary statistics of permanent plot data within an interval of any five consecutive years (n = 16,619).

Table 2 .
Description of 44 candidate climatic variables.

Table 3 .
The predictive performance measures of four optimal models based on tenfold crossvalidation.

Table 3 .
The predictive performance measures of four optimal models based on tenfold cross-validation.