Reducing the Uncertainty of Radiata Pine Site Index Maps Using an Spatial Ensemble of Machine Learning Models

: Site Index has been widely used as an age normalised metric in order to account for variation in forest height at a range of spatial scales. Although previous research has used a range of modelling methods to describe the regional variation in Site Index, little research has examined gains that can be achieved through the use of regression kriging or spatial ensemble methods. In this study, an extensive set of environmental surfaces were used as covariates to predict Site Index measurements covering the environmental range of Pinus radiata D. Don plantations in Chile. Using this dataset, the objectives of this research were to (i) compare predictive precision of a range of geostatistical, parametric, and non-parametric models, (ii) determine whether signiﬁcant gains in precision can be attained through use of regression kriging, (iii) evaluate the precision of a spatial ensemble model that utilises predictions from the ﬁve most precise models, through using the model prediction with lowest error for a given pixel, and (iv) produce a map of Site Index across the study area. The ﬁve most precise models were all geostatistical and they included ordinary kriging and four regression kriging models that were based on partial least squares or random forests. A spatial ensemble model that was constructed from these ﬁve models was the most precise of those developed (RMSE = 1.851 m, RMSE% = 6.38%) and it had relatively little bias. Climatic and edaphic variables were the strongest determinants of Site Index and, in particular, variables that are related to soil water balance were well represented within the most precise predictive models. These results highlight the utility of predicting Site Index using a range of approaches, as these can be used to construct a spatial ensemble that may be more precise than predictions from the constituent models.


Introduction
Pinus radiata D. Don (radiata pine) is the predominant plantation species within Chile and there is considerable interest within the forest sector in the accurate prediction of Site Index for this species [1]. P. radiata is the most widely established plantation species within the Southern Hemisphere, and this species constitutes a large proportion of plantations in Chile, New Zealand, and Australia [2]. This species is very responsive to environment and, as a consequence, productivity has been found to range widely across the environments over which it is grown [3,4]. A number of process-based models, such as 3PG [5], CenW [6], and CABALA [7], have been developed to describe how the environment influences growth of plantation species, such as P. radiata (e.g., Kirschbaum and Watt [8]). However, empirical or hybrid models are still the most widely used for predictions of plantation productivity, as these models are simpler to parameterise and can provide more precise estimates of growth than process-based approaches [9]. Stand productivity is modelled by empirical models as a function of stand age, while using non-linear functional forms. Variation in the productivity between stands is accounted for by standardised measurements of productivity at a given age that are used to adjust both the trajectory and the asymptote of predictions of productivity over time [10][11][12]. Site Index, which expresses the height of dominant or co-dominant trees at a reference age [13], has been most widely used to account for this inter-stand variation, as this metric is correlated with productivity [14,15] and the height of dominant trees is relatively invariant to stand density [16][17][18].
Environmental surfaces have been widely used through a range of modelling approaches to develop maps of Site Index for P. radiata [3,19] and many other coniferous tree species [20][21][22][23][24][25]. When compared to direct measurements of Site Index made using plot data, which are typically averaged to the stand level, predictions of Site Index from environmental surfaces open up a range of applications that are not available from traditional inventory. The resulting spatial description of Site Index provides insight into the key environmental drivers of productivity and allows for managers to understand how productivity is likely to vary across the landscape and where the optimal productivity will occur at a range of resolutions from the intra-stand to the regional level [3,19]. In contrast to spatial predictions of Site Index from remotely sensed data, such as LiDAR, [26], surfaces of productivity, which are created from environmental surfaces can also be used to estimate productivity for unplanted areas, providing managers with insight into the potential value of land that they intend to purchase [27].
The use of Site Index surfaces to parameterise empirical growth models incorporates elements of process-based modelling, as Site Index integrates the most important determinants of tree growth, including topography, soil characteristics, and climate [28]. Consequently, spatial predictions of Site Index provide a means of generating stand growth curves that are sensitive to fine and coarser scale landscape level changes in climatic and edaphic conditions [29]. These estimates of stand development allow for managers to spatially optimise the timing of a range of silvicultural operations including thinning and pruning, across their estate [30,31]. The site Index surfaces can also be used as input to models that are used for key management decisions, such as the optimisation of final crop stand density (S opt ) and the development of surfaces showing spatial variation in S opt [32].
Parametric methods that utilise the spatial correlation between the underlying plot data describing Site Index have been less frequently used to develop models and surfaces of Site Index. Amongst these geostatistical methods, ordinary kriging and regression kriging are the most commonly used techniques [3,19]. Because predictions are made by ordinary kriging through interpolating values between measured plots, this method is most precise when plots are located in relatively close proximity [3]. Regression kriging is less reliant on a dense plot network than ordinary kriging, as this method fits an underlying regression model and then geospatially refines these estimates through kriging the model residual variation across the area of interest [3].
The recent emergence of advanced machine learning methods allows for greater utilisation of the increasing amount of information in geospatial surfaces, as these models can often accommodate collinearity between closely correlated environmental variables [49,50]. Despite this advantage, few studies have compared the predictive precision of these methods with more traditional approaches. For forest species located in Belgium and Turkey, Site Index was more precisely predicted while using non-parametric methods than multiple linear regression and, amongst non-parametric methods, artificial neural networks had the highest predictive performance [33]. Comparative studies of model performance undertaken in P. radiata plantations have highlighted the precision of regression kriging and more advanced non-parametric models, but, as with other forest species, have not included a comprehensive comparison of the models. Within New Zealand plantations, regression kriging was found to be marginally more precise than ordinary kriging, which, in turn, was more precise than Partial Least Squares [3,19]. A comparison of seven modelling methods using data that were collected from northwest Spain found the non-parametric Multivariate Adaptive Regression Splines (MARS) to be the most precisely predicted Site Index, which was closely followed by the parametric methods of stepwise regression and PLS [45].
Because each modelling method has its own limitations and advantages [51], an alternative approach for improving the overall model precision is to combine predictions from each model [52,53]. This method, which is known as Ensemble Modelling, is a well known methodology that can improve prediction through integrating knowledge from many sources [53]. Although this technique has been used for the prediction of many soil attributes [53,54] and class prediction studies [52], we are unaware of any studies that use spatial Ensemble Models for the prediction of Site Index.
In Chile, different Site Index curves have been developed for each region and geographic area, although these local predictions are relatively inaccurate and there is little understanding of how the Site Index responds to topography, climatic and edaphic conditions [55]. Given the wide diversity of environmental conditions within the region over which plantations are grown, we assumed that more than one modelling method would be required to best predict Site Index across south-central Chile. Consequently, the objectives of this study were to compare the precision of a wide range of modelling algorithms and determine whether the combination of multiple algorithms (e.g., by the means of spatial ensemble learning) could more precisely predict Site Index than the best performing single modelling algorithm.

Data and Covariates Description
Stand level data describing Site Index of P. radiata were extracted from 20 year stands. The site index for P. radiata is defined as the mean top height at age 20 years old, where mean top height is defined as the mean height of the 100 largest diameter trees [56]. As stands used in this study were planted between 1987-1997 Site Index could be estimated at 20 years of age rather than being projected forward to 20 years as is commonly done [57]. In total, there were 64,190 observations of Site Index available for modelling that were dispersed from Región del Maule (latitude 35 • 14 ) to Región de los Ríos (latitude 40 • 6 ) and covered a Site Index range of 14.2-42 m with a mean of 29.0 m. These observations were randomly split, with 75% used for the fitting dataset, 12.5% for the calibration dataset (used for the ensemble methodology only), and 12.5% for the validation dataset. All three datasets covered a similar geographic range that was representative of the location of P. radiata plantations through Chile ( Figure 1). The site Index and enviromental conditions were very similar between the three data sets, and they are summarized in Table 1.
The 64 environmental factors or covariates, as listed in Appendix B, were extracted for each of the plot locations at a 90 × 90 m resolution from spatial layers. These spatial layers described topography, vegetation index, soil properties, and climate. Topography was characterised from a Digital Elevation Model (DEM) that was created using LiDAR. Automated Geoscientific Analyses (SAGA) was used to extract the topographical variables listed in Appendix B from this DEM. The enhanced vegetation index (EVI) was used in order to characterise the vegetation. The values for EVI were derived from MODIS images collected between 1987-2017 period that were reclassified to 90 × 90 m. Values of EVI describing the mean, range, and standard deviation were extracted from this imagery (Appendix B). The soil morphology was determined from the Chilean Natural Resources Information Center (CIREN). The soil properties were determined by interpolating data from ten thousand soil pits that were distributed across south-central Chile that belong to Arauco. The soil surfaces available from this dataset included soil depth, clay content, nutrient content (C:N ratio, N content) and physical properties (available soil water, bulk density, hydraulic conductivity). Long term monthly air temperature, rainfall, evapotranspiration, and water balance were obtained from CR2 (Center for Climate and resilience research [58]), which is unpublished information, but available for purchase.  In addition to long term averages, mean climatic data that were linked to the time period of each plot were also used within analyses. These covariates were developed for the model prediction while using the climate that was experienced by each forest stand (observations), from the plantation establishment until an age of 20 years, including the effect of rainfall for different stand periods (first, second, and forth initial years post establishment), accumulative cold days, evapotranspiration, and water deficit index per stand. Where these variables were significant based on the relative environmental feature selection (details in Section 2.2), they were used to construct the models. The map development used raster surfaces of these variables while using the average values over the last 30 years for air temperature, water balance, rainfall, and evapotranspiration.

Subsetting of Covariates for the Modelling
A pre-processing method was used in order to extract a subset of relevant features from the available 64 covariates. According to Weston et al. [59], this step plays an important role in improving prediction performance and can reduce overfitting. Two types of redundant features selection were made. The first was for non-parametric models (NPM), which included random forest (RF), support vector machines (SVM), neural network (ANN), eXtreme Gradient Boosting (XGBoost), and Multivariate Adaptive Regression Splines (MARS). The second method was used for parametric models (PM), including multiple linear regression (MLR), partial least squares (PLS), and elastic net (EN). Appendix A provides the description of the pre-processing that each of these models follow. Using the methods detailed in the next subsections, a total of 18 variables were selected for PM, while 20 were selected for NPM. From these selected 'optimal' variables, the top five were identified based on their level of importance ranking, while using the mean reduction in accuracy (MDA) for NPM (further details in Section 2.1) and Pearson correlation coefficient for PM. The models of Site Index were developed while using PM and NPM methods that were based on both the optimal selection and the top five variables. Model development using the reduced number of variables was undertaken in order to produce a greater computational time efficient alternative, interpretable, and parsimonious models, and to explore how variable number affected statistical differences in SI prediction.

Covariate Selection for Non-Parametric Models
Recursive Feature Elimination (RFE) implemented via caret [60] was used to subset variables for the NPM. This method seeks to improve generalization performance through removing the features that have the least effect on training errors [61]. RFE is basically a backward selection of the predictors, which selects features by recursively considering smaller and smaller sub sets of features, and then builds a model while using the remaining attributes and calculates model accuracy with an internal cross-validation [62].
During the reduction of input selection with RFE the importance of each variable is measured based on the MDA. This process assesses how much the prediction accuracy drops by randomly permuting the values of each input variable (one at a time). A higher importance of the input variable under consideration corresponds to larger reductions in the prediction accuracy [63]. The top five covariates were selected based on RFE importance level.

Covariate Selection for Parametric Models
A penalized regression was used in order to reduce the number of predictors for PM as this method has been shown to produce more parsimonious models [64]. A least absolute shrinkage and selection operator (LASSO) was used to penalize the parameter estimates to avoid overfitting. LASSO finds the best variables and coefficients by minimizing the residual sum of squares and adding penalties that are useful for fitting a wide variety of linear models [65]. This technique requires the selection of a tuning parameter λ that determines the amount of shrinkage. This method was followed by a complementary collinearity test while using the variance inflation factor (VIF). VIF can be used in order to identify correlated variables, which can inflate coefficient values, and it is determined from the following formulation [66], where R 2 is the coefficient of determination.
The variables that were identified to be most important by penalized regression were reduced to a set of 18 predictors while using a procedure that sequentially eliminated correlated variables with a VIF that exceeded four [66].
After LASSO and VIF were completed, a ranking of the variables was obtained that was based on the strength of the regression between the variable and Site Index.

Modelling Approach
Five types of modelling methods were used in order to predict Site Index. Each of them used various machine learning and regression algorithm methods, which included: (1) geostatistical models (ordinary kriging [3,19]); (2) parametric models (multiple linear regression, partial least squares [45], elastic net); (3) non-parametric models (random forest [67], support vector machines [62], neural network [33], XGBoost [68] and MARS [45]); (4) hybrid models (random forest-kriging and PLS-kriging [19]); and, (5) a modelling ensemble approach using the best five models from the previous four categories. The strategy for fitting the models is outlined below. The parametric, non-parametric models and ordinary kriging (OK) were fitted to Site Index while using the fitting dataset. Regression kriging was then used to create a range of hybrid models from the fitting dataset, through kriging the residuals of the partial least squares and random forest models. Predictions from all of these models were then made on the calibration dataset. The five most precise models were selected from this process and the difference between the actual and predicted Site Index (residuals) for these five models were determined. The residuals for these five most precise models were kriged using OK. For all pixels, the model with the lowest residual was selected and this combination of predictions from all five models was termed as the model ensemble (Code available on [69]. The final precision of all of the models that were developed, including the ensemble, was determined through predictions that were made on the validation dataset.

Model Description and Fitting Procedure
Appendix A provides abrief explanation of all the parametric and non-parametric models. Each of the PM and NPM machine learning models was fitted while using the caret package within R [60], utilising a five fold cross validation. Hyperparameters for the models were optimized using a grid search that started with a wide search radius and narrowed down to the final values over a series of iterations. Table 2 summarises the final hyperparameters.
Geostatistical models have been widely used for predicting spatially continuous variables that are based only on geospatial locations. Ordinary kriging (OK) is one of the most widely used geostatistical methods [3,19,[70][71][72], in which the value for an unsampled point is estimated based on the weighted average of observed neighbouring points within a given area [72]. The neighborhood was restricted to include only the 100 nearest neighbours. Ordinary Kriging was used in order to predict Site Index from, whereẐ(S 0 ) is the predicted value of an unvisited location S 0 , Z(S 1 ), .., Z(S n ) are the measured values and their location, and λ i are the weights that depend on the spatial autocorrelation of the variable, as defined by the semi-variogram (see below for description). Regression kriging is a hybrid modelling technique that combines model prediction of the dependant variables with ordinary kriging of the model residuals [73,74]. This method was used in order to predict Site Index, at an unsampled site, from, where the driftm refers to the predictions made by the modelling method (as described above) and the residuals from these models,ˆ , are interpolated while using ordinary kriging. In this study, the model drift was estimated while using random forests and partial least squares. Semi-variance is used within ordinary kriging and regression kriging to describe the spatial autocorrelation of measured values between locations. The plot of semi-variance against distance is a semi-variagram [74]. In this study the semi-variogram was fitted using the autofitVariogram from the R package automap, which tests various models (spherical model, exponential model, gaussian model, and Stein's parameterization) and automatically fits the most precise to the data [75]. Table 2. Final hyperparameters used in each model. The term "opt" or "5" is added to specify whether the model was fitted using the top five or the optimum number of covariates.

Model
Hyperparameter Best Tune (

Model Evaluation
Each model was evaluated by comparing the performance of our predicted Site Index with the validation data set, following the procedure that was proposed in Guevara and Olmedo [71]. Using modStats from the openair library from R, the statistics of different models were compared [76]. These statistics included root mean square error (RMSE), Mean Bias (MB), the Pearson correlation coefficient (r), and Index of Agreement based on Willmontt (IOA), which were defined, as follows: whereŷ is the predicted value in i, y is the average of the observed values (analogously for x), y i is the observed value (analogously for x), and n is the number of plots.

Covariate Selection
While using the covariate selection process described above, the number of independant variables used in the modelling was reduced from sixty-four to twenty for NPM and eighteen for PM (Table 3). These were further subsetted to the top five variables that are based on their importance level. The 18 key variables selected for PM were predominantly related to topography, with the remainder being evenly distributed across the climate, soil properties and vegetation categories. Eight of the 20 variables selected for NPM were related to climate with the remainder evenly distributed within the other three categories. The top five variables from the selected covariates mainly related to soil properties for PM and included soil hydraulic conductivity, available soil water, C:N ratio, and soil depth. From NPM, these top five variables were all related to climate and included growing degree days, rainfall, and two variables that accounted for seasonality in rainfall and air temperature.

Creation of the Ensemble Model
Predictions that were made on the calibration data showed that the five most precise models were either geostatistical or hybrid models. Four of these models were developed through regression kriging while using PLS and random forests with either the optimum set of covariates or the top five covariates, while the fifth model used OK. Within the calibration dataset residuals were extracted from these five models and OK was used to spatially interpolate values throughout the study area. An ensemble model was then created from these five models by selecting, from each pixel, the model with the lowest residual ( Figure 2).

Validation of the Models
All of the created models were fitted to the validation dataset and the model statistics are displayed for the top five models (Table 4) and all models (Appendix C). The model ensemble was the least biased (mean bias = −0.0227 m) with the highest precision (RMSE = 1.8505 m, r = 0.8103 and IOA = 0.7228). The five models that were used in order to create the ensemble were also the next most precise, with RMSE ranging from 1.8888-2.0373 m (Table 4) and mean bias ranging from 0.0309-0.5537 m. The number of variables included in the model did not markedly affect model precision, as these five models included regression kriged PLS and RF models that had either five variables or the entire set of covariates (18 for PLS and 20 for RF).
A plot of predicted against actual and residual values for the ensemble showed little apparent bias (Figure 3). This bias was smallest for the high density observations and residuals for these observations largely did not exceed 2 m. However the model did slightly overpredict at low values and underpredict at high values of Site Index, but this bias was generally constrained to outlying points that occurred at low density ( Figure 3).   Figure 4 illustrates the SI prediction of the five most precise models along with the ensemble approach. All five models that were used to produce the ensemble show the highest values of Site Index occurred at a latitude of ca. 36-38 • S within coastal and some parts of the inland areas. With the exception of OK, the lowest Site Index was predicted to occur in northern regions and in eastern regions that are close to the Andes. OK did not predict low values in eastern regions, as there were few plots in this area, but did predict low values in the north, where the plot density was higher. Predictions from the ensemble largely reflected the values from the five constituent models. This model predicted moderate to high values of Site Index in coastal areas and within the central valley with the lowest predicted values occurring in the eastern and northern regions.

Discussion
This study clearly demonstrated the utility of geostatistical methods for predicting the Site Index of P. radiata. While using a novel spatial ensemble approach, the most precise model for each pixel was combined in order to produce an overall model with a more precise prediction that the constituent models. The final model had an RMSE of 1.88 m which compared favourably with previous predictions for both P. radiata and other coniferous species [3,19,33,34]. The variable reduction process highlighted the sensitivity of Site Index to climatic and edaphic factors.
The geostatistical models of ordinary kriging or regression kriging provided the most precise predictions of Site Index among the compared methods. Previous studies have primarily focused on comparisons of precision between the models of Site Index, which do not include a geostatistical component, demonstrating that non-parametric generally outperform parametric models [33,45,46]. Our results extend this research through demonstrating that the addition of a spatial component to both of these model types outperforms models without this component. Gains through regression kriging over the base model were particularly marked for the most precise model, which utilised PLS (r = 0.807 vs 0.705), demonstrating the utility of this approach. Although regression kriging has been widely used for prediction in other domains [67,77,78], with few exceptions very little research has used regression kriging for prediction of productivity indices. As noted by Samuel-Rosa et al. [79], there was generally a consistent but small reduction in precision when the number of variables in the models was reduced from between 18-20 to five.
Regression kriging and ordinary kriging have been found to be the most precise when applied to high density datasets. The accuracy of OK is most influenced by the spatial point pattern (random, aggregated, or regular), sampling density (high or low), autocorrelation, data distribution (normal and skewed), and heterogeneity of the data [80]. The high density of observations in this study favoured the use of ordinary kriging and regression kriging, which is consistent with a model of P. radiata Site Index developed in New Zealand [19]. Regression kriging is less sensitive to the spatial distribution of the sample plots than ordinary kriging, as this method also includes an underlying model that is based on environmental variables. As a result, regression kriging can outperform OK when datasets include a range of plot densities across the area of interest (e.g., [19]).
A novel ensemble approach was used here to spatially combine predictions from the five most precise prediction models. Although ensemble methods have been widely used in other disciplines [52,54,81,82], this method has not previously been used in order to predict Site Index. Most ensemble approaches combine predictions from all models across the entire study area using a range of approaches to weight the individual model predictions [81,83]. Our approach differs in that a single model was used in order to predict Site Index within each pixel, which improved the overall predictive precision over any of the five constituent models.
One of the advantages of our ensemble approach is that this method highlighted regions in which each model performed best, which may be useful if predictions need to be made within a sub-set of the study area. The RF-Kriging method was found to be well suited to northern regions with sparse observations and southern parts of the study area without observations, which highlights the utility of this approach, where the observation density is low. OK had less error in regions where there were both sparse and denser observations. In high altitude eastern areas where the OK model was not selected predictions from this model are likely be higher than actual values as there were not any points to interpolate within this region. In the central part of the study area, as well as southern parts of the Andes, PLS-riging had the lowest prediction error and this region generally included a dense concentration of observations.
The five environmental variables that were the most important determinants of Site Index were all climatic variables for the RF-Kriging model and almost all edaphic variables for the PLS-Kriging model. Growing degree days, accumulated rainfall during years 1 and 4, and variables describing the rainfall and temperature seasonality were the most important climatic determinants of Site Index. Growing degree days has a sound physiological basis as a predictive variable, as P. radiata height extension is strongly regulated by air temperature [84,85] and, consequently, this variable controls the length of the growing season. The sensitivity of Site Index to rainfall has been well established [4,8] and our study clearly demonstrates the importance of adequate rainfall during the years immediately after establishment. The seasonality of air temperature and rainfall were also important regulators of Site Index within Chile, where both of the variables exhibit a marked seasonal variance [86].
Important soil properties included C:N ratio, soil hydraulic conductivity, available soil water, and soil depth. Soil C:N ratio has been found to be a key determinant of conifer productivity [87] and it is a more precise proxy of soil nutrient availability than N as C:N ratio accounts for the positive relationship between carbon content and nitrogen immobilisation [88][89][90][91][92]. Both available soil water and soil depth control the amount of water available to trees which are key attributes within the study area where rainfall is often sparse and highly seasonal [86]. Similarly, soil hydraulic conductivity is also related to water availability and it reflects the soil's ability to transmit water when subjected to a hydraulic gradient, therefore controlling the partitioning of precipitation between surface runoff and groundwater recharge [93].
Although enhanced vegetation index was not included in the top five variables, variables describing the mean and dispersion of EVI consistently featured among the optimum variables for both types of model. Previous research has found EVI to be a useful predictor of forest canopy structure [94]. There is a strong physiological link between this variable and growth rate, as EVI has also been found to be strongly related to leaf area index (LAI) [95,96] and EVI can also be used in order to identify the start of the growing season [97].
Topographic variables that were well represented for prediction of Site Index in both types of models included terrain surface convexity (TSC) and valley depth. These covariates influence local microclimate and soil-forming processes, and they are consequently associated with the soil type [43]. Both TSC and valley depth are also associated with multiple environmental variables, such as water drainage and water availability, as well as the accumulation of clay and other soil particles. Valley depth is also likely to be a proxy for local exposure to the wind. As higher windspeeds result in reduced tree height and increased diameter [98][99][100][101] P. radiata located in deep valleys with little wind exposure has been found to have significantly greater height than trees that are located on ridges or more exposed areas [102,103].
The direct estimation of Site Index from height data that were collected at the index age of 20 years reduced the error from extrapolation. According to Burkhart and Tomé [104], estimates of Site Index using measurements that coincide with 20 years are rare. As a result, most SI studies use equations to extrapolate height to the required index age, which is an approach that is potentially biased [105,106]. More recent methods, such as the generalised algebraic difference approach (GADA), which include polymorphic models, provide a more accurate estimation method, but still include uncertainties in the final prediction [105]. An additional advantage of using an older dataset was that site specific climatic conditions could be estimated over a uniform period of the rotation length from establishment to 20 years of age.

Conclusions
In conclusion, we found geostatistical models of Site Index to outperform a range of parametric and non-parametric models without a geo-spatial component. These five geostatistical models were successfully combined into an ensemble model that was more precise than the constituent models. Climatic and edaphic variables were most strongly related to Site Index, although EVI and many topographic variables were also widely used within the five most precise models. Variables that are related to soil water balance, such as rainfall, soil depth, and water holding capacity, were well represented in the top five models reflecting the importance of water limitations in regulating growth across the study area.
This research highlights the potential improvements that can be gained through the application of sophisticated modelling methods to site productivity modelling, which are likely to be transferrable to other species and countries. Although geostatistical models, such as Ordinary Kriging, are not likely to be as applicable in situations with sparser datasets the regression models used here would be applicable under these circumstances. Future research should more fully utilise LiDAR from existing plantations, as these data can be used to estimate height and Site Index very precisely. In the context of this study, LiDAR can also be used as a supplemental form of plot data, which could prove to be useful when existing plot data are sparse or do not completely cover all environmental conditions. These estimates can be used as input to machine learning and geostatistical models to generate surfaces and predictions of Site Index for unplanted sites.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Acknowledgments:
The authors would like to thank Forestal Arauco, for their role in providing the study site observations, and financial support to this study. Additionally, we want to acknowledge Rodrigo Sobarzo, Fernando Bustamante, Juan Jose Rivera and Eldemar Jaskiu from Forestal Arauco for validating and reviewing the content of this document. Special thanks to Rodrigo Ahumada and Sebastian Fernandez for their support in the publication of this manuscript. Multiple linear regression is a statistical method that predicts the response variable from more than one independent variable. Use of this method assumes that independent variables are not too highly correlated and that residuals from the final model are normally distributed [107].

Conflicts of
Appendix A.1.2. Partial Least Squares Partial least squares (PLS) condenses the most useful information from a large number of predictors to a reduced set of uncorrelated components. These components are then used within a regression to predict the dependant variable [3,108]. The main advantage of this method is that provides a lower risk of chance correlation and higher predictive accuracy than multiple regression particularly when there are a large number of correlated predictors in the dataset [109]. The number of components within the model was optimised.
Appendix A.1.3. Elastic Net Elastic net (EN) is a form of regularised regression. This method incorporates penalities from both lasso and ridge regression that constrain the coefficient size within the model. EN performs automatic variable selection like Lasso, while the penalization from the Ridge term stabilizes the solution paths which improves the prediction accuracy. The model was fitted using the 'glmnet' method, which was used to optimise the two hyperparameters alpha (mixing percentage) and lambda (regularization parameter).

Appendix A.2. Non-Parametric Models
Appendix A.2.1. Random Forest Random Forest (RF) consists of a combination of many binary decision trees built using several bootstrap samples from a supervised machine learning algorithm, which randomly chooses, at each node, a subset of explanatory variables [110]. RF belongs to the family of ensemble methods, in where the final prediction comprises the average predictions from individual trees [111].
Using the caret package [60], a more memory efficient implementation of RF, using the 'ranger' method was applied [112]. For this methodology the following parameters were optimised: (i) the number of randomly selected predictors (mtry), (ii) the minimum node size and (iii) the splitting rule.
Appendix A.2.2. Support Vector Machines Support vector machine (SVM) applies a simple linear technique to the data but in a high-dimensional feature space that is non-linearly related to the input space [113]. The regularized SVM Machine (dual) with Linear Kernel was fitted to the data [60]. The covariates were standardized by pre-processing the predictor data ("center", and "scale"). The loss function and cost parameters were optimised within the model.

Appendix A.2.3. Neural Network
Neural networks or artificial neural networks (ANN) consist of processing units called neurons or nodes, whose functionality is loosely based on the structure and behaviour of the natural neuron. This technique uses mathematical models that learn nonlinear relationship between the data set, and response variable for both prediction and classification [114]. ANN was fitted using "nnet" method [60] and the hidden units and weight decay parameters were optimised.

Appendix A.2.4. Xgboost
The method eXtreme Gradient Boosting (XGBoost) is a scalable implementation of gradient boosting framework developed by Friedman [115]. XGBoost is a supervised machine learning, using decision-tree algorithms. This method builds models from adding individual so called "weak learners" from a gradient descent algorithm over an objective function. XGBoost used the caret package with the "xgbTree" methodology [60]. The hyperparameters that were tuned for this method, included : (i) boosting iterations, (ii) maximum tree depth, (iii) shrinkage, (iv) minimum loss reduction, (v) subsample ratio of columns, (vi) minimum sum of instance weight, and (vii) subsample percentage.
Appendix A.2.5. Multivariate Adaptive Regression Splines Multivariate Adaptive Regression Splines (MARS) is a regression technique that captures the nonlinear relationships in the data by assessing cutpoints (knots), which identifies regions where the relationship between the predictor variable and the response changes [116]. The main advantage of MARS is that it enhances the interpretability of complex interactions between the predictor and the response variables. This model was fitted using the "earth" method in the caret package [60], which tuned the number of terms (the maximum number of knots) and the number of variable interactions. Climate bio7 = Temperature annual range (bio5-bio6) ( • C) biovar. 8 Climate bio8 = Mean temperature of the wettest quarter ( • C) biovar. 9 Climate bio9 = Mean temperature of driest quarter ( • C) biovar. 10 Climate bio10 = Mean temperature of warmest quarter ( • C) biovar. 11 Climate bio11 = Mean temperature of coldest quarter ( • C) biovar. 12 Climate bio12 = Total (annual) precipitation (mm) biovar. 13 Climate bio13 = Precipitation of wettest month (mm) biovar.14 Climate bio14 = Precipitation of driest month (mm) biovar. 15 Climate bio15 = Precipitation seasonality (coefficient of variation) (%) biovar. 16 Climate bio16 = Precipitation of wettest quarter (mm) biovar. 17 Climate bio17 = Precipitation of driest quarter (mm) biovar. 18 Climate bio18 = Precipitation of warmest quarter (mm) biovar. 19 Climate