Integrating Neighborhood Effect and Supervised Machine Learning Techniques to Model and Simulate Forest Insect Outbreaks in British Columbia, Canada

Background and Objectives: Modelling and simulation of forest land cover change due to epidemic insect outbreaks are powerful tools that can be used in planning and preparing strategies for forest management. In this study, we propose an integrative approach to model land cover changes at a provincial level, using as a study case the simulation of the spatiotemporal dynamics of mountain pine beetle (MPB) infestation over the lodgepole pine forest of British Columbia (BC), Canada. This paper aims to simulate land cover change by applying supervised machine learning techniques to maps of MPB-driven deforestation. Materials and Methods: We used a 16-year series (1999–2014) of spatial information on annual mortality of pine trees due to MPB attacks, provided by the BC Ministry of Forests. We used elevation, aspect, slope, ruggedness, and weighted neighborhood of infestation as predictors. We implemented (a) generalized linear regression (GLM), and (b) random forest (RF) algorithms to simulate forestland cover changes due to MPB between 2005 and 2014. To optimize the ability of our models to predict MPB infestation in 2020, a cross-validation procedure was implemented. Results: Simulating infestations from 2008 to 2014, RF algorithms produced less error than GLM. Our simulations for the year 2020 confirmed the predictions from the BC Ministry of Forest by forecasting a slower rate of spread in future MPB infestations in the province. Conclusions: Integrating neighborhood effects as variables in model calibration allows spatiotemporal complexities to be simulated.


Introduction
Disturbances are a critical component of forests dynamics, which shape and substantially affect these key ecosystems [1,2]. Particularly of interest, forest insect epidemics can exert severe impacts on ecosystem dynamics due to mortality or growth reduction of millions of trees over widespread areas [3][4][5][6]. Both indigenous and invasive species can disturb natural and managed forests habitats [6][7][8]. Additional to ecological impacts, insect epidemics may have devastating effects associated with economic (e.g., losses within the forestry industry) [9][10][11] and social (e.g., unemployment due the sawmill closures) development or stability [12,13].
In the province of British Columbia (BC), Canada, an unprecedented insect outbreak of mountain pine beetle (MPB; Dendroctonus ponderosae Hopkins) began in the early 1990s and reached a peak between 2005 and 2006, which facilitated a massive migration of beetles into the province of Alberta 1.
Compare the performance of two methodologies, namely binomial regression and random forests, to model the MPB spread between 1999 and 2014.

2.
Evaluate the usefulness of a set of predictor variables, describing the influence of local topography and the state (i.e., infested/non-infested) of neighboring localities, to determine the extent and speed of the MPB infestation. 3.
Simulate possible land cover changes in 2020, due to MPB infestation.

Study Area
The study was conducted in the Canadian province of British Columbia (BC), and covers an area of 944,735 km 2 , extending from 59 • 59 27 N 138 • 54 19 W to 48 • 59 53 N 114 • 2 37 W (Figure 1). BC is known for its highly diverse mountainous landscape subject to a diversity of disturbance regimes [39][40][41]. The climatic conditions in the province are largely controlled by the Pacific Ocean to the west, continental air masses in the interior plateaus, and the Rocky Mountains to the east [42]. Seventy percent of the total area is covered by forest, whereas only two percent of the total area is used by humans to live, cultivate, etc. Forest in central BC, where lodgepole pines are the main tree species, have been experiencing an epidemic infestation of MPB, due to factors including fire suppression and changing climate [43].

Pine Mortality Dataset
The original source of data for the project is a collection of 16 maps indicating cumulative lodgepole pine mortality caused by MPB attacks on the forests of the province as observed in the period between 1999 and 2014 [20]. Observations were acquired by the BC Ministry of Forest from aerial photographs and LANDSAT satellite images, wherein infested areas are identifiable based on their spectral response and by calculating a Normalized Difference Moisture Index (NDMI), contrasting the near-infrared (NIR) band 4, which is sensitive to the reflectance of leaf chlorophyll content, to the mid-infrared (MIR) band 5, which is sensitive to the absorbance of leaf moisture. These maps are in a raster format with an Albers equal area projection and a cell size of 400 m. The cell values equal 10 times the percentage of infestation in each cell, hence ranging from 0 to 1000; the Ministry of Forests, Lands and Natural Resource Operations of the Canadian province of BC have made this dataset publicly accessible [44]. Existing literature and reports emphasize the importance of infestation levels above which the risk of MPB attack should be considered seriously by forest managers for further investigation [44][45][46].
For the purposes of this study we applied a threshold to the cumulative infestation maps in order to transform their continuous percentage scale into a binary scale (i.e., infested = 1/not-infested = 0). This is a simplified assumption that enabled us to apply well-known statistical methods to our datasets. The procedure to calculate that threshold value is described in detail in Appendix A. Seventy percent of the total area is covered by forest, whereas only two percent of the total area is used by humans to live, cultivate, etc. Forest in central BC, where lodgepole pines are the main tree species, have been experiencing an epidemic infestation of MPB, due to factors including fire suppression and changing climate [43].

Pine Mortality Dataset
The original source of data for the project is a collection of 16 maps indicating cumulative lodgepole pine mortality caused by MPB attacks on the forests of the province as observed in the period between 1999 and 2014 [20]. Observations were acquired by the BC Ministry of Forest from aerial photographs and LANDSAT satellite images, wherein infested areas are identifiable based on their spectral response and by calculating a Normalized Difference Moisture Index (NDMI), contrasting the near-infrared (NIR) band 4, which is sensitive to the reflectance of leaf chlorophyll content, to the mid-infrared (MIR) band 5, which is sensitive to the absorbance of leaf moisture. These maps are in a raster format with an Albers equal area projection and a cell size of 400 m. The cell values equal 10 times the percentage of infestation in each cell, hence ranging from 0 to 1000; the Ministry of Forests, Lands and Natural Resource Operations of the Canadian province of BC have made this dataset publicly accessible [44]. Existing literature and reports emphasize the importance of infestation levels above which the risk of MPB attack should be considered seriously by forest managers for further investigation [44][45][46].
For the purposes of this study we applied a threshold to the cumulative infestation maps in order to transform their continuous percentage scale into a binary scale (i.e., infested = 1/not-infested = 0). This is a simplified assumption that enabled us to apply well-known statistical methods to our datasets. The procedure to calculate that threshold value is described in detail in Appendix A.

Predictor Variables
For modelling purposes, we assumed that the infestation pattern in the near future depended directly on the status of the infestation during past years. For the sake of notation, if we denote by t 2 the future year for which a prediction is sought, then t 1 represents the starting date from which the simulated map of MPB infestation is projected and t 1p designates a previous year, for which further explanatory variables, used by the model, must be determined. Throughout the study, t 1p < t 1 < t 2 and t 1 − t 1p = 1 year, although t 2 − t 1 = 3 years.
We also assumed that the probability of beetle infestation in our thresholded maps depended entirely, for a given pixel, both on the local topography and on the state of the infestation in adjacent pixels. Arguably, local topography may either enhance or stall MPB outbreaks by, e.g., boosting or blocking beetle flights, respectively. It may also determine local climatic and environmental conditions in forests, making them a more or less suitable habitat for beetles to settle and attack. In turn, the infestation status of nearby areas should arguably have a direct influence on the number of beetles that affect a given location. These are distance-dependent variables that must be determined for every individual simulation. That dependence, however, is not known a priori and must be approximated. Bearing these ideas in mind, we set out to select a set of variables that could serve as valid drivers for the MPB infestation.
Our choices for explanatory variables represented the drivers that we fed the MPB numeric infestation model. These variables are listed in Table 1.  (6) and (7). The Time column indicates whether the variable is calculated at t 1p or t 1 (see text for an explanation). Topography-based explanatory variables were calculated only once at the beginning of the calibration because we assumed that local topographic conditions did not change during the time intervals spanned by the simulations:

1.
Elevation: MPB infestation has been observed to take place mostly at low or medium heights [47]. Elevation is defined as height above sea level per pixel. We used a Digital Elevation Map provided by GeoBC. The original pixel size of 500 was changed to 400 to match the resolution of the MPB infestation map.

2.
Slope: Steeper areas may affect, for example, distances between tree canopies on a hillside [25,48], which, in turn, may make it easier or harder for beetles to fly from one tree to another. Slope was calculated from the elevation map with the "terrain" function of the "raster" R package.

3.
Aspect: The spread of the MPB infestation may benefit from milder temperatures [20,38,49] on south-oriented slopes. For that reason, aspect was calculated from the elevation map as the compass direction of the pixel slope face. We employed the "terrain" function of the "raster" R package. Next, it was sine-transformed to avoid the discontinuity at point 0-2π radians (0 • -360 • ). Sine and cosine functions were used to avoid ambiguity at 0 radians.

4.
Ruggedness: Adult beetle flight may be faster and/or longer over open ground [50]. To account for this effect, we implemented the Terrain Ruggedness Index (TRI) [51]. The TRI index was calculated from the elevation map with the "terrain" function of the "raster" R package using an 8-pixel window.
In contrast, adjacency-type predictors had to be computed at every temporal step of the simulation. As a measure of the dependence of infestation rate on the state of the neighboring pixels, we computed a weighted sum of the surrounding pixels (containing 0 s or 1 s) at each location. The basic equation to account for the adjacency effect can be written as: where z j stands for the generic adjacency effect at location j. Index i runs such that the distance from location j at which pixel value p ij is summed, i.e., d i , is smaller or equal than the maximum size d max of the weighting window, w i . To account for the unknown true dependency, we used four different weighted sum expressions:
Linear weighting (z lin ): weights decrease linearly until d i = d max 3.
Inverse-distance weighting (z inv ): weights decrease as a function of the inverse of distance until 4. Squared-inverse-distance weighting (z squ ): weights decrease as a function of the inverse of the squared distance until d i = d max In parentheses, we have included the corresponding variable name in Equations (6) and (7) and Table 1. Appendix C demonstrates the aforementioned four weighting functions in a neighborhood of radius 5.
All of these predictor variables were determined at every infested and non-infested pixel in the t 1p and t 1 maps. Regarding adjacency-type predictors, we included the weighted total number of surrounding infested pixels both at t 1p and at t 1 as adjacency-type predictors. Because we ignored the exact dependence of the infestation rate on these predictors, we chose several weighting procedures to account for this unknown dependency and included all of them in the calibration procedure as independent variables.
We carried out a preliminary exploration of the relationship between infestation probability, represented by the thresholded infestation map specified above, and the set of predictor variables described in the previous paragraphs. Appendix A shows the logit-transformed average infestation as a function of the binned predictors. In general, infestation rate appears to change linearly with predictors. The mean response showed a parabolic response vs. elevation, indicating that the expected infestation rate is proportional to a quadratic function of the elevation. In addition, it displayed an approximately linear response vs. slope, aspect, ruggedness, and the log-transformed adjacency measures.

Approaches to Model and Simulate Land Cover Changes
To model and simulate the changes in the lodgepole pine forest cover within the province of BC during the sixteen-year period of the data set on recorded MPB epidemics, we implemented two algorithms: (1) generalized linear regression (GLM), and (2) random forest (RF).

Generalized Linear Regression (GLM)
Generalized linear regression is a maximum-likelihood regression methodology that computes the parameters of a linear model that maximize an appropriate log-likelihood [52]. It can be applied to cases where the error distribution of the response variable is not Gaussian. The linear model and the dependent variable are related directly via a continuous link function, which relates the expected value of the response variable with a linear combination of the predictors. In turn, the error distribution of the dependent variable determines our choice for the distributional model from which a log-likelihood can then be derived. There are several common options for log-likelihood and link functions, and the modeler must make his/her own choice by considering the observed relationships between response and predictor variables. The link function imposes a final expression for the log-likelihood that is often non-linear in the parameters. As a consequence, maximization of the log-likelihood is carried out iteratively with an appropriate optimization scheme (see e.g., McCullagh and Nelder, 1989).
To account for the binary dependent variable representing an MPB affected/non-affected pixel, we chose a Bernoulli log-likelihood, which corresponds to a binomial log-likelihood with number of trials equal to one. In turn, we determined the most appropriate dependence between expected binary response and the explanatory variables by carrying out an exploratory exercise (see Appendix B). Based on those observed relationships, we proposed the following linear logit link function: logit µ j = β 0 + β 1 e j + β 2 e 2 j + β 3 s j + β 4 a j + β 5 r j + β 6 z iden,t 1 + β 7 z lin,t 1 + β 8 z inv,t 1 + β 9 z squ,t 1 + β 10 z iden,t 1p + β 11 z lin,t 1p + β 12 z inv,t 1p + β 13 z squ,t 1p (7) For the sake of clarity, sub index j (i ∈ j . . . n, where n is the number of observations) was omitted in the neighborhood covariates z. In these equations, µ j indicates the logit-transformed expected value of the binary response variable at location j, the sub-indexed βs are the unknown parameters to be calculated, and the predictor variables are shown in Latin letters, following the notation illustrated in Table 1. In our calculations we used the "glm" function of the built-in stats package of the R software [53].
The generalized linear regression scheme described above produced a continuous function, µ, bounded between 0 and 1 such that it represented the probability of infestation. Therefore, at each location j, that probability is given by: where the ellipsis in the exponent indicates all linear and their interactions, depending on whether we use Equation (6) or (7). We then used µ to elaborate a predictive binary map pinpointing the location of newly infested pixels. This last step entailed the selection of a valid cut-off value to map the continuous µ variable onto a binary scale. We computed a suitable cut-off for µ by maximizing the fuzzy kappa index between observed and predicted infestation maps.

Random Forests (RF)
Random forests (RF) belong to the group of ensemble learning methods. Used for both classification and regression, RF are sets of decision trees each including a random subset of the data, that work based on the principle of highest voted decision, or the average of decided scores, depending on the type of the model. Throughout the calculations, we used the RF algorithms implemented in the "ranger" package of the R software. As predictor variables we used the same variables shown in Equation (6). Preliminary results (Appendix E) suggested that it was better to regress than to classify with RF, so we used the "ranger" function of that package in regression mode.

Model Calibration and Validation
To optimize the ability of our models to predict infestation in non-infested pixels we set up a cross-validation strategy by dividing the datasets into training and test subsets. The training dataset was used to calculate the unknown parameters of the model, whether binomial or RF, as shown above. The test dataset, in turn, was used to gauge the predictive ability of the models and to adjust a relevant parameter. We selected a training set containing 75% of the original dataset, whereas the test set included the remaining 25%. Figure 2 shows the method of analysis in a flow chart. The flow of the algorithm is downwards and generally to the right. Raw inputs include geographical variables of elevation, slope, and aspect, in addition to three infestation maps. The time interval between the first two maps at t 1p and t 1 is one year, and the time step between the last two maps between t 1 and t 2 is 3 years. The goal of the algorithm is to predict the binary spread of infestations in the next time step. After reading inputs, the algorithm is divided into two main parts. In the first part, the inputs are analyzed, and a model is developed and parameterized to simulate the change from the first two images (t 1p and t 1 maps) to the third (t 2 ). In the second part, the parameterized model is used to predict the change from the last two given images (t 2p and t 2 maps) to the next time step in the future (t 3 ). variables shown in Equation (6). Preliminary results (Appendix E) suggested that it was better to regress than to classify with RF, so we used the "ranger" function of that package in regression mode.

Model Calibration and Validation
To optimize the ability of our models to predict infestation in non-infested pixels we set up a cross-validation strategy by dividing the datasets into training and test subsets. The training dataset was used to calculate the unknown parameters of the model, whether binomial or RF, as shown above. The test dataset, in turn, was used to gauge the predictive ability of the models and to adjust a relevant parameter. We selected a training set containing 75% of the original dataset, whereas the test set included the remaining 25%. Figure 2 shows the method of analysis in a flow chart. The flow of the algorithm is downwards and generally to the right. Raw inputs include geographical variables of elevation, slope, and aspect, in addition to three infestation maps. The time interval between the first two maps at and is one year, and the time step between the last two maps between and is 3 years. The goal of the algorithm is to predict the binary spread of infestations in the next time step. After reading inputs, the algorithm is divided into two main parts. In the first part, the inputs are analyzed, and a model is developed and parameterized to simulate the change from the first two images ( and maps) to the third ( ). In the second part, the parameterized model is used to predict the change from the last two given images ( and maps) to the next time step in the future ( ). In this model, parameters were determined and adjusted in three stages. These stages are shown with gray flowchart shapes in Figure 2. First, the initial threshold for conversion of numeric data into binary (infested/not infested) was identified. The input images were converted to binary assuming a threshold of 0.1586553 (see Appendix A), i.e., infestation proportions of above 0.158 are taken as presence and lower values as non-presence. The second stage was the adjustment of the regression model parameters. To do so, the compiled dataset was divided into training and test datasets. Using the training dataset, regression parameters were estimated. This preliminary model was then applied to the test dataset in order to assess its ability to predict the changes in input data. Such assessment involved the third stage of model parameterization-the selection of the final threshold for creation of the binary output. This was done by applying and testing the goodness-of-fit of 1000 images created by different cut-off values. The test involved giving test data to the model and comparing its output with the most recent input image. The cut-off thresholds giving the highest kappa and the highest Youden's J statistic was selected for application in the prediction phase.  In this model, parameters were determined and adjusted in three stages. These stages are shown with gray flowchart shapes in Figure 2. First, the initial threshold for conversion of numeric data into binary (infested/not infested) was identified. The input images were converted to binary assuming a threshold of 0.1586553 (see Appendix A), i.e., infestation proportions of above 0.158 are taken as presence and lower values as non-presence. The second stage was the adjustment of the regression model parameters. To do so, the compiled dataset was divided into training and test datasets. Using the training dataset, regression parameters were estimated. This preliminary model was then applied to the test dataset in order to assess its ability to predict the changes in input data. Such assessment involved the third stage of model parameterization-the selection of the final threshold for creation of the binary output. This was done by applying and testing the goodness-of-fit of 1000 images created by different cut-off values. The test involved giving test data to the model and comparing its output with the most recent input image. The cut-off thresholds giving the highest kappa and the highest Youden's J statistic was selected for application in the prediction phase.
Model output was compared with the reference data for validation. It should be noted that the similarity of simulation and reference maps of the final year does not provide sufficient information for validating the model [54][55][56]. In fact, if the overall change in the landscape is small, even an erroneous prediction can still be highly similar to the reference. Therefore, model validation should account for the simulated and observed change. This requires considering three maps: a reference map of the beginning time, and reference and simulation maps at the ending time. In this study, the process of change is infestation, which is irreversible. As such, change may only occur in areas that were initially not infested. We analyzed these areas when validating the model.

Results
Validation testing of the model involved simulation of changes in the study area from 2008 to 2014 and comparison with reference observations. Prior to testing, the model was parameterized using data of changes from 2005 to 2008. Then, in two rounds, it simulated the change from 2008 to 2011 and from the predicted 2011 to 2014. These predictions were made with 3-year time steps. Figure 3 shows model results for three algorithms-binomial regression (Equation (6); hereafter GLM1), binomial regression with parabolic elevation (Equation (7); hereafter GLM2), and random forest (RF) with maximum kappa final cut-off threshold-in addition to the observed change (Appendix D).
Outputs of validation analysis for the three algorithms with maximum kappa final cut-off threshold are presented in Figure 4. Each map of this figure corresponds to one simulation algorithm and demonstrates a comparison between three images: observed infestations in 2008, simulated infestations in 2014, and observed infestations in 2014. In this figure, correctly simulated changes are identified as "hits"; observed changes that are missing from simulations are identified as "misses"; simulated changes that are not observed are identified as "false alarms"; and correct simulations of no change are identified as "correct rejections". Because infestation is a one-way process that is not reversible, it is evident that zones of "prior infestation", that is, zones that were already infested at the beginning of the study period, are not susceptible to future change and they should be excluded from the assessment of model performance.   Table 2 shows hits, misses and false alarms in numbers of pixels for each of the simulation algorithms. In addition, this table gives the figure of merit, which is the percentage of hits in the sum of hits, misses, and false alarms. The table also shows the overall accuracy of each model, which is the percentage of correct predictions in the study area. Using the model results, which indicate locations of simulated infestations and the Ministry's map of pine volume density [58] in the study area, cumulative estimates of pine volume killed were calculated for each algorithm. Table 3 shows these estimates. In comparison, the real percentage of cumulative change based on the Ministry's map of observed infestations for year 2014 is 49% for the entire province (excluding the Tree Farm License Zone). Youden's J 55 Figure 5 shows simulations of changes from 2014 to 2020, where the model was parameterized with observed changes from 2011 to 2014. The time step for all these simulations was 3 years. This is a demonstration of a possible application of the model for future predictions. Model outputs are projections of the infestations in the future, for which no reference data is available yet. As such, it is not possible to calculate the accuracy of these predictions. Rather, based on the results of validation tests described above and summarized in Table 2, we assume that the model is able to produce valid predictions. Aggregate percentages of cumulative pine volume killed are estimated for each simulation algorithm using the Ministry's pine volume density map, and are presented in Table 4.

Model Validation
All of the simulation and observation images show a noticeable area of new infestations north of the center. A closer examination also reveals scattered infestations in other regions. Regarding the larger infested area, we know based on prior information that it is the result of the spread of previous infestations from the center of the province towards the north. A visual inspection suggests that, in the two GLM simulations, the southern and northern envelopes of the large infested areas are somewhat similar, as if the models have offset the prior infestations to the north. Considering that the model calculates the adjacency-type predictors in windows of a predefined size, it appears from Figure 4 that the GLM algorithms predicted a large number of infestations in a limited distance.
Comparing the images with respect to the large zone of new infestations north of the center, it appears that the GLM algorithms predicted a larger area for the spread of the insect. Moreover, the two GLM algorithms appear to have predicted a more relatively uniform spread inside the zone of new infestations. In contrast, the reference observation image shows a less homogeneous pattern in the same area. Regarding the RF algorithm, its result is somewhat non-homogeneous in that area. However, identifying correspondence between this image and the reference observation requires further analysis. Figure 4 clearly shows that the larger parts of observed and simulated changes occurred in the area north of the center. It is noticeable that the two GLM algorithms predicted larger numbers of infestations in this area. Some of these predictions of infestation matched reference observation (hits), but many of them did not (false alarms). False alarms in the results of these two algorithms are visually distinctive. In contrast, the RF algorithm predicted a smaller spread of infestations and missed some infestations that actually occurred. Table 2 reveals useful information about the performance of the models. Considering that hits indicate correct predictions of change, and that misses and false alarms indicate errors in prediction of change, the figure of merit is the proportion of correct predictions in all predictions involving change. This table confirms the above visual analyses of simulation outputs. The GLM algorithms predicted more change (hits + false alarms) than RF. Some of these excess predictions were correct, and these algorithms had more hits than RF. However, many of them were incorrect. The GLM algorithms, especially with Youden's J final cutoff threshold, produced a large number of false alarms. In fact, although they have more hits and therefore higher figures of merit, this advantage is overshadowed by their error in estimating the quantity of change. The GLM algorithms, particularly with Youden's J final cutoff threshold, produced more false alarms than misses. This means that they overestimated the quantity of change. In contrast, the table shows that the RF algorithm underestimated the quantity of change, producing more misses than false alarms. With the kappa final cutoff threshold, the number of these errors in the estimation of quantity in RF is comparable with that of the GLM algorithms. With Youden's J final cutoff threshold, however, the underestimation of the quantity of change in the RF algorithm is relatively small.
Prior to producing prediction maps, each algorithm creates a probability map. Then it compares that map with the final cutoff threshold; that is, it classifies pixel values above the final cutoff threshold as 'Infested', and others as 'Not infested'. The Youden's J threshold was calculated to be lower value than that of the Kappa threshold for each algorithm. Therefore, for the same probability map, Youden's J cutoff predicts more infestations than the kappa cutoff. On the other hand, the table shows that regardless of the final cutoff threshold, the GLM algorithms tend to predict more change. This tendency, combined with the lower Youden's J final cutoff threshold, resulted in the large overestimation of quantity seen in the 2nd and 4th rows of the table. One avenue for improvement of the model in future works can be to reduce its error of estimated quantity of change.
As seen in Table 3, the random forest algorithm produces a closer estimate of the aggregate quantity of change in the study area. In comparison, the other two algorithms (GLM1 and GLM2) appear to overestimate this quantity. This is also confirmed in Table 2 and Figure 4; it may be noted that these algorithms produce more false alarms than misses. Having seen the model's performance in the validation test, it is easier to understand the model's prediction of future changes.

Model Predictions
Predictions of future infestations are summarized in Table 4. These calculations are based on the Ministry's previous inventory and do not include the effect of harvesting or management decisions. Different simulations estimate that by 2020 between 64 and 70 percent of the pine volume in BC forests will be killed in MPB attacks. These assessments have been made for the entire forests-or the Whole Land Base (WLB)-of the province. The Ministry has made estimates of the spread of infestation in the Timber Harvesting Land Base (THLB), which is a smaller subset of the WLB. According to the Ministry the cumulative percentage of pine volume killed in the THLB could reach 55% by 2020 (BC Ministry of Forests, 2015c). Taking this difference in land base or study area into account, the results of our model can be considered as a complement to the Ministry's calculations, particularly for zones outside the THLB. In both studies, it is predicted that in future the spread of infestations in the province will subside.

Limitations of the Study
Applying a threshold to transform the initial continuous infestation maps into binary maps may entail a loss of information because we discard valid knowledge about the shape of the infestation curve, i.e., whether it grows faster or slower until it reaches saturation. However, we have shown (Appendix A) that, in general, infestation curves in all pixels seem to follow a similar sigmoidal pattern. Therefore, our choice of a single threshold to compute all binary maps appears to be a reasonable compromise between simplifying the analysis and retaining as much information relating to the infestation process as possible.
It is obvious that, for a process that spreads spatially, the state of the surroundings is important in determining the expansion rate. Presumably, definitions of adjacency other than those we used in the present study may lead to a better characterization of the infestation. Proximity laws based on insect phenology (e.g., flight time and strength, sensibility to extreme heat or sunlight) may yield better results than the generic adjacency functions used in this work. Future work may shed some light on this crucial topic.

Conclusions
In this work, we built a spatial change model by integrating neighborhood selection and transition rule identification in one process. Although we did not have detailed information about distance and neighborhood effects of MPB infestations at the beginning of this study, we compensated for this lack of knowledge by calculating several mathematically independent functions of distance as predictors in the model, and allowed the rule identification algorithm to find the transition rule that best described the input data. We validated our model by comparison of its output with the observed data. Validation based on components of figure of merit gave us better insight into the performance of different algorithms applied in the model. In validation analyses, we noted that although the GLM simulations resulted in higher figures of merit, the RF algorithm was better able to estimate the quantity of change. We then used the validated model to predict the upcoming changes in the study area. By integrating neighborhood effects as variables in the parameterization of the model, we were able to simulate spatiotemporal complexities of a forest land change process.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Calculation of Threshold Value on Cumulative Pinus contorta Mortality Data
To calculate the initial threshold value (i.e., the cutoff value that we must apply to the original infestation map), we first convert the original 0-1000 scale of the cumulative data into a 0%-100% scale. At every spatial pixel the infestation starts at a value close to 0% starting in the year 1999, which then increases quickly and finally arrives at saturation level (that is, there are no more trees left in that pixel to attack) near 100%. When sample curves of the cumulative infestation values are plotted as a function of time (see examples in Figure A1, left panel) we see that they approximately display a sigmoidal behavior, although the years the infestation starts are different.
infestation map), we first convert the original 0-1000 scale of the cumulative data into a 0%-100% scale. At every spatial pixel the infestation starts at a value close to 0% starting in the year 1999, which then increases quickly and finally arrives at saturation level (that is, there are no more trees left in that pixel to attack) near 100%. When sample curves of the cumulative infestation values are plotted as a function of time (see examples in Figure A1, left panel) we see that they approximately display a sigmoidal behavior, although the years the infestation starts are different. ︵ Figure A1.
Three sample curves of cumulative (left panels) and annual (right panels) infestation percentages.
Next, we determine the annual infestation percentage by computing the numerical derivative of the cumulative curves. The resulting curves ( Figure A1, right panel) now show a coarse bell-like shape with a given center and width. Figure A1.
Three sample curves of cumulative (left panels) and annual (right panels) infestation percentages.
Next, we determine the annual infestation percentage by computing the numerical derivative of the cumulative curves. The resulting curves ( Figure A1, right panel) now show a coarse bell-like shape with a given center and width.
If we conjecture that the curves represent unimodal and symmetric distributions, we can calculate, at each pixel i, (a) the centroid c i as the first statistical moment, and (b) a proxy of the width w i as the square-root of second moment (i.e., variance), as follows: and: where p ij stands for the infestation percentage for pixel i at year t j (t 1 = 1999, . . . , t 16 = 2014). We then use centroids c i to re-center the offset of the original cumulative infestation curves. A random selection of 50,000 offset-corrected cumulative curves is shown in Figure A2. This figure shows the resulting re-centered data points, where it is now easier to make out an approximate sigmoidal shape for the cumulative infestation process. Notice that no correction for different widths has been carried out. Next, we compute the mean cumulative infestation in small intervals along the temporal axis. When those mean infestation data points are plotted as a function of time ( Figure A3, red dots), the sigmoidal-like shape of the infestation process becomes more conspicuous. Finally, we plot a normal cumulative distribution function (CDF, hereafter I(t)) with standard deviation σ = w ( Figure A3, blue curve), where: For reference, we have also plotted (in green) the probability density function of the corresponding normal distribution, I (t), in Figure A3. Clearly, I(t) serves as a good approximation to the observed cumulative infestation curve. In curves I(t) and I (t) we can distinguish five phases as we move from left to right: initial negligible or very low infestation that spreads slowly (I(t) and I (t) are close to zero); 2.
a transitional phase in which the infestation starts picking up speed (I(t) still low but I (t) increases); 3.
fast but steady infestation that increases constantly (I(t) increases but I (t) reaches a maximum); 4.
transitional phase during which the infestation slows down (I(t) increases further, I (t) decreases); 5.
saturation level (I(t) highest, I (t) very close to zero). Figure A3. Mean observed and offset-corrected cumulative infestation (red dots), normal cumulative distribution function I(t) (CDF, in blue, scaled to 0-100), whose standard deviation is explained in text, and its derivative I (t), i.e., the normal probability density function (in green).
We assume in the present study that a critical moment in the spread of the infestation occurs when, in phase 2 above, the infestation starts picking up speed, therefore accelerating its spread. Consequently, we will use the infestation level at which that acceleration is maximum as our threshold value I th , which will enable us to categorize the continuous cumulative infestation maps into binary maps I b such that: 0 (not in f ested) I ≤ I th 1 (in f ested) I > I th (A4) The so-called acceleration can be calculated as the second derivative of I(t). Therefore, to find its maximum we must compute in turn the third derivative and solve I (t) = 0. Assuming that I(t) is well approximated by a normal CDF, we know that, for a centered normal distribution: Equating I (t) = 0 we arrive at t = ±σ, excluding the trivial solutions t = ±∞. Because t = +σ corresponds to phase 4 above, i.e., the deaccelerating phase, we take t th = −σ as the temporal location of the sought-after threshold. Inserting t th into I(t) yields, finally, I(−σ) = 0.1586553 (or 15.86553 in our percentage scale). Figure A4 depicts the procedure. Figure A4. Cumulative distribution function (red) of the normal distribution and its 2nd derivative (blue). The location of the maximum of the 2nd derivative and the corresponding ordinate is indicated with a dashed line.

Appendix B. Plots of Average Mortality vs. Predictors
Each predictor variable in Table 1 (see main text) was binned at consecutive intervals and at each interval we averaged over the corresponding 0 and 1 pixels in the map. This yielded the expected infestation rates, which were then logit-transformed and plotted vs. the center of the intervals, as shown below.   Figure A6.

Appendix C. Graphic Representation of the Four Different Neighborhood Types Implemented
The four neighborhood types implemented in this model are (a) no-weight: neighborhood effect is uniformly distributed; (b) linear: neighborhood effect decreases linearly with increasing distance; (c) inverse-distance: neighborhood effect decreases inversely proportional to distance; and (d) squared-inverse-distance: neighborhood effect decreases inversely proportional to distance squared.

Appendix D. Model Parameterization
The results of parameterization of the GLM1, GLM2, and RF models are summarized in this appendix.
The GLM1 algorithm resulted in a Receiver Operating Characteristic Area Under Curve (AUC) of 0.7802905. The maximum values of kappa and Youden's J obtained in model parameterization were 0.3489783 and 0.4540825, respectively. Statistics for this model's variables are presented in Table A1.