Comparisons of Statistical Approaches for Modelling Land-Use Change

....................................................................................................................................................... iii Acknowledgements ...................................................................................................................................... iv Table of


Introduction
A variety of approaches are used in land-change science to represent land-use change (e.g., statistical [1]; cellular automata [2]; agent based [3]).Among the methods used to model land-use change, empirical statistical models are often used to detect the effects of drivers [4] of change as well as to simulate potential quantity and pattern outcomes of land-use change [5].To date, many statistical Land 2018, 7, 144 2 of 33 models have been used to model land-use change (e.g., [6,7]), of which logistic regression and linear regression are the most frequent [8].In some cases, statistical models are combined with other methods, for instance, Markov chain approaches are often coupled with logistic regression (e.g., [9]), cellular automata (e.g., [10,11]) and genetic algorithm (e.g., [12]) to model land-use change.
The benefit of using statistical models arises from their ease of implementation and application as well as their ability to provide a general representation of land-use change given scarce time, resources, and data.Furthermore, statistical models have a longer history and frequency of use, and they are more likely to be used in other research projects that do not have a direct collaboration with their creators compared to process-based models [13].The trade-off, however, is that statistical modelling approaches poorly represent the explicit processes associated with human decision-making (e.g., farmers' planting decision on agricultural lands), which is better represented by process-based models such as agent-based models [14].
Despite the utility and widespread use of statistical methods for modelling land-use change, there is a lack of review or assessment of the performance of more than two different statistical methods (or different combinations) with the same dataset at the same location.Interesting exceptions to this lacuna include: a comparison of two variants of logistic regression and survival analysis applied to an urban sprawl model (i.e., a controlled setting), which showed superior performance of survival analysis [15]; and a comparison of input, output, and validation maps from nine different land-change models applied to 12 different locations [7].Results of the nine models were compared against observed change and a null model of persistence from the initial time-step of land-use data.However, the statistical methods behind the presented models, overall model accuracy, or accuracy by land-use type were not discussed.Furthermore, the models were applied to different data and when they were applied to the same location the format of the data were different, which obfuscated their comparison.
Other scientific fields of study have also noted the lack of comparative studies of more than two statistical modelling approaches applied to the same data (e.g., ecology [16]).While some comparisons exist (e.g., species presence/absence [17]; landslide susceptibility [18]), a gap remains in the land-use literature.We seek to contribute to overcoming this gap in the comparative statistical literature on modelling land-use change by comparing the performance of four conceptual approaches (stochastic process, parametric model, non-parametric model, and time series model) to modelling land-use change.These approaches span a range of frequency of application in land-use modelling literature and are commonly operationalized as Markov chain (MC), logistic regression (LR), generalized additive model (GAM), and survival analysis (SA).The four modelling approaches are applied to a common study area using the same data.Their performance is evaluated in terms of their ability to predict overall land-use change as well as change by specific land-use types.Through this analysis we answer the questions: what is the overall accuracy of different statistical methods in representing land-use change, and how well do these methods compare across individual land uses?

Study Area
Our study area is situated in the Region of Waterloo, which comprises 1369 km 2 in Southern Ontario, Canada (Figure 1).The region is composed of three cities (Kitchener, Waterloo, and Cambridge) and four townships (Wellesley, Woolwich, Wilmot, and North Dumfries), within which exists a mixture of residential, commercial, agricultural, and other land-use types.The Region has been experiencing above average population growth and subsequent land-use change due in part to employment opportunities in the high-tech research and development sector, low-cost housing relative to the City of Toronto, its location along highway 401 (the busiest highway in North America [19]), and close proximity to the City of Toronto (the 5th largest North American financial center [20]).medium-density residential and 16 percent of them converted to commercial land use.Other noticeable land-use changes included changes to transportation (9 percent), high-density residential (8 percent) and a number of properties were found in transition and under development (7 percent).The remaining 9 percent of change were attributed to the remaining 6 land-use types.Within these overall land-use changes, the dominant transitions were under development to residential (34.24 percent) and commercial to residential (20.77 percent).

Data
Ownership property parcel was used as the observational unit of land-use change in this study, and the data were acquired from Teranet for the Region of Waterloo for the year 2010.Land-cover-raster data were acquired from the Modelling and Spatial Analysis Lab at the University of Waterloo (Waterloo, ON, Canada) [22], which comprised 8 land-cover types at a resolution of 80 cm.These land-cover data were used to classify property parcel land-use data for the Region of Waterloo, for the years 2006, 2010, and 2015 [22].The land-use classes in these data consist of the following ten land-use types and water: low-density residential, medium-density residential, The population of the Region of Waterloo increased from 507,096 to 535,154 from 2011 to 2016, which was a 5.5 percent increase in population that exceeded the 4.6 percent provincial and the 5 percent national population growth rates recorded in 2016 [21].The Kitchener-Cambridge-Waterloo Census Metropolitan Area had the 4th largest population in Ontario and the 10th largest population in Canada according to the 2016 Census data.Moreover, the population growth rate in the Kitchener-Cambridge-Waterloo census metropolitan area ranked the 4th highest in Ontario in 2016.The fast growth rate for the region has resulted in urban sprawl, which influences the types of land-use transitions in the region.
Between 2010 and 2015, 10,606 of 160,053 (6.63 percent) individual property ownership parcels experienced land-use changes [22] (Figure 1), in which 51 percent of the parcels converted to medium-density residential land use and 16 percent of them converted to commercial land use.Other noticeable land-use changes included changes to transportation (9 percent), high-density residential (8 percent) and a number of properties were found in transition and under development (7 percent).The remaining 9 percent of change were attributed to the remaining six land-use types.Within these overall land-use changes, the dominant transitions were under development to residential (34.24 percent) and commercial to residential (20.77 percent).

Data
Ownership property parcel was used as the observational unit of land-use change in this study, and the data were acquired from Teranet for the Region of Waterloo for the year 2010.Land-cover-raster data were acquired from the Modelling and Spatial Analysis Lab at the University of Waterloo (Waterloo, ON, Canada) [22], which comprised eight land-cover types at a resolution of 80 cm.These land-cover data were used to classify property parcel land-use data for the Region of Waterloo, for the years 2006, 2010, and 2015 [22].The land-use classes in these data consist of the following ten land-use types and water: low-density residential, medium-density residential, high-density residential, commercial, industrial, institution, transportation, protected area and recreation, agriculture, water, and under development.
A variety of drivers of land-use change can be found in the literature, which varies among models focused on specific types of land-use change versus comprehensive land-use models (i.e., those representing all types of land-use change in a study area).Often these variables are grouped by similarity and then individually evaluated.For example, conceptual groupings include: population, affluence, and technology [23]; key policies, technology, markets, demography, endowments, infrastructure, and institutions [24]; and demography, economy and infrastructure, and geomorphology [25].Drivers of land-use change used in this paper were selected among commonly recognized predictors among these groupings, such as mean slope and elevation (e.g., [25,26]) and population density (e.g., [27,28]).
In this paper, drivers of land-use change were conceptualized into geometric variables (e.g., parcel area and area of Dissemination Area), site variables (e.g., slope and elevation), demographic variables (e.g., population density), and spatial variables (e.g., distance from a parcel to the nearest highway ramp).Geometric variables were calculated using ArcGIS.Demographic variables were acquired from the 2011 Census data (Dissemination Areas boundary polygons and demographic data [29]).The Canadian Census is taken every five years and 2011 provided the closest year to 2010.In addition to these drivers, variables were derived from data acquired from the Ontario Ministry of Natural Resources and Forestry (Rivers, Water Bodies, Wooded Areas, and a Digital Elevation Model (10 m resolution)), and the Ontario Ministry of Transportation (Toronto, ON, Canada) (Road Network).A final dataset created by the Modelling and Spatial Analysis Lab represented the access point of the road network to all major 400 series highways in the province of Ontario.In addition to variable creation, variable transformation was performed to unify the units of distance variables from meters to kilometers, units of population density variables from person/m 2 to person/km 2 , units of parcel geometry variables from m 2 to km 2 , and units of elevation from m 2 to km 2 .The measurement transformations ensure the gradients of values of all predictors are on a similar scale.The detailed pre-processing steps of the land-use data are documented in Appendix A, and a full list of variables used in model building can be found in Appendix B.
In addition to the aforementioned data, zoning is typically considered a critical factor that regulates, restricts, and fosters the types and locations of land-use change in a municipality.Thus, a change of land use can be prohibited to occur in an area zoned for a particular land-use type [30].Despite common zoning policies (e.g., by land-use type or minimum lot size), zoning may vary across municipalities and may change over time within a single municipality, which creates a spatially and temporally heterogeneous set of zoning rules.Furthermore, variations in designated land-use zones may occur [31] and occur with different frequencies by municipality.Due to these inconsistencies in the zoning process among municipalities and the lack of any central repository consolidating the Master Plans of the 444 municipalities of Ontario, zoning was excluded as a predictor variable in the presented statistical modelling approaches.Although the presented research uses the case study of the Region of Waterloo, inclusion of the zoning information as a driver in a statistical model to predict future land-use changes would make the model non-transferable, which means that the model is restricted to predict local land-use changes and contribute to local land-use planning area.Our goal is to construct statistical models that can be applied widely across the province of Ontario and more broadly Canada to represent land-use change patterns.

Markov Chain
Markov Chain (MC) is a statistical method that incorporates stochasticity in the process of changes between states.A discrete-time MC requires a countable set of states and events that are mutually exclusive and collectively exhaustive [32].Moreover, uniform length is required between any two time points.In this research, a discrete-time MC method was used to model land-use changes that occurred between 2010 and 2015 at a parcel level, which is considered a first-order discrete-time MC since the status at a given time only depends on the status of the nearest past state.Using recent land-use data (2015) and the past land-use data (2010), an eleven by eleven transition probability matrix that contains only one-step transition probabilities was calculated.The probability matrix was obtained by observing the frequencies of land-use changes between the two years.To test the performance of MC, a random value between 0 and 1 was assigned to each parcel in the test set to represent the land-use transition probability from 2010 to 2015.For each starting state of land use, a probability space of 0 to 1 was divided into eleven non-overlapping segments based on the eleven one-step transition probabilities in their corresponding rows in the transition probability matrix.The segment within which the randomly drawn parcel value lies determines the future land use of the parcel in 2015.Then, the predicted land uses of all parcels in the test dataset were compared to the real land uses of those parcels in 2015 to determine the performance of MC for making prediction of future land uses.

Logistic Regression
Logistic regression (LR) is a type of statistical method used to model categorical variables and is a member of generalized linear models.Its response variable follows a binomial distribution and connects to the linear combination of all covariates though a logit-link function.LR can be used to simulate and predict categorical land-use change outcomes [33].Multiple LR, which contains more than one independent variable, has often been used to model land-cover change (e.g., [34]).Multiple LR was used in the presented research among all other models in the family of LR (e.g., ordinal LR and multinomial LR).When predicting parcels with unknown land-use types, either 0 or 1 was assigned to each parcel based on the estimated probability and a land-use-change threshold.If the land-use change status at a location was determined to be 1, it means the model predicted the parcel converts from one land-use type to a target land-use type; otherwise, it means no change occurred.

Generalized Additive Model
Generalized additive models (GAMs) extend generalized linear models (GLMs) by using a series of smoothing splines to express the non-linear relationship between the expected mean of responses and a set of predictors [35].GAM has been implemented in land-use science in addition to GLM (e.g., [36]).The advantage of GAM over GLM is that it has the ability to represent non-linear relationships, which ensures that more realistic situations can be represented.However, the effect of a non-linear predictor modelled by a smoothing spline cannot be interpreted using one consistent coefficient.Even though sometimes a smoothing spline can be replaced by a simple function (e.g., square and cubic function) across the entire range of variable values when we increase the degree of smoothness of the smoothing spline, the transformed predictor can be very difficult to interpret compared to linear predictors.The presented research used the additive logistic regression model from the family of GAMs.Additive logistic regression was chosen because its response variables (i.e., statuses of land-use changes) are binary.Similar to LR, a value of 0 or 1 was assigned to each parcel to determine the status of predicted land-use change.

Survival Analysis
SA analysis (SA) is used mostly in health and clinical studies to predict the mortality rate or recovery rate of patients (e.g., recovery rate from injury or diseases).SA can handle both time-varying and time invariant variables and can take into account incomplete data.SA models use both the duration of each observation in the experiment and an indicator variable showing the occurrence of the event of interest as response variables.All other potential factors that influence the occurrence of the event are used as covariates to calculate the success or failure rate of the event at the time.
Among many SA techniques, a Cox proportional hazards model [37] was used because it assumes constant effects of covariates on hazards over time, which is consistent with the assumptions of predictor effects in other models (i.e., LR and GAM).In the context of modelling land-use change, the event of interest is the change from one land use to another between two time points and the failure time of a parcel would be the time that a land-use change occurs.Our land-use data include only two points in time 2010 and 2015 giving two observations in the five-year time period.However, multiple measurements of land use for each parcel in the study period are required to determine a survival analysis failure time.Therefore, a time variable was created by randomly generating integers in the range of 1 to 6 to represent the failure time of each parcel within the 2010 (time = 1) and 2015 (time = 6) period.Other values of time represent years between 2010 and 2015 in an ascending order.Discrete time steps were used instead of continuous time to maintain consistency with land-use data and typical annual time steps used in modelling land-use change.

Analysis
As part of our comparative analysis of different statistical methods for representing land-use change, we also evaluated the effects of sample size on model performance.For LR, GAM, and SA, we first created a 'full dataset' for each land-use type comprising all property parcels with the same land-use type in 2015 (Figure 2a).Binary response variables, Y, were created for each parcel in the full dataset to indicate if the state of a parcel underwent a change in land use from 2010 to 2015 or remained unchanged.Eleven full datasets were created corresponding to the eleven defined land uses in 2015.
Land 2018, 7, x FOR PEER REVIEW 6 of 36 assumptions of predictor effects in other models (i.e., LR and GAM).In the context of modelling land-use change, the event of interest is the change from one land use to another between two time points and the failure time of a parcel would be the time that a land-use change occurs.Our land-use data include only two points in time 2010 and 2015 giving two observations in the five-year time period.However, multiple measurements of land use for each parcel in the study period are required to determine a survival analysis failure time.Therefore, a time variable was created by randomly generating integers in the range of 1 to 6 to represent the failure time of each parcel within the 2010 (time = 1) and 2015 (time = 6) period.Other values of time represent years between 2010 and 2015 in an ascending order.Discrete time steps were used instead of continuous time to maintain consistency with land-use data and typical annual time steps used in modelling land-use change.

Analysis
As part of our comparative analysis of different statistical methods for representing land-use change, we also evaluated the effects of sample size on model performance.For LR, GAM, and SA, we first created a 'full dataset' for each land-use type comprising all property parcels with the same land-use type in 2015 (Figure 2a).Binary response variables, Y, were created for each parcel in the full dataset to indicate if the state of a parcel underwent a change in land use from 2010 to 2015 or remained unchanged.Eleven full datasets were created corresponding to the eleven defined land uses in 2015.Each of the eleven full datasets was used to create two new balanced datasets (22 in total), a full and balanced (full-balanced) dataset and a reduced and balanced (reduced-balanced) dataset (Figure 2b).A balanced dataset contains the same number of observations for all existing levels of a categorical independent variable [38], which in this study refers to the dataset containing an equal number of changed and unchanged parcels.One benefit of using a balanced dataset is that it eases the interpretation and use of probabilistic thresholds when predicting change.In this study, when the outcomes of LR, GAM, and SA probabilities of parcel change are greater than or equal to 0.5, the parcel was classified as changed and otherwise was classified as unchanged.Because LR and GAM Each of the eleven full datasets was used to create two new balanced datasets (22 in total), a full and balanced (full-balanced) dataset and a reduced and balanced (reduced-balanced) dataset (Figure 2b).A balanced dataset contains the same number of observations for all existing levels of a categorical independent variable [38], which in this study refers to the dataset containing an equal number of changed and unchanged parcels.One benefit of using a balanced dataset is that it eases the interpretation and use of probabilistic thresholds when predicting change.In this study, when the outcomes of LR, GAM, and SA probabilities of parcel change are greater than or equal to 0.5, the parcel Land 2018, 7, 144 7 of 33 was classified as changed and otherwise was classified as unchanged.Because LR and GAM produce probabilities and SA outputs can be converted to probabilities, by converting its hazard function to a survival function [6], the three statistical approaches can be easily compared.
A full-balanced dataset corresponding to a specific land use (e.g., industrial) was created by acquiring all parcels that changed to that land use (e.g., industrial) and a matching number of parcels that persisted as that land use over the 2010-2015 time period (e.g., unchanged industrial parcels).The number of changed parcels in a reduced-balanced dataset is the same as the number of unchanged parcels and is set to a maximum of 500 to reduce the potential of over-fitting, yet maintain enough data to construct a model with many predictors and to assess the effects of sample size on model performance.When the number of changed parcels is less than 500 for one type of land-use change, the full-and reduced-balanced datasets for this land-use change are the same.The size of full-balanced datasets ranges from 10 to 10,816 points (Table 1).In our data, no parcel had changed to agricultural land-use from 2010 to 2015.Moreover, only five and thirty-six parcels had changed from some land uses to institution and water, respectively.Thus, there are insufficient data for LR, GAM, and SA to model agricultural, institutional, and water land-use changes with many predictors.The full-and reduced-balanced datasets for each land-use type were partitioned into training (70%) and test (30%) datasets using the traditional hold-out method (e.g., [39]) (Figure 2c).Then, a 10-fold conventional cross-validation (CCV) was applied to the training data to produce an averaged cross-validation accuracy for each land-use type and an overall cross-validation accuracy for each statistical method and the associated full-and reduced-balanced datasets (Figure 2d).A second cross-validation procedure was then conducted using 10-fold spatial cross-validation (SCV, [40]) to determine if spatial autocorrelation was present and had an effect on statistical model performance.Spatial autocorrelation is a common problem in land-use modelling, which violates most statistical model assumptions about independent observations.With CCV, parcels are randomly selected for training and test data, while SCV partitions the parcel data using k-means clustering [41] based on their spatial coordinates.Average CCV and SCV accuracy values for each land-use type and each balanced dataset within the training data were reported and used to calculate an overall model accuracy for the four statistical models evaluated (Figure 2d).Predictor coefficients were estimated using each statistical method and only significant predictors (p-values less or equal to 0.1) were retained from the CCV and SCV that produced the highest accuracy (Figure 2f) for final model development.Final models were fit against the training data and final accuracy values were determined against the test data (Figure 2g).
The datasets used in the MC approach involved combining all full-balanced datasets into one MC full-balanced dataset and all reduced-balanced datasets into one MC reduced-balanced dataset.An additional 500 agricultural parcels were added to each of the MC full-balanced and MC reduced-balanced datasets, which were excluded from the LR, GAM, and SA models due to no occurrences of change.From MC full-balanced dataset and reduced-balanced dataset, 70 percent were randomly selected for training models with 10-fold CCV and 10-fold SCV and the rest were used for testing the performance of MC.Using these datasets, two transition probability matrices were created as part of the MC modelling process, one for the MC full-balanced dataset and one for the MC reduced-balanced dataset.The probability matrices contain the probability of a land-use transition from each land-use type in 2010 to another land-use type in 2015.
The performance of the statistical models was compared using the overall cross-validation accuracies for predicting land-use change.The overall cross-validation accuracies of a method were calculated by averaging averaged cross-validation accuracies for changes in LU between 2010 and 2015.Thus, methods that produce the highest overall cross-validation accuracies and the highest averaged final model accuracies by LU type were determined.Furthermore, the distribution of accuracy values across types of LU change was determined.

Implementation of Statistical Approaches
The R statistical software [42] was used to implement the four statistical methods evaluated.Data partitioning for CCV and SCV was done using the creatFolds function from the caret package [43] and partition.kmeansfunction from the sperrorest package [40], respectively.The creation of the transition probability matrices for the MC method used the prop.tablefunction, which was applied to a contingency table of land-use classes.The fitting of CCV and SCV models for GAM was conducted using the train function that adopts the algorithm of GAM from the mgcv package [44][45][46][47][48].The train function with gam from mgcv package fits predictors with default smoothing functions, the thin plate regression splines that are considered robust smoothers regardless of the dimension of basis functions, which cannot be modified.The coxph function in the survival package [49] was used to construct a Cox proportional hazard model [37].Since the coxph function in R can only handle right-censored data, all time measurements are considered precise and only right-censored data (i.e., parcels have not experienced land-use change at the end of study) exist in this study.

Conventional Cross-Validation and Final Models
Results from the 10-fold CCV by method demonstrated that the overall CCV accuracy was highest for GAM, followed by LR, SA, and MC for both the full-and reduced-balanced training datasets (Table 2).GAM achieved 85.17 percent and 82.39 percent overall accuracies for full-and reduced-balanced training datasets, respectively.For full-balanced training datasets, the overall accuracy of GAM is 4.2 percent, 4.26 percent and 38.12 percent higher than the overall accuracies of LR, SA, and MC, respectively.For reduced-balanced training datasets, the overall accuracy of GAM is 2.15 percent, 2.6 percent and 31.71percent higher than the overall accuracies of LR, SA, and MC, respectively.Excluding MC, the increase in overall CCV accuracy due to reducing the sample size from the full-balanced to the reduced-balanced dataset was highest for GAM (2.78 percent).The effects of samples size were more strongly observed among individual land-use changes.For example, the largest difference among the land-use changes occurred for medium-density residential under GAM (6.98 percent).
Despite the higher computational overhead, GAM outperformed the other three modelling approaches in overall accuracy and yielded the highest CCV accuracy for six of eight land-use changes with full-balanced datasets (medium-density residential, high-density residential, commercial, transportation, protected area and recreation, under development) and five of eight land-use changes with reduced-balanced datasets (high-density residential, commercial, transportation, protected area and recreation, under development; Table 2).While LR obtained the highest accuracy for industrial, the benefit was minor (0.1 percent difference) relative to GAM.In contrast, SA had an accuracy 5.1 percent greater than GAM for low-density residential and had an accuracy 2.18 percent higher than GAM for the reduced-balanced medium-density residential land use.The MC approach was able to contribute to the three land-use types (institution, agriculture and water) that could not be modelled by the other methods and performed surprisingly well for agriculture due to the observed low frequency of agricultural land-use changes in the data.Differences in averaged land-use accuracies among methods were greater than differences between overall accuracies (Table 2).Excluding MC, the differences in averaged accuracies across the other three models were lowest for industrial (1.21 percent) and greatest for full-balanced commercial (13.16 percent).These results suggest that model choice is critical to gaining an accurate representation of pattern for some types of land-use change.In addition, our results suggest that greatly increasing the sample size from reduced-balanced to full-balanced for land-use changes of medium-density residential, high-density residential, commercial and under development had little effect on the within land-use accuracy for LR, whereby the difference is less than 1.18 percent for all land-use types except transportation (3.53 percent).The differences between sample sizes for individual types were greater for GAM and SA (Appendix D).
The final GAM, LR, and SA models were those that produced the highest accuracy in the 10-fold CCV process for specific types of land-use change and held significant predictors.The set of final models were evaluated against partitioned test datasets.The ranking of the overall final model accuracies by method was the same as that obtained from CCV (i.e., GAM > LR > SA) with reduced-balanced training data.The overall accuracy of SA was slightly higher than the overall accuracy of LR for final models with full-balanced test data.Since MC does not have a specific form and does not contain any predictor variables, it was excluded from this comparison.
The overall final models' accuracies with full-balanced test datasets are again slightly higher than the overall final models' accuracies with reduced-balanced test datasets.For full-balanced test datasets, the overall accuracy of GAM is 4.37 percent and 4.27 percent higher than the overall accuracies of LR and SA, respectively.For reduced-balanced test datasets, the overall accuracy of GAM is 2.41 percent and 3.13 percent higher than the overall accuracies of LR and SA, respectively.This implies that the advantage of GAM predicting overall land-use changes with full-balanced dataset over reduced-balanced dataset is higher in the final models.Despite the higher accuracy in the final models, the differences between the three models are greater and suggest that the GAM is not only the best overall model, but also that it is more generalizable than LR and SA.
Among the final models GAM had the highest accuracy for six land-use changes (medium-density residential (full-balanced), high-density residential, commercial, industrial, transportation, and under development), SA performed best for two land-use changes (low-density residential and medium-density residential (reduced-balanced)), and LR performed best for two land-use changes (protected area and recreation, and under development (reduced-balanced), Table 3).Again, the differences among land-use change accuracies were much greater than the difference in overall accuracies.The coefficients of significant predictors in final models are listed in Appendix E.

Spatial Cross-Validation and Final Models
To evaluate the effects of spatial autocorrelation on model performance, spatial cross-validation (SCV) was conducted, which produced similar overall rankings to those attained in the CCV for the full-balanced dataset (i.e., GAM > SA > LR > MC).Results of the SCV using the reduced-balanced dataset held slightly different results with LR outperforming SA (i.e., GAM > LR > SA > MC) (Table 4).The overall SCV accuracies were slightly lower than the CCV accuracies (Table 4) and the differences in accuracies among the methods were also slightly reduced.The differences in overall SCV accuracies due to sample size within methods were minor (within 2 percent).Furthermore, the SCV results do not uniformly demonstrate that sample size positively influences overall SCV accuracies, which is likely due to the heterogeneity in the reduced-balanced dataset introduced due to partitioning the training and testing data spatially.By partitioning the training and testing data spatially, models are constructed that have reduced variation in predictor values but the differences among the derived models are greater.The outcome of averaging these Land 2018, 7, 144 11 of 33 differences yields results that are more dependent on the spatial segmentation, which differs from the complete and random mixing of partitions generated from the CCV.The greater mixing in CCV yields more similar models and a clear effect of sample size on model performance.
In contrast, sample size had a substantial effect on some SCV specific land-use change accuracies (e.g., GAM medium-density residential had >6% difference between full-and reduced-balanced).In general, GAM performed better using the full-balanced dataset, producing the highest accuracy models for six land-use changes (medium-density residential, high-density residential, commercial, industrial, transportation, and under development) and only the highest accuracy model for three land-use changes with reduced-balanced training dataset (high-density residential, industrial, and transportation).LR performed best for commercial, protected area and recreation, and under development land-use changes with the reduced-balanced training dataset and SA performed best for low-density residential regardless of sample size, and medium-density residential with reduced-balanced training dataset.Similar to the CCV, using SCV, MC was able to contribute to those land-use changes that could not be modeled using the other approaches (institution, agriculture and water).The SCV results corroborate the CCV results in that model choice is critical to gaining an accurate representation of pattern for some land-use change types.
Among the final models derived from SCV, GAM performed the best overall as well as for the following full-balanced dataset land-use changes: medium-density residential, high-density residential, commercial, transportation and under development (Table 5).LR produced the second highest overall final model accuracies and best models of land-use change for low-density residential, industrial, protected area and recreation, for both full-and reduced-balanced datasets and tied SA for the highest under development accuracy using the reduced-balanced dataset.These results demonstrate that the model accuracies ranked with GAM > LR > SA, which is the same outcome as the SCV reduced-balanced dataset results.Results among specific land-use types and sample sizes confirm the findings made by analyzing results from CCV and CCV-final models that model choice is a more critical factor than sample size is for making predictions of land-use changes.Furthermore, SCV only contributed to alleviating over-fitting by a small amount as demonstrated by the minor reduction in overall model accuracies.We conclude that spatial autocorrelation in our data did not cause substantial over-fitting of our statistical models in terms of their overall accuracies but did have a recognizable impact on individual land-use changes and single methods.The coefficients of significant predictors in final models are listed in Appendix E.

The Combination of Statistical Methods
A theoretical land-use model was constructed by combining methods that produced the highest final model accuracies by land-use type (Table 6).Since SCV-final models generally alleviated over-fitting caused by spatial autocorrelation, final models derived from SCV were selected to form the theoretical method.Moreover, MC derived from CCV was selected since MC only accounted for frequencies of land-use changes.The theoretical land-use model produces an overall accuracy that is 3.98 percent and 4.52 percent higher than the best performing GAM models derived from SCV with full-and reduced-balanced datasets, respectively.When institution, agriculture and water are included (as represented by the MC model) the overall accuracy drops from 86.14 percent to 78.70 percent for full-balanced datasets and from 84.82 percent to 78.95 percent for reduced-balanced datasets; however, it is only through this mixed approach that all types of land-use changes can be modeled.

Discussion
The presented research sought to answer the question: what is the overall accuracy of different statistical methods in representing land-use change.To answer this question we developed, quantified, and compared the accuracy of four different conceptual approaches to modelling land-use change, i.e., parametric model, non-parametric model, time series, and stochastic process as represented by logistic regression (LR), general additive modelling (GAM), survival analysis (SA), and Markov chain (MC), respectively.These approaches were used to generate models of eleven different land-use types and were tested using 2010 and 2015 data for the Region of Waterloo, Ontario, Canada.Our results found that overall model accuracy was highest for GAM followed by SA, LR, and MC using CCV and final model selection held that ranking.When SCV was conducted, LR outperformed SA and MC was not competitive in either case.Given the close performance of GAM, LR, and SA, it is worth noting that the run time for the 10-fold CCV for LR, SA, and MC were less than 20 s (Appendix D).However, performing the 10-fold CCV for GAM took over eighty-five minutes for both full-and reduced-balanced datasets due to the iteration in back-fitting of smoothing functions.
We subsequently sought to answer the question: how do the accuracies differ among the tested statistical methods for specific land-use types?While GAM, again, outperformed the other modelling approaches, some specific land-use types were better modeled using each of LR, SA, and MC.In fact, the differences among methods by land-use type were greater than the overall model accuracies by method.To interrogate this concept further, all model derivations and analyses were repeated with two differently sized datasets, full-balanced and reduced-balanced, and a theoretical model comprising the best performing models of each type was derived.Our results demonstrate that not only is the choice of model by land-use type more important than sample size, but also that a hybrid land-use model comprising the best statistical modelling approaches for each land-use change outperformed the individual statistical approaches.
It is worth noting that while we found accuracies to typically increase with sample size, this was not the case when SCV was used in model derivation and selection.The inclusion of SCV slightly reduced the degree of over-fitting of models using CCV and demonstrated that spatial autocorrelation had little effect on our final model accuracies.It is also worth noting that while MC was not competitive, it was useful in providing representation of land-use changes that had small sample sizes and could not be represented using other methods or in other cases where there is no predictor data.Given that it has been shown to outperform other null models [50] and performs better when there are data over longer time scales [51,52], it should not be completely disregarded.

Modelling One-to-One Land-use Change versus Many-to-One Land-Use Change
LR and GAM that have been used to model land-use change are usually designed to represent one-to-one land-use change (e.g., with only two classes comprising non-urban to urban [15,53]; non-forest to forest and forest to non-forest [54]).SA has been also been used to model one-to-one land-use change using two land-use classes, non-urban to urban [15], and three one-to-one models of land-use changes from agriculture to three types of subdivisions [6].In this way, focusing on modelling one-to-one land-use change can reveal the effects of predictors on a specific transition between land uses providing more insight toward potential causal mechanisms for change.
Modelling specific land-use change transitions (i.e., one-to-one) is not only time-consuming when many land-use classes are involved, but the availability of land-use change samples can also be restrictive.For example, if our eleven land-use classes (n = 11) were modeled based on single transitions (one-to-one; r = 2 land uses; Figure 3a) then we could have (n!/(n − r)!) or 110 models.While some of these transitions would be unlikely (e.g., high-density residential to agriculture), the number of samples of many different individual transitions is small for our study area.In contrast, we represented each unique land use as the endpoint resulting from many possible points of origin (Figure 3b), yielding only eleven final models and overall accuracies greater than 85% for our best final models and theoretically combined model with full-balanced datasets.
Land 2018, 7, x FOR PEER REVIEW 13 of 36 that not only is the choice of model by land-use type more important than sample size, but also that a hybrid land-use model comprising the best statistical modelling approaches for each land-use change outperformed the individual statistical approaches.
It is worth noting that while we found accuracies to typically increase with sample size, this was not the case when SCV was used in model derivation and selection.The inclusion of SCV slightly reduced the degree of over-fitting of models using CCV and demonstrated that spatial autocorrelation had little effect on our final model accuracies.It is also worth noting that while MC was not competitive, it was useful in providing representation of land-use changes that had small sample sizes and could not be represented using other methods or in other cases where there is no predictor data.Given that it has been shown to outperform other null models [50] and performs better when there are data over longer time scales [51,52], it should not be completely disregarded.

Modelling One-to-One Land-use Change versus Many-to-One Land-Use Change
LR and GAM that have been used to model land-use change are usually designed to represent one-to-one land-use change (e.g., non-urban to urban [15,53]; non-forest to forest and forest to non-forest [54]).SA has been used to model one-to-one land-use change from non-urban to urban [15] and land-use changes from farm to three types of subdivisions [6].In this way, focusing on modelling one-to-one land-use change can reveal the effects of predictors on a specific transition between land uses providing more insight toward potential causal mechanisms for change.
Modelling specific land-use change transitions (i.e., one-to-one) is not only time-consuming when many land-use classes are involved, but the availability of land-use change samples can also be restrictive.For example, if our eleven land-use classes ( = 11) were modeled based on single transitions (one-to-one; = 2 land uses; Figure 3a) then we could have ( !/( − )!) or 110 models.While some of these transitions would be unlikely (e.g., high-density residential to agriculture), the number of samples of these individual transitions is small for our study area.In contrast, we represented each unique land use as the endpoint resulting from many possible points of origin (Figure 3b), yielding only eleven final models and overall accuracies greater than 85% for our best final models and theoretically combined model with full-balanced datasets.

Operationalizing the Combined Statistical Model
Prediction of future land uses in an area involves land-use competition, which was not modeled in the presented research.However, the statistical methods were chosen because they produce similar outputs in terms of probability of change to different land-use types.Future research will seek to determine the interactive effects between the tested modelling approaches and evaluate different decision-making algorithms for determining future land use at a given location.Existing models incorporate preferential ordering (e.g., [55]), highest fitness (e.g., [56]), and rescaled roulette-wheel approaches (e.g., [52]) among others.To the best of the authors' knowledge, a comparison among competing land-use models for land-use allocation to a specific location has not been published.The conditions of their success, requirements for application, and lessons learned would not only guide land-use change modelling researchers but would also help to synthesize what is typically a set of disparate case study models that have yet to coordinate the creation of community frameworks, models, or model components [13,57].

Future Directions
The presented comparative analysis focused on four methods that are most frequently used in land-use modelling literature and span four conceptual approaches to statistical analysis (stochastic process, parametric model, non-parametric model, and time series model).A variety of other approaches would be worth investigating, such as mixed effects models (e.g., [58]), especially mixed effects logistic regression and mixed effects additive logistic regression as well as machine learning techniques such as random forest (e.g., [59]) and support vector machine (e.g., [60]).Mixed effects logistic regression and mixed effects additive logistic regression are expected to perform better than LR and additive logistic regression, respectively, since random effects in mixed-effects models could take account of time-varying variables and spatial autocorrelation, but at the cost of increased model complexity and difficulty of interpretation.Machine learning approaches have been applied to classify land cover (e.g., [61][62][63][64]) and are more flexible than traditional statistical methods since they are spared from general assumptions of statistical methods such as linearity, independence, and an underlying distribution of data.Moreover, machine learning can benefit from a large amount of input data, both observations and predictors, and is less affected by multi-collinearity among predictors.We intend to extend and compare the presented research with these additional methods in the future and welcome others to do so as well using our data included with this publication.
Our comparison of statistical methods used traditional cross-validation and accuracy assessments.The inclusion of spatial cross-validation mitigated the effects of spatial autocorrelation and violation of independence among samples, which yielded models with slightly greater error than those derived from traditional cross-validation.These efforts differ from those used in land-use modelling that focus on partitioning quantity and location of change based on generated maps compared to observations as done by [7].We created balanced datasets that comprise locations of change for a specific land use and locations where that type of land use persisted throughout the time frame of our analysis.The tested models were constrained to these data and were not applied against all potential locations as is done in most models of land-use change.Using our approach we did not partition our results by quantity of change and location of change as the quantity remained fixed by the size of our balanced datasets.Future research will assess the maps generated from the tested approaches along with additional approaches such as mixed effects and machine learning as mentioned above.
In making comparisons between the presented model and other models of land-use change, it is worth noting that the presented models used vector parcel data as the minimum mapping unit rather than raster cells.The spatial scale was fixed to the spatial unit at which land-use change decisions are made (i.e., the ownership property parcel).Like raster-based land-use models, our generated maps will experience an increase in accuracy if aggregated to more coarse resolutions, but these have less conceptual meaning given our ownership property parcel approach.While use of property parcels offers alignment with observations that is more realistic than results obtained from raster-based models, modelling future scenarios using our approach faces the challenge of representing the process of parcelization or property parcel subdivision.Efforts in parcelization are underway (e.g., [65]), however, it remains an area in need of continued research.
We have shown the capacity for statistical models to represent land-use change with relatively high levels of accuracy, however, they are limited in their ability to represent the conditions, actors, behaviors, and decision-making processes that drive land-use change.To represent heterogeneity within the environment, actors driving change within the land system, and their interactions with each other and their environment, agent-based modelling provides an alternative to the constraints of statistical modelling.While agent-based modelling is widely used to represent land-use change [66], the approach faces challenges in the time required for model development, its ability to predict future outcomes, and a lack of available data about actor characteristics, behavior, and interactions [67].Furthermore, our understanding of statistical models has history and a level of transparency that makes their communication in publications more feasible and subsequently their adoption and use by other non-creators more likely [13].
As the study of complexity complements reductionism rather than displaces it [68], so too can statistical models form the basis for agent-based approaches and the creation of hybrid models that utilize the best aspects of both approaches can be developed.For example, in the absence of data about actors driving industrial and commercial development, logistic regression was used to create a model rich in the number of land-use types (11) [69,70].Not only did the logistic regression perform well, but few researchers are engaged in modelling the site-selection behavior of industrial and commercial actors using an agent-based modelling approach.A unique and related result from our statistical model comparison is that the land-use change that typically yielded the lowest accuracy across GAM, LR, and SA approaches was low-density residential.This outcome provides justification for the substantial number of agent-based models developed to represent urban sprawl and residential development (e.g., [71][72][73]), which could be argued as the most prominent focus among case study applications of agent-based models to land-use change in developed countries, with farming agents yielding a close second.
Current effort in the land-use modeling community using agent-based modelling approaches involve attempts to advance toward representing land-use change across large spatial extents [74].However, the challenges associated with deriving common agent types (e.g., human function types [14]) and collecting actor characteristic, behavior, and decision-making data are inhibiting these efforts.One approach to simplify the process and speed up advancements in this area is to construct hybrid models that use an agent-based framework and statistical models of land-use change as placeholders for agent behavioral models until novel approaches to data collection or agent behavioral development take place.The incorporation of statistical representations of agent behaviors can (1) produce outcomes repeatedly in a relatively short time period, (2) can be validated, and (3) can be applied across large spatial extents.With a hybrid ABM-statistical approach, one can get a representative model up and running quickly, and can provide a range of insights and findings that can be extended when behavior data become available.Such an approach may also facilitate the representation of human systems with natural systems, which has faced not only technical challenges, but also buy-in from natural scientists and funding agencies [75].

Conclusions
Parametric, non-parametric, time series, and stochastic models of land-use change were created and compared across three dimensions: accuracy (overall and by land-use type), sample size, and spatial independence via conventional versus spatial cross validation.To the best of the authors' knowledge, this is the first study to formally compare the relative performance of these statistical approaches as represented by LR, GAM, SA, and MC models of land-use change at the same location using the same data.Our results indicated that GAM outperformed the other three approaches in both overall accuracy and among most land-use accuracies.However, LR and SA were more accurate for specific land-use types (e.g., low-density residential and industrial) and MC was able to represent changes that could not be modeled by other approaches due to sample size restrictions.Therefore, a land-use model that combines the best models of each land-use type can outperform any one modelling approach and provide a richer representation of land use types.
The performance of the four tested methods and their relative ranking may differ for other regions since they are dependent on the quality and characteristics of the data to which they are applied and the underlying processes of land-use change may differ among regions.While we assessed the fit of statistical models to the outcomes of land-use change processes, a better fit may not improve our explanation and understanding of these processes or the models predictive capacity due to over-fitting.Despite these caveats, the presented comparison, of statistical land-use models, offers guidance to others selecting among statistical approaches for representing land-use change.Furthermore, the effects of sample size on model performance were quantified and were minor relative to model choice for specific types of land-use change, particularly when spatial cross-validation is incorporated into the modelling process.While the incorporation of spatial cross-validation reduced overall model performance, it also reduced over-fitting and illustrated that spatial autocorrelation had little effect on our modelling analysis.
The presented research contributes to a lack of literature on land-use change modelling where the focus is on high-fidelity parcel-level models, which is the scale at which land-use change decisions are made [69].Future research will seek to extend the modelling comparison to other statistical modelling approaches as well as to investigate the outcomes of different scenarios of future land-use and land-cover change.We hope that by providing the details about the results in the paper and supplementary material about the model specifics in appendices that others will compare and contrast their approaches to the work presented here.and 2015 were used as ground truth data toward modelling land-use changes with the proposed statistical methods.
Before creating any variables, DAUIDs (i.e., unique IDs of Das) in the Region of Waterloo in 2010 were assigned to parcels according to the location of parcels in the DA in order to relate DA's information to parcels.Geometries (i.e., perimeter and area) of the parcel polygons and DA polygons were calculated in ArcGIS and attached to the ownership parcel data.Site variables (i.e., mean slope and mean elevation) were created using slope and elevation data.Demographic data from 2011 Census data (e.g., population) associated with the DA in the Region of Waterloo retrieved from Statistics Canada were merged with the parcel data in excel format in R by the common variable DAUID.The parcel data were then imported back to ArcGIS since some spatial variables need to be created using the Polygon Neighbors tool in ArcGIS.The tool created a table that contains IDs of source polygons, IDs of neighbor polygons and records of land-use types for neighbor polygons.The table was then used to calculate the percentage of each land-use type around each source polygon by manually programmed codes in R. The percentage of neighbor land use was merged with the parcel data by unique parcel ID (i.e., ID of source polygon) in R. Furthermore, the parcel data were converted to point data (i.e., centroids of parcels) in order to calculate Euclidean distances from parcel centroids to other features (e.g., highway ramp and commercial parcel) in ArcGIS.During the process of variable creation, some parcels have been removed from the full dataset due to the lack of data for creating drivers associated with the parcels (e.g., a lack of census data in some areas).
Misclassification was found in some rare cases of land-use change during the process of creating status for land-use change (i.e., binary response variables).Since rare cases of land-use change (e.g., from industrial to water) are a relatively small amount of data compared to the total, second round of manual classification was conducted by the author alone to increase the classification accuracy of the data.During this round, SWOOP imageries for all three years (i.e., 2006, 2010, and 2015) were used to classify land-use types that associated with parcels being found with the occurrence of rare cases of land-use change.The manual classification was also done to parcels that found with the occurrence of some other land-use changes that have a relatively small amount of data compared to the total due to the consideration of data accuracy.Meanwhile, some rules of manual land-use classification have been modified based on Smith's work.The complete and modified rules are presented in Appendix C.Moreover, spatial variables were re-created using 2010 as the reference year since the neighborhood parcels may be changed.
Furthermore, some other problems have arisen.As mentioned in Section 2.4 in this paper, each full dataset is consisted of all parcels that had their land use converted to a specific land use in 2015 and parcels that had remained the specific land use during the study period.After conducting an explanatory analysis on full datasets, it has been found that the majority types of land-use change in a full dataset came from the part of parcels that has not been manually verified or modified in the second round.Therefore, it is reasonably to suspect the accuracy of the data.However, the study has moved forward with the current set of data due to the high overall accuracy of computer simulated land-use data and the time constraint of this project.Another finding is that the classification accuracy of computer simulated 2010 land use is about 68 percent for the approximately 6000 parcels that have been verified or modified in the second round of manual classification.This infers that the computer classification approach performed differently for different land-use types.In addition, the season that SWOOP data have been taken is another factor that can cause misclassification of the same land use in different years by computer simulation methods.Parcels that appear to contain a single dwelling for a single family on a large property.These parcels typically appear outside the urban core in suburbs or rural areas.While houses tend to be larger than medium density residential, it is not a requirement for the classification.

Medium-Density Residential
Average sized parcels containing a single dwelling for a single family, which may or may not be attached to adjacent dwellings.This class contains the majority of residential parcels within subdivisions and the urban core.In most parcels, the house and driveway cover most or all of the width of the parcels, with yards in the front and back.
Townhouses are usually classified as medium-density residential.* 3 High-Density Residential Parcels containing buildings with multiple dwellings or units, and therefore multiple families within the parcel.Typically in two forms, apartment or condo buildings, and townhouses where one parcel contains multiple units.Parcels may contain green space and parking lots in addition to the buildings.

Commercial
Parcels containing business where customers visit to obtain products and services, or office buildings which may not receive customers.Larger parcels, such as malls or box stores, will contain large parking lots for customers.These parcels do not contain large outdoor storage areas, although garden and home improvement stores may have some outdoor storage.

Industrial
Parcels which contain a business with an outdoor storage area such as a factory or a car scrapyard.These business typically do not receive customers although there may be parking lots for employees and areas for incoming materials and outgoing products.

Institutional
Manually classified parcels for schools (private and public) and hospitals.Schools and hospitals can appear as a variety of classes but provide different services from these misclassifications (e.g., Commercial or Protected Areas and Recreation).Manually classifying these parcels allows for them to be included in the landscape without large amounts of misclassification.

Transportation
Parcels which represent roads and railways.These parcels often include the boulevard and sidewalks.Highway interchange parcels include all the land which is owned and managed by the managing government.

Protected Areas and Recreation
Areas which have a primary purpose of recreation, such as parks, or protected areas such as forests.Commercial forests and private forests are included in this class as they appear very similar, or even identical to the natural forests.
9 Agriculture Parcels which are primarily used for raw food production.This includes fields for crops and pastures.Some parcels will have barns and/or a farm house, while others may have neither.Parcels may also include a portion which is forested, sometimes referred to as "the back forty".
10 Water Parcels which have a main purpose of outlining waterbodies such as rivers.
Lakes are included when the lake occupies a majority of the parcel.The rest of the parcel may include sections which would otherwise be classified as Protected Areas and Recreation.

Under Development
Properties where construction has not been completed and no residents or business has moved in.These parcels may become many different classes when complete, but the class cannot be guaranteed at the time of the imagery.Depending on the progress of a development project, residential areas and big box stores or shopping complexes may appear similar as the area is represented by only a single parcel.
* Addition for clarification of manual land-use classification used by Smith (2017).

First Class Second Class Problem Solution
Low-Density Residential

Medium-Density Residential
Parcel size is a continuous variable and it is difficult to define the exact separation between the two classes.
In many cases where there is confusion, the house is the same size as the surrounding properties which are either low or medium density and is a similar distance from the road.The parcel in question will usually have its additional size added through its backyard.* Addition for clarification of manual land-use classification used by Smith (2017).
Table A4.Exemptions and special cases in land-use classification.

Airport Commercial
Airports provide services similar to Commercial parcels, where people are constantly visiting the parcel.Visually they are similar as they both include large paved areas such as parking lots and a large building.

Fire station Commercial
Although functionally different from Commercial parcels, they are very similar in the imagery.

Graveyard Protected Areas and Recreation
Graveyards and cemeteries are visually similar to parks, where there are paths for people to walk and grass fields.The only visual difference is that there are pieces of stone (headstones) scattered across the fields and there is no sports equipment.

Water Tower Protected Areas and Recreation
Water towers can be visually similar to parks as they can have large grassy areas surrounding the tower.If the water tower is in a parcel without much grassed area, it may be classified as Commercial instead.

Commercial Forest -Post-Harvest Various
If the harvested forest appears to be converted into agriculture, classify as Agriculture.If it shows signs of urban development, it should be classified as Under Development.If it appears to be replanted and is still being used as a commercial forest, classify as Protected Areas and Recreation.

Catwalk Transportation
The paths between houses, or catwalks, are similar to roads, although a little smaller.A path through a park or green space would not be considered transportation.

Walking paths Protected Areas and Recreation
Walking paths in the area can often be found under large electrical transmission lines.The transmission lines and towers account for a small portion of the parcel, and therefore simply appear as grassy corridors through subdivisions, similar to parks.

Church Commercial
Churches are visibly similar to Commercial parcels because they are a building which has a parking lot and some property.Functionally they are also similar as people will visit a church for a relatively short period of time, similar to a business.

Artifacts N/A
The parcel data is not perfect and has artifacts from either previous versions, or mistakes during creation.Some artifacts have little impact on the data, while others have large impacts.The most frequent example is a single parcel being divided into multiple parcels by the artifacts.

Appendix E
In the following context, a smoothed term refers to a variable that was fit using a smoothing function to represent the non-linear relationship between it and the response variable in GAM.

Figure 1 .
Figure 1.Land-use change from 2010-2015 in the Region of Waterloo, Ontario, Canada.

Figure 1 .
Figure 1.Land-use change from 2010-2015 in the Region of Waterloo, Ontario, Canada.

Figure 2 .
Figure 2. Methodology for logistic regression (LR), generalized additive model (GAM), and survival analysis (SA) statistical models including the development of full-balanced and reduced-balanced training and testing data, model selection, and land-use classification accuracy assessment.CCV: conventional cross validation; SCV: spatial cross-validation.

Figure 2 .
Figure 2. Methodology for logistic regression (LR), generalized additive model (GAM), and survival analysis (SA) statistical models including the development of full-balanced and reduced-balanced training and testing data, model selection, and land-use classification accuracy assessment.CCV: conventional cross-validation; SCV: spatial cross-validation.

Figure 3 .Figure 3 .
Figure 3. (a) The ten possibilities of modelling one-to-one land-use change from agriculture to all other land uses.(b) Modelling many-to-one land-use change from any land use to agriculture.[Notes: LDR: low-density residential; MDR: medium-density residential; HDR: high-density Figure 3. (a) The ten possibilities of modelling one-to-one land-use change from agriculture to all other land uses.(b) Modelling many-to-one land-use change from any land use to agriculture.[Notes: LDR: low-density residential; MDR: medium-density residential; HDR: high-density residential; COM: commercial; IND: industrial; INS: institution; TRA: transportation; REC: protected area and recreation; AGR: agriculture; WAT: water; UND: under development].

Table 1 .
Sample sizes of full-balanced and reduced-balanced datasets by land-use type.
* The samples of agricultural parcels are randomly selected for Markov chain.
Note: Bold values indicate highest accuracy by land-use type.

Table 3 .
The individual and overall accuracy (percentage) for final models derived from 10-fold CCV with full-balanced (FB) and reduced-balanced (RB) test datasets.
Note: Bold values indicate highest accuracy by land-use and land-cover type.
Note: Bold values indicate highest accuracy by land-use type.

Table 5 .
The individual and overall accuracy (percentage) for final models derived from 10-fold SCV with full-balanced (FB) and reduced-balanced (RB) test datasets.
Note: Bold values indicate highest accuracy by land-use type.

Table 6 .
The combination of final models that produces the highest accuracy (percentage) by land-use type with full-balanced (FB) and reduced-balanced (RB) test datasets.
Note: Overall 1 is the overall accuracy of all land-use changes.Overall 2 is the overall accuracy excluding land-use changes of institution, agriculture and water.

Table A2 .
Manual land-use classification of parcels in the Region of Waterloo.
If the backyard visually occupies two thirds of the property, it can be easily called low density, if less, medium density.If the parcel has a backyard smaller than two thirds, but the front yard and house are large, then it can also be classified as low density.If an absolute value of size is needed, 2000 m 2 should be used as the minimum size for Low Density Residential.If there is a completed house with grass on the property it should be considered complete and classified as Medium Density Residential.If the house does not appear complete or there is no grass where there should be, it should be classified as Under Development.

Table A5 .
N/AWhen a parcel is divided by artifacts all segments should be classified as the original type if suitable.If a segment can clearly be classified as another land use type it should be done.For example, if a Low Density Residential parcel is divided into three pieces, two covering the house and one covering a forest at the back of the property, the two on the house should be Low Density Residential and the one on the forest should be Protected Areas and Recreation.In the scenarios where parcels have been created but no development has begun, classify the parcel based on the currently present land use type.If the imagery shows evidence of development, then classify as Under Development.Running time (seconds) of methods with 10-fold CCV by land-use type for full-balanced (FB) and reduced-balanced (RB) training datasets.

Table A6 .
Absolute difference between averaged 10-fold CCV accuracies and overall accuracies of full-balanced and reduced-balanced training datasets in percentage (%).

Table A7 .
Absolute difference between overall accuracies of final models derived from 10-fold CCV with full-balanced and reduced-balanced test datasets in percentage (%).

Table A8 .
Absolute difference between averaged 10-fold SCV accuracies and overall accuracies of full-balanced and reduced-balanced training datasets in percentage (%).

Table A9 .
Absolute difference between overall accuracies of final models derived from 10-fold SCV with full-balanced and reduced-balanced test datasets in percentage (%).

Table A10 .
Coefficients of significant land-use change predictors in final LR derived from CCV with full-balanced test datasets.

Table A11 .
Coefficients of significant land-use change predictors in final LR derived from CCV with reduced-balanced test datasets.

Table A12 .
Coefficients of significant land-use change predictors in final GAM derived from CCV with full-balanced test datasets.The symbol "S" in the table indicates that the predictor is considered significant as a smoothed term.

Table A13 .
Coefficients of significant land-use change predictors in final GAM derived from CCV with reduced-balanced test datasets.

Table A15 .
Coefficients of significant land-use change predictors in final SA derived from CCV with reduced-balanced test datasets.

Table A16 .
Coefficients of significant land-use change predictors in final LR derived from SCV with full-balanced test datasets.

Table A17 .
Coefficients of significant land-use change predictors in final LR derived from SCV with reduced-balanced test datasets.

Table A18 .
Coefficients of significant land-use change predictors in final GAM derived from SCV with full-balanced test datasets.

Table A21 .
Coefficients of significant land-use change predictors in final SA derived from SCV with reduced-balanced test datasets.