Machine Learning Techniques for Modelling Short Term Land-Use Change

: The representation of land use change (LUC) is often achieved by using data-driven methods that include machine learning (ML) techniques. The main objectives of this research study are to implement three ML techniques, Decision Trees (DT), Neural Networks (NN), and Support Vector Machines (SVM) for LUC modeling, in order to compare these three ML techniques and to ﬁnd the appropriate data representation. The ML techniques are applied on the case study of LUC in three municipalities of the City of Belgrade, the Republic of Serbia, using historical geospatial data sets and considering nine land use classes. The ML models were built and assessed using two different time intervals. The information gain ranking technique and the recursive attribute elimination procedure were implemented to ﬁnd the most informative attributes that were related to LUC in the study area. The results indicate that all three ML techniques can be used effectively for short-term forecasting of LUC, but the SVM achieved the highest agreement of predicted changes.


Introduction
Studying, understanding, and modeling the land use change (LUC) process is important and represents one of the key research topic for many disciplines, particularly geography, urban planning, geo-information science, ecology, and land use science [1][2][3].Understanding the spatial patterns of the LUC process can enable planners and policy-makers to manage community development and growth in a sustainable way.LUC is influenced by many driving factors, ranging from socioeconomic conditions, demography, landscape topography, physical infrastructure, and planning constraints and policies.Consequently, modeling the LUC process is a challenging undertaking that has been implemented using various techniques, from logistic and multiple regression [4][5][6], Markov models [7,8], cellular automata [9][10][11], agent-based approaches [12,13], and more recently, machine learning (ML) techniques [14,15].Modeling the LUC process is dependent on availability of various and large data sets including demographic, geospatial, and historical data, and can be expressed as being data-driven.
The main focus of data-driven modeling methods is to find patterns and trends or to induce representative models of underlying processes using past data [16,17].These modeling methods assume stationarity in the relationship between the predictors and land change variables [18].Consequently, it is possible to discover the relationship between process inputs and outputs without the need for detailed understanding of the physical transformation process or transition functions.
The discovered relationships between the inputs and outputs are then used to define a model with only a limited number of assumptions [19].These properties make data-driven models easily transferrable to various end-user contexts when compared with related agent-based or cellular automata models that normally require expert knowledge to define the transition rules.
LUC is a complex process, and hence, ML provides suitable tools to represent the change process.The early ML techniques applied to LUC modeling used artificial neural networks, coupled with cellular automata [37,38] and geographic information system (GIS) [39,40].In addition, Decision Trees [41] and Support Vector Machines were used to guide transition rules for cellular automata models [42,43].Modeling LUC using ML techniques, such as Decision Trees, Support Vector Machines, Neural Network, and Random Forest have been expanding in recent years [15,[44][45][46].The LUC models described in these studies used only one ML technique and considered only two or three land use classes.Tayyebi and Pijanowski [14] used three ML techniques: Neural Network, Classification and Regression Trees, and Multivariate Adaptive Regression Splines for modeling land use changes when considering four land use class: (1) agriculture, (2) urban, (3) forest, and (4) other classes.They concluded that Neural Networks (NN) provided a better accuracy when comparing with the other two techniques.However, they used the same time interval in order to train and test the model.Further, many of those mentioned studies focused on the correlations between spatial indicators and land use classes at the same time point, but they do not extend toward estimating LUC to forecast or how it is often titled in statistical terms to predict the near future.ML frameworks that used to build such models are readily transferable to different urban environments, and, except for data preparation that could be automated, would require the limited involvement of a knowledge expert.Therefore, this research study addresses two main objectives regarding forecasting short-term land use class changes using ML techniques; selecting appropriate data representation and comparison of selected outputs of nine urban land use class models derived by three ML techniques, Decision Trees (DT), Neural Networks (NN) and Support Vector Machines (SVM).
Given that the proposed models use past data to "learn" the unknown relations between predictor variables and nine land use classes, it is a necessary requirement to define how the entities from the real world are described by choosing the most informative attributes (predictors) for the problem at hand.In our case study, the focus in on spatial attributes only, but other information categories, such as socio-economic position, could be easily added to the model building framework.The attribute selection process provides benefits, such as improving model accuracy, reducing noisy inputs and model complexity, and finally reducing the time required for training [47][48][49][50].Informative attributes were selected using the Information Gain ranking technique and the procedure of recursive attribute elimination.
The performance of three commonly used ML classification algorithms (DT, NN, and SVM) were assessed using the Sensitivity (True Positive rate-TP rate ), Fall-out (False Positive rate-FP rate ), Area Under Receiver Operating Characteristic Curve (AUC) [51] measure, and the method of comparison of three maps [52].The rationale for selecting a particular classification algorithm was based on high classification accuracy in many domains (NN, SVM) and easy model interpretability (DT).The forecasting land use models were tested on multiple land use class transitions that characterized different urban growth behaviours within the City of Belgrade, Republic of Serbia.A total of nine land use classes were investigated using data from three selected municipalities in the City of Belgrade for the time interval 2001-2011.The models were built and tested using two different time intervals 2003-2007 and 2007-2011, respectively.In the sections that follow, we formally define the classification problem for LUC forecasting, outline the work-flow developed for designing the models (attribute selection and learning protocol), report on the model comparison results, and provide conclusions from the research work.

Formulating the Problem as a Classification Task
In order to build a predictive short-term LUC model using ML techniques, it is necessary to have series of historical data for the study area in consideration and the assumption that an unknown land use transition function retains similar properties over a short time interval for which the model is generated.Generally, the duration of the short time interval depends on the properties of the study area.For our chosen study area, it takes three to four years.LUC models that are based on ML require the transformation of physical, socio-economic, neighboring, and other related data for each spatial cell unit of the study area into an appropriate data representation.The study area is usually represented as a grid of cells that are georeferenced as raster data layers in a geographic information system (GIS).Each raster cell has a regular shape, usually square, with its accompanying attributes and land use class.Several GIS data layers of the study area at different time points are necessary to build ML models for LUC analysis.The ML technique for modeling LUC can be regarded as a classification problem [53], which is more formally defined as follows: Let C = {c 1 , c 2 , . . ., c k } be the set of k predefined land use classes.Each grid cell at time point t can be represented as a row vector x t = <x t 1 , . . ., x t i , . . ., x t n > with x t i representing the value of the i-th spatial attribute that is assigned to the cell.Further, let y t from C be a land use class of x t at time point t.A function f p : x t → y t+1 applied over each x t from a grid representing the study area, is called a prediction if for each x t the relationship f p (x t ) = y t+1 is true whenever a cell x t changes its land use to the class y t+1 .Values from C are mapped into natural numbers, each representing a particular land use class.The model constraint is that each cell can belong to only one class at any time point t.
Assuming that the transition function f p exists and maps the grid at the time point t to land use classes at time point t + 1, the ML technique attempts to learn the function f p ' that approximates the unknown f p using only the training set in which all of the attribute values at time t and land use classes at time t + 1 are known beforehand.In order to evaluate the predictive power of the approximation, it is necessary to test the model that is built in the previous step with inputs x t+1 and to compare f p '(x t+1 ) with known y t+2 .Hence, model building and validation assumes the availability of data on spatial attributes and land use classes from three different time points [18].

Forecasting Short-Term Land Use Change Based on ML Techniques
The applied methodology for model building and verification consisted of five steps: (1) creating the initial geospatial data set; (2) creating training and test data sets; (3) attribute ranking; (4) model building for each ML technique; and, (5) model validation.
Step 1: Creating an initial input data set assumes the definition of a data representation for each grid cell x t at each time point t as discussed, in Section 2.1.The initial data set I t, t+1 is created for each time interval (t, t + 1) containing x t represented with attributes and y t+1 as a land use class at time point t+1 for each cell in the study area.
Step 2: In order to learn the predictive function f p , a training set of the form {(x t , y t+1 ) i }, i = 1,2, . . .,n, (n is the number of samples in the training set), was constructed from the initial set (I t, t+1 ), where x i t represents a grid cell at time point t and y i t+1 represents its land use class at time point t + 1.
Models that were obtained in the training phase were tested using an independent test set in the form of {(x t+1 , y t+2 ) i }, i = 1,2, . . .,m (m is the number of samples in the test set).The test set was derived from the corresponding I t+1 , t+2 initial set.Note the necessity of using data from three different time points when building (t, t + 1) and testing (t + 1, t + 2) the proposed forecasting models.
Due to the large volume of data and the low percentage of cells that change their land use class over time, the most commonly used random sampling is not appropriate.Therefore, we applied a balanced sampling strategy [54], providing more informative data sets for the supervised learning algorithm and building models that are less affected by unchanged cells or majority classes.The balanced training and test sets were created from the initial data sets in the following manner: (a) all of the cells that change their class were selected together with the same number of cells without LUC, (b) unchanged cells were chosen to be uniformly distributed across the whole area, and (c) the proportion of all the land-use classes remained the same as in the initial sets.Balanced training and test sets were labeled as S t, t+1 (training) and S t+1, t+2 (test).
Step 3: The goal of attribute ranking is to choose a subset of input attributes by eliminating those with small or no predictive influence on the model output.These less important attributes lower the performance of the ML process [48].The Info Gain (IG) ranking method that was used in this research is based on the concept of entropy that is defined in information theory [55].It provides the concept of how much information gained towards a correct classification is obtained if the value of the selected attribute is known.The IG ranking method independently evaluates all of the attributes using the training set and provides their ranking for the classification problem at hand.We applied a recursive attribute elimination method [56] and IG ranking to analyse the impact of the n most informative attributes on model performance.
Step 4: After selecting the appropriate spatial attributes and creating a balanced training and test sets from the initial data, the next step was to build classification models using DT, NN, and SVM.
Step 5: In order to assess and compare various modeling outcomes that were obtained from the different ML techniques, the weighted averages of TP rate , FP rate and AUC measure [57], and for an in-depth analysis of comparison of the three maps method [52], were applied.The overall model performance can be represented by weighted averages of TP rate , FP rate , and AUC (Weighted AUC), in which weights could be selected appropriately (if weights are equal, all of the classes are equally important in model validation).The comparison of the three maps method is based on a three-dimensional table that is constructed by comparing land use classes in three maps: reference (real) map at time t+1, reference map at time t + 2, and simulation map at time t + 2. Pontius et al. [52] explained this concept in detail; generally, the three-dimensional table quantifies simultaneously the observed (real) and simulated transitions among different classes.The method examines two components of agreement and three components of disagreement.The agreement is measured through persistence simulated correctly and change simulated correctly, while the disagreement is observed as a change simulated as persistence, change simulated as change to wrong category, and persistence simulated as change.These quantities provide a clear insight into the forecasting capability of the model, by giving separate measures for the cells that changed their land use over the testing interval and for the ones that remained unchanged.

Machine Learning Techniques-Theoretical Background
The three ML techniques DT, NN, and SVM were used for supervised learning in order to build the LUC models and are briefly described as follows: DT refers to hierarchical models that are used for classification and decision making [58].A tree consists of decision nodes in which an instance is tested against the values of the associated attribute.The instance follows one of the possible paths from the root to a leaf node, where it is classified according to the majority of training examples that were associated with this leaf node.There are many different learning approaches that are used to build the tree from the training data, such as CART [59], ID3 [60], and C4.5 [61].The C4.5 decision-tree classifier was adopted for use in this study since it uses a common strategy for tree induction from the training data [61], where the more informative attributes are located in the upper parts of the tree.The advantage of a DT model is in its transferability to a set of if-then rules that connect input attributes and class decisions, which are easily understandable for domain experts.
NN originate from the late 1950's when Rosenblatt [62] presented the perceptron as the simplest form of the NN, which calculates the weighted sum of its inputs and generate 1 if the sum is greater the predefined threshold, otherwise 0. Since then, various types of neural networks were designed, such as Kohonen network [63], Radial Basis network [64], and Multi-layer Perceptron.In our multi-class problem setting (nine land use classes), the Multi-layer Perceptron (MLP) neural network, as defined by Rumelhart et al. [65] was adopted for use.The MLP is one of the most widely used neural networks [39], in which data paths from input to output units are strictly feed-forward [66].The network model is represented with a set of real weights that are associated to the interconnections between perceptrons, and a learning procedure is applied, where, in each iteration, weights are updated according to a back-propagation algorithm in order to minimize the error between the outputs of the network and the desired values.
The SVM [67,68] is a linear binary classifier.Every n-class problem can be transformed into a sequence of n (one-versus-all) or n(n−1)/2 (one-versus-one) binary classification tasks by using different voting schemes that ultimately lead to a final decision [69].SVM tries to construct the separation hyperplane between the training points of two classes such that the margin of separation is maximal.However, for non-linear cases, it first maps the points into a high dimensional space in which the linear separation is feasible.There exist many mapping or kernel functions from the original input space to the high dimensional feature space.In our study, the Radial Basis Function, also known as the Gaussian kernel [70], was used as it handles both linear and non-linear problem domains.
The Weka 3.6 software (Machine Learning Group at the University of Waikato, Hamilton, New Zealand) [71,72] was used to build all of the models based on DT, NN, and SVM techniques.Weka [72] is an open source software that implements various types of ML algorithms.In this research, J48 was used as a Java implementation of the C4.5 algorithm for learning DT, Sequential Minimal-Optimization algorithm for SVM learning, and a two-layer feed forward net with error back-propagation for NN.In addition, Weka contains data pre-processing modules, including tools for attributes selection, such as IG, used in this research.

Study Area and Data Sets
The City of Belgrade, in the Republic of Serbia, was developed under different cultural, historical, social, and economic influences that have shaped its urban morphology.The chosen study area includes three neighboring municipalities that are characterized by different urban dynamics: Zemun, Novi Beograd, and Surčin (Figure 1).The main study region covered an area of about 19 km × 25 km and was buffered by 100 m on each side to minimize any potential edge effects.
ISPRS Int.J. Geo-Inf.2017, 6, 387 5 of 17 weights are updated according to a back-propagation algorithm in order to minimize the error between the outputs of the network and the desired values.
The SVM [67,68] is a linear binary classifier.Every n-class problem can be transformed into a sequence of n (one-versus-all) or n(n−1)/2 (one-versus-one) binary classification tasks by using different voting schemes that ultimately lead to a final decision [69].SVM tries to construct the separation hyperplane between the training points of two classes such that the margin of separation is maximal.However, for non-linear cases, it first maps the points into a high dimensional space in which the linear separation is feasible.There exist many mapping or kernel functions from the original input space to the high dimensional feature space.In our study, the Radial Basis Function, also known as the Gaussian kernel [70], was used as it handles both linear and non-linear problem domains.
The Weka 3.6 software (Machine Learning Group at the University of Waikato, Hamilton, New Zealand) [71,72] was used to build all of the models based on DT, NN, and SVM techniques.Weka [72] is an open source software that implements various types of ML algorithms.In this research, J48 was used as a Java implementation of the C4.5 algorithm for learning DT, Sequential Minimal-Optimization algorithm for SVM learning, and a two-layer feed forward net with error back-propagation for NN.In addition, Weka contains data pre-processing modules, including tools for attributes selection, such as IG, used in this research.

Study Area and Data Sets
The City of Belgrade, in the Republic of Serbia, was developed under different cultural, historical, social, and economic influences that have shaped its urban morphology.The chosen study area includes three neighboring municipalities that are characterized by different urban dynamics: Zemun, Novi Beograd, and Surčin (Figure 1).The main study region covered an area of about 19 km × 25 km and was buffered by 100 m on each side to minimize any potential edge effects.Supplementary maps representing spatial attributes were created to contain information on accessibility, population density, and spatial neighborhoods using available data that was mostly obtained from the Urban Planning Institute of Belgrade and via consultations with urban planners (Table 1).After the analysis of LUC trends for the considered time interval (2001-2011), the following accessibility maps were created to reflect distances from the: city center, municipality centers, rivers (Danube and Sava), green areas greater than 100 m × 100 m, railways, highways, main roads, and streets of category I and II.The accessibility maps refer to raster maps of Euclidean distance to the closest cell of interest, i.e., to the closest cell of rivers, green area, etc.The study area is relatively flat and so attributes that are related to elevation were not considered.Population counts were obtained for 2011 from the Serbian Census Office and the official estimated population for the years 2003 and 2007 were obtained from the Statistical Office of the Republic of Serbia.Population density was calculated for each grid cell using a dasymetric mapping method [74].In order to emphasize the differences of urban types between three used municipalities (Zemun, New Belgrade and Surčin), an additional categorical attribute was defined containing information on cell location (the municipality of a cell), labeled as x 1 , Municipality code (Table 1 ).The influence of neighboring cells (x 13 and x 14 ) is represented with the two most frequent classes within the nxn cells Moore neighborhood, which makes a link to cellular automata LUC models that utilise this type of information.The size of the neighborhood can be adjusted to fit various resolutions for different modeling areas.The conducted analysis suggested that 7 × 7 cells is the most suitable representation for local spatial neighborhoods, providing the best model performance results [54].The information about the cell's previous land use class clarified whether the past land use class influences future land use apart from the present class (cell's memory info).As seen from the obtained research results, this type of information is important for the proposed models that are related to the study area.Data preparation was performed using multiple software tools.The SAGA GIS environment [75] and ArcGIS [76] were used to create the database and spatial attributes, analyse data, and present the final results.Java programming routines were developed to generate all of the required data sets from input GIS layers, including the most frequent classes in a cell's neighborhood.

Results and Discussion
The proposed models for short-term land use forecasting were built and assessed according to the five steps method, as described in Section 2.3.Since there was only a 4% change in the study area for the 10-year time interval 2001-2011 (Figure 2), two balanced data sets, S 3-7 for training and S 7-11 for testing, were sampled from the corresponding initial sets and the cells were represented according to the Section 2.

Results and Discussion
The proposed models for short-term land use forecasting were built and assessed according to the five steps method, as described in Section 2.   In order to assess the importance of each considered attribute, the IG ranking method was applied on the training set S3-7.The obtained attribute rankings are shown in Table 3.The highest ranked attribute represented the previous land use class (year 2001), while the present land use class was the second ranked attribute with an almost identical IG score.The next ranked was the most frequent land use class in the neighborhood reflecting the importance of the influence of nearby cells.Other informative attributes were related to the proximity of the city center and the distance to the highways.The lowest ranked were related to the attributes that did not change at all, or only by a very small amount, during the observed time interval, such as distances to the river and railway, or the number of inhabitants per cell.The obtained attribute ranking is meaningful for the urban  In order to assess the importance of each considered attribute, the IG ranking method was applied on the training set S 3-7 .The obtained attribute rankings are shown in Table 3.The highest ranked attribute represented the previous land use class (year 2001), while the present land use class was the second ranked attribute with an almost identical IG score.The next ranked was the most frequent land use class in the neighborhood reflecting the importance of the influence of nearby cells.Other informative attributes were related to the proximity of the city center and the distance to the highways.The lowest ranked were related to the attributes that did not change at all, or only by a very small amount, during the observed time interval, such as distances to the river and railway, or the number of inhabitants per cell.The obtained attribute ranking is meaningful for the urban planning purposes, except for the attribute mun_code.The municipality code attribute (only three distinct values) is ranked much lower when compared with an urban expert's point of view, in respect that the study area municipalities are very different in the sense of urban growth.The reason for the lower ranking of that attribute could be that IG is biased towards choosing attributes with a large number of values.In order to examine the effects of selecting the top n informative attributes by IG to the classification performance, a recursive attribute elimination method was applied.The lowest-ranked attribute was removed from the training and test data set in each iteration until all of the attributes have been removed.The obtained performance curves for all three ML techniques are presented in Figure 3.
ISPRS Int.J. Geo-Inf.2017, 6, 387 8 of 17 planning purposes, except for the attribute mun_code.The municipality code attribute (only three distinct values) is ranked much lower when compared with an urban expert's point of view, in respect that the study area municipalities are very different in the sense of urban growth.The reason for the lower ranking of that attribute could be that IG is biased towards choosing attributes with a large number of values.In order to examine the effects of selecting the top n informative attributes by IG to the classification performance, a recursive attribute elimination method was applied.The lowest-ranked attribute was removed from the training and test data set in each iteration until all of the attributes have been removed.The obtained performance curves for all three ML techniques are presented in Figure 3. Generally, the overall model performance suggest that all three ML techniques are capable to effectively model short term land use changes with slight differences.Even with a small number of used attributes, ML based model can derive outcomes, with high values of weighted AUC (Figure 3).By comparing the obtained performance curves, it can be concluded that SVM and NN have a Generally, the overall model performance suggest that all three ML techniques are capable to effectively model short term land use changes with slight differences.Even with a small number of used attributes, ML based model can derive outcomes, with high values of weighted AUC (Figure 3).By comparing the obtained performance curves, it can be concluded that SVM and NN have a greater power of generalization when compared to DT.Values of weighted AUC indicate that the accuracy of the tested ML techniques declined abruptly when using less than five first-ranked attributes.Furthermore, it can be noted that the elimination of certain attributes improved the classification capability as a result of reduced model complexity, while retaining enough information about the phenomenon and the same number of teaching examples.This indicates that the initial selection of attributes that are given in Table 1 was justified for the study area and included real, informative predictors for LUC.Values of weighted average TP rate indicate that the DT method is approximately 5% less efficient in classification of predicted land use class when compared with the two other used techniques.Furthermore, weighted average TP rate values declined sharply for all three ML techniques when using four and less first-ranked attributes by IG, while weighted average FP rate values increased.The recursive attribute elimination should be stopped when the performance measures TP rate /FP rate start to drop/increase significantly.
An in-depth analysis of the results is realized by using the three maps comparison method, with a separate investigation of simulation outcomes of persistent and changed cells (Figure 4).This method allows for more insight into the model's performance and their mutual differences are more evident in comparison to weighted averages AUC, TP rate , and FP rate .The number of changed cells that are simulated correctly and that are simulated as persistent (Figure 4a,b) also indicate that the accuracy of the tested ML techniques declines abruptly when using less than five first-ranked attributes.The agreement between simulated changed and in reality changed cells (Figure 4a) suggests that SVM had a better ability to learn changes in a land use class when compared to NN and DT.DT favored the persistence of the land use class (Figure 4c,e).When comparing the disagreements for the changed cells (Figure 4b,d), it can be concluded that most of them were related to cells that were changed in reality, whereas the model simulated them as persistent.All of the ML techniques were capable of learning the concept of unchanging cells.
Taking into account all of the measures, the best performing models were built using the first nine ranked attributes for DT and SVM, and the first 11 ranked attributes for NN.A detailed per-class insight into the behavior of the best performing models is given in Table 4.
DT is less capable of predicting the Commercial land use class in comparison to the other two techniques.Concerning the Infrastructure class, none of the Infrastructure objects have been built from 2001 to 2011.Therefore, DT and SVM, as opposed to NN, successfully "learned" that this class did not change.The small amount of changes in the Wetland class during the observed time interval was caused by the start of construction of the bridge over the Sava.All of the ML techniques registered those changes during the learning process.When initially reducing from 13 to nine classes, the existing Not built land use class was transformed into the Green area or Agriculture classes that were based on the actual conditions detected on the orthophoto maps.Additionally, the Green area class contains areas that are used for several different purposes (cemeteries, parks, recreation, etc.).Therefore, the ML techniques had difficulties in learning related transition rules.
Figure 5 shows the actual and forecasted LUCs from 2007 to 2011 for the entire study area.Maps were generated by the best performing models trained on the balanced set S 3-7 after the removal of the less informative attributes.Taking into account all of the measures, the best performing models were built using the first nine ranked attributes for DT and SVM, and the first 11 ranked attributes for NN.A detailed per-class insight into the behavior of the best performing models is given in Table 4.
DT is less capable of predicting the Commercial land use class in comparison to the other two techniques.Concerning the Infrastructure class, none of the Infrastructure objects have been built from 2001 to 2011.Therefore, DT and SVM, as opposed to NN, successfully "learned" that this class did not change.The small amount of changes in the Wetland class during the observed time interval was caused by the start of construction of the bridge over the Sava.All of the ML techniques registered those changes during the learning process.When initially reducing from 13 to nine classes, the existing Not built land use class was transformed into the Green area or Agriculture classes that were based on the actual conditions detected on the orthophoto maps.Additionally, the Green area class contains areas that are used for several different purposes (cemeteries, parks, recreation, etc.).Therefore, the ML techniques had difficulties in learning related transition rules.Taking into account all of the measures, the best performing models were built using the first nine ranked attributes for DT and SVM, and the first 11 ranked attributes for NN.A detailed per-class insight into the behavior of the best performing models is given in Table 4.
DT is less capable of predicting the Commercial land use class in comparison to the other two techniques.Concerning the Infrastructure class, none of the Infrastructure objects have been built from 2001 to 2011.Therefore, DT and SVM, as opposed to NN, successfully "learned" that this class did not change.The small amount of changes in the Wetland class during the observed time interval was caused by the start of construction of the bridge over the Sava.All of the ML techniques registered those changes during the learning process.When initially reducing from 13 to nine classes, the existing Not built land use class was transformed into the Green area or Agriculture classes that were based on the actual conditions detected on the orthophoto maps.Additionally, the Green area class contains areas that are used for several different purposes (cemeteries, parks, recreation, etc.).Therefore, the ML techniques had difficulties in learning related transition rules.    1 N-number of changed cells; 2 M-number of persistent cells; 3 NS-number of changed cells simulated as; 4 MS-number of persistent cells simulated as.
Based on the interpretation of the generated maps by urban experts, it was concluded that all three models successfully forecasted the LUCs from 2007 to 2011 in three Belgrade municipalities.The SVM and NN models showed slightly better performance than DT.However, DT has a distinct advantage to be used in cases when experts require insight into the model internals since if-then rules can easily be generated.In addition, DT is faster to train than SVM and NN.However, the calibration process of a DT model is less complex when compared with the two other ML techniques, which require the optimization of the pairs of relevant algorithm parameters.Nevertheless, the SVM and NN have a better capability to provide a good generalization of land use changes.Furthermore, the SVM technique is less prone to overfitting when compared to NN.

Conclusions
This research examined LUC models that were based on ML techniques in an urban environment with nine land use classes.The obtained results demonstarted that ML models are capable to learn the transition rules in a supervised manner by utilizing the spatial data and land use classes at time point t -1 and land use classes at time point t.Assuming the existence of an unknown transition function that retains similar properties over the short period of time, two to four years, the model is capable of forecasting land use classes at time point t + 1 from the spatial data that is describing the area at time point t.The ML-based models are capable to forecast the LUC and could be a valuable decision support tool for municipal services that require information related to land-use when experts are not available.Furthermore, experts can benefit from such ML models in a way to better understand relations that are hidden in the data and to complement other expert knowledge approaches.
A comparison was made between three common ML techniques (DT, NN, SVM) when building LUC models from the historic spatial and land use data.The SVM method was the best to forecast the land use changes, while the DT method was capable to learn the concept of no change better than other two.Finally, the DT method can be perceived as easiest to interpret over techniques such as NN and SVM.
One of the objectives of the research was to find the appropriate data representation for learning.Initial representation included common spatial urban indicators (present land use class, distances to city center, highways, roads, etc.) that were augmented with neighboring information about land use classes and history information about previous land uses at each cell in the grid.A detailed analysis using the IG ranking method was done to find the most informative attributes that are related to LUC in the study area.It was found that the reduced number of attributes produced less complex models with a better predictive performance.

Figure 1 .
Figure 1.Study area: Municipalities Zemun, New Belgrade, and Surčin from Belgrade metropolitan area, Republic of Serbia.The spatial data were obtained from orthophoto images for the years 2001, 2003, 2007, and 2011, with spatial resolutions 0.30 × 0.30 m, 0.30 × 0.30 m, 0.25 × 0.25 m, 0.20 × 0.20 m, respectively, and from actual land use maps in vector format (*.shp files, polygons of land use class) that were provided from the Urban Planning Institute of Belgrade for the years 2001 and 2010.The maps of land use classes for the years 2003 and 2007 were created by digitizing corresponding orthophoto

Figure 1 .
Figure 1.Study area: Municipalities Zemun, New Belgrade, and Surčin from Belgrade metropolitan area, Republic of Serbia.
The previous land use class for the time interval 2003-2007 is land use class at the year 2001 and land use class at year 2003 represents previous land use class for the time interval 2007-2011.

3 .
Since there was only a 4% change in the study area for the 10-year time interval 2001-2011 (Figure 2), two balanced data sets, S3-7 for training and S7-11 for testing, were sampled from the corresponding initial sets and the cells were represented according to the Section 2.3, step 1, using time points 2001 (only previous land use class), 2003, 2007 and 2011 (Table2).

Figure 3 .
Figure 3. Effectiveness of the attribute selection when applied to building models based on weighted average of: (a) Area Under Receiver Operating Characteristic Curve (AUC); (b) True Positive rate (TPrate); (c) False Positive rate (FPrate).

Figure 3 .
Figure 3. Effectiveness of the attribute selection when applied to building models based on weighted average of: (a) Area Under Receiver Operating Characteristic Curve (AUC); (b) True Positive rate (TP rate ); (c) False Positive rate (FP rate ).

Figure 4 .
Figure 4. Effectiveness of the attribute selection when applied to building models based on three maps comparison methods: (a) Number of changed cells simulated correctly; (b) Number of changed cells simulated as persistent; (c) Number of persistent cells simulated correctly; (d) Number of changed cells simulated as change to wrong category; (e) Number of persistent cells simulated as change.

Figure 4 .
Figure 4. Effectiveness of the attribute selection when applied to building models based on three maps comparison methods: (a) Number of changed cells simulated correctly; (b) Number of changed cells simulated as persistent; (c) Number of persistent cells simulated correctly; (d) Number of changed cells simulated as change to wrong category; (e) Number of persistent cells simulated as change.

Figure 4 . 17 Figure 5
Figure 4. Effectiveness of the attribute selection when applied to building models based on three maps comparison methods: (a) Number of changed cells simulated correctly; (b) Number of changed cells simulated as persistent; (c) Number of persistent cells simulated correctly; (d) Number of changed cells simulated as change to wrong category; (e) Number of persistent cells simulated as change.ISPRS Int.J. Geo-Inf.2017, 6, 387 11 of 17 Figure 5 shows the actual and forecasted LUCs from 2007 to 2011 for the entire study area.Maps were generated by the best performing models trained on the balanced set S3-7 after the removal of the less informative attributes.

Figure 5 .
Figure 5. Generated maps for best performing models for year 2011 (a) Actual land-use and forecasts obtained with (b) Decision Trees (DT), (c) Neural Networks (NN), and (d) Support Vector Machines (SVM).

Figure 5 .
Figure 5. Generated maps for best performing models for year 2011 (a) Actual land-use and forecasts obtained with (b) Decision Trees (DT), (c) Neural Networks (NN), and (d) Support Vector Machines (SVM).

Table 1 ,
).The initial data set for each time interval, I 2003,2007 and I 2007,20011 , contains attributes about the municipality code (Table 1, x 1 ), Euclidian distances to specific entities (Table 1, x 2 -x 10 ), population (Table 1, x 11 ), the cell's land use class (Table 1, x 12 ), information about land use classes in the cell's neighborhood (Table 1, x 13 and x 14 ), and information about the cell's previous land use class (history) (x 15

Table 1 .
List of attributes used for modeling land-use change in the study area.

Table 2 .
Description of data sets used in the experiments.

Table 2 .
Description of data sets used in the experiments.

Table 3 .
Top ranked attributes based on the Info Gain (IG) method.

Table 3 .
Top ranked attributes based on the Info Gain (IG) method.

Table 4 .
Effectiveness of the attribute selection when applied to building models based on three maps comparison methods.