www.mdpi.com/journal/ijgi/ Evaluation of Model Validation Techniques in Land Cover Dynamics

This paper applies different methods of map comparison to quantify the characteristics of three different land change models. The land change models used for simulation are termed as “Stochastic Markov (St_Markov)”, “Cellular Automata Markov (CA_Markov)” and “Multi Layer Perceptron Markov (MLP_Markov)” models. Various model validation techniques such as per category method, kappa statistics, components of agreement and disagreement, three map comparison and fuzzy methods have then been applied. A comparative analysis of the validation techniques has also been discussed. In all cases, it is found that “MLP_Markov” gives the best results among the three modeling techniques. Fuzzy set theory is the method that seems best able to distinguish areas of minor spatial errors from major spatial errors. Based on the outcome of this paper, it is recommended that scientists should try to use the Kappa, three map comparison and fuzzy methods for model validation. This paper facilitates communication among land change modelers, because it illustrates the range of results for a variety of model validation techniques and articulates priorities for future research.


Introduction
A typical approach to land-use and land-cover change (LUCC) modeling is to investigate how different variables relate to historic land transitions, and to then use those relationships to build models to project future land transitions [1,2].Moreover, in general the spatially-explicit models of LUCC begin with a digital map of an initial time and then simulate transitions in order to produce a prediction map for a subsequent time [3].Upon seeing the prediction results, questions may arise about the accuracy of the base maps, the performance of the model and whether this predicted map represents the real scenario [4].In this regard, it is necessary to quantify the map errors, the amount of differences among the maps and to validate the models used for prediction.
With the growth of high-resolution spatial modeling, geographic information systems (GIS) and remote sensing the need for map comparison methods increases.Good comparison methods are needed to perform calibration and validation of spatial results in a structured manner [5].The importance of map comparison methods is recognized and has growing interest among researchers [6,7].In general maps are compared for a number of reasons: (1) to compare maps generated by models under different scenarios and assumptions, (2) to detect temporal/spatial changes, (3) to calibrate/validate land-use models, (4) to perform uncertainty and sensitivity analyses and (5) to assess map accuracy.In fact, map comparison may be seen as finding a goodness-of-fit measure [8].
There has been tremendous interest in validation of simulation models that predict changes over time [9,10].However, there is usually less than perfect agreement between the change predicted by the model and the change observed in the reference maps, which is no surprise, since scientists usually do not anticipate that a model's prediction will be perfect.Furthermore, scientists rarely believe that the data are perfect.Therefore, a natural question is, -What accounts for the most important disagreements between the prediction and the data: (1) error in the prediction map, or (2) error in the reference maps?‖ [11].If precise information on accuracy and error structure is available, then there could be a method to incorporate information concerning data quality into measures of model validation [12,13].
Assessing model performance is a continuous challenge for modelers of landscape dynamics.A common approach is historical validation where a predicted map is compared to an actual map [14].However, many types of land-use models simulate land-use changes starting from an original land-use map, such as Markov models, cellular automata, logistic regression models, neural networks, etc.Since most locations do no change their land use over the length of a typical simulation period, the similarity between the simulated land-use map and the actual land-use map will be high for most calibrated models [15].Therefore, to rigorously assess the accuracy of the simulated land-use map, a meaningful reference level is required [16].
The evaluation of spatial similarities and land use change between two raster maps is traditionally based on pixel-by-pixel comparison techniques.This kind of change detection procedure is called the post-classification comparisons [17].A problem with this traditional approach is that, because they are based on a pixel-by-pixel comparison, they do not necessarily capture the qualitative similarities between the two maps.This problem becomes important when map comparisons (e.g., of actual and predicted land use) are used to evaluate the output of predictive spatial models such as cellular automata based land use models [18].The lack of appropriate comparison techniques, specially, the ones that can handle qualitative comparisons of complex land use maps for the purpose of evaluating model output, is currently a major problem in the area of predictive simulation modeling [19].
Recently, numerous map comparison methods have been proposed that take into account the spatial relation between cells, as opposed to simple cell-by-cell overlap [20].These new methods consider, for example, proximity [21], the presence of recognizable structures, i.e., features [22], moving windows [23] or wavelet decomposition [24].Others have evaluated model performance based on metrics summarizing the whole landscape [25,26].This is how different methods have been introduced and new software packages are being developed, for the sake of map comparison/validation of models that predict LUCC change from a map of initial time to a map of a subsequent time [2].This paper addresses these issues and illustrates some methods through a case study from Khulna, Bangladesh to validate the predicted maps.The main objective of this paper is to find out whether the simulation is giving any abrupt result or not and to compare among the different model validation techniques.Therefore, in this paper, we will discuss the advantages and disadvantages of some commonly-used map comparison techniques to assess the agreement between the simulated maps and the actual land-cover maps.

Study Area
The proposed study area is Khulna City Corporation (KCC) and its surrounding impact areas (Figures 1 and 2).Geographically, Khulna lies at 22°49'N and 89°34'E.Its mean elevation is seven feet above Mean Sea Level.Khulna is a linear shaped city [27].Within the KCC core area, there are roughly 11,280 acres of land.Nearly 10% of this land is not yet in urban use.It means that about 1,100 acres of land are available within KCC for future urban growth [27].

Remote Sensing Data
To prepare the base maps, the Landsat satellite images (1989, 1999 and 2009) have been collected from the official website of US Geological Survey (USGS).Landsat Path 137 and Row 44 cover the whole study area.The map projection of the satellite images is set as the Universal Transverse Mercator (UTM) within Zone 46 N-Datum World Geodetic System (WGS) 1984.The pixel size of the images is 30 × 30 m.
The following five land cover types have been identified for this research (Table 1) [28]:

Low Land
Permanent and seasonal wetlands, low-lying areas, marshy land, rills and gully, swamps, mudflats, all cultivated areas including urban agriculture; crop fields and rice-paddies.

Fallow Land
Fallow land, earth and sand land in-fillings, construction sites, developed land, excavation sites, solid waste landfills, open space, bare soils.

Base Map Preparation
A supervised classification method using -Fisher hard classifier‖ has been applied to prepare the base maps.Fisher classifier performs well when there are very few areas of unknown classes [28].This is why fisher classifier is selected.Then a mode filter is applied to generalize the fisher classified land cover images.This kind of filtering helps minimizing the isolated pixels.Later the generalized images are reclassified to produce the final version of land cover maps of three different years (Figure 3).The combination is adopted for generating the best possible classification results in this particular context.
After performing change detection analysis, it is found that -builtup area‖ is increasing while -water body‖, -low land‖ and -fallow land‖ cover types are decreasing gradually (Figure 4).This is the general trend of land cover change that will be used for simulating the future scenario.

Accuracy Assessment
The next stage of image classification process is accuracy assessment.It is not typical to ground truth in every pixel of the classified image.Therefore, at first, some reference pixels are generated.A total of 250 reference pixels are generated for each classification image to perform accuracy assessment.The detailed historical base maps (1988, 1999 and 2008) of KCC area are collected from -Survey of Bangladesh‖ for performing the accuracy assessment.The collected base maps have then been used to find the land cover types of the reference points.It is known that user's accuracy for category K is the percent of category K in the reference information, given that the map shows category K. Producer's accuracy for category K is the percent of category K in the map, given that the reference information shows category K [28].The overall accuracy represents the percentage of correctly classified pixels [29].At the end, the producer's and user's accuracy for all the years are found ranging approximately from 76% to 95%.While the overall accuracies for 1989, 1999 and 2009 are found 84.20%, 88.80% and 93.60% respectively.

Simulating Land Cover Maps
In the next stage three different models, using IDRISI Selva® software, are implemented to simulate the land cover maps of Khulna of 2009 [28].For this purpose, the base maps of 1989 and 1999 are used in all three cases.The first model that has been implemented is given the name as -Stochastic Markov Model (St_Markov)‖ [30], because this model combines both the Stochastic processes as well Markov Chain analysis techniques [31,32].The second model is termed as -Cellular Automata Markov Model (CA_Markov)‖ [30].CA_Markov combines the concepts of Markov Chain [32], Cellular Automata [33], Multi-Criteria Evaluation [34] and Multi-Objective Land Allocation [30].The third model is named as -Multi Layer Perceptron Markov Model (MLP_Markov)‖ [28].MLP_Markov combines the concepts of Markov Chain [32], Artificial Neural Network [35] and the Feed-Forward concept of Multi Layer Perceptron Neural Network [36].The -St_Markov‖, -CA_Markov‖ and -MLP_Markov‖ methods have been adopted from Ahmed and Ahmed (2012) [28].The simulated land cover maps are shown in Figure 5.

Results and Discussion
Traditionally model validation refers to comparing the simulated and reference maps [37].Sometimes the simulated maps can give misleading results.In that case, it is necessary to validate the projected/simulated map with the base/reference map.In this section, the comparisons between the actual base map (2009) and the simulated maps (St_Markov, CA_Markov and MLP_Markov) of year 2009 have been performed.The main objective of model validation is to find out whether the simulation is giving any abrupt result or not.This justifies the modeling output in terms of reality.
For validating the simulated maps, two different approaches are adopted.The first one is pixel-based visual approach.This approach helps to reveal the spatial patterns in a quick look.The visual approach is subjective.Another one is statistical approach.This approach is important because it explains the scenario in a quantitative way.There is a general trend in choosing the wrong technique for the purpose of model validation in remote sensing and GIS analysis.This article will help the researchers in selecting the proper model validation technique.

Per Category Method
The per category comparison method performs a cell-by-cell comparison with respect to one (user selected) category.It simultaneously gives the user information about the occurrence of the selected category in both maps [37].Figures 6-8 show the method that performs cell-by-cell comparison for each land cover category.The outputs are depicted in four different legends indication different states of comparison.The more there will be the amount of -both maps‖, the better the simulation result.This is how all the possible combinations (Base Map 2009 vs. St_Markov 2009, Base Map 2009 vs. CA_Markov 2009 and Base Map 2009 vs. MLP_Markov 2009) are taken into consideration.It is then found that the simulated map of -MLP_Markov 2009‖ shows the best results for all the land cover categories in terms of the highest amount of the legend -in both maps‖ (Figure 8).This kind of pixel-based per category map comparison method is calculated based on the -Contingency Table‖, which details the cross-distribution of categories on the two maps.The table is expressed in number of cells (Tables 2-4).Three statistics are compared in each confusion matrix: overall accuracy, producer's accuracy, and user's accuracy.But this kind of map comparison method cannot perform and formulate the concepts of -error due to quantity‖ and -error due to location‖, in order to partition the total error when comparing maps that show the same categorical variable [38,39].

Location and Quantity Accuracies Using Kappa Statistics
Kappa is a member of family of indices that have the following desirable properties: (1) if classification is perfect, then Kappa = 1; (2) if observed proportion correct is greater than expected proportion correct due to chance, then Kappa > 0; (3) if observed proportion correct is equal to expected proportion correct due to chance, then Kappa = 0; and (4) if observed proportion correct is less than expected proportion correct due to chance, then Kappa < 0 [38,40].
But Pontius (2000Pontius ( , 2002) ) proved that standard Kappa (Cohen's Kappa) offers almost no useful information because it confounds quantification error with location error [38,39].Therefore, four kappa statistics are presented here (Table 5): the traditional kappa (K standard ), a revised general kappa defined as kappa for no ability (K no ), and two more detailed kappa statistics to distinguish accuracies in quantity and location (K quantity and K location ).The K no statistic is an improved general statistic over K standard as it penalizes large quantity errors and rewards further correct location classifications, while K quantity and K location are able to distinguish clearly between quantification error and location error, respectively [38].Pontius (2000Pontius ( , 2002) ) tried to prove that standard Kappa is not giving proper information.However, that concept has not yet been recognized globally by the international scientists.The land-use modelers extensively use Kappa, as a simple index to evaluate the accuracy of base maps and for map comparison purposes [41,42].Therefore, still Kappa is a very popular and well recognized map comparison index [43].
After analyzing Table 5, it can be concluded that -MLP_Markov‖ is showing the highest values of kappa coefficients among the three models.The assumption is like-the higher the kappa values, the better the model.

Errors Due to Quantity and Allocation
For the practical applications in remote sensing, Pontius and Millones (2011) explained how these Kappa metrics are misleading for the purposes of accuracy assessment and map comparison [44].It is more helpful to summarize the cross-tabulation matrix in terms of quantity disagreement and allocation disagreement, as opposed to proportion correct or the various Kappa indices [44].
Chen and Pontius (2010) now recommend using the term -error due to allocation‖ rather than -error due to location‖, in order to clarify its meaning [2,45].Both error due to quantity and error due to allocation are measured in terms of the percent of the landscape and the two types of errors sum to the total error [2].For a two-map comparison, error due to allocation measures how much less than optimal is the match in the spatial allocation of the changes, given the specification of the quantities of the changes in the observed and predicted change maps [4].Pontius and Millones (2011) suggested that the two simple measures of quantity disagreement and allocation disagreement are much more useful to summarize a cross-tabulation matrix than the various Kappa indices [44].Therefore, a variety of statistical summaries of a cross-tabulation matrix that is called -PontiusMatrix20.xlsx‖ has been recommended [44].It offers one comprehensive statistical analysis that answers simultaneously two important questions: (1) How well do a pair of maps agree in terms of the quantity of cells in each category?and (2) How well do a pair of maps agree in terms of the allocation of cells in each category?The statistics indicate how well the comparison map agrees with the reference map [4].
Results show that the values of disagreement components are found lowest while the values of agreement components are highest for MLP_Markov (Table 6).
However, this method compares the simulation for 2009 to the reference map for 2009, which is a flawed comparison, because it fails to distinguish agreement, due to persistence from agreement resulting from change.Therefore, it is important to perform a three maps comparison of reference 1999, reference 2009 and simulation 2009 [46].

Comparison of Three Maps
In this section, a method of comparing three maps (a reference map of time 1, a reference map of time 2 and a simulation/prediction map of time 2) has been implemented for model validation [46].In this case, the base map of 1999, the base map of 2009 and the simulated maps of time 2009 (St_Markov, CA_Markov and MLP_Markov) have been used.The three map comparison for each modeling application specifies the amount of the prediction's accuracy that is attributable to land persistence versus land change [1].
Comparison between the reference map of time 1 and the reference map of time 2 characterizes the observed change in the maps, which reflects the dynamics of the landscape.Comparison between the reference map of time 1 and the prediction map of time 2 characterizes the model's predicted change, which reflects the behavior of the model.Comparison between the reference map of time 2 and the prediction map of time 2 characterizes the accuracy of the prediction, which is frequently a primary interest [1].
However, an additional validation technique, considering the overlay of all three maps (the three-map comparison), allows one to distinguish between the pixels that are correct due to persistence and the pixels that are correct due to change [1].
The three maps comparison method consists of two components of agreement and three components of disagreement.According to Pontius et al. (2011), the components of agreement are persistence simulated correctly and change simulated correctly; the components of disagreement are change simulated as persistence (the entries where reference t1 matches simulation t2 but does not match reference t2), persistence simulated as change (the entries where reference t1 matches reference t2 but does not match simulation t2) and change simulated as change to wrong category (the entries where all three maps disagree) [46].
Figure 9 shows the results from an overlay of the three maps (the base map of 1999, the base map of 2009 and the St_Markov/CA_Markov/MLP_Markov simulated maps of 2009).From this figure, it is possible to get a clear idea about the nature of the prediction errors visually.Results show that the percentages of disagreement components are lowest (28.066%) while the percentages of agreement components (71.934%) are highest for MLP_Markov model (Table 7).

Fuzzy Set Theory
The aim of traditional pairwise pixel-by-pixel comparison is to identify areas of categorical disagreement between two maps, by determining the pixels with a difference in theme [18].Several authors have expressed the need for a better post-classification change detection or map similarity procedure because of the limitations of a pixel-by-pixel comparison [47,48].First, the procedure is sensitive to the existence of mixed pixels.A pixel-by-pixel comparison of multi-temporal maps will interpret any misalignment of one or both of the maps as change [49].Second, the comparison techniques will often produce results that are significantly different from the actual land use.This is due to their inability to account for the inaccuracies in the maps throughout the comparison operation [50].The comparison method presented in this section was primarily developed to be of use in the calibration and validation process of cellular models for land-use dynamics [5].The method is based on fuzzy set theory [51,52].Several authors addressed the potential of fuzzy set theory for geographical applications and it has been used before to assess the accuracy of map representations and for map comparisons [53,54].
The flexibility of fuzzy representation of spatial data offers potential for avoiding the problems of traditional comparison procedures [18].First of all, misregistration and locational inaccuracies can be accounted for by fuzzifying the boundaries of the pixels or polygons of the input maps.Second, fuzzy set theory provides a method of dealing and comparing maps containing a complex mixture of spatial information.A fuzzy map is more appropriate for representing a complex land use type.Therefore, the degrees and types of categorical differences between maps should be determined by a fuzzy post classification comparison [18].
The main purpose of the fuzzy map comparison/fuzzy Kappa map comparison is to take into account that there are grades of similarity between pairs of cells in two maps.This method takes the neighborhood of a cell in account to express similarity of that cell in a value between 0 (fully distinct) and 1 (fully identical) [55].The resulting map is called the fuzzy similarity map (Figure 10). Figure 10 gives the results of the fuzzy cell-by-cell method (comparing each of the three different simulations with the base map of 2009).The fuzzy membership function is that of exponential decay with a halving distance of two cells and a neighborhood with a four-cell radius.Later the fuzzy output maps have been categorized into three levels of agreement: identical, medium similarity and low similarity (Figure 10).Both fuzzy Kappa and average similarity is found highest for -MLP_Markov‖ and lowest for -St_Markov‖ model (Figure 10 and Table 8).

Conclusions
At the beginning of this paper, a fisher supervised classification method is applied to prepare the base maps of Khulna City with five land cover classes.After performing accuracy assessment and quantifying map errors, it is found that the errors in the maps are not much larger than the amount of land change between the two points in time (1989-1999 and 1999-2009).Later, being persistent with the inherent changing characteristics, three different methods are implemented to simulate the land cover maps of Khulna City (2009).The methods are named as -Stochastic Markov (St_Markov)‖, -Cellular Automata Markov (CA_Markov)‖ and -Multi Layer Perceptron Markov (MLP_Markov)‖ model.
Then different model validation techniques like per category method, kappa statistics, components of agreement and disagreement, three map comparison and fuzzy method are applied.A comparative analysis, in terms of concerned advantages and disadvantages, on the validation techniques has also been discussed.Fuzzy set theory is found best able to distinguish areas of minor spatial errors from major spatial errors.In all cases, it is found that -MLP_Markov‖ is giving the best results among the three modeling techniques.This is how, it is possible to compare different models and choose which modeling technique is giving better results.
In order to compare the predicted change to the observed change and to perform validation for predictive land change models; it is recommended that scientists should use Kappa, three map comparison and fuzzy method based on the outcome of this paper.
Our hope might be realized if the error in the base maps is reduced to the point where the error becomes smaller than apparent change in land.This paper will help the researchers deciding whether the most important errors are in the model or in the data.Moreover, it is our belief that this kind of research has a high potential to contribute towards learning about the different available validation techniques and to choose the right one by the researchers working on different case studies.
We have designed this article in such an order so that it produces helpful information for other scientists whose goals are to validate a model's performance and to set an agenda for future research.

Future Research
For any kind of model validation or map comparison, the accuracy of the base maps is very important.However, maintaining accuracy of the base maps is difficult due to lack of availability of historical data or verification of the older maps.Moreover, there are different image classification (e.g., supervised, unsupervised, object-based, hybrid, etc.) methods, which can give different results.Even the use of different filtering techniques (e.g., median, mode, mean, Gaussian), filter size, classifier (e.g., hard, soft, segmentation) and reclassification methods can give variant results.The spatial and temporal resolution of the remotely sensed images can also put impact while identifying training sites for signature development.All these factors can play important role in assessing the accuracy of maps or model validation purposes.This is why future research can be conducted incorporating all these relevant issues.
There are many available map comparison techniques.Each has its own advantages and disadvantages.Therefore, it is very important to distinguish which technique is suitable for a particular context or case study.This can be another dimension for future research.Finally, future research must address the spatial dependency between the maps to be compared.

Figure 2 .
Figure 2. Location of the study area (areas of Khulna City Corporation (KCC) and adjoining fringe areas) on Landsat satellite images.(Image source: US Geological Survey (USGS), 2012 and Shapefile source: Khulna City Corporation, 2012).

Figure 3 .
Figure 3. Landcover maps of the study area.

Figure 9 .
Figure 9. Maps of the components of agreement and disagreement.

Table 1 .
Land cover types.

Table 5 .
Summary of kappa statistics for the models on validation data (2009).

Table 6 .
Components of agreement and disagreement for model validation.

Table 7 .
Components of agreement and disagreement of three map comparison method.

Table 8 .
Agreements of fuzzy similarity maps for model validation.