E ﬀ ects of Category Aggregation on Land Change Simulation Based on Corine Land Cover Data

: Several factors inﬂuence the performance of land change simulation models. One potentially important factor is land category aggregation, which reduces the number of categories while having the potential to reduce also the size of apparent land change in the data. Our article compares how four methods to aggregate Corine Land Cover categories inﬂuence the size of land changes in various spatial extents and consequently inﬂuence the performance of 114 Cellular Automata-Markov simulation model runs. We calculated the reference change during the calibration interval, the reference change during the validation interval and the simulation change during the validation interval, along with ﬁve metrics of simulation performance, Figure of Merit and its four components: Misses, Hits, Wrong Hits and False Alarms. The Corine Standard Level 1 category aggregation reduced change more than any of the other aggregation methods. The model runs that used the Corine Standard Level 1 aggregation method tended to return lower sizes of changing areas and lower values of Misses, Hits, Wrong Hits and False Alarms, where Hits are correctly simulated changes. The behavior-based aggregation method maintained the most change while using fewer categories compared to the other aggregation methods. We recommend an aggregation method that maintains the size of the reference change during the calibration and validation intervals while reducing the number of categories, so the model uses the largest size of change while using fewer than the original number of categories. (L), (M)


Introduction
Land change modelling is a popular research topic with a broad variety of approaches for understanding and simulating the temporal changes among land categories. Land change modeling supports decision-making concerning land management by giving insight into land-change processes and providing future scenarios by simulating future land mosaic and structure [1][2][3]. Within the variety of land change modeling approaches, Cellular Automata (CA)-Markov modelling is a widely known technique for land use/land cover (LULC) change simulation by extrapolating changes in a LULC map based on previous states of LULC in the spatial extent.
Data derive from a wide variety of sources. Landsat is one of the most popular. Initially called ERTS, the Landsat mission was launched in the 1970s-thus, its long history makes it possible to perform long-term land change analysis [4,5]. Another popular dataset is the imagery of the Sentinel satellites. The combined use of Sentinel's radar and optical data allows for land monitoring temporally, because Sentinel's frequent revisit time makes it possible to produce maps of land change during short time intervals [6,7]. Classification of land change requires procedures depending on the data source, such as hyperspectral images [8,9] or Synthetic Aperture Radar data [10,11]. Researchers can apply ready-to-use LULC datasets, like the Corine Land Cover database available in European countries. In the Corine Land Cover database, multiple datasets are available describing the LULC status in different years (1990, 2000, 2006, 2012 and 2018), making it possible to use it as an input for either monitoring [12,13] or simulation and validation [14,15] of land changes.
Category aggregation is the procedure to merge categories to form fewer categories. Aggregation is an important factor in land change modelling because some LULC maps have too many categories to analyze coherently. The aggregation of LULC map categories affects whether a specific categorical transition is present or eliminated from the extent. Aggregation of two categories erases the change between those two categories. Pontius Jr and Malizia (2004) [16] gave five mathematical rules that dictate how aggregation can reduce the size of temporal change. Pontius Jr et al. (2018) found that smaller amounts of reference change during the validation interval are associated with lower accuracies of land change simulation models [17].
Hierarchical classification schemes organize detailed categories for various levels of aggregation into broader categories. One of them is Anderson's classification scheme [18], which is a nested hierarchical system of LULC categories, where a greater Anderson level has more categories. According to Guttenberg (2002) [19], the origin of multidimensional land use classification is related to urban planning that arose in the middle of the twentieth century. This approach of classification provided an opportunity for project-specific classifications [20]. Many category schemes exist, but there is no uniformly used category scheme because each scheme has its purpose and each project has its purpose [21]. For instance, Africover was a land cover category database for Africa particularly, created by FAO [22]. There are examples of vegetation-specific classification systems as well [23][24][25]. These latter category schemes focused on engaging the character of various vegetation types, therefore they could not be applied for general LULC classification in an area where non-vegetation LULC classes occur. The LULC classification scheme of Anderson [18] and Corine Land Cover Program [26] had the purpose of characterizing land cover in general because they had to characterize large areas with an extremely diversified LULC, e.g., covering a Pan-European scale in the case of Corine. Our research considers methods that aggregate any type of LULC classes.
Our article examines (1) how four aggregation methods affect the size of changes in eight spatial extents and (2) how the simulation model performance varies with the aggregation methods. To the best of our knowledge, this problem has not been analyzed in the literature concerning land change simulation modeling. We analyzed the reference change during the calibration interval, the reference change during the validation interval and simulation change during the validation interval concerning the original categories and four aggregation methods. We also computed five metrics concerning the performance of a CA-Markov simulation model and evaluated the results with special attention to category aggregation. We draw conclusions concerning the application of various aggregation methods and their impact on simulation model performance.
The Materials and Methods section describes the dataset, aggregation methods, model features and statistical analysis. The Results section gives several figures to visualize the findings. The Discussion section states the practical importance of the results and provides insights concerning best practices in aggregating data before running a land change simulation model.

Dataset
The input data are the Corine Land Cover (CLC) dataset, which is a pan-European land cover database. The database covers 39 countries, comprising the European Environment Agency (EEA) member and cooperating countries, including the members of the European Union. The CLC program was initiated in 1985 and there have been five datasets released so far. The datasets characterize LULC based on images taken approximately 1-2 years before the release dates. We used CLC datasets  for the years 2000, 2006 and 2012. The time interval between 2000 and 2006 was the calibration  interval and the time interval between 2006 and 2012 was the validation interval for all the models we  ran. The CLC has its own nomenclature with a maximum of 44 categories, and the nomenclature is  consistent over the databases from different years, hence making the years comparable and helping to monitor the changes over time. The nomenclature consists of 3 hierarchical levels, which are called standard levels. Standard level 1 has fewer categories than standard level 2, which has fewer categories than standard level 3. The CLC change maps were also produced based on the comparison of the consecutive years, and CLC change layers had a smaller minimal mapping unit than CLC layers, resulting in a finer land change dataset. CLC and CLC change layers are freely available from https://land.copernicus.eu/pan-european/corine-land-cover.

Study Area Selection According to Model Requirements
We analyze eight spatial extents that we selected in part based on the quantity of changing areas according to CLC Change layers for 2000-2006 and 2006-2012. The 8 extents were located across Europe and the extents were named after the cities closest to their location ( Figure 1).
Remote Sens. 2020, 12, x FOR PEER REVIEW 3 of 17 nomenclature is consistent over the databases from different years, hence making the years comparable and helping to monitor the changes over time. The nomenclature consists of 3 hierarchical levels, which are called standard levels. Standard level 1 has fewer categories than standard level 2, which has fewer categories than standard level 3. The CLC change maps were also produced based on the comparison of the consecutive years, and CLC change layers had a smaller minimal mapping unit than CLC layers, resulting in a finer land change dataset. CLC and CLC change layers are freely available from https://land.copernicus.eu/pan-european/corine-land-cover.

Study Area Selection According to Model Requirements
We analyze eight spatial extents that we selected in part based on the quantity of changing areas according to CLC Change layers for 2000-2006 and 2006-2012. The 8 extents were located across Europe and the extents were named after the cities closest to their location ( Figure 1). We selected areas that had large sizes of changes during at least one of the time intervals of the analysis to test the model in various cases. Further conditions for selection were that the extents must have 20 categories as a maximum at each date and must have the same categories in at least the first two dates (2000 and 2006). The CA-Markov model cannot handle cases where a category at the end of the calibration interval does not exist at the beginning of the calibration interval. To each of the eight extents, we applied three zoom levels large (L), medium (M) and small (S). The large zoom level is identical to the extent. The medium zoom level is a subset of the large zoom level.
The small zoom level is a subset of the medium zoom level. The 24 combinations of the extent and zoom level were clipped from CLC's 100 m resolution raster layers. The 24 selected combinations had the following characteristics: 1. the areas had the same pixel number by zoom level (Large = 62,500 pixels; Medium = 15,625 pixels; Small = 2500 pixels); We selected areas that had large sizes of changes during at least one of the time intervals of the analysis to test the model in various cases. Further conditions for selection were that the extents must have 20 categories as a maximum at each date and must have the same categories in at least the first two dates (2000 and 2006). The CA-Markov model cannot handle cases where a category at the end of the calibration interval does not exist at the beginning of the calibration interval. To each of the eight extents, we applied three zoom levels large (L), medium (M) and small (S). The large zoom level is identical to the extent. The medium zoom level is a subset of the large zoom level.
The small zoom level is a subset of the medium zoom level. The 24 combinations of the extent and zoom level were clipped from CLC's 100 m resolution raster layers. The 24 selected combinations had the following characteristics: the areas had the same pixel number by zoom level (Large = 62,500 pixels; Medium = 15,625 pixels; Small = 2500 pixels); 2.
the areas had the same pixel resolution for all levels (100 m); 3.
the areas had the same categories in 2000 and 2006; 4.
the areas experienced as much change as possible according to the CLC change layers.
Then, in each of the 24 combinations of the extent and zoom level, we performed four category aggregation procedures. In 8 extents, 3 zoom levels and 5 aggregation methods we could have examined 120 models, but in six cases the aggregation did not make sense due to not meeting the condition of performing threshold-based aggregation, so we examined 114 models altogether.

CLC Standard Levels
CLC database has a standard hierarchical nomenclature with 3 standard levels. The most detailed standard level is Level 3, which consists of 44 categories. The CLC dataset classifies land into these 44 categories based on various remotely-sensed data over time, processed along with matching a technical guideline [27]. Newer releases applied new remotely-sensed techniques [28]. Some processing methods varied over time, but all years have a uniform minimal mapping unit of 25 hectares, a thematic accuracy over 85% and a uniform nomenclature [26].
We used all three hierarchical levels of CLC datasets. We clipped our eight extents from the Level

Behavior-Based Category Aggregation
The behavior-based category aggregation method originates from a research published by Aldwaik et al. (2015) [29], in which the authors presented a Microsoft Excel Visual Basic for Applications macro for a special type of category aggregation. This method aggregates categories based on the contingency table of the change between two dates. The behavior-based aggregation maintains net change, which is the change attributable to quantity differences of the categories between the two dates. The authors produced a free computer program that performs the aggregation for any transition matrix [29]. We used this computer program for executing the aggregation for the 24 combinations of extents and zoom levels. The program aggregates classes step by step, two categories in each step, while monitoring the total and net changes in each step. The user can follow if the total and net change starts to shrink in any of the steps. We aggregated the classes dictated by the behavior-based aggregation method in a manner that we stopped aggregating classes just before the total change started to decrease. In this way, the total change in the aggregated data equals the total change in the unaggregated data. We applied the algorithm for the calibration interval, then applied the same aggregation to the categories during the validation interval. We refer to this type of aggregation as behavior-based (BB) aggregation (Figures 2 and 3).

Aggregation Based on Changes Observed in the Reference Intervals
The categories were aggregated based on the size of the change during the calibration and validation intervals. Based on a common threshold, we aggregated the categories that showed a total change of less than 0.1% of the actual extent in any of the calibration or validation intervals. This aggregation procedure was applied to the CLC Level 3 data, thus the unaggregated categories are identical to the CLC Level 3 categories. The categories not meeting the threshold were aggregated into a new category called Other. If there was no category meeting the threshold, then we did not perform any aggregation. That is why there are six cases of combination of the extent and zoom level where no aggregation was performed by this method. We refer to this type of aggregation as threshold-based (TB) aggregation (Figures 2 and 3). Figure 2 shows how detailed categories are aggregated to fewer broader categories for a specific combination of extent and zoom, according to the rules of each aggregation method. CLC Standard Level 3 consists of the 13 categories that occurred in Borovany extent, Zoom Level L according to Level 3 categories. The CLC Standard system is a hierarchical category scheme, so the categories of the more detailed CLC Standard Level 3 are aggregated into broader Standard 2 (L2) categories and broader Standard 2 (L2) categories.
The BB category aggregation method resulted in two of the original classes and three aggregated classes. The BB method considers the change characteristics during the calibration interval in the combination of study extent and zoom. BB may aggregate categories that are thematically diverse. For example, Figure 2 shows Discontinuous urban fabric aggregated with Transitional woodland-scrub. The BB algorithm maintains change by aggregating categories that do not transition with each other.
TB aggregation also resulted in six original categories and one category called Other. This Other category is the aggregation of seven thematically diverse categories. These categories account for extremely small changes. Consequently, the land change model focuses on the categories that provide relatively larger changes in the landscape.

CA-Markov Model
For the 114 cases (Table 1), we applied a Cellular Automata-Markov (CA-Markov) model in Idrisi Selva software [30]. This model is based on two components. First, a Markov matrix characterizes the categorical transitions from 2000 to 2006, then extrapolates the number of pixels of each transition from 2006 to 2012. The Markov matrix determines the number of pixels that transition from each category to every other category during the extrapolated time interval [31,32]. Second, the CA-algorithm influences the spatial allocation of extrapolated change from 2006 to 2012 based on the default settings in Idrisi's algorithm concerning neighboring pixels [33,34]. Table 1. A matrix of extents and aggregation methods concerning how we ran our models. Aggregation methods are named according to Section 2.3. L, M and S letters are abbreviations for Large, Medium and Small zoom levels, referring to their area size. Positive (+) signs denote that model was run in the relevant extent, zoom level and aggregation method according to their positions in the matrix. Negative (-) signs denote that model was not run for that specific case.

Extent
Aggregation   All the models were run with an iteration number of 6 and a 5 × 5 contiguity filter. By using the same parameters, we ensured that the various model parameters did not influence the investigation of the factors playing an important role in model validation or performance.

Metrics Concerning Calibration and Validation Intervals
All the model runs derived from the 8 extents, 3 zoom levels and 5 aggregation methods. Before running the models, we measured the following: 1.
the changing areas in the calibration interval as a percentage of the actual combination of the extent and zoom level; 2.
the changing areas in the validation interval as a percentage of the actual combination of the extent and zoom level; 3.
the number of categories in reference time 1, reference time 2 and reference time 3 maps.
The figure of merit (FOM) is a measurement especially used for simulation models, describing the match of predicted and observed change [17,35,36]. If the FOM is 0%, then there is no intersection of predicted and observed change. If the FOM is 100%, then there is a complete overlap between predicted and observed change. The FOM components provide a deeper insight into the similarity of changes [37]. The FOM components show all types of incorrectly and correctly predicted change as a percentage of the size of the combination of the extent and zoom level, as follows: Misses = area of observed change predicted as persistence (incorrect); Hits = area of observed change predicted as change to the correct category (correct); Wrong Hits = area of observed change predicted as change to the wrong category (incorrect); False Alarms = area of observed persistence predicted as change (incorrect) [17].
We calculated for all the 114 model runs the FOM and its components: Misses, Hits, Wrong Hits, and False Alarms. These metrics were calculated in R software [38] with the 'lulcc' package [39].

Statistical Analysis
According to the Shapiro-Wilk test, the distribution of the metrics did not follow the normal distribution, therefore, we applied robust Two-Way factorial ANOVA to reveal the effects of aggregation and the study sites in the specific measures of calibration, validation and simulation changes, FOM and its components. We applied bootstrapping with 599 repetitions and the median as the estimator. Our Null Hypothesis was that (i) the medians of the four aggregation methods and the original data were equal, (ii) the medians of eight study sites were equal and (iii) there was no statistical interaction between these factors (i.e., the effects of aggregation do not depend on the study sites) [40,41]. When robust ANOVA test statistics were not available with the bootstrap design, we reported only the p-values. Results of post hoc analysis were plotted in boxplot diagrams focusing on the aggregation methods, significant differences (p < 0.05) were signed with different letters in the upper section of the diagrams, i.e., groups that are significantly different from each other had different letters, while the groups with indistinguishable medians had an identical letter above the boxplots [42].

Results
The number of categories was different by the aggregation methods: L1 aggregation method had the fewest number of categories, BB and L2 aggregation methods had fewer categories relative to L3 and TB in general.
L1, L2 and L3 groups had a theoretical maximum of 5, 15 and 44 categories, respectively because those are the number of categories at each level in the Corine data. Our data had a maximum of 5, 12 and 18 categories, respectively. The cases aggregated by BB and TB aggregation methods had a theoretical maximum of 44 categories, as they derived from L3 data, but the BB and TB aggregations were based on the sizes and types of change. BB and TB aggregation methods had a maximum of 9 and 13 categories, respectively ( Figure 4).   (Table 2), meaning we reject the hypothesis that all medians are identical. According to the post hoc analysis, the median change in the L1 group was significantly less than the median of  (Table 2), meaning we reject the hypothesis that all medians are identical. According to the post hoc analysis, the median change in the L1 group was significantly less than the median of the other aggregation methods in either calibration, validation or simulation intervals ( Figure 5). Table 2. Factorial ANOVA results of the percent of changes in the calibration, validation and simulation intervals by aggregation methods (H 0 : medians of calibration intervals, validation intervals and simulation intervals are equal).   Figure 6 shows that the aggregation methods had an insignificant effect on the FOM median. L1 had a slightly wider range of FOM values, TB had a slightly tighter range of FOM values related to groups of BB, L2 and L3 groups. However, Figure 6 showed that the median of L1 was less than the median of other aggregation methods. However, effect sizes indicated a larger effect regarding the magnitude of differences between L1 and other aggregation techniques (L1-BB: 0.30, L1-L2: 0.29, L1-  Figure 6 shows that the aggregation methods had an insignificant effect on the FOM median. L1 had a slightly wider range of FOM values, TB had a slightly tighter range of FOM values related to groups of BB, L2 and L3 groups. However, Figure 6 showed that the median of L1 was less than the median of other aggregation methods. However, effect sizes indicated a larger effect regarding the magnitude of differences between L1 and other aggregation techniques (L1-BB: 0.30, L1-L2: 0.29, L1-L3: 0.31, L1-TB: 0.30). In comparison, the largest difference of other aggregation techniques was between L2 and L3, the effect size was 0.002, indicating smaller effect.  (Table 3). Wrong Hits, False Alarms and Misses showed different types of disagreement between reference and simulated changes ( Figure  7). All of these components were lower for L1 relative to the other aggregation methods. Hits are the agreement between reference and simulated changes. Hits were also lower for L1 relative to the other aggregation methods. Hits and Wrong Hits in the L1 group were significantly different from L2, L3 and BB groups. False alarms in the L1 group were significantly different from L3 and BB groups. Misses in the L1 group were significantly different from groups of all the other aggregation methods.    (Table 3). Wrong Hits, False Alarms and Misses showed different types of disagreement between reference and simulated changes ( Figure 7). All of these components were lower for L1 relative to the other aggregation methods. Hits are the agreement between reference and simulated changes. Hits were also lower for L1 relative to the other aggregation methods. Hits and Wrong Hits in the L1 group were significantly different from L2, L3 and BB groups. False alarms in the L1 group were significantly different from L3 and BB groups. Misses in the L1 group were significantly different from groups of all the other aggregation methods.

Effects of Changes in the Study Area
Based on the results of calibration, validation and simulation interval changes, the L1 group showed a lower ratio of changes in the study area, according to other groups, as Figure 4 depicts. The range of the ratio of changes in all other aggregation methods, excluding L1, was similar and they were not significantly different from each other. It means that L1 aggregation eliminates more change relative to the other aggregation methods, and all the other aggregation methods show a similar ratio of changes in the study areas. L1 aggregation results in a lower amount of changes in the study area, which makes sense because L1 has fewer categories. L1 aggregation rules are based on Corine's thematic hierarchy [46].
Corine's hierarchical structure dictates the L1 and L2 aggregations, regardless of the empirical patterns during any time interval. The LULC changes during the calibration interval guide the BB and TB aggregations. When detailed categories are aggregated, the model cannot simulate changes among those detailed categories during the validation interval, even when those detailed categories account for change during the validation interval.

Effects of Changes in the Study Area
Based on the results of calibration, validation and simulation interval changes, the L1 group showed a lower ratio of changes in the study area, according to other groups, as Figure 4 depicts. The range of the ratio of changes in all other aggregation methods, excluding L1, was similar and they were not significantly different from each other. It means that L1 aggregation eliminates more change relative to the other aggregation methods, and all the other aggregation methods show a similar ratio of changes in the study areas. L1 aggregation results in a lower amount of changes in the study area, which makes sense because L1 has fewer categories. L1 aggregation rules are based on Corine's thematic hierarchy [46].
Corine's hierarchical structure dictates the L1 and L2 aggregations, regardless of the empirical patterns during any time interval. The LULC changes during the calibration interval guide the BB and TB aggregations. When detailed categories are aggregated, the model cannot simulate changes among those detailed categories during the validation interval, even when those detailed categories account for change during the validation interval.

Effects of FOM and FOM Components
Pontius Jr et al. 2008 [47] found that FOM tended to return larger values where reference maps had larger amounts of observed net change. In our case, the median FOM for the L1 group was not statistically significantly different from the other aggregation methods. However, the L1 group's minimum FOM was equal to its median FOM, which was equal to 0. This highlights the limitations of hypothesis testing, which can be misleading, as several authors have concluded [48][49][50][51]. Here, differences among L1 and other aggregation methods were not significant, but half of L1 s FOM values were less than the lower quartiles for all other aggregations.
L1 values were generally lower in terms of Wrong Hits, False Alarms, Misses and Hits. All the other aggregation methods were similar in terms of Wrong Hits, False Alarms, Misses and Hits. The L1 group was significantly different from L2, L3 and BB groups in terms of Hits and Wrong Hits. The L1 group was significantly different from L3 and BB groups in terms of False Alarms. The L1 group was significantly different from groups of all the other aggregation methods in terms of Misses. These components are calculated based on changing areas by definition [17] and they are given as a percent of the study area. Due to this fact, while eliminating changes in the study area, the possibility of showing a higher value of any of the FOM components related to other aggregation methods becomes less possible in the case of L1. While L1 had fewer changing areas, it had lower values in terms of all FOM components related to other aggregation methods, in a statistically significant way or based on visual interpretation. Hits values mean the correctly simulated changes in the simulation map and the other three components determine the erroneously simulated changes. Since all examined components are lower in the case of the L1 group, it means a lower ratio of erroneously predicted areas, but means a lower ratio of correctly simulated changes simultaneously. It refers to the fact that the simulation model in the case of the L1 group explains less change. While the FOM is not enough to qualify model performance, as stated in Varga et al. 2019 [37], four FOM components-Wrong Hits, False Alarms, Misses and Hits-return lower values in L1 group, where lower ratios of observed changing areas were also present in either calibration or validation intervals. However, we do not know about any papers focusing on the statistical relationship between FOM components and observed changes.

Effects of Category Numbers
Although the aggregation methods, excluding L1, had similar characteristics concerning FOM components and ratios of change in the study areas, it is important to pay attention to the number of categories as well. Aldwaik et al. (2015) [29] stated that if maps have a larger number of categories, it may cause difficulties in the interpretation of the analysis. The behavior-based algorithm decreases the number of categories while maintaining change. BB, L2 and L3 were not significantly different in terms of Hits values, but the BB group had generally lower numbers of categories. It refers to the fact that while BB aggregation decreases the number of categories, it does not have lower Hits values related to other aggregation methods with a generally larger number of categories. The decrease of category numbers may result in an easier interpretation and a lower demand for computing resources when running a simulation model. Besides the lower number of categories, BB aggregation has also an advantage that the category aggregation rules aim to maintain changes in the area. Although L1 aggregation also results in a low number of categories, L1 aggregation eliminates changes in the area as described above. TB aggregation method is also based on maintaining some of the changing categories, but with an arbitrary threshold of changes in the area. TB aggregation did not make it possible to maintain the most changes possible, because the changes not meeting the arbitrary threshold were eliminated. Moreover, TB aggregation did not reduce the number of categories as much as BB and L1 aggregations. Consequently, we recommend an aggregation method that maintains changes and correctly simulated changing areas in the study area along with reducing the number of categories as much as possible. Based on the results, we recommend BB aggregation of all the applied aggregation methods as a best practice, instead of applying CLC standard level aggregation. CLC standard level aggregation matches a hierarchical system while eliminating changes in the study area and resulting in lower ratios of correctly and incorrectly simulated changing areas, therefore a lower ratio of explained changes. If we know the effect of category aggregation on model performance, then we will be able to eliminate a factor that may result in a less informative insight into the model performance.

Conclusions
Our paper describes an analysis concerning 114 runs of a CA-Markov simulation model, which were based on LULC maps derived from Corine Land Cover data and were generated with four aggregation methods. We analyzed the effects of aggregation methods on changes in the study areas, the Figure of Merit (FOM) and FOM's components: Wrong Hits, False Alarms, Misses, Hits. We have five main conclusions. First, L1 and BB aggregations produced the fewest categories. Second, BB aggregation maintained the largest sizes of changes. Third, L1 had generally lower sizes of changes in the calibration, validation and simulation intervals. Fourth, L1 medians of change were considerably lower, specifically, half of the medians were zero. Fifth, L1 had generally lower values in terms of all FOM components. Based on the results and the aggregation rules of various aggregation methods, we warn users that the Corine standard level aggregation rules can eliminate sizeable changes. We recommend users apply aggregation methods that reduce the number of categories while maintaining changes and not reducing the correctly simulated changes in the area. In our analysis, the behavior-based aggregation method met these goals. We also recommend that users calculate FOM and FOM's components to gain important insights concerning the interaction of the simulation model performance and changes in the reference data.