A Novel k-Means Clustering Based Task Decomposition Method for Distributed Vector-Based CA Models

More and more vector-based cellular automata (VCA) models have been built to leverage parallel computing to model rapidly changing cities and urban regions. During parallel simulation, common task decomposition methods based on space partitioning, e.g., grid partitioning (GRID) and recursive binary space partitioning (BSP), do not work well given the heterogeneity of VCA parcel tasks. In this paper, to solve this problem, we propose a novel task decomposition method for distributed VCA models based on k-means clustering, named KCP. Firstly, the polygon dataset is converted into points based on centroids, which combines the size of two parcels and the outer distance. A low-cost recursive quad-partition is then applied to decide the initial cluster centers based on parcel density. Finally, neighbor parcels can be allocated into the same subdivision through k-means clustering. As a result, the proposed KCP method takes both the number of tasks and computing complexity into consideration to achieve a well-balanced local workload. A typical urban VCA growth model was designed to evaluate the proposed KCP method with traditional spatial partitioning methods, i.e., GRID and BSP. KCP had the shortest total simulation time when compared with GRID and BSP. During experimental urban growth simulations, the time spent on a single iteration was reduced by 15% with the BSP and by 25% with the GRID method. The total simulation time with a 120 m neighborhood buffer size was reduced by more than one hour to around three minutes with 32 cores.


Introduction
Vector-based cellular automata (VCA) models extend traditional raster-based cellular automata (raster-based CA) models by using irregular vector polygons to represent actual geographic features, e.g., parcels and blocks [1].This extension relieves the sensitivity of a CA model to spatial resolution and breaks the limits on uniform neighborhood definition [2,3].VCA models have been widely used to simulate urban dynamics, e.g., land-use and land-cover changes, urban growth [4][5][6], urban planning [7], etc.However, large-scale and highly detailed VCA/CA models usually require massive computing resources to obtain timely simulation results.Thus, researchers are turning to parallel computing to accelerate these time-consuming model simulations, e.g., programs like pRPL [8,9], pSLEUTH [10], CAMEL [11].
In a given VCA model, irregular parcel polygons evolve through a number of discrete time steps following a set of transition rules based on the states of neighboring parcels.Thus, a basic parallel VCA (pVCA) task is equivalent to the corresponding polygon parcel.The computation of one pVCA task includes neighbor search and parcel status transition, and the amount is generally proportional to the area of a parcel.
The task decomposition in pVCA models can be formulated as typical space partitioning problems.There is a growing collection of space partitioning algorithms, which can be grouped according to how partitioning is conducted and classified into two categories; flat grid partitioning and hierarchical tree partitioning.Flat grid-based partitioning algorithms use a rectangular grid to partition the target space, thus directly obtaining discrete regular subdivisions.Hierarchical tree-based algorithms on the other hand, divide the target space into two or more disjoint subsets recursively, eventually producing a binary space partitioning (BSP) tree, e.g., KD-tree.
However, both flat grid-based and hierarchical tree-based algorithms are not easily applicable to pVCA task decomposition because neither the equal subdivision area nor equal task number can guarantee an equal workload.Flat grid partitioning can produce equal-area subdivisions, but the irregular shape of pVCA parcels leads to different task numbers for each subdivision.Conversely, hierarchical tree partitioning can easily produce equal task numbers by recursive division, but the workload of each subdivision is still unbalanced because of the varying computing complexity of pVCA tasks.
In this paper, a novel pVCA task decomposition algorithm based on general k-means clustering named KCP is proposed to overcome these drawbacks.The KCP method formulates a customized k-means clustering to cluster tasks with higher geographical proximity.This method uses parcel centroid to represent parcel polygons and defines a proximity distance combining parcel size and outer distance.Through an iterative process, the generated subdivisions will contain neighboring parcels, while, at the same time, the technique will separate large-sized parcels using the centroid distance.In this way, the KCP decomposition can take both task numbers and their computation complexity into consideration to obtain better workload balance.Since the workload of each subdivision depends on the number of allocated tasks, the computing complexity and communication overhead depend on the amount of ghost parcels, the NSD-PA (normalized standard deviation of the area of parcels), NSD-PN (normalized standard deviation of the number of parcels), and the number of total ghost agents that are applied to indicate local workload balance and the communication overhead.
An additional VCA-based urban growth model was developed to evaluate the efficiency of the proposed KCP algorithm.Further, we parallelized the model based on a bulk synchronous parallel model and adopted the ghost agent strategy to reduce information exchange frequency.Two groups of experiments were designed, with four subdivision sizes, 4, 8, 16, and 32, to test their scalability and three buffer sizes, 120 m, 240 m, and 360 m, to test their effects on the communication overhead.In addition, we made a detailed study on the local workload and communication overhead separately using the NSD-PN, NSD-PA, and the number of total ghost agents.The experimental results show that KCP employs the least local computing time with acceptable communication overhead and achieves the best parallel performance, compared with two common decomposition methods, GRID and BSP.The proposed KCP method can be used to partition spatial tasks of large-scale detailed VCA models effectively, which can shorten the computing time and improve the efficiency of adopting a VCA model.
The rest of the paper is outlined as follows.Section 2 provides a detailed introduction to VCA models and an overview of existing task decomposition methods with spatial-partitioning. Section 3 explains the proposed KCP task decomposition method.In Section 4, a parallel VCA urban growth model is presented.In Section 5, we evaluate and compare the KCP performance to the GRID and BSP performance.In Section 6, we present a discussion and draw some conclusions.

VCA Models
VCA originates from the idea of using irregular polygons instead of regular cells to represent real features in simulation models [1,2].A number of researchers have implemented this idea with various spatial segregation strategies, e.g., resolution elements [3], Voronoi polygons [4], Delaunay triangles, and cadastral parcel polygons [5][6][7][8].Among these approaches, parcel polygons, rather than a mixture of regular cells, most closely correspond to features in the real world.Thus, irregular representation of VCA models provides the capability to model more detailed and complex behaviors and interaction between features.
In contrast to the Moore or Von Neumann neighborhood definitions of raster-based CA models, a VCA neighborhood can be defined by a spatial relationship; adjacency or buffer.In a neighborhood, two parcel polygons sharing points or edges are considered adjacent neighbors.A buffer neighborhood is defined by the distance between parcel polygons.All the polygons within a user-defined buffer distance are considered neighbors.Dahal et al. [9] made a detailed study of the neighborhood sensitivity in irregular CA models using nine types of neighborhoods and buffer distances.
The transition functions in VCA models are usually derived from GIS suitability analysis, e.g., Neighborhood, Accessibility, Suitability, and Zone status (NASZ) [10].However, each parcel polygon contains a different number of neighbors, and they are also different in size and shape.These heterogeneities make the computing cost of each VCA task fluctuate very sharply.
Compared to raster CA, VCA model simulations are, however, computer-intensive because of irregular spatial representation, complex transition rules, and massive input polygons [8,11].Following parallel CA models, parallel computing technology is required to support large scale detailed VCA model simulation.During parallelization, task decomposition is critical to fully utilize computing and storage resources.

Task Decomposition Methods in Parallel VCA Models
The spatial task decomposition method determines the workload balance and communication overhead influencing the model parallel computing efficiency.The spatial tasks in pVCA models are mutually dependent as the status of neighbors is involved in each polygon transition.Moreover, the shape and size of corresponding polygons in a pVCA task are very different, and this cannot guarantee that equal task numbers or equal subdivision areas produce equal local workloads.Thus, pVCA task decomposition should firstly address task dependence to minimize the communication overhead and then take both task computing complexity and task number into consideration to obtain a balanced workload.
In practice, space partitioning strategies are usually used to conduct a near optimal pVCA task decomposition.Space partitioning divides the target space into a set of non-overlapping subdivisions, and then VCA polygons can be allocated to different subdivisions by their spatial relationships.Space partitioning algorithms can be categorized into flat grid-based methods and hierarchical tree-based methods according to the data structure applied in the partitioning process.

Flat Grid-Based Methods
Flat grid-based methods are widely adopted in raster-based CA models [12][13][14][15].These methods usually exhibit three forms; column-wise, row-wise, and grid-wise.The established grid firstly partitions the target space into flat regular cells and directly allocates simulation tasks to the corresponding cells.Several parallel CA libraries support these partition methods, e.g., pRPL, pRPL2, and CAMEL.Figure 1 shows an example of a grid-wise partition applied in a raster-based CA(1a) and in a VCA model(1b).Since raster-based CA models contain equal-sized uniform cells, task decomposition with this type of space partitioning can easily achieve both a balanced workload and a minimum communication overhead.However, when it is applied to VCA models, these methods will lead to an ill-balanced workload because of irregular polygons and heterogeneous distribution; hierarchical tree-based methods such as BSP overcome this limitation by recursively bisecting tasks.

Hierarchical Tree-Based Methods
Figure 2 illustrates the hierarchical space partitioning process by a binary space partitioning tree.When applied in pVCA task decomposition, the BSP tree can produce subdivisions with an equal number of tasks.However, the BSP tree method does not consider the task computing complexity and still cannot guarantee workload balance.In addition, BSP lacks a strong guarantee on the communication overhead, which makes it degenerate when exploring neighborhood configurations.

Hierarchical Tree-Based Methods
Figure 2 illustrates the hierarchical space partitioning process by a binary space partitioning tree.When applied in pVCA task decomposition, the BSP tree can produce subdivisions with an equal number of tasks.However, the BSP tree method does not consider the task computing complexity and still cannot guarantee workload balance.In addition, BSP lacks a strong guarantee on the communication overhead, which makes it degenerate when exploring neighborhood configurations.

Hierarchical Tree-Based Methods
Figure 2 illustrates the hierarchical space partitioning process by a binary space partitioning tree.When applied in pVCA task decomposition, the BSP tree can produce subdivisions with an equal number of tasks.However, the BSP tree method does not consider the task computing complexity and still cannot guarantee workload balance.In addition, BSP lacks a strong guarantee on the communication overhead, which makes it degenerate when exploring neighborhood configurations.

k-Means Clustering Method
The k-means clustering method is one of the most widely used clustering algorithms because of its simplicity, efficiency, and empirical success [16].Given a set of samples (x1, x2, …, xn) in which each sample is a R d real vector, k-means clustering partitions n samples into k sets S = {S1, S2, …, Sk}, where (k ≤ n) and minimizes the within-cluster sum of squares (WCSS), i.e. the sum of the distance functions of each point within one cluster of the k center, as in Equation (1), where μi is the cluster center in Si.The kernel of k-means clustering is the user-defined distance function, which provides a quantitative measurement of proximity between features.The basic k-means clustering algorithm has been extended in many different ways, including soft membership in Fuzzy c-means [17,18], recursively hierarchical divisive bisecting of the k-means [19], automatically critical based k finding in X-means [20], etc. Readers can refer to [16] for a more detailed review.

The KCP Task Decomposition Method
In VCA models, both adjacent and buffer neighborhood definitions follow a common idea that closer features have a more important effect on the center feature.Inspired by this approach, the proposed KCP method formulates a new k-means clustering model to decompose VCA tasks.The KCP method consists of three phases; mapping, initial centers choosing, and clustering.The mapping phase converts polygon-based tasks into point types.The second phase chooses initial centers for each future cluster.The clustering phase carries out iterative clustering computation. (

1) Mapping phase
As the k-means clustering algorithm requires point input, a mapping is firstly conducted to convert VCA polygons into points representing the centroid of its minimum bounding box.Shown in Figure 3, the distance d between polygons consists of three parts: d1 and d3 represent the sizes of two polygons, and d2 is the actual outer spatial distance between parcels.By this distance definition, two bigger parcels are more likely to be separated.For example, parcel A and B are big parcels, and parcel C is small.Although A and B have shorter outer distances than A and C, nevertheless, A and C are more likely to be allocated into the same subdivision.Furthermore, A and B are more likely to

k-Means Clustering Method
The k-means clustering method is one of the most widely used clustering algorithms because of its simplicity, efficiency, and empirical success [16].Given a set of samples (x 1 , x 2 , . . ., x n ) in which each sample is a R d real vector, k-means clustering partitions n samples into k sets S = {S 1 , S 2 , . . ., S k }, where (k ≤ n) and minimizes the within-cluster sum of squares (WCSS), i.e., the sum of the distance functions of each point within one cluster of the k center, as in Equation (1), where µ i is the cluster center in S i .The kernel of k-means clustering is the user-defined distance function, which provides a quantitative measurement of proximity between features.The basic k-means clustering algorithm has been extended in many different ways, including soft membership in Fuzzy c-means [17,18], recursively hierarchical divisive bisecting of the k-means [19], automatically critical based k finding in X-means [20], etc. Readers can refer to [16] for a more detailed review.

The KCP Task Decomposition Method
In VCA models, both adjacent and buffer neighborhood definitions follow a common idea that closer features have a more important effect on the center feature.Inspired by this approach, the proposed KCP method formulates a new k-means clustering model to decompose VCA tasks.The KCP method consists of three phases; mapping, initial centers choosing, and clustering.The mapping phase converts polygon-based tasks into point types.The second phase chooses initial centers for each future cluster.The clustering phase carries out iterative clustering computation. (

1) Mapping phase
As the k-means clustering algorithm requires point input, a mapping is firstly conducted to convert VCA polygons into points representing the centroid of its minimum bounding box.Shown in Figure 3, the distance d between polygons consists of three parts: d 1 and d 3 represent the sizes of two polygons, and d 2 is the actual outer spatial distance between parcels.By this distance definition, two bigger parcels are more likely to be separated.For example, parcel A and B are big parcels, and parcel C is small.Although A and B have shorter outer distances than A and C, nevertheless,  Thus, this mapping strategy exhibits two advantages: (1) two close big polygons may end up with a large distance, and thus these two polygons are likely to be allocated into two different subdivisions and (2) though two dispersed small polygons have share a smaller distance, they are likely to be allocated into the same subdivisions.Since bigger polygons usually create more computation than smaller ones when geometry computation involved, these two traits can ensure a more balanced local workload among subdivisions. (

2) Choosing the k initial cluster centers
The total number of clusters, k, and initial centers are important for convergence speed and final clustering results.There are many customizations designed to find a cluster number or initial centers, such as canopy clustering and hierarchical clustering.As for task decomposition, k can be trivially defined as the number of processors.Usually initial cluster centers should be picked from the high-density areas of input points.In this paper, a simple recursive 4-way method was designed to determine the initial cluster centers.Shown in Figure 4, it partitions the whole space into 4 level sub grids and records the point number in each sub grid.For a given cluster number k, the geometric centers with the k highest values are chosen as the initial cluster centers.(3) Clustering phase Thus, this mapping strategy exhibits two advantages: (1) two close big polygons may end up with a large distance, and thus these two polygons are likely to be allocated into two different subdivisions and (2) though two dispersed small polygons have share a smaller distance, they are likely to be allocated into the same subdivisions.Since bigger polygons usually create more computation than smaller ones when geometry computation involved, these two traits can ensure a more balanced local workload among subdivisions.
(2) Choosing the k initial cluster centers The total number of clusters, k, and initial centers are important for convergence speed and final clustering results.There are many customizations designed to find a cluster number or initial centers, such as canopy clustering and hierarchical clustering.As for task decomposition, k can be trivially defined as the number of processors.Usually initial cluster centers should be picked from the high-density areas of input points.In this paper, a simple recursive 4-way method was designed to determine the initial cluster centers.Shown in Figure 4, it partitions the whole space into 4 level sub grids and records the point number in each sub grid.For a given cluster number k, the geometric centers with the k highest values are chosen as the initial cluster centers.Thus, this mapping strategy exhibits two advantages: (1) two close big polygons may end up with a large distance, and thus these two polygons are likely to be allocated into two different subdivisions and (2) though two dispersed small polygons have share a smaller distance, they are likely to be allocated into the same subdivisions.Since bigger polygons usually create more computation than smaller ones when geometry computation involved, these two traits can ensure a more balanced local workload among subdivisions. (

2) Choosing the k initial cluster centers
The total number of clusters, k, and initial centers are important for convergence speed and final clustering results.There are many customizations designed to find a cluster number or initial centers, such as canopy clustering and hierarchical clustering.As for task decomposition, k can be trivially defined as the number of processors.Usually initial cluster centers should be picked from the high-density areas of input points.In this paper, a simple recursive 4-way method was designed to determine the initial cluster centers.Shown in Figure 4, it partitions the whole space into 4 level sub grids and records the point number in each sub grid.For a given cluster number k, the geometric centers with the k highest values are chosen as the initial cluster centers.(3) Clustering phase In order to determine the partitioning level, let ∂ ≥ 1 indicate the selection freedom factor (SFF); then the least total sub grid numbers is n = ∂k and the least partition level is level = log n 4 .For example, k is 4, ∂ is 4, and the minimum number of sub grids is 16, which means the minimum partition level is 2.
(3) Clustering phase Each subdivision has a center at C j , which represents the geometric center of all the spatial tasks belonging to this subdivision.The goal of k-means clustering is to minimize the value J in Equation (2).
where pred i->j is a criterion function.If the feature X i belongs to the subdivision C j , the pred i->j is 1; otherwise it is 0. Repeat the following three steps until the cluster members stabilize.
(a) Generate a new clustering by assigning each spatial task to its closest cluster center according to Equation (3).
(b) Compute new cluster centers following Equation ( 4) (c) Check if the current clustering results are stable according to the criteria in Equation ( 5).If true, go to step (a) and repeat the assignment operation; otherwise the clustering is finished.

A Parallelized Urban Growth Model Based on VCA
In order to evaluate the KCP method, a parallel urban growth model was designed to study the performance of KCP on workload-balance and communication overhead.

The VCA Based Urban Growth Model
The design of the VCA model is illustrated in Figure 5.The model input contains a collection of data layers related to urban growth, including transportation, terrain, and parcels.The model output represents a future urban growth area.The whole urban growth simulation can be viewed as an iteration process; each iteration spans a period of time, e.g., a month, a quarter, or a year.During iteration, the new status of each polygon is derived according to the status and its neighbor information.
In this model, each parcel polygon has two statuses: undeveloped and developed.The initial parcels status of each polygon was extracted from the initial urban map by overlap analysis.The polygons covered by the initial urban area were assigned to developed parcels; otherwise they were set as undeveloped.A buffer neighborhood was included in the neighbor definition, and three buffer distances (120 m, 240 m, and 360 m) were applied to evaluate how the neighborhood size affected the communication overhead.
As shown in Figure 6, the transition status of a parcel contains two phases; calculating the transition probability to developed status and conducting a transition rule test.The NASZ scheme was used to calculate the transition probability of the parcels.This scheme contains three steps; accumulate the effect of neighbors, evaluate the suitability of a parcel on its own, and synthesize these two results.The derived transition probability combines its current conditions, the status of its neighbors, and a set of suitability criteria in Equation ( 6),   The derived transition probability combines its current conditions, the status of its neighbors, and a set of suitability criteria in Equation ( 6),  The derived transition probability combines its current conditions, the status of its neighbors, and a set of suitability criteria in Equation ( 6), where, C t i is the current condition of parcel i at time t − 1 and F g is the global factor indicating the external driving force, such as positive policy, the vitality of the city, etc.The second term cond(C i ) is a function that calculates the suitability of a parcel for development, based on a set of suitability criteria, including the distance to major roads, the slope, and area.The last accumulated term represents the accumulated effect of neighboring parcels on the center parcel, e.g., if there are more developed cells around it, it is more likely to be developed.
After computing transition probability, a transition rule test is applied to all parcels.Generally, there are two kinds of transition rule tests, i.e., threshold and stochastic tests.The threshold type uses a user-specified threshold value to determine if the transition probability of a parcel can pass the test.In contrast, the stochastic type of rule test applies a random number generator to determine if a parcel can make a transition.In this way, the stochastic type of test can introduce random factors to an urbanization process, which can simulate uncertainties such as policy planning, environment change, etc.A threshold transition rule was applied in our urban growth model parallelization testing design.

Research Area and Input Datasets
The research area covers a small city near San Bernardino, California, with area of around 207.823 km 2 .The major parcel dataset used in this case study was downloaded from the Internet [21] and contains 38820 parcels, as shown in Figure 7. a function that calculates the suitability of a parcel for development, based on a set of suitability criteria, including the distance to major roads, the slope, and area.The last accumulated term represents the accumulated effect of neighboring parcels on the center parcel, e.g. if there are more developed cells around it, it is more likely to be developed.After computing transition probability, a transition rule test is applied to all parcels.Generally, there are two kinds of transition rule tests, i.e. threshold and stochastic tests.The threshold type uses a user-specified threshold value to determine if the transition probability of a parcel can pass the test.In contrast, the stochastic type of rule test applies a random number generator to determine if a parcel can make a transition.In this way, the stochastic type of test can introduce random factors to an urbanization process, which can simulate uncertainties such as policy planning, environment change, etc.A threshold transition rule was applied in our urban growth model parallelization testing design.

Research Area and Input Datasets
The research area covers a small city near San Bernardino, California, with area of around 207.823 km 2 .The major parcel dataset used in this case study was downloaded from the Internet [21] and contains 38820 parcels, as shown in Figure 7.The detailed attributes of other relevant datasets are also listed in Table 1.The detailed attributes of other relevant datasets are also listed in Table 1.As described in Section 4.1, the kernel of an urban growth model is to evaluate the new status of each parcel iteratively, and each parcel is equal to an independent task unit in the whole simulation process.During parallelization, the task decomposition method, e.g., KCP, GRID, or BSP, divides all VCA tasks into discrete groups, and each processing element is in charge of one group simulation.
A bulk synchronous parallel strategy was used to coordinate the computation and communication during parallel simulation [22], consisting of local computing and synchronization in each iteration.In the local computing phase, the local status of each parcel was computed according to the transition function.The synchronization phase conducts neighboring information exchange between processors.The parallel VCA urban growth model is illustrated in Figure 8.

Parallelization of the VCA-Based Urban Growth Test Model
As described in Section 4.1, the kernel of an urban growth model is to evaluate the new status of each parcel iteratively, and each parcel is equal to an independent task unit in the whole simulation process.During parallelization, the task decomposition method, e.g.KCP, GRID, or BSP, divides all VCA tasks into discrete groups, and each processing element is in charge of one group simulation.
A bulk synchronous parallel strategy was used to coordinate the computation and communication during parallel simulation [22], consisting of local computing and synchronization in each iteration.In the local computing phase, the local status of each parcel was computed according to the transition function.The synchronization phase conducts neighboring information exchange between processors.The parallel VCA urban growth model is illustrated in Figure 8. Neighboring information exchanges result in additional communication overhead.In order to reduce the communication overhead, ghost parcels [23,24] were created to lower the status exchange frequency between cells.Figure 9 illustrates how ghost parcels reduce the communication overhead [14].To avoid requesting the status of remote parcels during each iteration, current processing nodes will cache selected parcels on the local process; these are called 'ghost parcels'.Each ghost parcel acts as a copy of an original remote parcel.Local process can query the status of a ghost parcel but does not have permission to update their status.Their status can be only updated during the synchronization phase when the original parcels are changed.Neighboring information exchanges result in additional communication overhead.In order to reduce the communication overhead, ghost parcels [23,24] were created to lower the status exchange frequency between cells.Figure 9 illustrates how ghost parcels reduce the communication overhead [14].To avoid requesting the status of remote parcels during each iteration, current processing nodes will cache selected parcels on the local process; these are called 'ghost parcels'.Each ghost parcel acts as a copy of an original remote parcel.Local process can query the status of a ghost parcel but does not have permission to update their status.Their status can be only updated during the synchronization phase when the original parcels are changed.The simulation was implemented in a parallel simulation framework, 4D-SAS [25].Its GIS-enabled functionalities provided essential support for geometric computing required by the VCA test model.The 4D-SAS framework accommodated the proposed KCP task decomposition method and the ghost-parcel strategy to reduce the communication overhead.

Case study and Experiments
In this paper, two groups of experiments were carried out with the designed parallel VCA urban growth model to evaluate the proposed KCP algorithm.The evaluations compared the proposed KCP method with two other task decomposition methods, GRID, and BSP.The first group used different numbers of divisions, i.e. 4, 8, 16, and 32, to test scalability.The second group was configured with different buffer distances, i.e. 120 m, 240 m, and 360 m, to inspect communication overhead sensitivity.The 4D-SAS simulation framework was deployed on a high-performance computing cluster, consisting of five machines configured as follows; one was used as a master node, three machines were used as simulation engines, and the last one was used for input/output dataset storage.All the servers were directly connected by a dedicated 1 Gbps Ethernet.The hardware and OS of these servers are listed in Table 2.

The Total Simulation Time
In order to evaluate the overall parallel efficiency of KCP, GRID, and BSP, the total simulation time of a 24-year simulation (24 iterations, each iteration represents a year) were recorded as in Table 3.The simulation was implemented in a parallel simulation framework, 4D-SAS [25].Its GIS-enabled functionalities provided essential support for geometric computing required by the VCA test model.The 4D-SAS framework accommodated the proposed KCP task decomposition method and the ghost-parcel strategy to reduce the communication overhead.

Case Study and Experiments
In this paper, two groups of experiments were carried out with the designed parallel VCA urban growth model to evaluate the proposed KCP algorithm.The evaluations compared the proposed KCP method with two other task decomposition methods, GRID, and BSP.The first group used different numbers of divisions, i.e., 4, 8, 16, and 32, to test scalability.The second group was configured with different buffer distances, i.e., 120 m, 240 m, and 360 m, to inspect communication overhead sensitivity.The 4D-SAS simulation framework was deployed on a high-performance computing cluster, consisting of five machines configured as follows; one was used as a master node, three machines were used as simulation engines, and the last one was used for input/output dataset storage.All the servers were directly connected by a dedicated 1 Gbps Ethernet.The hardware and OS of these servers are listed in Table 2.

The Total Simulation Time
In order to evaluate the overall parallel efficiency of KCP, GRID, and BSP, the total simulation time of a 24-year simulation (24 iterations, each iteration represents a year) were recorded as in Table 3.As seen in Table 3, KCP achieves the best parallel efficiency with the least simulation time; about 15% less than BSP and 25% less than GRID.The total simulation time for KCP was reduced from over one hour to about three minutes, using 32 cores with a 120 m neighborhood buffer.The simulation results after 24 iterations are shown in Figure 10.Compared to Figure 9, a large part of the parcels has become developed.Parcels near the initial urban area and with good transportation are more likely to be developed.Those areas with limitations, e.g., a protected zone or steep slope, were less likely to be developed.As seen in Table 3, KCP achieves the best parallel efficiency with the least simulation time; about 15% less than BSP and 25% less than GRID.The total simulation time for KCP was reduced from over one hour to about three minutes, using 32 cores with a 120 m neighborhood buffer.The simulation results after 24 iterations are shown in Figure 10.Compared to Figure 9, a large part of the parcels has become developed.Parcels near the initial urban area and with good transportation are more likely to be developed.Those areas with limitations, e.g. a protected zone or steep slope, were less likely to be developed.

Execution Time in a Single Iteration
In order to understand how the computation workload and communication overhead affect parallel efficiency, the time of local computing and synchronization were recorded separately with different subdivisions and buffer sizes.The results were recorded as shown in Tables 4 and 5.As the buffer size increased, both the local computing time and the synchronization time for all methods increased, which means the computing of each parcel increases as more neighbors become involved.As compared to GRID and BSP, the local computing time of KCP was shorter because both the number of tasks and the computing complexity were considered, based on the centroid distance in KCP.Considering synchronization, KCP takes less time than BSP and more time than GRID, as the communication overhead was lower than that of BSP and higher than that of GRID in terms of the total number of ghost agents.

Local Workload Comparison of Different Partition Methods
We assumed that large-sized parcels are surrounded by more neighboring parcels, involving more geometric computing.In this paper, the number of allocated parcels and their total area were examined to analyze workloads obtained from GRID, BSP, and KCP.
(1) The payload distribution of subdivision parcels Firstly, the normalized number of parcels nn i in each subdivision was calculated, and the corresponding normalized standard deviation of the number of parcels (NSD-PN) θ was computed according to Equation (7), where K is the subdivision size, i.e., 4, 8, 16, or 32, and N represents the total number of tasks.N i is the allocated number of tasks for subdivision i. Shown in Table 6 are the results from the BSP method that generates an equal number of parcels for each subdivision.Compared with GRID, KCP results in a better distribution of parcels.As the subdivision number increased, the NSD-PN θ value of GRID and KCP both decreased.(2) The subdivision parcel area distribution The same as the NSD-PN calculation, the first step is to compute normalized allocated area ns j of each subdivision, and then the normalized standard deviation of the resulting partition areas (NSD-PA) µ is computed according to Equation (8), where s i represents the area of parcel i, and ns j represents the ration of allocated parcels area for subdivision j.Shown in Table 7, the GRID method resulted in a well-balanced distribution of task areas among subdivisions, while the BSP method resulted in an unbalanced workload.In addition, the µ value decreased as the subdivision sizes increased.As shown in Tables 6 and 7, GRID generated the highest NSD-PA and the lowest NSD-PN, while the results produced by BSP were the reverse of the GRID task allocation results.The proposals by KCP were moderate in both aspects.The GRID method weights more spatial task complexity over number of spatial tasks.In contrast, the BSP method focuses on balancing the number of spatial tasks allocated among subdivisions.The KCP method makes a compromise between these two factors, which can achieve the optimum local workload balance.

Global Communication Overhead Comparison of Partition Methods
As described in Section 4.2, the status synchronization of ghost parcels is the major source of the communication overhead.The number of ghost parcels was used to measure the communication overhead among the three task decomposition methods.
Shown in Table 8, KCP leads to less ghost parcels than BSP but more than GRID.As the subdivision size increased, the ghost parcel number of KCP increased more slowly than that of GRID and BSP.In addition, for the same number of subdivisions, as the buffer size increased, KCP also increased more slowly than GRID and BSP.Therefore, KCP is less sensitive to subdivision and buffer size, which makes it suitable for exploring the effect of neighborhood factors.As shown by the experiment results, the KCP method achieved the most efficient parallel performance in comparison to GRID and BSP.However, due to the inherent empirical success of k-means, the KCP method cannot guarantee optimal task decomposition.Moreover, there is no control on the weighting between the local workload balance and the communication overhead, which is critical for the parallelization efficiency of a VCA model operating on a heterogeneous computing cluster.

Conclusions
Compared to raster-based CA models, task decomposition in VCA models is more complex because of heterogeneous parcel distributions and task computing complexity.Existing task decomposition methods employ space partitioning directly (e.g., GRID and BSP) to decompose simulation tasks and thus cannot guarantee a balanced workload and minimum communication.In this paper, a novel task decomposition method based on k-means clustering is proposed to partition tasks in VCA models.The KCP method adopts the centroid distance of parcel polygons as a measurement of proximity, which enables both the task numbers and their computing complexity to be taken into consideration.Illustrated by our experimental results, KCP obtains the optimum workload balance with an acceptable communication overhead, therefore achieving high parallel efficiency.KCP can be applied as an effective approach to do task decomposition in parallelization of large-scale detailed VCA, where the variety in computing units cannot be ignored.In the future, we will conduct a detailed study of KCP performance in urban growth models based on VCA involving geometric change.We will add weights to the local workload balance and communication overhead to make task decomposition more responsive to the characteristics of heterogeneous computing clusters.

Figure 2 .
Figure 2. Spatial tasks partitioning with a binary space partitioning (BSP) tree of (a) raster-based CA and (b) VCA.

Figure 2 .
Figure 2. Spatial tasks partitioning with a binary space partitioning (BSP) tree of (a) raster-based CA and (b) VCA.

A
and C are more likely to be allocated into the same subdivision.Furthermore, A and B are more likely to be allocated into different subdivisions because A and C have a smaller combination distance than A and B under this mapping strategy.ISPRS Int.J. Geo-Inf.2017, 6, 93 6 of 17 ISPRS Int.J. Geo-Inf.2017, 6, x; doi: FOR PEER REVIEW www.mdpi.com/journal/ijgibe allocated into different subdivisions because A and C have a smaller combination distance than A and B under this mapping strategy.

Figure 3 .
Figure 3.An illustration of distance definition between parcels.

Figure 3 .
Figure 3.An illustration of distance definition between parcels.
ISPRS Int.J. Geo-Inf.2017, 6, 93 6 of 17 ISPRS Int.J. Geo-Inf.2017, 6, x; doi: FOR PEER REVIEW www.mdpi.com/journal/ijgibe allocated into different subdivisions because A and C have a smaller combination distance than A and B under this mapping strategy.

Figure 3 .
Figure 3.An illustration of distance definition between parcels.

Figure 5 .
Figure 5.A typical urban growth model based on GIS suitability analysis.

Figure 6 .
Figure 6.The evolution flowchart of each parcel polygon.

i
is the current condition of parcel i at time 1 t− and Fg is the global factor indicating the external driving force, such as positive policy, the vitality of the city, etc.The second term cond(Ci) is

Figure 5 .Figure 5 .
Figure 5.A typical urban growth model based on GIS suitability analysis.

Figure 6 .
Figure 6.The evolution flowchart of each parcel polygon.
current condition of parcel i at time 1 t− and Fg is the global factor indicating the external driving force, such as positive policy, the vitality of the city, etc.The second term cond(Ci) is

Figure 6 .
Figure 6.The evolution flowchart of each parcel polygon.

Figure 7 .
Figure 7.The reasearch area and experimental parcel dataset.

Figure 7 .
Figure 7.The reasearch area and experimental parcel dataset.

Figure 8 .
Figure 8. Parallelization of the urban growth model based on a bulk synchronous parallel strategy.

Figure 8 .
Figure 8. Parallelization of the urban growth model based on a bulk synchronous parallel strategy.

Figure 9 .
Figure 9.The ghost agent strategy used in the parallelized VCA model.

Figure 9 .
Figure 9.The ghost agent strategy used in the parallelized VCA model.

Figure 10 .
Figure 10.Simulation results of the designed urban growth model.

Figure 10 .
Figure 10.Simulation results of the designed urban growth model.

Table 1 .
Basic information of the dataset used in the VCA-based urban growth model.

Table 1 .
Basic information of the dataset used in the VCA-based urban growth model.

Table 2 .
Detailed configuration of the cluster servers.

Table 2 .
Detailed configuration of the cluster servers.

Table 3 .
Total computing time for 24 iterations.

Table 3 .
Total computing time for 24 iterations.

Table 4 .
Average local computing time for an iteration.

Table 5 .
Average synchronization time for an iteration.

Table 6 .
Normalized standard deviation of the number of parcels (NSD-PN) of different patitioning methods.

Table 7 .
Normalized standard deviation of the resulting partition areas (NSD-PA) of different patitioning methods.

Table 8 .
Global communication overhead in the total ghost agent number.