A Parallel-Computing Approach for Vector Road-Network Matching Using GPU Architecture

The road-network matching method is an effective tool for map integration, fusion, and update. Due to the complexity of road networks in the real world, matching methods often contain a series of complicated processes to identify homonymous roads and deal with their intricate relationship. However, traditional road-network matching algorithms, which are mainly central processing unit (CPU)-based approaches, may have performance bottleneck problems when facing big data. We developed a particle-swarm optimization (PSO)-based parallel road-network matching method on graphics-processing unit (GPU). Based on the characteristics of the two main stages (similarity computation and matching-relationship identification), data-partition and task-partition strategies were utilized, respectively, to fully use GPU threads. Experiments were conducted on datasets with 14 different scales. Results indicate that the parallel PSO-based matching algorithm (PSOM) could correctly identify most matching relationships with an average accuracy of 84.44%, which was at the same level as the accuracy of a benchmark—the probability-relaxation-matching (PRM) method. The PSOM approach significantly reduced the road-network matching time in dealing with large amounts of data in comparison with the PRM method. This paper provides a common parallel algorithm framework for road-network matching algorithms and contributes to integration and update of large-scale road-networks.


Road-Network Matching
Road-network matching is a key technology of vector road-map integration, fusion, and update [1].The main task of road-network matching is building the corresponding relationship of road-object pairs that represent the same segment of real-world road in heterogeneous road maps [2].It has significant potential for the timely and cost-effective updating of road-network data and geographic-information science applications (e.g., vehicle-navigation products) [3].
The core of the road-network matching method is evaluating the similarity of two node/road features and determining the corresponding relationship of matching pairs.
(1) Similarity quantifies the similar degree of two features, providing the basis for determining the matching relationship of homonymous objects [1].The similarity-evaluation function varies with the matching unit.In earlier studies, the matching unit is usually the limited local context around nodes or road segments.Such similarities as the Hausdorff distance or the Fréchet distance of homonymous nodes or road segments and the topological structure of intersections are mainly considered in these approaches [4][5][6][7][8].Some researchers also integrated multi-factors (e.g., length, orientation, and number of topological connections) for road-segment matching [2,9].However, matching a unit at the node/road segment scale may bring about local optimization problems, limiting matching accuracy.Some studies pay more attention to the structural information of the road network and shift the matching unit to a larger scale, such as junction/segment clusters, for global optimization [3,10].It makes the similarity measures more comprehensive but more complicated.(2) When the similarity of matching units is evaluated, the matching relationship of the nodes/road segments can be determined.Most earlier methods rely on a sequential greedy strategy [1].This strategy intends to achieve the local optimum by comparing similarity with an experiential threshold [4,11,12].To overcome the limitation of the local optimum, several optimization strategies have been proposed to find a global optimal solution from all possible matching choices [8,13].In these strategies, the objective function maximizes total similarity among all matched feature pairs.Global optimal solutions have greatly promoted matching accuracy.However, increased computational complexity also degrades the performance of the matching method.
To date, most studies on road-network matching focus on evaluating the effectiveness of the matching model and promoting matching accuracy.Few studies have mentioned the performance of the matching method, and are listed in Table 1.The amount of data in most studies is small, fewer than 1000 nodes/segments.Time cost varies from seconds to hours.Even at the same data scale, the time cost is different.Reasons that affect the computational costs can be summarized in three points.(1) The scale of road-network data.Dealing with a larger scale of road-network data requires a higher time cost, because the number of objects that need to be computed increases.(2) The complexity of the matching unit and the similarity function.Compared with a simple matching unit (e.g., road segment or node), the similarity function of complex matching units (e.g., node/segment cluster) is always more complex in order to take more factors into careful consideration.Thus, complex matching units often need more computing time.(3) The number of iterations and convergence rate.Optimal solutions are usually determined iteratively in the previous literature.For instance, the probabilistic relaxation method iterates to find the stable matching probability matrix, and its computing time is closely related to convergence rate and the number of iterations [14].
From a data perspective, dividing the data into smaller units can offer a potential for accelerating matching efficiency by parallel computation [1].

Research Object
Numerous algorithms have been proposed to solve the road-network matching problem [1][2][3][4][5]8,13,14,19].The accuracy of road-networks matching is greatly promoted in previous studies.However, the performance of the matching method is less noticed.As the sizes of the databases increase, the performance issue becomes increasingly prominent.Parallel techniques are well-suited to accelerate matching computation.Dividing large areas into several sub-regions could be an effective way for faster computing with parallel computation [1].Graphics-processing units (GPU) that have hundreds, even thousands, of integrated cores in a single chip are perfect for solving problems that can be partitioned into independent and smaller parts.
Accordingly, to perform the road-network matching in an effective and efficient manner, a parallel matching strategy using GPU architecture is developed in this paper, and particle-swarm optimization (PSO) is introduced for the global optimization of the matching process.PSO is straightforwardly parallelizable and has shown excellent convergence rates, which can further reduce the computing time.It has the advantage of concurrently evaluating a model population, and particles can be evenly distributed to multiple threads/processes [20].The remainder of this article is organized as follows.Section 2 introduces the model construction of the particle-swarm optimization-based matching (PSOM), including the matching unit introduction, the mapping of matching process to particle flight process and the workflow of PSOM.Section 3 introduces the parallel strategies of similarity computation and matching identification in PSOM.Section 4 describes the accuracy evaluation indices.Experiments conducted on 14 datasets with different scales are presented in Section 5.1.Parameter settings of particle and thread are discussed in Section 5.2; algorithm accuracies are analyzed in Section 5.3; and Section 5.4 examines the speed-up ratio of PSOM under the GPU environment, comparing with another global optimization strategy, the probability-relaxation algorithm (PRM).The conclusions are drawn in Section 6.

Matching Unit
The node/road-segment-matching unit may result in local optimization problems.In this paper, the "stroke", a natural morphological extension of continuous road segments, is employed as the matching unit.Strokes can serve as reference to help locate homonymous road segments and handle a fuzzy matching relationship of road segments.A hierarchical stroke structure is also a global constraint that can avoid local matching errors caused by nonsystematic bias.

Model Construction of PSOM
PSO has the advantage of global optimization and parallelization.This section provides a detailed solution of how to use the PSO to solve road-network matching problems.
(1) Particle Swarm The first step of PSO is to generate a swarm composed of several models (position vector) in the model parameter space.The initial models can be defined by a given random distribution.Each model is represented by a particle that interacts with its neighborhood to find the global minimum of the misfit function.
The swarm is composed of m particles, which means m potential-candidate solutions in the solution space.The swarm in the nth iteration is denoted as , where P m (n) is the mth particle in the nth iteration.
Suppose the dimension of one particle is d, the ith particle can be represented as position vector where p ij (n)denotes the position of j-dimension for particle i.
Road-network matching is an optimization problem in d (the number of nodes in road network) dimensional space.In this paper, one particle corresponds to one solution space/model for the road-network matching.A particle's position in d-dimensional space is a solution composed of d matching pairs.
We denote the road network with fewer nodes as R a , and the road network with more nodes as R b .The nodes of R a are labeled as N a , and the nodes of R b are labeled as N b .The size of N a is viewed as the dimension d of the model, and each node in N a corresponds to one dimension of the model.Actually, the size of N a is used to define the size of position vector P i (n) that is denoted as dimension d.Then, the position of the ith dimension for particle i, p ij (n) is represented by a matching pair (n aj , n bx ), n aj ∈ N a , n bx ∈ N pb ∈ N b .n aj is the jth element in N a and n bx is any node from the potential candidate nodes set N pb, which also belongs to N b.
Specifically, suppose feature n aj in R a has k potential candidate features {n c1 ,n c2 , . . ., n ck } in R b , then all potential matching pairs for n aj constitute set PSet aj {(n aj , n c1 ), (n aj , n c2 ), . . ., (n aj , n ck )}.Each (n aj , n bx ) denotes one matching pair of R a and R b , and represents the position for one specific dimension.Thus, the set of (n aj , n bx ), which is composed of m matching pairs, constitutes a specific solution space, and corresponds to a specific position of the particle in the M-dimensional space.The matching pair (n aj , n bx ) updates during the flight process in terms of iterative function; therefore, the position of the particle and the model vector change.
(2) Particle Velocity Particle velocity controls how a particle moves in the model parameter space.The velocity vector represents the changing degree of the particle position.Velocity vector is adjusted according to its own personal best model and the global best model of the whole swarm.It is represented by is the velocity of particle i in the jth dimension on the nth iteration.Here, v ij (n) represents the updated extent of the jth matching pair, which will decide the next matching pair for n aj.
(3) Particle Fitness Value Particle fitness value is determined by the optimization objective, which is used to evaluate the optimization extent.The particle fitness value is the final optimized solution when the iteration process ends.In our problem, the optimization objective is the global similarity of R a and R b.Here, the fitness of one particle is represented by the sum of the similarity of each matching pair in a particle.The greater the fitness value is, the better the matching results are.
The fitness value of particle i in the nth iteration is represented as , in which f ij (n) is the fitness value for the jth dimension.The similarity-measure model Sim(n aj , n bx ) is used to calculate f ij (n) for jth matching pair (n aj , n bx ) in the particle (see Appendix A).Note that here f ij (n) is a universal framework that could fit different kinds of similarity functions.
Sim(n aj , n bx ) is the similarity between node n aj and n bx .f ij (n) quantitatively describes the similarity between node n aj and n bx .
(4) Individual Optimum Extremum Individual optimum extreme refers to the optimal solution in its multiple iterations for a single particle, which is also called the personal best model.The formula is as follows: where P cbest i (n) is the maximum PFV chosen from the initial state to the current iteration ci for particle i. P cbest i (n) is also represented as vector , where P cbest ij (n) is the value on the jth dimension.In this way, the local optimization of each matching pair is also achieved by updating the personal best model of each particle.Thus, the individual optimum extreme refers to a set of matching pairs with individual best fitness.
(5) Global Optimal Extremum Global extreme value refers to the optimal particle in the process of searching from the initial state to the current iteration for the whole particle swarm, which is also called the global best model.In each iteration, the particle with maximum fitness PFV is selected from the particle swarm.P gbest (n) is continuously updated in the iteration process.
The global optimal extremum on the nth iteration is represented as P gbest (n) = [P (n) is the value on the jth dimension.In this way, the flying process maximizes the total similarity of all matched feature pairs of R a and R b .When the iteration ends, P gbest (n) is the optimal matching result for R a and R b .
(6) Particle Flying The principle of particle flying is illustrated in Figure 1; the velocity of particle i, v ij (n + 1) is decided by three parts: (1) velocity in the last iteration, which inherits the previous velocity; (2) distance between the current position of particle i and the optimal position of particle i, which represents the cumulative experience of particle i; and (3) distance between the current position of particle i and the global optimal particle in the swarm, which represents the sharing experience from the society.In this way, other particles gradually fly to the state of global optimum.
The velocity and position of each particle are updated as follows: where v ij (n + 1) means the velocity of particle i on the jth dimension at iteration n + 1. w is an inertia weight, and c 1 and c 2 are two acceleration parameters that, respectively, control the cognition and social interactions of the particles.The velocity and position of each particle are updated as follows: where vij(n + 1) means the velocity of particle i on the jth dimension at iteration n + 1. w is an inertia weight, and c1 and c2 are two acceleration parameters that, respectively, control the cognition and social interactions of the particles.r1 and r2 are uniform random-number vectors drawn at iteration n, P cbest ij (n) is the personal best model of particle i on the jth dimension at iteration n, and P gbest j (n) is the global best model of the swarm on the jth dimension at iteration n.Through particle iteration, the optimal similarity matching-pair sequence is gradually formed.
Then, the next position of each particle is calculated by Formula ( 6): To identify the matching pair, we consider all the features in Ra as the anchor point; in other words, naj in the jth matching pair (naj, nbx) is fixed.The objective is to find the correct matching feature from Npb for each feature naj in Ra.Since each feature naj in Ra has the potential matching area in Rb, in our problem, the flying range of each dimension for one particle is limited.The model parameter space of the jth dimension is restricted to all potential matching pairs of naj, which is selected from set PSetaj.The flying position range for jth dimension should be restricted in PSetaj.
Usually, the change of velocity is a continuous process, while in the search of matching pairs, the change of the matching pair is a discrete process, so we need to map the continuous changing process into a discrete process.Meanwhile, to guarantee that the changing direction of velocity in each iteration process is in the right orientation, the similarity value is utilized in the personal best model of particle i and the global best model of the swarm to ensure that global similarity is increasing in the flying process.
The speed-range vector is defined to restrict the flying extreme.The jth dimension of this vector represents the flying range for the jth matching pair, Vrj ∈ [minSim(PSetaj)-maxSim(PSetaj), maxSim(PSetaj)-minSim(PSetaj)].The lower bound is defined by minSim(PSetaj)-maxSim(PSetaj), and the upper bound is defined by maxSim(PSetaj)-minSim(PSetaj), in which minSim(PSetaj) is the minimum value among the similarity value for k potential candidate pairs in PSetaj, and maxSim(PSetaj) is the maximum value among the similarity value for k potential candidate pairs in PSetaj.
For each dimension, k potential-candidate pairs in PSetaj are sorted in order of similarity value, and the corresponding relationship between the similarity value interval and matching pairs should be established for these k candidate pairs.In the iteration process, to get a new position for the jth dimension of particle i, a matching pair (naj, nbx) is found that has the closest value with xij from PSetaj..Then, Sim(naj, nbx) is assigned to xij.
Mapping between continuous velocity and discrete matching is achieved.Using the extent of similarity changing to update speed is more in line with the optimization objective of global matching.
Fitness value PFV of each particle changes synchronously when all matching pairs in the particle are changed in terms of iteration function.That is, the position of the entire particle swarm changes Then, the next position of each particle is calculated by Formula ( 6): To identify the matching pair, we consider all the features in R a as the anchor point; in other words, n aj in the jth matching pair (n aj , n bx ) is fixed.The objective is to find the correct matching feature from N pb for each feature n aj in R a .Since each feature n aj in R a has the potential matching area in R b , in our problem, the flying range of each dimension for one particle is limited.The model parameter space of the jth dimension is restricted to all potential matching pairs of n aj, which is selected from set PSet aj .The flying position range for jth dimension should be restricted in PSet aj.
Usually, the change of velocity is a continuous process, while in the search of matching pairs, the change of the matching pair is a discrete process, so we need to map the continuous changing process into a discrete process.Meanwhile, to guarantee that the changing direction of velocity in each iteration process is in the right orientation, the similarity value is utilized in the personal best model of particle i and the global best model of the swarm to ensure that global similarity is increasing in the flying process.
The speed-range vector is defined to restrict the flying extreme.The jth dimension of this vector represents the flying range for the jth matching pair, Vr j ∈ [minSim(PSet aj )-maxSim(PSet aj ), maxSim(PSet aj )-minSim(PSet aj )].The lower bound is defined by minSim(PSet aj )-maxSim(PSet aj ), and the upper bound is defined by maxSim(PSet aj )-minSim(PSet aj ), in which minSim(PSet aj ) is the minimum value among the similarity value for k potential candidate pairs in PSet aj , and maxSim(PSet aj ) is the maximum value among the similarity value for k potential candidate pairs in PSet aj .
For each dimension, k potential-candidate pairs in PSet aj are sorted in order of similarity value, and the corresponding relationship between the similarity value interval and matching pairs should be established for these k candidate pairs.In the iteration process, to get a new position for the jth dimension of particle i, a matching pair (n aj , n bx ) is found that has the closest value with x ij from PSet aj. .Then, Sim(n aj , n bx ) is assigned to x ij .
Mapping between continuous velocity and discrete matching is achieved.Using the extent of similarity changing to update speed is more in line with the optimization objective of global matching.
Fitness value PFV of each particle changes synchronously when all matching pairs in the particle are changed in terms of iteration function.That is, the position of the entire particle swarm changes with the change of velocity position of each dimension.In this way, the optimization direction of the particle swarm is in the direction of global similarity increasing in the process of iterations.

PSOM Workflow
The PSOM algorithm can quickly find optimal solutions, but it is easy to fall into a local optimum.Therefore, a tabu-search strategy is used to prevent prematurely falling into a local optimum.
Two optimal solutions should be recorded in the iterations.P gbest (n) records the global best model of the swarm at iteration n, and P hgbest (n) records the history global best model of the swarm in the process of n iterations.P gbest (n) is put into the tabu list to stay for √ N time, in which N is the number of particles.The particle in the tabu list will not be selected except for when it meets the contempt criterion that intends to prevent loss of the global optimal solution.The contempt criterion is valid when the PFV of the current particle is greater than P hgbest (n).
The PSOM workflow is shown in Figure 2, and the specific steps are as follows: (1) Similarity calculation.For all nodes in R a , calculate the similarity value of each matching pair in PSet aj using Equation ( 2), and sort PSet aj by the similarity value.
(2) Particle Initialization.Initialize particles in a random manner, and select P gbest (n), putting into the tabu list.(3) Particle Iteration.Calculate and update the velocity and position of each particle using Equations ( 5) and (6).Calculate the PFV for all particles at iteration n using Equation ( 1), and the maximum particle is selected to be P gbest (n).with the change of velocity position of each dimension.In this way, the optimization direction of the particle swarm is in the direction of global similarity increasing in the process of iterations.

PSOM Workflow
The PSOM algorithm can quickly find optimal solutions, but it is easy to fall into a local optimum.Therefore, a tabu-search strategy is used to prevent prematurely falling into a local optimum.Two optimal solutions should be recorded in the iterations.P gbest (n) records the global best model of the swarm at iteration n, and P hgbest (n) records the history global best model of the swarm in the process of n iterations.P gbest (n) is put into the tabu list to stay for N time, in which N is the number of particles.The particle in the tabu list will not be selected except for when it meets the contempt criterion that intends to prevent loss of the global optimal solution.The contempt criterion is valid when the PFV of the current particle is greater than P hgbest (n).
The PSOM workflow is shown in Figure 2, and the specific steps are as follows: (1) Similarity calculation.For all nodes in Ra, calculate the similarity value of each matching pair in PSetaj using Equation ( 2), and sort PSetaj by the similarity value.(2) Particle Initialization.Initialize particles in a random manner, and select P gbest (n), putting into the tabu list.(3) Particle Iteration.Calculate and update the velocity and position of each particle using Equations ( 5) and (6).Calculate the PFV for all particles at iteration n using Equation ( 1), and the maximum particle is selected to be P gbest (n).

PSOM Parallelization Strategy
PSOM is composed of a similarity computation process and matching-relationship identification process.Accordingly, the parallelization strategy is designed for these two stages, respectively (Figure 3).

PSOM Parallelization Strategy
PSOM is composed of a similarity computation process and matching-relationship identification process.Accordingly, the parallelization strategy is designed for these two stages, respectively (Figure 3).

Parallel-Similarity Computation
Similarity calculation is the precondition of matching-relationship identification.The time cost of similarity calculation is dependent on the data scale and the complexity of the similarity measure.

Parallel-Similarity Computation
Similarity calculation is the precondition of matching-relationship identification.The time cost of similarity calculation is dependent on the data scale and the complexity of the similarity measure.Transforming similarity calculation from a sequence way to a parallel way can significantly speed up the process, and is applicable for different similarity-measurement methods.
For each node n aj in the original map, the potential matching pairs are selected using a specified buffer threshold.Suppose each n aj has m matching candidates, stored in PSet aj {(n aj , n c1 ), (n aj , n c2 ), . . ., (n aj , n cm )}.The similarity for n nodes in the original map need n×m similarity operations.Similarity computation of matching pairs is an independent operation, so the n*m similarity operations can be distributed across multiple threads, and the whole similarity-calculation process can be parallelized.
There are three phases for the parallelization of similarity computing.The first phase is decomposing node-feature pairs into multiple subsets for threads.The size of subsets is determined by the total number of nodes in the original map and the number of available compute units.The second phase is a parallel execution of similarity operations.Each thread handles a subset, calculating the similarity of node-feature pairs and sorting PSet by the similarity value.The final phase is collecting the outputs derived from each thread and recompiling them into a large single similarity set, which is used for the matching-relationship identification process.
Additionally, due to the diversity of data structures in different data sources, the original data are reconstructed in the data preparation stage.The additional structure (i.e., stroke structure) and the matching candidates of each node are all prepared beforehand.They are allocated in the global memory for the usage of similarity calculation in the GPU procedure.

Parallel-Matching Identification
The flight behavior of each particle is independent and has intrinsic parallel attributes, which make the PSOM algorithm a kind of natural parallel optimization algorithm.
The matching-relationship identification of PSOM utilizes a decomposition parallel strategy.This strategy decomposes the whole particle swarm into subgroups corresponding with the compute units.Each node of the compute units deals with all the tasks including initialization, iterative information update, and information exchange.The best-location information is exchanged between nodes according to a specific strategy.Because PSOM is a fine-granularity parallel-computing model, the node number (i.e., the thread number of GPU) is set equal to the particle number in this paper.This can balance the computing load and maximize parallel the efficiency.
The PSOM algorithm can be divided into two stages, particle initialization and particle flight.Two GPU kernel functions are designed to parallel these two processes, respectively.The first kernel function k-particle-initial initializes the particles in parallel.The number of threads is determined according to the number of particles, and the curand function is used to generate random numbers that are used to set the initial position of particles in parallel.The initial potential matching pairs in one particle are randomly selected from the PSet aj of nodes in R a .Particles with different possible parameter models are initialized in parallel.After the initialization of all particles is completed, the comparison is made in the first thread to find the global optimal value and local optimal value.The second kernel function k-particle-fly is iteratively executed in parallel.The iteration number of k-particle-fly is a preset flying-time number, and the number of threads is still determined by the number of particles.The flying process is described in detailed in the pseudo-code (Algorithm 1).
In the flying process, communications are inevitable for visiting and updating the global optimum particle and the tabu list.The key point is to control synchronization access to the global memory.Thus, a shared region should be designed for the global optimum particle and the tabu list.The global memory is utilized to reserve the global optimum value.After the local optimum value in each particle is updated, the global optimum particle is calculated through communication between each single thread and the global memory.The global optimum particle is finally updated into the global memory.
To avoid the conflict of the global optimum updating and reducing communication times/cost, a periodically updated strategy is adopted.The update behavior is executed in the first thread when the specified iteration of all particles is completed.Hence, the global optimal value can be periodically updated to guide the following flying direction of each particle.This strategy not only fully exerts the independence of each particle, but also reduces the communication cost.
Besides, during iteration, many random values are needed to avoid frequent data exchange between CPU and GPU.The CUDA CURAND library on the GPU can be used to directly generate random numbers instead of transform time from CPU to GPU.

Algorithm 1: Parallel-Matching Relationship Identification
Input: Road-network data in a defined GPU data structure; similarity set layer.

Output:
Matching relationship between road-network R a and R b .
1.1: Number of threads is determined according to number of particles.1.2:The curand function is used to randomly set the initial position of each particle.1.3: Global optimal value and local optimal value are found in the first thread.2: While iteration condition not reached do kernel function k-particle-fly.
2.1: The velocity of each particle is calculated according to Equation ( 5) 2.2: The position of each particle is updated according to Equation ( 6) 2.3: The PFV of each particle is calculated.2.4: When all particles are done, the P cbest (n) and P hbest (n) of each particle and P gbest (n) of the whole particle swarm are updated.
2.5: If PFV of the current particle is greater than P hbest (n), then the P gbest (n) is assigned to P hbest (n).Otherwise, P gbest (n) is put into the tabu list.
2.6: The stay time of all particles in the tabu list is updated.3.If the iteration condition is reached then the best particle found in the global memory is transferred from the GPU to the CPU. 4. Return the matching pairs in the final global optimum particle P hgbest (n), the algorithm ends.

Model Evaluation Indices
Two evaluation indices were used to access the accuracy of the models: the matching accuracy (Equation ( 7)) and the accuracy growth rate (Equation ( 8)).
The matching accuracy (P) is determined by the right matching features (TP), the wrong matching features (FP), and the ambiguous features that cannot be matched artificially (AM).
The closer that P value is to 1, the more correct matching features the algorithm recognizes.The accuracy growth rate (AGR) is used for comparing the result accuracies between different algorithms.
where AGB (A,B) represents the accuracy growth rate of Algorithm A compared to Algorithm B. P A is the matching accuracy of Algorithm A, and P B is the matching accuracy of Algorithm B. A higher absolute value of AGR indicates a higher accuracy change.If AGR is greater than zero, Algorithm A performs more accurately than Algorithm B; otherwise, Algorithm B is better.The effectiveness of performance optimization was evaluated by comparing with a benchmark.To the best of our knowledge, the road-network matching study using parallel computing strategy has not been reported in previous articles.Therefore, it is not possible to compare our parallel method with a similar parallelized method.In recent years, the probability-relaxation-matching algorithm (PRM) is often employed in road-network matching studies [5,[13][14][15]19,21,22], thus it was adopted as the benchmark.The general probability-relaxation framework was utilized to implement the PRM in our experiment [23].In addition, the similarity-computation function used in PRM is totally the same with that used in PSOM.The ending condition for the iteration of the PRM was set based on the experience value (0.0005) [15].

Experimental Environment and Data
To evaluating the performance of our method, 14 pairs of road networks were utilized with the node scale ranging from 44 to 72,130 (see Table 2 and Appendix B).Areas 1-6, 8-9, and 12 are from the Wuhan, China; Area 7 is part of the Chinese highway network; Areas 10-11 belong to Foshan, China; and Areas 13-14 belong to Oakland, New Zealand.Each road-network pair comes from different data producers.The source map represents the road network to be matched, and the destination map represents the road network that is used to match the source map.The proposed parallel strategy of PSOM was implemented using C++ programming language with the structure of CUDA based on GPU (PSOM-GPU).Experiments were carried on a computer equipped with an NVIDIA GeForce GTX 960 GPU and an Inter(R) Pentium(R) D 3.00 GHz with 3.5 GB RAM.The software-development environment is composed of CUDA 8.0 SDK, Visual Studio 2010, and MapGIS 10 (ZONDY, Wuhan, Hubei, China).

Parameter Setting
Two parameters needed to be set in the experiments: the particle number and the thread number.The particle number used in PSO algorithms is not uniform in different studies, and is usually between 20 and 50 in sequential algorithms and between 32 and 2000 in parallel algorithms [24][25][26].To obtain an appropriate particle number, a set of particle numbers (i.e., 50, 100, 200, 300, 400, 1000, and 2000) were tested in ten experimental areas with an iteration number of 200.
The algorithm accuracies based on each different particle number were compared with the accuracy of particle number 50 which served as the baseline.As shown in Figure 4, AGR values change slightly when particle number is in the range of 100-300.However, as the particle number increases, most AGR values have a dramatic increase when the particle number is 400.After that, AGR curves remain relatively stable.Overall, the increase of particle number can promote the matching accuracy; however, more memory space and processing time are required.Take Area 7 (439 nodes) as an example, both the matching-relationship identification time and the total time cost increase as the particle number grows, and an especially dramatic increase appears when the particle number is larger than 400 (Figure 5).Thus, particle number is positively associated with accuracy, but negatively correlated with efficiency.Based on the above analysis, the particle number was set as 400 in this study, which requires less time while maintaining a good performance.
The thread number differed in the two parallel stages.
In the stage of similarity computation, more than one node-feature pair is assigned in one thread.Thus, different combinations of node-feature numbers (FN) (i.e., the number of node-feature pairs assigned to each single thread) and thread numbers (TN) were tested.FN was set from 5 to 25 with an interval of 5.The size of TN grows in multiplicative size, from 16 to 256.Four road-network sets (Area 10, Area 11, Area 12, and Area 14) were employed to illustrate the time consumption of similarity computation under different conditions (Figure 6).For Areas 10-12, the time cost increased with the growth of FN and TN.No matter what value FN takes, the similarity computation time cost with 16 threads was always the least among five different settings of thread numbers.Thus, for the first three sets, the lower FN and TN are set, the faster is the computation speed.However, this regularity does not exist for Area 14.The experimental result reveals that, when setting parameters of FN = 15 and TN = 32, the time cost is the least, which is the best solution for similarity computation process in this study.The thread number differed in the two parallel stages.
In the stage of similarity computation, more than one node-feature pair is assigned in one thread.Thus, different combinations of node-feature numbers (FN) (i.e., the number of node-feature pairs assigned to each single thread) and thread numbers (TN) were tested.FN was set from 5 to 25 with an interval of 5.The size of TN grows in multiplicative size, from 16 to 256.Four road-network sets (Area 10, Area 11, Area 12, and Area 14) were employed to illustrate the time consumption of similarity computation under different conditions (Figure 6).For Areas 10-12, the time cost increased with the growth of FN and TN.No matter what value FN takes, the similarity computation time cost with 16 threads was always the least among five different settings of thread numbers.Thus, for the first three sets, the lower FN and TN are set, the faster is the computation speed.However, this regularity does not exist for Area 14.The experimental result reveals that, when setting parameters of FN = 15 and TN = 32, the time cost is the least, which is the best solution for similarity computation process in this study.The thread number differed in the two parallel stages.
In the stage of similarity computation, more than one node-feature pair is assigned in one thread.Thus, different combinations of node-feature numbers (FN) (i.e., the number of node-feature pairs assigned to each single thread) and thread numbers (TN) were tested.FN was set from 5 to 25 with an interval of 5.The size of TN grows in multiplicative size, from 16 to 256.Four road-network sets (Area 10, Area 11, Area 12, and Area 14) were employed to illustrate the time consumption of similarity computation under different conditions (Figure 6).For Areas 10-12, the time cost increased with the growth of FN and TN.No matter what value FN takes, the similarity computation time cost with 16 threads was always the least among five different settings of thread numbers.Thus, for the first three sets, the lower FN and TN are set, the faster is the computation speed.However, this regularity does not exist for Area 14.The experimental result reveals that, when setting parameters of FN = 15 and TN = 32, the time cost is the least, which is the best solution for similarity computation process in this study.In the matching-relationship identification stage, the number of threads is equal to the total number of particles.Each thread performs the same operation simultaneously and synchronously.Since the particle number was set to 400, the thread number of matching-relationship identification was also set to 400 in our experiment.

Algorithm Accuracy
Figure 7 delineates the matching accuracies of PSOM-GPU and the benchmark PRM in different road networks.Figure 8 shows the accuracy growth rate (AGR(PSOM-GPU,PRM)) of PSOM-GPU compared with PRM.
The accuracies of PSOM-GPU are all greater than 75%, and more than half of the accuracies are greater than 85%, peaking at 97.73% in Area 1.The accuracies of PSOM-GPU do not change linearly when the number of nodes increases, but vary from region to region, following the same trend as the benchmark.As Figure 8 shows, the accuracy differences between PSOM-GPU and the benchmark are very small with an average AGR(PSOM-GPU,PRM) of −0.012.For the areas with negative AGR, the absolute values are less than 0.05, which is in an acceptable accuracy error range.PSOM-GPU can achieve the same matching accuracy as the benchmark PRM, and correctly identify most matching-relationship (84.44% on average).In addition, since memory space required by PRM in Area 14 exceeds the tolerance value of the hardware, the accuracies of PRM are not calculated in this area.However, PSOM-GPU can successfully perform the matching process, and obtain an accuracy of 74.65%.

Algorithm Accuracy
Figure 7 delineates the matching accuracies of PSOM-GPU and the benchmark PRM in different road networks.Figure 8 shows the accuracy growth rate (AGR (PSOM-GPU,PRM) ) of PSOM-GPU compared with PRM.

Algorithm Accuracy
Figure 7 delineates the matching accuracies of PSOM-GPU and the benchmark PRM in different road networks.Figure 8 shows the accuracy growth rate (AGR(PSOM-GPU,PRM)) of PSOM-GPU compared with PRM.
The accuracies of PSOM-GPU are all greater than 75%, and more than half of the accuracies are greater than 85%, peaking at 97.73% in Area 1.The accuracies of PSOM-GPU do not change linearly when the number of nodes increases, but vary from region to region, following the same trend as the benchmark.As Figure 8 shows, the accuracy differences between PSOM-GPU and the benchmark are very small with an average AGR(PSOM-GPU,PRM) of −0.012.For the areas with negative AGR, the absolute values are less than 0.05, which is in an acceptable accuracy error range.PSOM-GPU can achieve the same matching accuracy as the benchmark PRM, and correctly identify most matching-relationship (84.44% on average).In addition, since memory space required by PRM in Area 14 exceeds the tolerance value of the hardware, the accuracies of PRM are not calculated in this area.However, PSOM-GPU can successfully perform the matching process, and obtain an accuracy of 74.65%.The accuracies of PSOM-GPU are all greater than 75%, and more than half of the accuracies are greater than 85%, peaking at 97.73% in Area 1.The accuracies of PSOM-GPU do not change linearly when the number of nodes increases, but vary from region to region, following the same trend as the benchmark.As Figure 8 shows, the accuracy differences between PSOM-GPU and the benchmark are very small with an average AGR (PSOM-GPU,PRM) of −0.012.For the areas with negative AGR, the absolute values are less than 0.05, which is in an acceptable accuracy error range.PSOM-GPU can achieve the same matching accuracy as the benchmark PRM, and correctly identify most matching-relationship (84.44% on average).In addition, since memory space required by PRM in Area 14 exceeds the tolerance value of the hardware, the accuracies of PRM are not calculated in this area.However, PSOM-GPU can successfully perform the matching process, and obtain an accuracy of 74.65%.

Algorithm Performance
Figure 9 presents the specific time cost of the PSOM-GPU and the benchmark PRM.The time cost becomes larger with the increase in nodes.However, compared with the benchmark, the growth rate of time cost for PSOM-GPU is relatively flat.When the node number grows from hundreds to thousands, the time cost of the whole matching process based on PRM increases by 449 times from 141 ms in Area 1 to 63,312 ms in Area 10.The time cost for PSOM-GPU grows from 562 ms in Area 1 to only half of the benchmark with 31,407 ms in Area 10.The difference of time cost between the two methods is more obvious as the data scale increases.Besides, the PRM might not be able to get the road-network matching results with large amount of data (e.g., Area 14), while the PSOM-GPU performed successfully (e.g., spent 136,093 ms in Area 14).Overall, PSOM-GPU has a better performance than the benchmark in dealing with large-scale road-network matching.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 15 of 21 the performance improvement of this stage.It indicates that the parallel strategy still has potential to be improved to increase the operational intensity of matching-relationship identification.Note that there are some curve fluctuations in the stage of similarity computation in Figure 10.We can see the descending segments in the curve (speed-up-I) from Area 6 to Area 7 and from Area 9 to Area 10, even though the data scale increases slightly.It is the reason that the time cost of the similarity function is highly dependent on the complexity of road and stroke.For instance, the average point number of a road segment is 3.6 in Area 6 while 7.8 in Area 7, and the average point number of strokes is 33.7 in Area 6 while 71.6 in Area 7. Thus, the road complexity of Area 7 is greater than that of Area 6, which leads to the lower speed-up in Area 7 than in Area 6.In addition, the speedup ratio also differs greatly among the four different sub-functions in similarity-measure function.Thus, the speed-up of similarity computation not only depends on the data scale, but also relates with the complexity of the original road-network data and the corresponding function complexity.
From the global perspective, the speed-up ratio is on the rise, increasing from 0.25 in Area 1 to 0.82 in Area 8, and then continues rising to more than 1, reaching 4.67 in Area 13.That is, when the data amount is small (e.g., 519 nodes in Area 8), less calculation is required, and the time is mainly spent on data transform.When facing with a large-size problem (e.g., 13,815 nodes in Area 13), the proposed parallelization strategy can significantly reduce the road-network matching time.Besides, the curve of the overall speed-up ratio is close to that of the matching-relationship identification process.This can be explained by the time-occupation ratio of the two key processes in PSOM (Figure 12).The time-consumption ratio of similarity computation is 0.15 on average, while it is 0.85 for matching-relationship identification.The matching-relationship identification process is the main consumption of the whole algorithm, which restricts the global performance improvement.More attention will be paid in parallel-matching identification stage in the future work.
In general, the proposed PSO-based parallel strategy of road network matching has a good performance when dealing with large amounts of data.

Figure 9.
The time cost of PRM and PSOM-GPU for road network matching.PRM-I, -II and -III denote the similarity computation, the matching-relationship identification and the global process, respectively, using PRM.PSOM-GPU-I, -II and -III denote the same three aspects, respectively, using PSOM-GPU. .The time cost of PRM and PSOM-GPU for road network matching.PRM-I, -II and -III denote the similarity computation, the matching-relationship identification and the global process, respectively, using PRM.PSOM-GPU-I, -II and -III denote the same three aspects, respectively, using PSOM-GPU.
To further evaluate the performance of PSOM-GPU, speed-up radio was calculated from local and global aspects (Figure 10).The local aspect focused on the performance improvement in the similarity computation stage and the matching-relationship identification stage.The global aspect embodied the performance of the whole process.Recorded execution time included the execution time on all stages of the algorithm, and the time spent on exchanging data.For the similarity computation stage, the time of reading data from disk was recorded, and for the matching-relationship identification stage, the time of writing data to disk was recorded.Besides, the recorded execution time for PSOM-GPU also contained the transfer time between the host memory and the GPU device memory.Thus, it is a fair and holistic comparison of the performance of different algorithms.From the local perspective, the speed-up ratios of both stages (the similarity computation and the matching-relationship identification) show an increasing trend as the number of road network nodes grows.The speed-up ratio of the similarity computation is increased from 0.88 in Area 1 to 31.36 in Area 13, and that of the matching-relationship identification grows from 0.07 to 4.08.The acceleration result of the algorithm is more obvious with increasing volumes of data.However, the speed-up performance of the two stages are not the same.Especially in Area 13, the speed-up ratio of similarity computation is 31.36,much higher than that of matching-relationship identification (4.08).The acceleration performance of similarity computation is better than the matching-relationship identification.
The roofline model [27] was adopted to analyze the theoretical performance of the two key processes (Figure 11).It is an intuitive visual performance model, providing the upper bound of performance and showing inherent hardware limitations.According to the specifications of NVIDIA GeForce GTX 960 GPU, the theoretical peak operational intensity of the GPU (hereafter, referred to as I max ) is 20.5 FLOP per Byte.The operational intensity of similarity computation is around I max , reaching 22.46 FLOP per Byte in the average case, which means that the computing power of the GPU is fully utilized under similarity computation stage.The operational intensity of the matching-relationship identification stage is very low, only 2.5 FLOP per Byte.The memory bandwidth restricts the performance improvement of this stage.It indicates that the parallel strategy still has potential to be improved to increase the operational intensity of matching-relationship identification.
Note that there are some curve fluctuations in the stage of similarity computation in Figure 10.We can see the descending segments in the curve (speed-up-I) from Area 6 to Area 7 and from Area 9 to Area 10, even though the data scale increases slightly.It is the reason that the time cost of the similarity function is highly dependent on the complexity of road and stroke.For instance, the average point number of a road segment is 3.6 in Area 6 while 7.8 in Area 7, and the average point number of strokes is 33.7 in Area 6 while 71.6 in Area 7. Thus, the road complexity of Area 7 is greater than that of Area 6, which leads to the lower speed-up in Area 7 than in Area 6.In addition, the speed-up ratio also differs greatly among the four different sub-functions in similarity-measure function.Thus, the speed-up of similarity computation not only depends on the data scale, but also relates with the complexity of the original road-network data and the corresponding function complexity.From the global perspective, the speed-up ratio is on the rise, increasing from 0.25 in Area 1 to 0.82 in Area 8, and then continues rising to more than 1, reaching 4.67 in Area 13.That is, when the data amount is small (e.g., 519 nodes in Area 8), less calculation is required, and the time is mainly spent on data transform.When facing with a large-size problem (e.g., 13,815 nodes in Area 13), the proposed parallelization strategy can significantly reduce the road-network matching time.Besides, the curve of the overall speed-up ratio is close to that of the matching-relationship identification process.This can be explained by the time-occupation ratio of the two key processes in PSOM (Figure 12).The time-consumption ratio of similarity computation is 0.15 on average, while it is 0.85 for matching-relationship identification.The matching-relationship identification process is the main consumption of the whole algorithm, which restricts the global performance improvement.More attention will be paid in parallel-matching identification stage in the future work.In general, the proposed PSO-based parallel strategy of road network matching has a good performance when dealing with large amounts of data.

Conclusions
This study developed a parallel PSO-based road-network matching algorithm (PSOM) on graphics-processing units (GPU), which can be used to conquer the bottleneck problem of traditional road-network matching algorithms when facing big data.To fully use GPU threads, a data-partition and a task-partition strategy were proposed in similarity computation and matching-relationship identification stages, respectively.Fourteen pairs of road networks with different node scales were selected to verify the proposed PSOM method.Our results show that the road matching accuracy of the parallel GPU-accelerated PSOM approach achieved the same level as the benchmark PRM.When facing massive amounts of data, e.g. more than thousands of nodes, the proposed parallel PSOM method can significantly improve computational efficiency of the road-network matching process.Overall, the larger the data scale is, the more obvious the advantage of the PSOM-GPU approach in the speed-up of similarity computation and matching-relationship identification process is.According to these findings, the proposed parallel PSOM shows great potential for matching road-network in an effective and efficient manner.
This research also found that, although the parallel GPU-accelerated PSOM approach could promote the execution efficiency of both similarity computation and matching-relationship identification, the speed-up effects were different.It had better effect for the similarity computation stage.The memory bandwidth restricted the performance improvement of matching-relationship identification stage, which further affected the overall performance improvement.In the future, more work will be assigned for improving the speed-up performance of matching-relationship identification.

Figure 4 .
Figure 4. Relationship between accuracy growth rate and particle number.

Figure 5 .
Figure 5. Matching identification time and total time cost under different particle numbers using Area 7 (439 nodes) as an example.

Figure 4 .
Figure 4. Relationship between accuracy growth rate and particle number.

Figure 4 .
Figure 4. Relationship between accuracy growth rate and particle number.

Figure 5 .
Figure 5. Matching identification time and total time cost under different particle numbers using Area 7 (439 nodes) as an example.

Figure 5 .
Figure 5. Matching identification time and total time cost under different particle numbers using Area 7 (439 nodes) as an example.

Figure 6 .
Figure 6.Time consumption of similarity computation in Area 10, Area 11, Area 12, and Area 14 under different combinations of FN and TN.

Figure 6 .
Figure 6.Time consumption of similarity computation in Area 10, Area 11, Area 12, and Area 14 under different combinations of FN and TN.

Figure 6 .
Figure 6.Time consumption of similarity computation in Area 10, Area 11, Area 12, and Area 14 under different combinations of FN and TN.

21 Figure 10 .Figure 11 .
Figure 10.The speed-up ratio of PSOM-GPU compared with PRM.I, II and III denote the three aspects of similarity computation, matching-relationship identification and global process, respectively.

Figure 10 .
Figure 10.The speed-up ratio of PSOM-GPU compared with PRM.I, II and III denote the three aspects of similarity computation, matching-relationship identification and global process, respectively.

Figure 10 .Figure 11 .
Figure 10.The speed-up ratio of PSOM-GPU compared with PRM.I, II and III denote the three aspects of similarity computation, matching-relationship identification and global process, respectively.

Figure 11 .
Figure 11.Roofline model for NVIDIA GeForce GTX 960 GPU with two stages of PSOM: similarity computation stage and the matching identification stage.

21 Figure 10 .Figure 11 .
Figure 10.The speed-up ratio of PSOM-GPU compared with PRM.I, II and III denote the three aspects of similarity computation, matching-relationship identification and global process, respectively.

Table 1 .
Studies that mention the performance of road-network matching algorithms (listing data scale, time cost, accuracy, recall, and environment).
model of the swarm on the jth dimension at iteration n.Through particle iteration, the optimal similarity matching-pair sequence is gradually formed.
the PFV is greater than P hgbest (n), the maximum particle is assigned to P hgbest (n); otherwise, P gbest (n) is put into the tabu list if it is not contained.The global optimal model P gbest (n) and the personal best model P cbest i (n) are updated and recorded in each iteration.Continue with this step until the maximum iteration number is reached.(4)Output all the matching pairs in the final global optimum particle P hgbest (n).
Continue with this step until the maximum iteration number is reached.(4) Output all the matching pairs in the final global optimum particle P hgbest (n).
the PFV is greater than P hgbest (n), the maximum particle is assigned to P hgbest (n); otherwise, P gbest (n) is put into the tabu list if it is not contained.The global optimal model P gbest (n) and the personal best model P cbest i (n) are updated and recorded in each iteration.PFV ＞ P hbest (n) PFV ≤ P hbest (n)

Table 2 .
Fourteen pairs of road networks used for experiments.NS represents the node number of the source map, and ND represents the node number of the destination map.