Research on the Parallelization of the DBSCAN Clustering Algorithm for Spatial Data Mining Based on the Spark Platform

: Density-based spatial clustering of applications with noise (DBSCAN) is a density-based clustering algorithm that has the characteristics of being able to discover clusters of any shape, effectively distinguishing noise points and naturally supporting spatial databases. DBSCAN has been widely used in the ﬁeld of spatial data mining. This paper studies the parallelization design and realization of the DBSCAN algorithm based on the Spark platform, and solves the following problems that arise when computing macro data: the requirement of a great deal of calculation using the single-node algorithm; the low level of resource-utilization with the multi-node algorithm; the large time consumption; and the lack of instantaneity. The experimental results indicate that the proposed parallel algorithm design is able to achieve more stable speedup at an increased involved spatial data scale.


Introduction
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an algorithm proposed by Ester et al. for clustering analyses based on the density method in 1996 [1].It has features like the abilities to find clusters of any shape, identify noisy points effectively, and support spatial databases.It has been widely used in the field of data mining.However, in some large-scale spatial data mining studies [2], as the scale of data computation increases, the processing time of DBSCAN rises exponentially.Thus, the performance of this serial algorithm is unable to meet the needs of real-time application development.
To speed up the performance of serial DBSCAN processing, on one hand, some researchers began to improve and optimize the serial algorithm; on the other hand, the DBSCAN clustering algorithm has also been parallelized on computer clusters, graphics-processing units (GPUs), and the Hadoop platform to improve its efficiency.However, these approaches have the following disadvantages: (1) The traditional parallel-processing platform is expensive and has inferior scalability and fault-tolerance properties, which result in bottlenecks in data transfer.(2) When dealing with a multi-iterative clustering algorithm such as DBSCAN, Hadoop needs to read and write data frequently from/to the distributed file system.The efficiency of the system decreases as the amount of data increases.
To resolve the problem of long processing times associated with large-scale data processed using the serial DBSCAN algorithm, in this paper, the big data processing platform Spark was introduced to design and implement a parallel DBSCAN clustering algorithm.Spark, as a new generation of fast general-purpose engine for large-scale data processing, provides resilient distributed dataset (RDD) abstraction for data storage that eliminates the need for intermediate results to be sent to the distributed file systems, and hence it improves real-time data processing.Spark has features such as a large degree of scalability and high fault tolerance, which overcome the problems of the aforementioned parallel platforms.Based on the Spark platform, which was designed and optimized for memory usage and iterative processes, this paper studies parallelization strategies for the DBSCAN algorithm.Historically, the DBSCAN algorithm was designed and optimized on the single-node Spark platform and the Spark cluster platform based on different resource managers.Therefore, the research presented in this paper helps to improve the efficiency of the DBSCAN algorithm and generate new ideas for the parallelization of other types of spatial clustering algorithms, and to a certain extent, promotes the organic integration of the Spark platform with traditional spatial data mining technologies.
In addition, we utilized the parallel DBSCAN algorithm based on the Spark platform in a traffic congestion area discovery application.Compared with traditional processing methods, this approach can dramatically improve the efficiency of city congestion discovery.These experiments verify the high efficiency and practicability of the parallel algorithm proposed in this paper.
The paper is organized as follows.Section 2 presents related works, and conducts a systematic analysis.Section 3 gives a brief introduction to the DBSCAN algorithm, including its mathematical principles, and the implementation of serial DBSCAN algorithm.Section 4 concentrates on the implementation of the parallel DBSCAN algorithm and its optimization methods.The subsequent section focuses on the test results of the parallel algorithm on both the single node platform and the Spark cluster with different resource managers.The experimental results are systematically analyzed thereafter.Section 6 discusses the application of the parallel algorithm for congested area discovery in urban traffic.Finally, Section 7 draws some conclusions and discusses future research directions.

Related Works
To shorten the elapsed time taken for the serial DBSCAN processing, researchers began to improve and optimize some selected serial algorithms.Based on DBSCAN, Ankerst et al. proposed the ordering points to identify the clustering structure (OPTICS) algorithm to solve the problem of parameter selection that affects the algorithm.However, the OPTICS algorithm is complicated, and requires additional I/O operations.As a result, its performance is lower than that of DBSCAN [3].Chen et al. designed the spatial access algorithm to improve DBSCAN, where the complexity of the algorithm was reduced from O n 2 to O(nlgn), which to a certain extent improved the efficiency of the algorithm [4].Kryszkiewicz et al. proposed a method that uses the triangle inequality to reduce the neighboring search space.The search time was thus reduced, and the efficiency of the algorithm was improved [5].
Although optimization of the serial DBSCAN algorithm itself can improve its efficiency to some extent, with the geometric growth of the amount of data processed, this serial algorithm cannot meet real-world requirements.In recent years, due to its powerful computing capabilities, GPU-based computing, and the technology of cloud computing, etc. have enabled major breakthroughs in parallel computing performance [6,7].GPU-based computing has advantages in processing intensive matrix operations, and it has been widely used in some compute-intensive applications [8][9][10][11][12][13][14][15][16].Cloud computing has performance advantages in data processing, and the MapReduce model has been used widely on cloud computing platforms in many data-intensive applications [17][18][19][20].In order to obtain more efficient processing capabilities, researchers also began to parallelize the clustering algorithms to improve the performance.The parallelization of clustering algorithms at home and abroad are mainly based on high-performance computing cluster platforms, GPU platforms, and Hadoop platforms.Relevant details include: (1) Parallelization on high-performance cluster platforms.Xu et al. proposed a network-based fast parallel-DBSCAN (PDBSCAN) algorithm for use on cluster systems with master-slave architectures.First, the entire data area is divided into N disjointed data blocks, where each data block is processed by each computing node.After a given data block has been processed, the results are merged on the master node.This parallel algorithm requires extra communication time, where the inefficiencies related to combine the data with the main node data reduce the overall processing efficiency of the algorithm [21].Based on the data-partition DBSCAN algorithm, Erdem & Gündem used high-performance clusters to improve the efficiency of the algorithm [22].In these studies, parallel platforms based on high-performance clusters can help to improve the algorithm, but the conventional cluster platform has issues such as poor scalability, bad fault tolerance, etc.The shared architecture is also likely to cause bottlenecks in data transfer.(2) Parallelization on GPU platforms.Böhm et al. proposed a GPU-based CUDA-DClust parallel algorithm that calculates the distance from the central point to surrounding points using the multi-thread programming model to efficiently query the neighboring data.The results show that the parallel algorithm enhances the overall efficiency with the help of the GPU [23].However, CUDA-DClust needs to calculate the distance between many unnecessary objects and store them on the device memory.Andrade et al. used the compute unified device architecture (CUDA) on the GPU to accelerate the parallel algorithm [24].Compared with the serial algorithm, the performance was greatly improved.(3) Parallelization on Cloud-computing platforms.Dean et al. proposed to take advantage of MapReduce, a large-scale data processing programming model, to enhance the processing efficiency [25].Böse et al. designed data-mining algorithms based on the MapReduce programming framework, and they demonstrated the feasibility of adopting the MapReduce programming model in data mining [26].He et al. implemented the MR-DBSCAN algorithm with MapReduce, and achieved good scalability and speedup [27].Dai et al. divided the data by reducing the number of boundary points, and they used KD-Tree spatial indices to parallelize DBSCAN on Hadoop to improve the performance of the algorithm [28].Fu et al. studied the parallel DBSCAN clustering algorithm based on Hadoop.The experimental results demonstrated that the proposed algorithm could efficiently process large datasets on commodity hardware and have good scalability [29].In addition, some researchers studied the parallel implementation of K-Means algorithm on Hadoop to improve the efficiency [30,31].In these studies, the efficiency of the parallel algorithms were improved on Hadoop, but the results were less than ideal when processing massive spatial datasets [30][31][32].The reasons for this were: (1) the startup time in MapReduce increased the overall run time of the jobs; (2) the frequent I/O carried out by the Hadoop distributed file system (HDFS) for the intermediate results, due to its fault-tolerance mechanism; and (3) with an increase in the data size, the processing efficiency was significantly decreased [33].
As the Hadoop platform cannot meet the needs of some specific real-time massive data processing, the AMPlab team at the University of California developed the Spark platform, which is a distributed computing framework based on the Hadoop MapReduce framework [34].One of the major features of Spark is that it inherits the merits of Hadoop MapReduce while using a global cache mechanism so that intermediate results are stored in memory, requiring less frequent data I/O and effectively improving the processing speed [35].Lawson used Spark to develop financial data processing applications and implemented the parallel alternating direction method of the multipliers algorithm in Spark.The experimental results show that the parallel algorithm was able to process millions of rows of data in only 520.1 s, which was a significant improvement in efficiency [36].Lipka implemented a strategic iteration algorithm on Hadoop and Spark.The results showed that the processing speed with Spark was 71% higher than that with Hadoop.The more computing nodes Spark used, the less execution time the algorithm took [37].Wang et al. studied the parallelization of the K-means algorithm on the Spark platform.By comparing the performance of the parallel and serial algorithms on the Spark platform, they presented their parallel algorithm run on the Spark platform for massive data processing and demonstrated improved performance [38].Jiang et al. studied the parallel FP-Like algorithm in Spark.The parallel algorithm achieved better scalability and acceleration results, and helped to demonstrate the advantages of Spark vs. Hadoop compared with the parallel Apriori algorithm [39].Jin et al. used Spark in the field of land-use analysis.Compared to the traditional superposition method, their method significantly improved the efficiency of the analysis [40].Xie et al. studied different data index methods based on Spark.Their implementation effectively improved the index query, while supporting spatial data management [41].To date, there have not been many studies on the parallelization of the DBSCAN spatial clustering algorithm in data mining with Spark.
In summary, the advent of the Big Data era [42] has resulted in a large amount of noisy spatial data.Traditional standalone systems, distributed cluster platforms, and Hadoop-distributed platforms cannot meet the development requirements of spatial data mining.Therefore, based on Spark, this paper studies parallelization strategies for the DBSCAN algorithm.Furthermore, the DBSCAN algorithm can be applied in many fields, including urban planning [43], hotspot clustering [44], anomaly detection [45], etc.In particular, there are many applications in the transportation area that use the DBSCAN algorithm.For instance, Silva et al. used the DBSCAN algorithm to process the real traffic data of Fortaleza to find the congested areas of the city [46].Through a large amount of taxi trajectory data, one can quickly understand the dynamical distribution of vehicles and find congested areas in urban traffic.For example, Adiba et al. proposed an approach using the DBSCAN clustering method to identify the traffic congestion regions and their spatio-temporal distribution with the taxi trajectory data [47].Wang et al. used their improved C-DBSCAN algorithm to deal with GPS trajectory data to discover the candidate locations for bus stops near the Capital International Airport in Beijing in 2012 [48].Liu et al. divided the taxi trajectory data into time slices, finding the traffic-jam areas in some parts of Wuhan after using the DBSCAN algorithm to process the corresponding data [49].

DBSCAN Algorithm
In order to parallelize an algorithm, it is crucial to understand the working principles and implementation of the original serial algorithm, and carry out detailed analyses and tests first.After that, one can take different approaches to parallelize it and then adopt the parallel algorithm in applications.

Mathematical Principles of the DBSCAN Algorithm
Let the dataset to be processed be denoted as D, the algorithm's clustering radius, Eps, and the minimum number of objects in the neighborhood, MinPts.Then, the following are some basic concepts of the algorithm [1]: (1) Eps neighboring area: let p be the center of a sphere in the dataset D. For data within the radius Eps of the object's area, a collection of points contained in the sphere is (2) Density: at the position of data point p in the dataset D, the number of points, Num, contained in the neighborhood Eps is its density.
(3) Core point: at the position of data point p in the dataset D, if the density (Num) in the neighborhood Eps satisfies Num ≥ MinPts, it is called a core point.(4) Border point: at the position of data point p in the dataset D, if the density in the neighborhood Eps satisfies Num ≤ MinPts but it is inside the sphere, it is called a border point.(5) Noise point: all the objects other than the core and border points in D. (6) Direct density-reachable: given objects p, q ∈ D, where there is a core point and this is inside the Eps neighborhood of q, it is said that from p to q is direct density-reachable, i.e., q ∈ N Eps (p) , N Eps (p) ≥ MinPts , as shown in Figure 2. (7) Density-reachable: given objects p 1 , p 2 , p 3 , p 4 , . . . .p n ∈ D, where p 1 = q, p n = q, if p i+1 is directdensity-reachable from p i , then p is density-reachable from q.The definition is shown in Figure 3. (8) Density-connected: given objects p, q ∈ D, if there is a point o ∈ D that is density-reachable from p and q, then p and q are density-connected.The definition is shown in Figure 4.
Remote Sens. 2017, 9, 1301 5 of 32 (2) Density: at the position of data point p in the dataset D , the number of points, Num, contained in the neighborhood Eps is its density.
(3) Core point: at the position of data point p in the dataset D , if the density (Num) in the neighborhood Eps satisfies ≥ Num MinPts , it is called a core point.
(4) Border point: at the position of data point p in the dataset D , if the density in the neighborhood Eps satisfies ≤ Num MinPts but it is inside the sphere, it is called a border point.
(5) Noise point: all the objects other than the core and border points in D .(6) Direct density-reachable: given objects , D ∈ p q , where there is a core point and this is inside the Eps neighborhood of q , it is said that from p to q is direct density-reachable, i.e., ( ) inPts, as shown in Figure 2.
(8) Density-connected: given objects , D p q ∈ , if there is a point D o∈ that is density-reachable from p and q , then p and q are density-connected.The definition is shown in Figure 4.   , where there is a core point and this is inside the Eps neighborhood of q , it is said that from p to q is direct density-reachable, i.e., ( ) inPts, as shown in Figure 2.
(7) Density-reachable: given objects then p is density-reachable from q .The definition is shown in Figure 3.
(8) Density-connected: given objects , D p q ∈ , if there is a point D o∈ that is density-reachable from p and q , then p and q are density-connected.The definition is shown in Figure 4.   (2) Density: at the position of data point p in the dataset D , the number of points, Num, contained in the neighborhood Eps is its density.
(3) Core point: at the position of data point p in the dataset D , if the density (Num) in the neighborhood Eps satisfies ≥ Num MinPts , it is called a core point.
(4) Border point: at the position of data point p in the dataset D , if the density in the neighborhood Eps satisfies ≤ Num MinPts but it is inside the sphere, it is called a border point.
(5) Noise point: all the objects other than the core and border points in D .(6) Direct density-reachable: given objects , D ∈ p q , where there is a core point and this is inside the Eps neighborhood of q , it is said that from p to q is direct density-reachable, i.e., ( ) inPts, as shown in Figure 2.
(8) Density-connected: given objects , D p q ∈ , if there is a point D o∈ that is density-reachable from p and q , then p and q are density-connected.The definition is shown in Figure 4.   Based on the aforementioned definitions, the idea of the DBSCAN algorithm is as follows.The search can start from the neighboring Eps area of any interested data point.Given enough data points in the neighborhood ( MinPts ≥ ) the cluster will expand.Otherwise, a data point is temporarily marked as noise.This point can later be found in other Eps neighborhoods and marked as part of a cluster.If a data object in a cluster is marked as a core, its Eps neighborhood is also part of the cluster.Thus, all the points found in the neighborhood, as well as the core neighborhood, are added to the cluster.This process is repeated until density-connected clusters are completely found.Finally, new and untreated points are retrieved and processed to find deeper clusters or noise.After all the objects in dataset D are checked, the algorithm ends.

Processing Procedure of the DBSCAN Algorithm
Based on the previous description of the basic concepts and ideas of the DBSCAN algorithm, its processing flow can be summarized in the following steps.Here, supposing the spatial dataset D , given clustering radius Eps , the minimum number of neighboring objects MinPts, and the current collection of objects as C , meanwhile add all the points in the neighborhood to the collection of objects 1 N .(2) For the collection of objects, 1 N , if object q therein has not been checked, then mark q as 'checked', and then check its Eps neighborhood and calculate the number of objects in the neighborhood n.If n satisfies n MinPts ≥ , then these objects are added to the object collection.If q does not belong to any cluster, then add q to 1 C .
(3) Repeat step (2), and continue to check object set 1 N until it is empty.
(4) Repeat steps (1) to (3).When all the data objects are marked as 'checked', the algorithm ends.The detailed implementation steps can be summarized in Algorithm 1.Based on the aforementioned definitions, the idea of the DBSCAN algorithm is as follows.The search can start from the neighboring Eps area of any interested data point.Given enough data points in the neighborhood (≥MinPts) the cluster will expand.Otherwise, a data point is temporarily marked as noise.This point can later be found in other Eps neighborhoods and marked as part of a cluster.If a data object in a cluster is marked as a core, its Eps neighborhood is also part of the cluster.Thus, all the points found in the neighborhood, as well as the core neighborhood, are added to the cluster.This process is repeated until density-connected clusters are completely found.Finally, new and untreated points are retrieved and processed to find deeper clusters or noise.After all the objects in dataset D are checked, the algorithm ends.

Processing Procedure of the DBSCAN Algorithm
Based on the previous description of the basic concepts and ideas of the DBSCAN algorithm, its processing flow can be summarized in the following steps.Here, supposing the spatial dataset D, given clustering radius Eps, the minimum number of neighboring objects MinPts, and the current collection of objects as N 1 .
(1) All the data objects in dataset D are marked as unchecked.Starting from any unchecked data point p, mark it as 'checked', then check its Eps neighborhood and calculate the number of objects in the neighborhood m.If m satisfies m ≥ MinPts, then create a new cluster C 1 , and add p to C 1 , meanwhile add all the points in the neighborhood to the collection of objects N 1 .(2) For the collection of objects, N 1 , if object q therein has not been checked, then mark q as 'checked', and then check its Eps neighborhood and calculate the number of objects in the neighborhood n.
If n satisfies n ≥ MinPts, then these objects are added to the object collection.If q does not belong to any cluster, then add q to C 1 .(3) Repeat step (2), and continue to check object set N 1 until it is empty.(4) Repeat steps (1) to (3).When all the data objects are marked as 'checked', the algorithm ends.
The detailed implementation steps can be summarized in Algorithm 1.

Analyzing the Sequential DBSCAN Algorithm
Before designing the parallel algorithm, it is necessary to use some professional performanceanalysis tools to determine any hotspots of the algorithm.In this paper, the Intel ® VTune™ profiling tool was used to perform hotspot analysis on the serial algorithm.In the test, it was found that the scale of the dataset was closely related to the performance of the DBSCAN algorithm.For this reason, the dataset scale was changed to carry out the test.The test results are shown in Table 1.From Table 1, it can be seen that the serial algorithm has a hotspot that occupied the most time-consuming part of the algorithm, regardless of the changing scale of the datasets.The hotspot mainly arose due to the neighborhood query function (FindArrivalPoints ( )), which occupied more than 90% of the whole time consumption.

Parallel Design of the DBSCAN Algorithm
In this paper, the mesh and the secondary extended partition strategy was adopted [27] to parallelize the DBSCAN algorithm.After data partitioning using this strategy, the data was divided into local and border areas, and the data points were marked.All points in the boundary area were included in multiple local areas, thus providing the conditions for the later cluster-merging procedure.
Additionally, this strategy solved the problem whereby the same point is divided into different clustering classes during the clustering procedure.
The principle of the parallel algorithm adopted this strategy is described as follows (Figure 5): (1) The formatted dataset from the HDFS is read and cached into the memory.(2) Based on the computation of the RDD, Spark's memory capabilities, and the properties of directed acyclical graphs (DAG), the dataset is divided into computing nodes in order to perform local partition-clustering operations according to the above-mentioned strategy.(3) Each partition performs local clustering operations in parallel.(4) After the local data-clustering is complete, the clustering results are merged.
(5) Finally, the merged results are re-marked, and then the global clusters are generated.A flowchart of these procedures is shown in Figure 5, where it is found that the parallel algorithm has four important stages: (1)  The principle of the parallel algorithm adopted this strategy is described as follows (Figure 5): (1) The formatted dataset from the HDFS is read and cached into the memory.(2) Based on the computation of the RDD, Spark's memory capabilities, and the properties of directed acyclical graphs (DAG), the dataset is divided into computing nodes in order to perform local partition-clustering operations according to the above-mentioned strategy.

Data Partitioning Stage
In this stage, the method of data partition is very important for the implementation of the parallel algorithm.Figure 6 demonstrates the procedure of performing the data partitioning.The data are divided into several data slices with a given length and width according to the data range.Normally, 3Eps is a suitable number to be used in the division, and has been demonstrated to give the best performance, based on experience.In Figure 6, all the points (x, y) that fall into a given rectangle ( 3Eps , 3Eps ) belong to one data slice (e.g., S1), and points within this slice are assigned to the same node processed by the executor.

Data Partitioning Stage
In this stage, the method of data partition is very important for the implementation of the parallel algorithm.Figure 6 demonstrates the procedure of performing the data partitioning.The data are divided into several data slices with a given length and width according to the data range.Normally, 3Eps is a suitable number to be used in the division, and has been demonstrated to give the best performance, based on experience.In Figure 6, all the points (x, y) that fall into a given rectangle (3Eps, 3Eps) belong to one data slice (e.g., S1), and points within this slice are assigned to the same node processed by the executor.In Figure 6, i.e., the data partitioning procedure, allocating points that belong to the same cluster to different nodes is unavoidable; that is, the attribution problem of the boundary points resulting from the data division appears.However, these boundary points are essential to the following cluster-merging process.Therefore, the secondary slice expansion technique was adapted with the first data-partitioning operation.
In Figure 7, data slices 1 and 2 are taken as an example to explain the mechanism of the secondary slice extension method.Slices 1 and 2 have a common boundary point.In order to include the boundary point in its cluster, it is necessary to expand the boundary of both slices.Here, the border of each of the new slices is extended outwards by 0.1Eps .Then, the new slices both contain the boundary point, which is beneficial for the subsequent clustering process.In Figure 6, i.e., the data partitioning procedure, allocating points that belong to the same cluster to different nodes is unavoidable; that is, the attribution problem of the boundary points resulting from the data division appears.However, these boundary points are essential to the following cluster-merging process.Therefore, the secondary slice expansion technique was adapted with the first data-partitioning operation.
In Figure 7, data slices 1 and 2 are taken as an example to explain the mechanism of the secondary slice extension method.Slices 1 and 2 have a common boundary point.In order to include the boundary point in its cluster, it is necessary to expand the boundary of both slices.Here, the border of each of the new slices is extended outwards by 0.1Eps.Then, the new slices both contain the boundary point, which is beneficial for the subsequent clustering process.In Figure 6, i.e., the data partitioning procedure, allocating points that belong to the same cluster to different nodes is unavoidable; that is, the attribution problem of the boundary points resulting from the data division appears.However, these boundary points are essential to the following cluster-merging process.Therefore, the secondary slice expansion technique was adapted with the first data-partitioning operation.
In Figure 7, data slices 1 and 2 are taken as an example to explain the mechanism of the secondary slice extension method.Slices 1 and 2 have a common boundary point.In order to include the boundary point in its cluster, it is necessary to expand the boundary of both slices.Here, the border of each of the new slices is extended outwards by 0.1Eps .Then, the new slices both contain the boundary point, which is beneficial for the subsequent clustering process.An RDD conversion flowchart of the data partitioning stage is shown in Figure 8.
An RDD conversion flowchart of the data partitioning stage is shown in Figure 8. From Figure 8, the text file to be processed is taken out from HDFS and loaded into a RDD.The P in Figure 8 represents a point (i.e., the point's location information).Subsequently, the data to be processed is transformed into a key/value pair by the map operator (where, 0 represents the initial cluster class ID), then is stored by the cache operator to the memory.Finally, the text file is partitioned by the mapPartition operator, where each RDD is treated as a partition.4

.2.2. Local Clustering Stage
At this stage, any data slices that have been segmented in the final step of the first stage are then distributed to the executor via the task scheduler for local DBSCAN clustering calculations.The process is demonstrated in Figure 9.After clustering, the format of the obtained intermediate result is a Key/Value.The values that the Key/Value can take are longitude, latitude, ClusterId, IsLocalRegion, and IsCorePoint.longitude and latitude represent the coordinate position of the current data point, while ClusterId gives the cluster number of the current data point (the default value is 0).IsLocalRegion specifies whether the current From Figure 8, the text file to be processed is taken out from HDFS and loaded into a RDD.The P in Figure 8 represents a point (i.e., the point's location information).Subsequently, the data to be processed is transformed into a key/value pair by the map operator (where, 0 represents the initial cluster class ID), then is stored by the cache operator to the memory.Finally, the text file is partitioned by the mapPartition operator, where each RDD is treated as a partition.

Local Clustering Stage
At this stage, any data slices that have been segmented in the final step of the first stage are then distributed to the executor via the task scheduler for local DBSCAN clustering calculations.The process is demonstrated in Figure 9.
Remote Sens. 2017, 9, 1301 10 of 32 An RDD conversion flowchart of the data partitioning stage is shown in Figure 8. From Figure 8, the text file to be processed is taken out from HDFS and loaded into a RDD.The P in Figure 8 represents a point (i.e., the point's location information).Subsequently, the data to be processed is transformed into a key/value pair by the map operator (where, 0 represents the initial cluster class ID), then is stored by the cache operator to the memory.Finally, the text file is partitioned by the mapPartition operator, where each RDD is treated as a partition.4

.2.2. Local Clustering Stage
At this stage, any data slices that have been segmented in the final step of the first stage are then distributed to the executor via the task scheduler for local DBSCAN clustering calculations.The process is demonstrated in Figure 9.After clustering, the format of the obtained intermediate result is a Key/Value.The values that the Key/Value can take are longitude, latitude, ClusterId, IsLocalRegion, and IsCorePoint.longitude and latitude represent the coordinate position of the current data point, while ClusterId gives the cluster number of the current data point (the default value is 0).IsLocalRegion specifies whether the current data point is located in the area (True) or in the boundary area (False), and IsCorePoint indicates whether the current data is a core object (True) or a border object (False).An RDD conversion flow chart of the local DBSCAN clustering is shown in Figure 10.From Figure 10, via the mapPartion operator, the data are segmented into different data slices.Then, these data slices are assigned to compute nodes.The local DBSCAN clustering operation is performed on each slice by the executor called by the map operator.At this stage, the points on each slice are classified into their respective cluster classes.

Data Merger Stage
After each computing node finishes its data slice(s) calculation, the results are passed to the master node.The master node then processes each object according to the received key/value pairs.If IsLocalRegion is True, the object can output its ClusterId directly.For a data object with a cluster number of 0, it is necessary to determine the value of its IsLocalRegion parameter.If it is not in the boundary area, it is treated directly as a noise point.If it is in the boundary region, it is necessary to determine whether it is included in the field of other core points.For any data points with multiple cluster labels and which are also core points, it is necessary to indicate whether these clusters can be merged and renamed in the next stage.
As shown in Figure 11, after the clustering operation, cluster 1 and cluster 2 have been generated.P is a core point, i.e., it is a boundary point belonging to both clusters.Therefore, this means that these clusters are actually the same cluster, and can thus be merged into a single cluster.From Figure 10, via the mapPartion operator, the data are segmented into different data slices.Then, these data slices are assigned to compute nodes.The local DBSCAN clustering operation is performed on each slice by the executor called by the map operator.At this stage, the points on each slice are classified into their respective cluster classes.

Data Merger Stage
After each computing node finishes its data slice(s) calculation, the results are passed to the master node.The master node then processes each object according to the received key/value pairs.If IsLocalRegion is True, the object can output its ClusterId directly.For a data object with a cluster number of 0, it is necessary to determine the value of its IsLocalRegion parameter.If it is not in the boundary area, it is treated directly as a noise point.If it is in the boundary region, it is necessary to determine whether it is included in the field of other core points.For any data points with multiple cluster labels and which are also core points, it is necessary to indicate whether these clusters can be merged and renamed in the next stage.
As shown in Figure 11, after the clustering operation, cluster 1 and cluster 2 have been generated.P is a core point, i.e., it is a boundary point belonging to both clusters.Therefore, this means that these clusters are actually the same cluster, and can thus be merged into a single cluster.

Global Cluster Generation Stage
In this stage, the resulting clusters are renamed, i.e., each cluster will have only one global cluster number.Thus, the final clustering results are generated, as shown in Figure 12, and a flowchart of the RDD conversion operations for generating the global cluster is shown in Figure 13.
As can be seen in Figure 13, the clustering results distributed to each RDD cluster are merged into one RDD and are re-marked and re-named.As a result, each cluster has only one global cluster class number, and generates the final clustering result.Finally, the result is dumped to a text file and stored in HDFS.

Global Cluster Generation Stage
In this stage, the resulting clusters are renamed, i.e., each cluster will have only one global cluster number.Thus, the final clustering results are generated, as shown in Figure 12, and a flowchart of the RDD conversion operations for generating the global cluster is shown in Figure 13.
As can be seen in Figure 13, the clustering results distributed to each RDD cluster are merged into one RDD and are re-marked and re-named.As a result, each cluster has only one global cluster class number, and generates the final clustering result.Finally, the result is dumped to a text file and stored in HDFS.

Optimizing the Parallel DBSCAN Algorithm on the Single-Node Spark Platform
To implement the parallel algorithm on the Spark platform, it is necessary to select suitable optimization methods according to the algorithm's characteristics, data size, and cluster resources.Reasonable tuning methods can give full play to the computing capability of the Spark platform.The initial parallel DBSCAN algorithm was designed and implemented based only on the algorithm's characteristics, and it did not consider other optimization aspects.In the existing single-node parallel DBSCAN algorithm based on the Spark platform, there is a large amount of data transmissions, RDD object storage, and inter-node communication, which increases the running time of the algorithm to a certain extent.Considering these aspects, this section will optimize the parallel algorithm in terms of the aspects of data transmission, serialization, and resource parameter tuning based on the single-node parallel algorithm, whereby (1) the data transmission optimization method can improve the data transmission speed, and hence reduce the data transmission time; (2) the serialization optimization method can reduce the storage cost of the data and also reduce the overhead of the metadata; and (3) the resource parameter tuning can make full use of the platform computing resources to further improve the efficiency of the parallel algorithm.

Optimization Method 1: Optimization of Data Transmission
Spark uses the broadcast variable (Broadcast) to transmit a large amount of pre-processed data.Optimizing data transmission is needed to improve its overall speed.The introduction of Broadcast variables, i.e., read-only variables, are shared by multiple tasks in the Spark calculation process.Only one copy of the Broadcast variable needs to be saved on each node, and it does not need to carry out variable transferring for every task.This thereby reduces the transmission time, and improves the efficiency of the calculation.The data transmission of Broadcast in each iteration will be stored in the worker memory of the slave nodes.Generally, the memory will not be released in advance, except when the calculation node memory is insufficient.Therefore, when the memory size of each calculation task is greater than 20 kB, it is recommended that Broadcast be used to optimize algorithm performance on the Spark platform.The method for generating a Broadcast variable by calling the SparkContext method is as follows.Let us take the bordering area object variable (borderPoint) as an example, where borderPoint was reused several times in the program.Table 2 shows which variable used the pseudo codes in the program before the data transmission optimization, while the pseudo codes after the optimization are shown in Table 3.
Meanwhile, in the process of implementing the Spark-based parallel algorithm, when the collected data used by the collect operator is too large, the data is stored in the distributed file system.This optimization method can effectively reduce the data I/O overheads and relieve the storage pressure on the Driver.If the amount of collected data is small, the adjustment parameter spark.akka.frameSizealleviates the problem of AKKA buffer overflow caused by large-task distribution, because Spark finishes task distribution by passing messages between the Actor models in the AKKA library.Based on the Scala functional programming language, Spark uses the closure method to transmit data by referencing the function variables.Therefore, using the task to distribute the data will reduce the overall execution efficiency.Setting the parameters of Spark.akka.frameSizecan fix the maximum capacity of the Actor communication message (such as the output of the task) in AKKA used by the Spark platform, whose value defaults to 10 MB.When dealing with large-scale data, the output of the task may be greater than this value.Generally, it needs to be set to a higher value according to the size of the data.

Optimization Method 2: Serialization Optimizing
Serialization plays important roles in the development of distributed computing platforms.In the serializing approach, chained object data turns into data objects stored in a continuous space via a byte array; thus, the serialized object will be stored in a distributed file system with a continuous space form, and transmitted in the form of a data stream.This approach reduces the overheads related to metadata information both of the object itself and the basic data types; therefore, it can effectively reduce the data storage space, and reduce the overhead and stress of the garbage collection (GC).
Those data formats require a lot of time to serialize, and any objects occupying the object's space will slow down the entire application's efficiency.The serialization method is one of the most important performance-tuning methods in Spark, because the RDD object storage and data transfer between nodes is necessary for carrying out the serialization process.
The Spark platform includes both Kyro and Java serialization libraries.Compared with the Java serialization library, the Kyro serialization library allows for serialization to be completed quickly and compactly.Also, users can define customized serialization methods that have good scalability.To implement the parallel DBSCAN algorithm on the Spark platform, which has higher network transmission requirements, better performance can be achieved with the Kyro serialization library.This serialization method is also recommended for network transmission-intensive computing in the official Spark documentation.Therefore, the Kyro serialization library was used to optimize the parallel algorithm.
Spark automatically introduces Kyro serialization support for many of the commonly used Scala core classes, all of which are supported by Spark using the Twitter chill library.Based on the foregoing analysis, the optimized program codes adopting Kyro serialization are shown in Table 4.In a distributed computing environment like the Spark platform, reasonable configuration of resource parameters can make full use of the platform's computing resources and promote its parallel computing capabilities, thereby enhancing its performance.Table 5 gives the main resource parameters and their specific action descriptions in Spark.One thing to note is that each of the parameters corresponds to a part of the operating principle of the job execution.When the parameter of "num-executors" is set to 50-100, one should set "executor-memory" to 4 G-8 G, "executor-cores" to 2-4, the "driver-memory" generally to 1 G, and 500-1000 is recommended for "Spark.default.parallelism",so that the job requirements can be met.When Spark operations have more RDD persistence operations, the parameter "spark.storage.memoryFraction"can be increased.When shuffle class operations become relatively large, the parameter "spark.storage.memoryFraction"should be appropriately reduced.It is also recommended that the parameter "spark.shuffle.memoryFraction"be reduced when the Spark job RDD persistence operation is less and the Shuffle operation is more.

Implementation of the Parallel DBSCAN Algorithm on a Virtual Spark Cluster
When the user does not have enough physical machines to establish a Spark cluster at the beginning of the experiment, virtualization technology can offer an inexpensive way to construct a cluster to deploy enough virtual nodes for testing and rapid program deployment.Thus, it can manage the applications and facilitate interactive user applications quickly.Virtualization technology is a management technology that simulates real-world computing environments, including virtual computer hardware platforms, virtual storage devices, and virtual network resources.As a lightweight container of virtualization technology in recent years, Docker has the advantages of traditional virtualization, meanwhile, it also has characteristics such as file system isolation, resource isolation, network isolation, copy-on-write, and logging recodes.
In order to test the efficiency of the parallel DBSCAN algorithm in the Spark cluster environment, this study used Docker virtualization technology to build a virtualized Spark cluster platform, and tested the parallel DBSCAN algorithm based on this.The principle of the parallel DBSCAN algorithm based on a virtual Spark cluster is as follows.(1) Based on a high-performance physical Linux server, when using the Docker virtualization technology and the existing Spark platform mirrored on Docker Hub, one can dynamically extend the number of the constructed virtual Spark cluster nodes; (2) Then, the parallel DBSCAN algorithm can be tested with different nodes on the virtual Spark cluster; (3) Finally, the impact of the number of nodes with the parallel algorithm's acceleration ratio can be determined.A schematic diagram of the entire procedure is shown in Figure 14.

Implementation of the Parallel DBSCAN Algorithm on a Virtual Spark Cluster
When the user does not have enough physical machines to establish a Spark cluster at the beginning of the experiment, virtualization technology can offer an inexpensive way to construct a cluster to deploy enough virtual nodes for testing and rapid program deployment.Thus, it can manage the applications and facilitate interactive user applications quickly.Virtualization technology is a management technology that simulates real-world computing environments, including virtual computer hardware platforms, virtual storage devices, and virtual network resources.As a lightweight container of virtualization technology in recent years, Docker has the advantages of traditional virtualization, meanwhile, it also has characteristics such as file system isolation, resource isolation, network isolation, copy-on-write, and logging recodes.
In order to test the efficiency of the parallel DBSCAN algorithm in the Spark cluster environment, this study used Docker virtualization technology to build a virtualized Spark cluster platform, and tested the parallel DBSCAN algorithm based on this.The principle of the parallel DBSCAN algorithm based on a virtual Spark cluster is as follows.(1) Based on a high-performance physical Linux server, when using the Docker virtualization technology and the existing Spark platform mirrored on Docker Hub, one can dynamically extend the number of the constructed virtual Spark cluster nodes.(2) Then, the parallel DBSCAN algorithm can be tested with different nodes on the virtual Spark cluster.(3) Finally, the impact of the number of nodes with the parallel algorithm's acceleration ratio can be determined.A schematic diagram of the entire procedure is shown in Figure 14.

Implementation of the Parallel DBSCAN Algorithm on the Spark Cluster with Yarn
This part migrates the parallel DBSACN algorithm from the virtual cluster to a physical Spark cluster, and introduces the Yarn resource manager to deploy the algorithm.When deploying the Spark platform with Yarn, which only supports coarse-grained mode, the container starts on the Yarn resource manager, the system resources cannot be dynamically expanded, and the resources used are fixed.
To implement the parallel DBSCAN algorithm with the Yarn resource manager, Yarn will replace the cluster manager in the Spark cluster.The execution process of the parallel algorithm with Yarn is described as follows:

Implementation of the Parallel DBSCAN Algorithm on the Spark Cluster with Yarn
This part migrates the parallel DBSACN algorithm from the virtual cluster to a physical Spark cluster, and introduces the Yarn resource manager to deploy the algorithm.When deploying the Spark platform with Yarn, which only supports coarse-grained mode, the container starts on the Yarn resource manager, the system resources cannot be dynamically expanded, and the resources used are fixed.
To implement the parallel DBSCAN algorithm with the Yarn resource manager, Yarn will replace the cluster manager in the Spark cluster.The execution process of the parallel algorithm with Yarn is described as follows: (1) The Yarn client uploads the DBSCAN application packages to HDFS, and requests a job commit to the Resource Manager to run this task.(2) Resource Manager responds to the job and requests the Node Manager to create a Spark Application Master and then starts it.(3) Spark Application Master finds the resource files (the Jar packages, etc.) from the HDFS, and starts DAGscheduler and YARN cluster scheduler to process the initialization operations.Then, the Spark application master will apply the resources from the Resource Manager, and start the respective container through the Node Manager.(4) Each container contains multiple executors to perform the corresponding tasks simultaneously, and report their state to the Spark Application Master.
The executing process of the parallel algorithm implemented with this mode is shown in Figure 15.
Remote Sens. 2017, 9, 1301 18 of 32 (1) The Yarn client uploads the DBSCAN application packages to HDFS, and requests a job commit to the Resource Manager to run this task.(2) Resource Manager responds to the job and requests the Node Manager to create a Spark Application Master and then starts it.(3) Spark Application Master finds the resource files (the Jar packages, etc.) from the HDFS, and starts DAGscheduler and YARN cluster scheduler to process the initialization operations.Then, the Spark application master will apply the resources from the Resource Manager, and start the respective container through the Node Manager.(4) Each container contains multiple executors to perform the corresponding tasks simultaneously, and report their state to the Spark Application Master.
The executing process of the parallel algorithm implemented with this mode is shown in Figure 15.In this mode, the specified executor can be assigned to the parallel DBSCAN application via the parameter "-num-executors", and then each executor can be controlled by the parameters "-executormemory" and "-executor-cores" to set the size of the assigned memory and the number of occupied CPU cores.This mechanism is conducive for controlling the system resources that the user-submitted applications take up; thus, it can fully share the cluster resources, and improve the throughput of the Yarn manager.

Implementation of the Parallel DBSCAN Algorithm on a Spark Cluster with Mesos
This part introduces the Mesos resource manager to deploy the Spark platform.Unlike the Yarn resource manager, which adopts a process with an isolation mechanism, the Mesos resource manager uses a Linux container to isolate resources such as memory and CPU.Mesos implements resource sharing in a fine-grained manner, which is aimed at improving cluster utilization, and is highly suitable for the unified management and scheduling of the resources at the data center scale.However, Yarn is more convenient and effective for the management and scheduling of big data jobs.In this mode, the specified executor can be assigned to the parallel DBSCAN application via the parameter "-num-executors", and then each executor can be controlled by the parameters "-executor-memory" and "-executor-cores" to set the size of the assigned memory and the number of occupied CPU cores.This mechanism is conducive for controlling the system resources that the user-submitted applications take up; thus, it can fully share the cluster resources, and improve the throughput of the Yarn manager.

Implementation of the Parallel DBSCAN Algorithm on a Spark Cluster with Mesos
This part introduces the Mesos resource manager to deploy the Spark platform.Unlike the Yarn resource manager, which adopts a process with an isolation mechanism, the Mesos resource manager uses a Linux container to isolate resources such as memory and CPU.Mesos implements resource sharing in a fine-grained manner, which is aimed at improving cluster utilization, and is highly suitable for the unified management and scheduling of the resources at the data center scale.However, Yarn is more convenient and effective for the management and scheduling of big data jobs.Based on Mesos deployment, a Spark cluster needs to implement two classes to perform a calculation through the Mesos resource manager.One class is SparkScheduler, which is derived from the system class MesosScheduler.The other class is SparkExecutor, which comes from the base system class of executor.
To implement the parallel DBSCAN algorithm based on the Mesos resource manager, the Mesos master node will replace the cluster manager in the Spark cluster.The process of executing the parallel algorithm with the Mesos resource manager is described as follows: (1) After submitting a DBSCAN processing request through the client side, Spark will generate RDD and Map/Reduce functions, and then transform and generate the corresponding job information.Each job contains multiple tasks, and the job is submitted to the SparkScheduler.(2) The job sets are sent to the Mesos master node by the SparkScheduler.The Mesos master node performs the corresponding task scheduling to the Mesos slave nodes as the manager, these slave nodes execute the corresponding task in parallel with multiple executors and then return the results.The process of executing the parallel algorithm is shown in Figure 16.
Remote Sens. 2017, 9, 1301 19 of 32 Based on Mesos deployment, a Spark cluster needs to implement two classes to perform a calculation through the Mesos resource manager.One class is SparkScheduler, which is derived from the system class MesosScheduler.The other class is SparkExecutor, which comes from the base system class of executor.
To implement the parallel DBSCAN algorithm based on the Mesos resource manager, the Mesos master node will replace the cluster manager in the Spark cluster.The process of executing the parallel algorithm with the Mesos resource manager is described as follows: (1) After submitting a DBSCAN processing request through the client side, Spark will generate RDD and Map/Reduce functions, and then transform and generate the corresponding job information.Each job contains multiple tasks, and the job is submitted to the SparkScheduler.(2) The job sets are sent to the Mesos master node by the SparkScheduler.The Mesos master node performs the corresponding task scheduling to the Mesos slave nodes as the manager, these slave nodes execute the corresponding task in parallel with multiple executors and then return the results.The process of executing the parallel algorithm is shown in Figure 16.There are two supported modes, i.e., the coarse-grained and the fine-grained mode, which are used to deploy the parallel algorithm on Spark with Mesos resource management.With Mesos, spark.mesos.coarse is set to True to configure the coarse-grained scheduling mode statically.The feature of this mode is that the DBSCAN application has a separate and fixed memory allocation; thus, these free CPU resources can be used to prompt the resource utilization when the application occupying the machine cannot run a task.

Test and Analysis of the Proposed Parallel DBSCAN Algorithm
In this section, the serial and the corresponding parallel DBSCAN algorithms are used to process different GPS trajectory datasets on different computing platforms to evaluate the parallel algorithms' performance.

Configurations of the Experimental Platforms
In this paper, there are three types of experimental platform, i.e., a single-node platform, a virtual Spark cluster, and a physical Spark cluster.The hardware configuration and details of the equipped software are shown in Table 6.There are two supported modes, i.e., the coarse-grained and the fine-grained mode, which are used to deploy the parallel algorithm on Spark with Mesos resource management.With Mesos, spark.mesos.coarse is set to True to configure the coarse-grained scheduling mode statically.The feature of this mode is that the DBSCAN application has a separate and fixed memory allocation; thus, these free CPU resources can be used to prompt the resource utilization when the application occupying the machine cannot run a task.

Test and Analysis of the Proposed Parallel DBSCAN Algorithm
In this section, the serial and the corresponding parallel DBSCAN algorithms are used to process different GPS trajectory datasets on different computing platforms to evaluate the parallel algorithms' performance.

Configurations of the Experimental Platforms
In this paper, there are three types of experimental platform, i.e., a single-node platform, a virtual Spark cluster, and a physical Spark cluster.The hardware configuration and details of the equipped software are shown in Table 6.

Experimental Data
In this paper, all experimental data were obtained from the GPS trajectory datasets of Shenzhen in April 2011.The data sets are entirely in ASCII format, and contain vehicle license plate number, latitude and longitude information, vehicle status, speed, and direction of traffic.Before processing, we need to extract the latitude and longitude information from the existing data and form a new spatial dataset with different scales (in this study, different numbers of vehicle records) of 10 K, 20 K, 40 K, 80 K and 100 K.The hotspot detection processing step is rather time-consuming when the data scale reaches or exceeds 40 K.For real-world applications, the data scale needs to be further increased.The experimental data format and a description are shown in Figure 17.

Experimental Data
In this paper, all experimental data were obtained from the GPS trajectory datasets of Shenzhen in April 2011.The data sets are entirely in ASCII format, and contain vehicle license plate number, latitude and longitude information, vehicle status, speed, and direction of traffic.Before processing, we need to extract the latitude and longitude information from the existing data and form a new spatial dataset with different scales (in this study, different numbers of vehicle records) of 10 K, 20 K, 40 K, 80 K and 100 K.The hotspot detection processing step is rather time-consuming when the data scale reaches or exceeds 40 K.For real-world applications, the data scale needs to be further increased.The experimental data format and a description are shown in Figure 17.

Parallel Algorithm Evaluation Index
In this paper, two factors-speedup scalability is used to evaluate the parallel algorithms' performance.Among them, speedup is an important evaluation index, which can directly reflect the level of parallel efficiency [50].The definition of speedup is as shown in Equation (1): where S p means speedup, T l indicates the processing time of the serial algorithm, and T p represents the processing time of the parallel algorithm.When S p is larger, the performance of the parallel algorithm is higher.

Experiment Design
Based on five selected datasets with different data scales, the performance of the parallel DBSCAN algorithms was tested and compared with a serial algorithm with the fixed clustering parameters (Eps, MinPts) set to (15,90).To obtain the accurate running time of the serial and parallel algorithm, functions such as time.h,clock ( ) and currentTimeMillis ( ) etc. were used for the time statistics.In order to reduce experimental error, all running time values are expressed as the statistical average time over several experiments.The entire experiment consisted of the following five tests on the parallel DBSCAN algorithm: (1) Test the performance of the parallel DBSCAN algorithm based on a single-node Spark platform.
(2) Test the performance of the optimized parallel algorithm based on a single-node Spark platform.
(3) Test the performance of the parallel algorithm based on a virtual Spark cluster.(4) Obtain the performance of the parallel algorithm based on a physical Spark cluster.The tests were conducted employing both the Mesos and Yarn resources managing modes in turn.

Testing the Parallel Algorithm on a Single-Node Spark Platform
For a single-node Spark platform, the running time of the serial and parallel DBSCAN algorithms for the five different data scales are shown in Table 7.For the five datasets, the processing performance of the parallel algorithm based on the single-node Spark platform is significantly higher than that of the serial algorithm.The speedup values of the parallel algorithm are shown in Figure 18.From Figure 18, generally, it can be seen that the speedup of the parallel algorithm based on the single-node Spark platform accelerates gradually as the data scale increases.When the data scale is small, the execution time is shorter, and the acceleration ratio is not ideal.This phenomenon is also influenced by factors related to the Spark platform's internal data transmission and communication processes.When the data scale reaches 40 K, the acceleration ratio tends to be stable.Due to the limitation of CPU core number in the single-node Spark platform, the parallel algorithm can only start a small number of executors.When the spatial data scale reaches the 100 K scale, the obtained speedup reaches 8.41.
The aforementioned test results show that it is feasible to parallelize the DBSCAN algorithm on the Spark platform, which can improve its processing efficiency to some extent.Remote Sens. 2017, 9, 1301 22 of 32 From Figure 18, generally, it can be seen that the speedup of the parallel algorithm based on the single-node Spark platform accelerates gradually as the data scale increases.When the data scale is small, the execution time is shorter, and the acceleration ratio is not ideal.This phenomenon is also influenced by factors related to the Spark platform's internal data transmission and communication processes.When the data scale reaches 40 K, the acceleration ratio tends to be stable.Due to the limitation of CPU core number in the single-node Spark platform, the parallel algorithm can only start a small number of executors.When the spatial data scale reaches the 100 K scale, the obtained speedup reaches 8.41.
The aforementioned test results show that it is feasible to parallelize the DBSCAN algorithm on the Spark platform, which can improve its processing efficiency to some extent.

Testing the Optimized Parallel Algorithm on a Single-Node Spark Platform
The optimized parallel algorithm adopts three measures to improve its performance.The comparison of the parallel algorithm based on the same single-node Spark platform before and after the optimization are demonstrated in Table 8.From Table 8, it can be seen that the optimized parallel algorithm is more efficient than the original parallel algorithm.The acceleration ratios of five different data scales are shown in Figure 19, from which it can be seen that the obtained speedup values increase from 2.82, 4.82, 7.25, 8.33, and The optimized parallel algorithm adopts three measures to improve its performance.The comparison of the parallel algorithm based on the same single-node Spark platform before and after the optimization are demonstrated in Table 8.From Table 8, it can be seen that the optimized parallel algorithm is more efficient than the original parallel algorithm.The acceleration ratios of five different data scales are shown in Figure 19, from which it can be seen that the obtained speedup values increase from 2.82, 4.82, 7.25, 8.33, and 8.41 to 4.85, 9.05, 13.23, 13.70 and 13.90, respectively, for the different data scales.The performance improvement ratio of the optimized parallel algorithm is improved by 71.99%, 87.76%, 82.48%, 64.47%, 65.28% based on the single-node Spark platform.The test results show that the performance of the optimized algorithm is further accelerated, which verifies the effectiveness of the three optimization methods described.

Testing the Parallel Algorithm on a Virtual Spark Cluster Platform
In this section, Docker container virtualization technology is used to build a specified number of Spark virtual clusters dynamically based on a Linux server, and to test the efficiency of the parallel algorithm based on a virtualized Spark cluster.The elapsed time of the algorithm is shown in Table 9, from which it can be seen that the processing time of the parallel algorithm is not reduced with the number of virtual cluster nodes on the single-node Spark platform.The scalability of the parallel algorithm on a virtual cluster is unsatisfactory, because the acceleration ratio of the parallel algorithm does not linearly increase with the number of cluster nodes.
Combined with the Spark platform and Docker container's technical characteristics, the above results arise for the following reasons: (1) The parallel algorithm has intensive I/O operations, so it requires a higher network-transmission state; however, the virtual cluster built with a single-node server encounters transmission pressure.
(2) The performance of the virtual Spark cluster cannot achieve the same desired effects as a physical Spark cluster.
Although the performance of the parallel algorithm on the Docker-based virtual Spark cluster is poor, it presents a simulation of the Spark cluster environment that is able to test the implementation of parallel programs in cluster mode for the user.In the next section, a physical Spark cluster is used to test the parallel algorithm, replacing the virtual Docker Spark cluster.In this section, Docker container virtualization technology is used to build a specified number of Spark virtual clusters dynamically based on a Linux server, and to test the efficiency of the parallel algorithm based on a virtualized Spark cluster.The elapsed time of the algorithm is shown in Table 9, from which it can be seen that the processing time of the parallel algorithm is not reduced with the number of virtual cluster nodes on the single-node Spark platform.The scalability of the parallel algorithm on a virtual cluster is unsatisfactory, because the acceleration ratio of the parallel algorithm does not linearly increase with the number of cluster nodes.
Combined with the Spark platform and Docker container's technical characteristics, the above results arise for the following reasons: (1) The parallel algorithm has intensive I/O operations, so it requires a higher network-transmission state; however, the virtual cluster built with a single-node server encounters transmission pressure.(2) The performance of the virtual Spark cluster cannot achieve the same desired effects as a physical Spark cluster.
Although the performance of the parallel algorithm on the Docker-based virtual Spark cluster is poor, it presents a simulation of the Spark cluster environment that is able to test the implementation of parallel programs in cluster mode for the user.In the next section, a physical Spark cluster is used to test the parallel algorithm, replacing the virtual Docker Spark cluster.As can be seen from Figure 20, the acceleration curves for the obtained speedup for data scales of 10 K and 40 K in this mode tend to be gentle as the number of computing nodes increases.This is mainly because the data scale is relatively small, and hence their processing times are relatively short.With an increase in computing nodes, the expenses of data transmission, resource scheduling, heartbeat detection, etc. reduce the execution efficiency of the algorithm.It can be seen that the acceleration ratio achieved is basically consistent with a linear increase with processing data scales of 80 K and 100 K.However, a single-node cluster cannot process such large amounts data, which indicates that the parallel DBSCAN algorithm based on the Spark in Yarn mode is effective when the data amount is sufficiently large.This method can solve the performance bottleneck that arises when using the serial algorithm, and thus dramatically improves processing efficiency.As can be seen from Figure 20, the acceleration curves for the obtained speedup for data scales of 10 K and 40 K in this mode tend to be gentle as the number of computing nodes increases.This is mainly because the data scale is relatively small, and hence their processing times are relatively short.With an increase in computing nodes, the expenses of data transmission, resource scheduling, heartbeat detection, etc. reduce the execution efficiency of the algorithm.It can be seen that the acceleration ratio achieved is basically consistent with a linear increase with processing data scales of 80 K and 100 K.However, a single-node cluster cannot process such large amounts data, which indicates that the parallel DBSCAN algorithm based on the Spark in Yarn mode is effective when the data amount is sufficiently large.This method can solve the performance bottleneck that arises when using the serial algorithm, and thus dramatically improves processing efficiency.

Testing the Parallel Algorithm on a Spark Cluster with the Mesos Resources Manager
In this part, the same physical Spark cluster using the Mesos resource manager to deploy and test the parallel DBSCAN algorithm was tested.Meanwhile, the performance results achieved under these two modes were tested.The speedup changes for different data scales in this mode are recoded and compared in Figure 21.
In Figure 21, we find the same phenomenon as with the Yarn mode.For example, the acceleration curve tends to be flat with the data scales of between 10 K and 40 K, while better acceleration ratios are achieved as the data scale becomes larger.
To determine the performance difference of these two resources managers, experiments were performed using a data scale of 100 K for both of these modes.A comparison chart of their acceleration performance is shown in Figure 22.
From Figure 22, it can be seen that Yarn shows better performance, and is more suitable than Mesos for applications with more iterations.The reason for this is that the Yarn resource manager is able to deal with the dynamic distribution of resources according to the application requirements when dealing with algorithms with multiple iterations.As a result, Yarn can make full use of the computing capability of the Spark cluster platform.
indicates that the parallel DBSCAN algorithm based on the Spark in Yarn mode is effective when the data amount is sufficiently large.This method can solve the performance bottleneck that arises when using the serial algorithm, and thus dramatically improves processing efficiency.In this part, the same physical Spark cluster using the Mesos resource manager to deploy and test the parallel DBSCAN algorithm was tested.Meanwhile, the performance results achieved under these two modes were tested.The speedup changes for different data scales in this mode are recoded and compared in Figure 21  In Figure 21, we find the same phenomenon as with the Yarn mode.For example, the acceleration curve tends to be flat with the data scales of between 10 K and 40 K, while better acceleration ratios are achieved as the data scale becomes larger.
To determine the performance difference of these two resources managers, experiments were performed using a data scale of 100 K for both of these modes.A comparison chart of their acceleration performance is shown in Figure 22.From Figure 22, it can be seen that Yarn shows better performance, and is more suitable than Mesos for applications with more iterations.The reason for this is that the Yarn resource manager is able to deal with the dynamic distribution of resources according to the application requirements when dealing with algorithms with multiple iterations.As a result, Yarn can make full use of the computing capability of the Spark cluster platform.

Summary of City Traffic Congestion Area Discovery
Urban transportation is one of the main driving forces of urban development, and is responsible for connecting cities, regulating the passenger flow, and providing logistics for city development.
The traffic situation has a decisive impact on a city's system development, and the lives of traveling people [51].However, with the development of the urban economy, the numbers of private cars, taxis, and bus holdings have gradually increased, and the traffic infrastructure in some areas has been unable to meet the growing demand for vehicle operation and transport capacity.Thus, locating areas of traffic congestion, and subsequently targeting the planning of new urban roads to alleviate these problems are important hurdles that urgently need to be overcome in urban design and construction

Summary of City Traffic Congestion Area Discovery
Urban transportation is one of the main driving forces of urban development, and is responsible for connecting cities, regulating the passenger flow, and providing logistics for city development.
The traffic situation has a decisive impact on a city's system development, and the lives of traveling people [51].However, with the development of the urban economy, the numbers of private cars, taxis, and bus holdings have gradually increased, and the traffic infrastructure in some areas has been unable to meet the growing demand for vehicle operation and transport capacity.Thus, locating areas of traffic congestion, and subsequently targeting the planning of new urban roads to alleviate these problems are important hurdles that urgently need to be overcome in urban design and construction [52].
Spatial data contains rich spatial features, thematic features, temporal characteristics, and other semantic information.Therefore, by studying temporal and spatial trajectory spatial data, we can understand the rules of its movement and its corresponding changes.Taxi trajectory data can reflect the traffic situation in a city to a certain extent.Through a large volume of taxi trajectory data, one can quickly understand the dynamical distribution of vehicles and find congested areas in urban traffic.Locating these congested areas can help urban planners understand the problems in urban transportation, and hence adopt some measures to adjust, optimize and upgrade urban roads to meet the growing demand for traffic and transportation.The DBSCAN algorithm can be used in these applications.

Traffic Congestion Area Discovery Based on the Spark Platform
Although the serial DBSCAN algorithm has been successfully applied in the field of traffic-jam discovery, with the gradual increase in data size, it cannot meet the needs of some real-time applications.In order to speed up the process, this study applied the parallel DBSCAN algorithm to the problem of city traffic congestion discovery based on the Spark cluster platform.
First, the GPS trajectory data was cleaned to obtain a spatial data format suitable for data-clustering operations.Then, the spatial data was clustered on each computing node in the Spark cluster.After each computing node had finished a calculation, the clustering results were merged.Then, through certain data processing means, areas of traffic congestion were identified.Finally, based on comparison with the existing research results, the feasibility and practicality of the parallel DBSCAN algorithm were verified.The entire process is demonstrated in Figure 23.

Traffic Congestion Area Discovery Based on the Spark Platform
Although the serial DBSCAN algorithm has been successfully applied in the field of traffic-jam discovery, with the gradual increase in data size, it cannot meet the needs of some real-time applications.In order to speed up the process, this study applied the parallel DBSCAN algorithm to the problem of city traffic congestion discovery based on the Spark cluster platform.
First, the GPS trajectory data was cleaned to obtain a spatial data format suitable for dataclustering operations.Then, the spatial data was clustered on each computing node in the Spark cluster.After each computing node had finished a calculation, the clustering results were merged.Then, through certain data processing means, areas of traffic congestion were identified.Finally, based on comparison with the existing research results, the feasibility and practicality of the parallel DBSCAN algorithm were verified.The entire process is demonstrated in Figure 23.

Experimental Data and Platform Configuration
The experimental data used are taxi GPS data of Shenzhen City in the year of 2011.The data contains a total of 13,799 vehicles and a data size of 1331 MB.The acquired data are from April 18 to April 26, with an acquisition frequency of 1-5 s.The data format as shown in Table 10.Experiments were performed on this data set using the serial DBSCAN algorithm and the corresponding parallel algorithm based on the single-node platform and the Spark cluster platform.The hardware and software configuration of the test platforms were as described in Table 6.

Experimental Data and Platform Configuration
The experimental data used are taxi GPS data of Shenzhen City in the year of 2011.The data contains a total of 13,799 vehicles and a data size of 1331 MB.The acquired data are from April 18 to April 26, with an acquisition frequency of 1-5 s.The data format as shown in Table 10.Experiments were performed on this data set using the serial DBSCAN algorithm and the corresponding parallel algorithm based on the single-node platform and the Spark cluster platform.The hardware and software configuration of the test platforms were as described in Table 6.

Experiments and Analysis
After selecting the above GPS trajectory data, a pre-processing process was used to remove wild values, e.g., (1) GPS geographical location data that exceeded the geographic location of Shenzhen, (2) acquisition data obtained for time periods between 00:00 and 06:00, (3) repeated data points, etc.The GPS data format after pre-processing is demonstrated in Table 11.After the pre-processing process, the data was processed a second time.In this process, we simplified the data, extracting only the latitude and longitude information of the taxi from the pre-processed data.Then, the spatial data was included in the cluster operation using the DBSCAN algorithm following extraction of the latitude and longitude information of the taxi from the pre-processed data.The format of the processed GPS tracking data (here using the car YB000H6 as an example) is shown in Table 12, while a visualization of the taxi trajectory distribution before the clustering is shown in Figure 24.
Based on the above optimized latitude and longitude data, the parallel DBSCAN algorithm based on the Spark platform was used to cluster and obtain the areas most highly frequented by taxis within a fixed time period, i.e., the trajectory interest area.The processed results are shown in Table 13, and a visualization of the taxi trajectory distribution obtained from the parallel DBSCAN clustering algorithm is shown in Figure 25.In this figure, there are two generated clusters (color with black and gray).The two clustered areas show the areas more densely frequented by taxis.The light gray-colored areas represent noise; that is, it shows points that do not satisfy the DBSCAN algorithm cluster conditions.All taxi trajectory GPS data were processed in the same way with the parallel DBSCAN algorithm on the Spark platform.After each node in the Spark cluster completed the data processing, the clustering results were merged to produce the final clustering result.The final clustering result shows dense areas of taxi traffic that appear at specific time slots in the city.
The final results of the entire clustering process show that 14 clusters were generated; that is, the city has 14 densely congested traffic areas in the designated time period.By obtaining the GPS coordinates of the central points of these clusters, we can locate the congested regional centers and find the corresponding congested sections.The coordinates of the central points of these clusters and the corresponding distribution of the congested urban sections are shown in Table 14.The results in the table are consistent with existing research findings for the Shenzhen Traffic Jamming Area [53], which shows that the parallel algorithm has achieved correct results and verifies the practicality of the parallel DBSCAN algorithm based on the Spark platform.All taxi trajectory GPS data were processed in the same way with the parallel DBSCAN algorithm on the Spark platform.After each node in the Spark cluster completed the data processing, the clustering results were merged to produce the final clustering result.The final clustering result shows dense areas of taxi traffic that appear at specific time slots in the city.
The final results of the entire clustering process show that 14 clusters were generated; that is, the city has 14 densely congested traffic areas in the designated time period.By obtaining the GPS coordinates of the central points of these clusters, we can locate the congested regional centers and find the corresponding congested sections.The coordinates of the central points of these clusters and the corresponding distribution of the congested urban sections are shown in Table 14.The results in All taxi trajectory GPS data were processed in the same way with the parallel DBSCAN algorithm on the Spark platform.After each node in the Spark cluster completed the data processing, the clustering results were merged to produce the final clustering result.The final clustering result shows dense areas of taxi traffic that appear at specific time slots in the city.
The final results of the entire clustering process show that 14 clusters were generated; that is, the city has 14 densely congested traffic areas in the designated time period.By obtaining the GPS coordinates of the central points of these clusters, we can locate the congested regional centers and find the corresponding congested sections.The coordinates of the central points of these clusters and the corresponding distribution of the congested urban sections are shown in Table 14.The results in the table are consistent with existing research findings for the Shenzhen Traffic Jamming Area [53], which shows that the parallel algorithm has achieved correct results and verifies the practicality of the parallel DBSCAN algorithm based on the Spark platform.At the same time, performance experiments based on two data sets were performed.One had 4000 K trajectory records and a data volume of 1054 MB, while the other had 2000 K trajectory records and a data size of 500 MB.The processing time required for these data sets using the traditional serial algorithm and the parallel algorithm on Spark platform was tested, and the results are shown in Table 15.From Table 15, it can be seen that when the data scale was 2000 K, the elapsed time of the serial algorithm was 13 h, which cannot satisfy real-time requirements.Compared with the serial algorithm, the parallel algorithm processing time was 0.4 h.The obtained speedup ratio is 32, which indicates that the parallel algorithm can effectively improve processing efficiency.When the spatial data scale reached 4000 K, the single-node method failed; however, the parallel algorithm's processing time was 2.2 h.
In summary, compared with traditional processing methods, the parallel DBSCAN algorithm based on the Spark platform can dramatically improve the efficiency of city congestion discovery.These experiments verify the high efficiency and practicability of the parallel algorithm proposed in this paper.

Conclusions and Future Directions
To resolve the problem of long processing times associated with large-scale data processed with the serial DBSCAN algorithm, in this paper, the big data processing platform Spark was used to design and implement a parallel DBSCAN clustering algorithm.The experimental results show that the DBSCAN algorithm achieves a stable acceleration effect on the single-node Spark platform compared with the serial program.In addition, on the Spark cluster platform, the Spark-Yarn deployment method is more suitable for applications that have many iterations compared to the Spark-Mesos method.However, there is still a place for optimizing the currently implemented parallel algorithm.For example, we could explore the data partitioning and merge the optimization strategies, or introduce the Spark map calculation method, etc., in order to further speed up the processing speed of the algorithm in the future. p

Figure 5 .
Figure 5. Flow chart of the parallel DBSCAN algorithm.

Figure 6 .
Figure 6.The first phase of data partitioning.

Figure 7 .
Figure 7.The second phase of data partitioning.

Figure 6 .
Figure 6.The first phase of data partitioning.

Figure 6 .
Figure 6.The first phase of data partitioning.

Figure 7 .
Figure 7.The second phase of data partitioning.

Figure 7 .
Figure 7.The second phase of data partitioning.

Figure 8 .
Figure 8. RDD conversion flowchart of the data partitioning stage.

Figure 8 .
Figure 8. RDD conversion flowchart of the data partitioning stage.

Figure 8 .
Figure 8. RDD conversion flowchart of the data partitioning stage.

Figure 10 .
Figure 10.RDD conversion flowchart in the local clustering stage.

Figure 10 .
Figure 10.RDD conversion flowchart in the local clustering stage.

Figure 12 .
Figure 12.A global cluster generation stage.

Figure 13 .
Figure 13.Flowchart of the RDD conversion of the global cluster generation.

Figure 13 .
Figure 13.Flowchart of the RDD conversion of the global cluster generation.Figure 13.Flowchart of the RDD conversion of the global cluster generation.

Figure 13 .
Figure 13.Flowchart of the RDD conversion of the global cluster generation.Figure 13.Flowchart of the RDD conversion of the global cluster generation.

Figure 14 .
Figure 14.Parallel DBSCAN algorithm based on a Docker-based virtual Spark cluster.

Figure 14 .
Figure 14.Parallel DBSCAN algorithm based on a Docker-based virtual Spark cluster.

Figure 15 .
Figure 15.Parallel DBSCAN algorithm based on Yarn resource manager.

Figure 15 .
Figure 15.Parallel DBSCAN algorithm based on Yarn resource manager.

Figure 17 .
Figure 17.Description and data format of the GPS trajectory data.

Figure 17 .
Figure 17.Description and data format of the GPS trajectory data.

Figure 18 .
Figure 18.Speedup of the parallel DBSCAN algorithm based on a single-node Spark platform.

Figure 18 .
Figure 18.Speedup of the parallel DBSCAN algorithm based on a single-node Spark platform.

Figure 19 .
Figure 19.Comparison of achieved Speedup before and after the optimization.

Figure 19 .
Figure 19.Comparison of achieved Speedup before and after the optimization.

5. 5 . 4 . 32 Figure 20 .
Figure 20.Achieved speedup of the parallel algorithm with different data scales with the Yarn resources manger.

5. 5 . 5 .
Testing the Parallel Algorithm on a Spark Cluster with the Mesos Resources ManagerIn this part, the same physical Spark cluster using the Mesos resource manager to deploy and test the parallel DBSCAN algorithm was tested.Meanwhile, the performance results achieved under these two modes were tested.The speedup changes for different data scales in this mode are recoded and compared in Figure21.

Figure 20 .
Figure 20.Achieved speedup of the parallel algorithm with different data scales with the Yarn resources manger.

5. 5 . 5 .
Testing the Parallel Algorithm on a Spark Cluster with the Mesos Resources Manager

Figure 21 . 32 Figure 21 .
Figure 21.Achieved Speedup of the parallel algorithm at different data scales with the Mesos resource manager.

Figure 22 .
Figure 22.Performance comparison of Yarn and Mesos resources manager.

Figure 22 .
Figure 22.Performance comparison of Yarn and Mesos resources manager.

Figure 23 .
Figure 23.Entire process of the congested area discovery with GPS trajectory data.

Figure 23 .
Figure 23.Entire process of the congested area discovery with GPS trajectory data.

Figure 24 .
Figure 24.The trajectory distribution of the original data before clustering.

Figure 25 .
Figure 25.Visualization of the trajectory data distribution after the clustering operation.

Figure 24 .
Figure 24.The trajectory distribution of the original data before clustering.

Figure 24 .
Figure 24.The trajectory distribution of the original data before clustering.

Figure 25 .
Figure 25.Visualization of the trajectory data distribution after the clustering operation.

Figure 25 .
Figure 25.Visualization of the trajectory data distribution after the clustering operation.

Algorithm 1. DBSCAN algorithm implementation steps. Input Data: Data Set D to be processed Output Data: Cluster that satisfies the Clustering requirements Parameters: Clustering
Radius Eps, Minimal number of neighboring points MinPts
Sens. 2017, 9, 1301 11 of 32 data point is located in the area (True) or in the boundary area (False), and IsCorePoint indicates whether the current data is a core object (True) or a border object (False).An RDD conversion flow chart of the local DBSCAN clustering is shown in Figure10.

Table 2 .
Optimization Method One-the pseudo codes before data transmission optimization.

Table 3 .
Optimization Method One-the pseudo codes after the data transmission optimization.

Table 4 .
The pseudo codes for Optimization Method 2.

Table 5 .
Resource parameter tuning.Set the number of default tasks spark.storage.memoryFractionConfigure the amount of space used for caching in RDD, which defaults to 0.67 spark.shuffle.memoryFractionConfigure Executor memory scale in shuffle process, which defaults to 0.2

Table 7 .
Running time of serial and parallel DBSCAN algorithm based on a single-node Spark platform.

Table 8 .
Comparison of the elapsed time of the parallel algorithm and its optimized algorithm.

Table 8 .
Comparison of the elapsed time of the parallel algorithm and its optimized algorithm.

Table 9 .
Running time of serial and parallel DBSCAN algorithms based on Docker Spark platform (s).Testing the Parallel Algorithm on a Spark Cluster with the Yarn Resources Manager

Table 9 .
Running time of serial and parallel DBSCAN algorithms based on Docker Spark platform (s). .

Table 10 .
Experimental GPS data format (using one vehicle as an example).

Table 10 .
Experimental GPS data format (using one vehicle as an example).

Table 11 .
GPS data format after the pre-processing procedure.

Table 12 .
GPS data format after optimization processing.

Table 14 .
Central points of congested areas.

Table 15 .
Processing time comparison of the serial algorithm and the parallel algorithm with different data sets