An Adaptive Trajectory Clustering Method Based on Grid and Density in Mobile Pattern Analysis

Clustering analysis is one of the most important issues in trajectory data mining. Trajectory clustering can be widely applied in the detection of hotspots, mobile pattern analysis, urban transportation control, and hurricane prediction, etc. To obtain good clustering performance, the existing trajectory clustering approaches need to input one or more parameters to calibrate the optimal values, which results in a heavy workload and computational complexity. To realize adaptive parameter calibration and reduce the workload of trajectory clustering, an adaptive trajectory clustering approach based on the grid and density (ATCGD) is proposed in this paper. The proposed ATCGD approach includes three parts: partition, mapping, and clustering. In the partition phase, ATCGD applies the average angular difference-based MDL (AD-MDL) partition method to ensure the partition accuracy on the premise that it decreases the number of the segments after the partition. During the mapping procedure, the partitioned segments are mapped into the corresponding cells, and the mapping relationship between the segments and the cells are stored. In the clustering phase, adopting the DBSCAN-based method, the segments in the cells are clustered on the basis of the calibrated values of parameters from the mapping procedure. The extensive experiments indicate that although the results of the adaptive parameter calibration are not optimal, in most cases, the difference between the adaptive calibration and the optimal is less than 5%, while the run time of clustering can reduce about 95%, compared with the TRACLUS algorithm.


Introduction
In recent years, with the rapid development of sensor technology and smart phones, GPS devices are widely applied to track moving objects, e.g., humans, vehicles, and animals, which can produce huge amounts of trajectory data every day. The trajectory data is the spatial-temporal data series from the moving objects with different timestamps. They contain a lot of information and help us understand the behaviors of the moving objects more directly. For example, zoologists can cluster the paths of animals to study the migration of animals [1]. Meteorologists explore the movement path of hurricanes through clustering and correlation analysis to improve the capabilities in disaster early warning and prevention [2]. Based on the clustering analysis of the movement patterns of vehicles, traffic managers can plan urban roads to mitigate the traffic jams [3,4]. For example, Yue et al. proposed the single-linkage clustering method to analyze taxi trajectory data to detect the time-dependent hot spots and movement patterns for urban traffic planning [5]. Moreover, a mobility-based clustering of vehicle trajectories was presented to detect hotspots and avoid the traffic jams [6].
Clustering analysis is one of the most important methods used in trajectory data mining. Trajectory clustering approaches can be applied in hotspot path analysis, mobility pattern analysis, and urban planning. At present, the trajectory clustering approaches include two types [7]: the first cluster the trajectory data based on the similarity of the full sequences. In other words, they take the whole trajectory as a unit to cluster the trajectory data. Those approaches have good effects on the clustering for the simple trajectories, however, they have negative effects for complex trajectories due to the fact they ignore the local detail sequences. The second type cluster the trajectory data based on the similarity of the sub-sequences. This means that the whole complex trajectory sequence is divided into several segments, which can be clustered with one segment as a unit. The second approaches have the ability to recognize the local features of complex trajectories.
Nonetheless, most available trajectory clustering algorithms depend on the calibration of one or multiple parameters. Meanwhile, the parameter values have a great influence on the effect of clustering. To reduce the complexity and workload of parameter calibration in trajectory clustering, a method called Adaptive Trajectory Clustering approach based on Grid and Density (ATCGD) is proposed in this paper. ATCGD firstly divides the trajectory data into multiple discrete segments through the average angular difference-based MDL (AD-MDL) algorithm. All of the discrete segments are mapped into the corresponding cells. Then, it calculates the average distance among the different segments in each cell, and the average number of the trajectory segments in each cell. Finally, adopting a DBSCAN-based approach, ATCGD carries out the adaptive parameter calibration based on the above data to realize effective and accurate trajectory clustering. As an illustration of the capabilities of the proposed method, we evaluate the performance of ATCGD approach on clustering quality and cost using two data sets from the random trajectories and hurricane trajectories in the Atlantic Ocean. The experimental results indicate that although the results of the adaptive parameter calibration are not optimal, in most cases, the difference between the adaptive calibration and the optimal one is less than 5%, while the run time of clustering can be reduced by about 95%.
The remainder of this paper is organized as follows: Section 2 discusses the related works and analyzes their drawbacks. The discrete trajectory partition algorithm, that is the average angular difference-based MDL (AD-MDL), is discussed in Section 3. Section 4 presents the proposed ATCGD approach, and the performance evaluations are given in Section 5. Discussion and conclusions are given in Section 6.

Trajectory Clustering Approaches
Trajectory data can be regarded as time sequence data. Trajectory clustering is an important part of clustering analysis. To study the trajectory clustering of mobile objects, Gaffney et al. presented the mixture regression model-based trajectory clustering algorithm [8]. Furthermore, considering the temporal feature of trajectories, the spatial distance of the mobile objects was expanded to the spatial-temporal distance of the trajectories [9]. The time-focused trajectories clustering of moving objects algorithm, TFCTMO, was proposed based on the spatial-temporal distance. To obtain the moving cluster in the spatial-temporal trajectory data, the filter-based spatial-temporal clustering algorithm was discussed [10]. The filter-based cluster algorithm first filtered the trajectory data in the different time-scale ranges, and then clustered the data in the spatial-scale range within the same timestamp. All of the above clustering algorithms are based on the similarity of the full sequences.
Lee et al. thought that the clustering approaches based on the full sequences may have negative effects for complex trajectories due to the fact they ignore the local partial similarity [11]. Moreover, they put forward a partition-and-group framework and clustering algorithm-TRACLUS-that divides the whole trajectory into several segments and clusters them through the DBSCAN method [12][13][14]. The TRACLUS algorithm can recognize the local partial similarity of trajectories, however, in order to obtain good clustering quality, TRACLUS requires a large amount of workload to calibrate two parameters (the scanning range eps and the density minPts of each group). At the same time, the values of the two parameters are sensitive to the different data sets. In order to reduce the complexity and workload of parameter calibration, some parameter adaptive clustering algorithms based on the DBSCAN were put forward. For example, a self-adaptive density-based clustering algorithm (SA-DBSCAN) was presented in [15]. In the SA-DBSCAN approach, the distance of every object-pair in the data set is calculated as the input of two parameters eps and minPts. Although SA-DBSCAN can achieve good accuracy, it results in high computational complexity O(n 2 ). Furthermore, through integrating the Affinity Propagation (AP) clustering method with DBSCAN, an AP-based clustering algorithm (APSCAN) was presented to cluster the objects without parameters [16]. However, the APSCAN algorithm still needs to compute the distance of every object-pair and thus exhibits high complexity. To further realize adaptive parameter calibration, the GCMDDBSCAN clustering algorithm established grid cells based on the various data, and then clustered the data based on optimal values of parameters eps and minPts with the cell as a unit [17].
From the above analysis, all of the DBSCAN-based clustering algorithms can achieve the adaptive parameter clustering for the simple object data. Considering the spatial and temporal characteristics of trajectory data, which differs from that of the simple object data, the trajectory clustering algorithm should reduce the computation complexity of clustering algorithms, especially in large-scale vehicle trajectories from intelligent systems. Based on the analysis of the DBSCAN-based clustering algorithms with adaptive parameter calibration, an Adaptive Trajectory Clustering approach based on Grid and Density (ATCGD) is proposed in this paper. ATCGD firstly divides the trajectory data into discrete trajectory segments based on the MDL-based method. All of the segments are mapped into the corresponding cells. Then, it calculates the average distance among the different segments in each grid cell, and the average number of the trajectory segments in each cell. Finally, adopting the idea of the DBSCAN-based method, ATCGD carries out the adaptive parameter calibration based on the above data to realize effective and accurate trajectory clustering.
Li et al. found that the existing trajectory algorithms focused on the static data and cannot deal with the problem of the data dynamic growth [18], so an incremental clustering framework of the trajectory, TCMM, was presented. In the TCMM framework, the whole trajectory was divided into several sequences and micro-clusters were established and dynamically maintained. The K-means method was also applied to the trajectory clustering problem [19]. However, it needed to determine the value of K in advance and cannot deal with noisy data, which results in poor performance in actual applications. Furthermore, the space covered by the trajectories was divided into cells. The trajectory clustering based on cells was proposed to cluster the grids when each cell is an object [20]. The cells-based clustering algorithm can exhibit good processing performance, while it ignores the differences among the sequences and leads to the poor clustering accuracy.

Trajectory Partition Methods
The proposed ATCGD algorithm includes three parts: partition, mapping, and clustering, as shown in Figure 1. In the partition phase, ATCGD applies the average angular difference-based MDL (AD-MDL) partition method to ensure the partition accuracy on the premise that it decreases the number of the segments after the partition. During the mapping procedure, the partitioned segments are mapped into the corresponding cells, and the mapping relationship between the segment and the cell are stored. In the clustering phase, adopting the DBSCAN-based method, the segments in the cells are clustered on the basis of the computed values of parameters from the mapping procedure. The clustering results can be applied in hotspot paths analysis, mobility pattern analysis, and urban planning.
In the field of trajectory partition, most of trajectory partition approaches rely on trajectory compression algorithms. The classical one is the Douglas-Peucker (DP) algorithm [21]. It detects some unnecessary points by calculating the information loss. Through introducing the concept of "window", that is the segment, into the information loss computation, the OPening Window algorithm (OPW) was proposed [22]. OPW uses iterations to compress the trajectories with one "window" as one unit, instead of one whole trajectory as one unit. Using an iterations method, OPW can greatly reduce the computation cost. Afterwards, taking the time dimension into consideration, the Top-Down Time Ratio (TD-TR) algorithm was presented [12], and the optimal upper bound of errors compression algorithm (SQUISH-E) was proposed [23]. These further improved the applicability of the compression algorithms in the GPS trajectory data. Lee et al. put forward the trajectory partition algorithm based on the Minimum Description Length (MDL) [11], which can effectively compress data as well as ensure the accuracy of the compressed data. From the above analysis of compression algorithms, it can be found that most of the compression algorithms try to obtain successive sequences of trajectory, which means all segments are end-to-end. However, continuity is unnecessary to the clustering of the trajectory segments. We can improve the accuracy of the compressed data when dealing with the discrete segments of trajectory. As shown in Figure 2a,  In this paper, adopting the AD-MDL discrete trajectories partition method, the proposed ATCGD trajectory clustering approach can map all of the segments into the corresponding cells. Then, based on the idea of the DBSCAN-based method, the segments are clustered through the calibration of adaptive parameters with the mapping relationship. The experimental results illustrate that the ATCGD approach can improve the effectiveness of clustering as well as ensure the accuracy. These further improved the applicability of the compression algorithms in the GPS trajectory data. Lee et al. put forward the trajectory partition algorithm based on the Minimum Description Length (MDL) [11], which can effectively compress data as well as ensure the accuracy of the compressed data. From the above analysis of compression algorithms, it can be found that most of the compression algorithms try to obtain successive sequences of trajectory, which means all segments are end-to-end. However, continuity is unnecessary to the clustering of the trajectory segments. We can improve the accuracy of the compressed data when dealing with the discrete segments of trajectory. As shown in Figure 2a,b, TS c−rep1 and TS c−rep2 marked with the red line, are the continuous representative segments of the original trajectory data TS original ; and TS d−rep1 and TS d−rep2 , marked with the green line, are the discrete representative segments, respectively. The dash area represents the area difference between the representative segments and the original trajectory. It is obvious that the area difference between TS c−rep and TS original is greater than that between TS d−rep and TS original . These further improved the applicability of the compression algorithms in the GPS trajectory data. Lee et al. put forward the trajectory partition algorithm based on the Minimum Description Length (MDL) [11], which can effectively compress data as well as ensure the accuracy of the compressed data. From the above analysis of compression algorithms, it can be found that most of the compression algorithms try to obtain successive sequences of trajectory, which means all segments are end-to-end. However, continuity is unnecessary to the clustering of the trajectory segments. We can improve the accuracy of the compressed data when dealing with the discrete segments of trajectory. As shown in Figure 2a  In this paper, adopting the AD-MDL discrete trajectories partition method, the proposed ATCGD trajectory clustering approach can map all of the segments into the corresponding cells. Then, based on the idea of the DBSCAN-based method, the segments are clustered through the calibration of adaptive parameters with the mapping relationship. The experimental results illustrate that the ATCGD approach can improve the effectiveness of clustering as well as ensure the accuracy. In this paper, adopting the AD-MDL discrete trajectories partition method, the proposed ATCGD trajectory clustering approach can map all of the segments into the corresponding cells. Then, based on the idea of the DBSCAN-based method, the segments are clustered through the calibration of adaptive parameters with the mapping relationship. The experimental results illustrate that the ATCGD approach can improve the effectiveness of clustering as well as ensure the accuracy.

Distance Measure Between the Trajectory Segments
Definition 1 (trajectory). With a given Euclidean space, a trajectory is composed of a series of trajectory points, expressed as TR = {P 1 , P 2 , . . . , P n }, where the discrete trajectory points are sorted by timestamp, P i refers to the trajectory point i, P i = (x i , y i ), and n represents the number of points in the trajectory.
Definition 2 (sub-trajectory segment). Two adjacent discrete trajectory points P i and P i+1 are connected to form a trajectory segment P i P i+1 , which is a sub-trajectory segment, denoted as TS i .
A trajectory sequence consists of a series of discrete points. Two adjacent discrete points are connected to form a sub-trajectory segment. Due to the massive amount of trajectory data generated by mobile phones and other GPS equipment, the trajectory data compression is an important task for the sub-trajectory segments clustering. To reduce the workload of clustering all of the trajectory data, it should first partition the trajectory TR = {P 1 , P 2 , . . . , P n } into the multiple sub-trajectory segments TS = {TS 1 , TS 2 , . . . , TS n−1 } by adopting the appropriate compression algorithm.
Lee et al. proposed a method to calculate the distance between two sub-trajectory segments with the weighted sum of the horizontal distance, vertical distance, and angular distance [10]. That distance of trajectory segments is suitable to the trajectory clustering. The horizontal distance can effectively avoid the noisy data problem when the distance between the two long trajectory segments is long. However, the angular distance may cause the problem of the short trajectory segments priority, which means that the shorter the trajectory segment is, the smaller the angular distance is. To solve the problem of the short trajectory segments priority, a new method to calculate the distance between the different segments is presented in this paper. As shown in Figure 3, TS 1 is the shorter trajectory segments and TS 2 is the longer one. l ⊥1 and l ⊥2 are the minimum and maximum vertical distance from any point in TS 1 to the segment TS 2 , respectively. l 1 and l 2 are the distance from the corresponding intersection to the endpoint, respectively. d ⊥ is the vertical distance between the two segments calculated with l ⊥1 and l ⊥2 . d is the horizontal distance between the two segments calculated with l 1 and l 2 , θ is the angle between the two segments TS 1 and TS 2 , as shown in Equations (1) and (2): A trajectory sequence consists of a series of discrete points. Two adjacent discrete points are connected to form a sub-trajectory segment. Due to the massive amount of trajectory data generated by mobile phones and other GPS equipment, the trajectory data compression is an important task for the sub-trajectory segments clustering. To reduce the workload of clustering all of the trajectory data, it should first partition the trajectory Lee et al. proposed a method to calculate the distance between two sub-trajectory segments with the weighted sum of the horizontal distance, vertical distance, and angular distance [10]. That distance of trajectory segments is suitable to the trajectory clustering. The horizontal distance can effectively avoid the noisy data problem when the distance between the two long trajectory segments is long. However, the angular distance may cause the problem of the short trajectory segments priority, which means that the shorter the trajectory segment is, the smaller the angular distance is. To solve the problem of the short trajectory segments priority, a new method to calculate the distance between the different segments is presented in this paper. As shown in Figure 3 the corresponding intersection to the endpoint, respectively. d  is the vertical distance between the two segments calculated with 1 l  and 2 l  . d  is the horizontal distance between the two segments calculated with 1 l  and 2 l  ,  is the angle between the two segments 1 TS and 2 TS , as shown in Equations (1) and (2): The distance between the two segments 1 TS and 2 TS can be computed as shown in Equation (3): The distance between the two segments TS 1 and TS 2 can be computed as shown in Equation (3):

Definition 3 (representative trajectory segments).
Given a set of the trajectory segments TS = {TS 1 , TS 2 , . . . , TS n }, TS can be represented with a trajectory segment TS rep as the representative trajectory segment.
According to the discussion in Section 2, from Figure 2a,b, it is obvious that the area difference between TS c−rep and TS original is greater than that between TS d−rep and TS original . In order to reduce the area difference between the set of the partitioned segments and the original whole trajectory, ATCGD approach applies the discrete representative trajectory segments to replace the original whole trajectory, instead of the continuous representative trajectory segments. Figure 4 illustrates the discrete representative segments. As shown in Figure 4, P i , i = 1, . . . , 5 denotes the trajectory point in the original trajectory. P mid is the middle point in the original trajectory, where x mid = According to the discussion in Section 2, from Figure 2a . In order to reduce the area difference between the set of the partitioned segments and the original whole trajectory, ATCGD approach applies the discrete representative trajectory segments to replace the original whole trajectory, instead of the continuous representative trajectory segments. Figure 4 illustrates the discrete representative segments. As shown in Figure In the same way, the coordinate values of the intersection e P can be calculated. It is obvious that the representative trajectory segments via the above method are discrete and cannot be end-to-end. From Figure 2  In Figure 4, TS mid is the trajectory line through the middle point P mid . Suppose that TS i ·θ represents the clockwise angle between the trajectory segment TS i and the horizontal line, where 0 ≤ TS i ·θ < π. TS mid ·θ is the clockwise angle between the trajectory segment TS mid and the horizontal line. TS mid ·θ can be calculated as follows: Then, it makes two vertical lines from two endpoints of original trajectory P 1 and P 5 to the line TS mid , and intersects at the points P s and P e , respectively. The trajectory segment P s P e is just the representative trajectory segment of the original trajectory {P 1 , P 2 , P 3 , P 4 , P 5 }, denoted as TS rep . The coordinate values of the intersection P s can be calculated with Equation (5): x s = y 1 +tan(π/2−TS mid ·θ)·x 1 −y mid +tan(TS mid ·θ)·x mid tan(TS mid ·θ)+tan(π/2−TS mid ·θ) In the same way, the coordinate values of the intersection P e can be calculated. It is obvious that the representative trajectory segments via the above method are discrete and cannot be end-to-end.
From Figure 2, the area difference between TS c−rep and TS original is greater than that between TS d−rep and TS original . Therefore, this discreteness cannot take negative effect on the clustering results, instead it can generate the more accurate representative segments of the original trajectory.
To evaluate the accuracy of the representative trajectory segments, the cumulative distance difference between the discrete representative trajectory segment TS rep and the set of the original continuous segments TS = {TS 1 , TS 2 , . . . , TS n } is introduced, which is represented as ϕ. Because the vertical distance is one major impact factor on the difference between the representative trajectory segment and the original ones, the vertical distance is adopted to compute the cumulative distance difference, as shown in Equation (6): where n is the number of the original segments. The smaller ϕ is, the more accurate the representative trajectory segment is. Meanwhile, in order to verify the accuracy of the discrete representative trajectory segments, 1000 trajectories from the GeoLife data sets [24] are randomly selected. Assume that ϕ discrete represents the cumulative distance difference between the discrete representative segment and the original trajectory segment. ϕ continuous represent the cumulative distance difference between the continuous representative segment and the original one. The experimental results are that there are ϕ discrete ≤ ϕ continuous in the 982 trajectories from the 1000 trajectories, while there is only ϕ discrete > ϕ continuous in the 18 trajectories. The experimental results indicate that the discrete representative trajectory segment can substitute the original one more accurately.

Discrete Trajectory Partition Algorithm
From daily life experience, we know that the trajectory variations of people's or vehicle's movements are always relatively smooth. That is to say, there are very small changes in the angle between the two adjacent trajectory segments. To further quantify the variations of trajectories, the average angular difference Avg angle−di f f is introduced. Given a trajectory data TR = {P 1 , P 2 , . . . , P m }, the average angular difference Avg angle−di f f can be calculated as shown in Equation (7): Lee et al. put forward the trajectory partition algorithm based on the Minimum Description Length (MDL) to compress data [8]. MDL is derived from Information Theory, which can be used to describe a given data set using fewer symbols than needed to describe the data literally. In essence, MDL can be applied to data compression. In the trajectory data compression, MDL can obtain a tradeoff between the number of sub-trajectory segments and the accuracy of the trajectories partition results, but MDL has high computational complexity to obtain the partitioned segments. In order to reduce the complexity of the trajectories partition, the average Angular Difference-based MDL (AD-MDL) is proposed to compress the trajectory data and partition the trajectories. AD-MDL consists of two phases: data filtering and trajectory partition.
In the data filtering phase, it eliminates the obvious outliers with the minimum cost based on the average angular difference Avg angle−di f f , which can reduce the computation workload during the trajectory partitioning. At first, the original trajectory data can be partitioned into multiple continuous segments. During the procedure of data filtering, the average angular difference Avg angle−di f f is considered as the filtering factor. The filter threshold is θ threshold . For each continuous sub-trajectory segment, if its average angular difference is greater than the threshold θ threshold , the starting point of that sub-trajectory segment should be added into the set of the candidate trajectory points TR c . Otherwise, the starting point of the segment is considered as an outlier and cannot be processed in the trajectory partition phase. After the data filtering, it can get the set of the candidate trajectory points TR c = {Pc 1 , Pc 2 , . . . , Pc n }. A GeoLife data set is introduced as an example to evaluate the performance on data compression. Based on the experimental results, it can be found that the AD-MDL can realize the 39% compression rate when the threshold value is θ threshold = π/64. Thus, it can greatly reduce the computation overhead in the trajectory partition phase.
In the trajectory partition phase, MDL method is still adopted to partition the compressed trajectories into discrete representative trajectories. During the data compression procedure, the overhead of MDL usually includes two parts: L(H) and L(D|H) . H is the hypothesis, and D is the described data. L(H) is the overhead of describing the hypothesis and L(D|H) is the overhead to describe the D under the hypothesis H. MDL aims to find the optimal H to describe D to minimize the sum of L(H) and L(D|H) . As to the trajectory partition, H is the set of discrete representative trajectory segments, and D is the original trajectory data. L(H) represents the total length of the all discrete representative segments. L(D|H) represents the difference between the discrete representative segments and the original trajectory. It is obvious that the greater number of the selected candidate points is, the more accuracy of the partition is. The greater L(H) is and the smaller L(D|H) is, which results in the high accuracy and high computation cost. Otherwise, it results in the low overhead and poor accuracy. When the sum of L(H) and L(D|H) is minimum, the trajectory partition can reach the tradeoff between the accuracy and computation cost. L(H) and L(D|H) can be computed as follows Equation (8): where TS c i −c i+1 represents the discrete trajectory segment from the candidate point Pc i to Pc i+1 , P j P j+1 is the original trajectory segment in the TS c i −c i+1 , and len(TS c i −c i+1 ) means the length of the discrete trajectory segment from the point Pc i to Pc i+1 .
To obtain the optimal trajectory partition, it should compute the global optimal solution to the minimum sum of L(H) and L(D|H) , which results in the high computation overhead. To reduce the computation cost, we adopt a greedy solution to find the local optimal results to replace the global optimal results.
Suppose According to the above discussion, the average Angular Difference-based MDL (AD-MDL) algorithm can be used to compress the trajectory data and create the discrete representative segments. The pseudo-code of the AD-MDL algorithm (Algorithm 1) is as given below. The AD-MDL trajectory partition algorithm contains two phases, the first one is the data filtering and the second one is to create the discrete representative trajectory segments. In the data filtering phase, part of the trajectory points is selected as the candidate point for the trajectory partition phase, based on the average angular difference. Thus, it can reduce the number of trajectory points to create the discrete representative segments and reduce the computation time in the second phase. Input: Trajectory sequences TR = {P 1 , P 2 , . . . , P n }, and the threshold of the average Angular Difference θ threshold Output: the set of discrete representative trajectory segments D TS // data filter phase 1: index = 1; p start = p 1 ; p start is added into the set of candidate trajectory points TR c 2: for j = 2 to n in the TR 3: if Avg angle−di f f (index, j) > θ threshold then 4: p j is added into the set TR c 5: index = j; j = j + 1; 6: p end = p n ; 7: p end is added into the set TR c // trajectory partition phase 8: index =1; 9: for j = 2 to m in the TR c 10: if MDL(c index , c j ) > L D (c index , c j )

11:
TS c index −c j is a discrete representative trajectory segment, and added into the set D TS 12: index = j; j = j + 1; 13: end for 14: return the set D TS .
As shown in lines 1 to 7 of the AD-MDL algorithm, if the average angular difference is not greater than the threshold θ threshold , a new trajectory point is added. Otherwise, the new added trajectory point is the characteristic point and is added into the set of candidate points TR c . In the trajectory partition phase, in order to obtain the clustering accuracy as well as the low complexity, the MDL-based method is adopted to create the discrete representative segments. As shown in the line 8 to line 14 of the AD-MDL algorithm, if there is MDL(c index , c j ) ≤ L D (c index , c j ), the trajectory points between the p c index and p c j are non-characteristic points, and the successive point is included. If MDL(c index , c j ) > L D (c index , c j ), the trajectory points between the p c index and p c j are characteristic points and the corresponding segment TS c index −c j is added into the set of discrete representative segments D TS . The AD-MDL algorithm traverses all of the trajectory points twice, so the computation complexity is O(n), where n is the total number of trajectory points.

Grid Partition
We can get the discrete representative trajectory segments with the AD-MDL algorithm. After the trajectory partitioning, the partitioned segments should be mapped into the appropriate cells with the clustering method based on the grid and density, which is the task of the grid partition phase. The trajectory clustering based on the density should follow the principle of the cluster size from small to big. Suppose that the average number of the trajectory segments in each cell is represented as Num avg . The value of Num avg should be as small as possible, which means that the average number of the trajectory segments should be minimum in each cell. However, in order to conduct the trajectories clustering based on the density, it needs to compute the distances among the different trajectory segments for each cell, which results in the heavy overhead of computation. In the experiments of Section 5.3, it can be found that the minimum value of Num avg cannot obtain the optimum of clustering. Through a lot of experiments, when Num avg = 2, it can obtain the best clustering quality. During the procedure of the grid partition and mapping the trajectory segments into the corresponding cells, it needs to traverse every trajectory segment and recognize all of the belonging cells and adjacent cells of every segment, as well as every cell's density. Those computation results are the inputs for the trajectory clustering.

Trajectory Clustering Algorithm
The DBSCAN-based clustering approaches should calibrate the values of two parameters eps and minPts . eps and minPts denote the radius of neighbor cells and the threshold of density of the trajectory segments, respectively. In Section 4.1, we could obtain the average distance among the different segments in each cell, and the average number of the trajectory segments in each cell. With the DBSCAN-based clustering approach, the ATCGD trajectory clustering approach carries out the adaptive parameters calibration eps and minPts , based on the above data to realize the effective and accurate trajectory clustering. Next, we will discuss the procedure of adaptive parameter calibration for eps . It selects the cells with density greater than 1, that is During the procedure of the grid partition and mapping the trajectory segments into the corresponding cells, it needs to traverse every trajectory segment and recognize all of the belonging cells and adjacent cells of every segment, as well as every cell's density. Those computation results are the inputs for the trajectory clustering.

Trajectory Clustering Algorithm
The DBSCAN-based clustering approaches should calibrate the values of two parameters eps and minPts. eps and minPts denote the radius of neighbor cells and the threshold of density of the trajectory segments, respectively. In Section 4.1, we could obtain the average distance among the different segments in each cell, and the average number of the trajectory segments in each cell. With the DBSCAN-based clustering approach, the ATCGD trajectory clustering approach carries out the adaptive parameters calibration eps and minPts, based on the above data to realize the effective and accurate trajectory clustering.

Definition 7 (neighborhood of trajectory segment):
Suppose there are two trajectory segments TS x and TS y in D TS , that is TS x ∈ D TS and TS y ∈ D TS , where D TS is the set of the discrete partitioned trajectory segments. If there has N eps (TS x ) = TS y ∈ D TS : dist(TS x , TS y ) ≤ eps , where eps is the radius of the neighbor cells, N eps (TS x ) is the neighborhood of trajectory segment TS x with eps, denoted as N eps (TS x ). From Definition 7, all of the trajectory segments, whose distance from the segment TS x is less than eps in the set D TS , are the neighborhood of trajectory segment TS x with eps. The size of radius of the neighbor cells eps can determine the size of N eps (TS x ) for the trajectory segment TS x . Next, we will discuss the procedure of adaptive parameter calibration for eps.
It selects the cells with density greater than 1, that is cell i .seg > 1, where i = 1, . . . , n, n is the number of cells. Suppose the number of the selected cells with cell i .seg > 1 is M, cell i .seg is the cell density of cell i , cell i .TS x is the trajectory segments TS x passing through cell i . The radius of the neighbor cells eps can be computed as follows: where EXP eps (i) is the expected value of eps for the cell i , and EXP avg represents the average expected value of eps for all of the cells. From the discussion in Section 4.1, we set Num avg = 2 to obtain good clustering quality. Due to the value of Num avg is enough small, the distances among the different trajectory segments in each cell are very short. The maximum distance among the trajectory segments in the cell i is selected as the expected value of eps of the cell i . The radius of the neighbor cells is the sum of the average expected value EXP avg and the standard deviation of all cells' expected values. For any one cell cell i , its cell density cell i .seg is constant. The computation complexity of eps is O(log n), where n is the number of the cells.

Definition 8 (segment density).
Suppose there is one trajectory segment TS x in D TS , the density of TS x is defined as the number of trajectory segments in its neighborhood, denoted as ρ(TS x ). That is Definition 9 (core segment). Suppose there is one trajectory segment TS x in D TS , and minPts is the threshold of density of the trajectory segments. If ρ(TS x ) ≥ minPts, the trajectory segment TS x is defined as the core segment of D TS . Otherwise, TS x is non-core segment of D TS . The set of core segments is denoted as D core and the set of non-core segments is denoted as D non−core .
In the ATCGD trajectory clustering approach, the threshold value of minPts is not fixed and may vary with the different number of the belonging cells of the trajectory segments. In the applications, if the density of the trajectory segment TS x is not less than the mean value through the statistical results, it can be considered that the density of the segment TS x , ρ(TS x ), can meet the requirements of trajectory clustering. For the trajectory segment TS x , the corresponding threshold minPts is set to minPts = Num avg × Belong_Cell.TS i . On the other hand, one trajectory segment may pass through one or more cells, and one cell can be covered by one or more trajectory segments. Num avg is the average number of the trajectory segments in each cell. Num avg can be further improved considering the many-to-many relationship between the |Belong_Cell.TS i | and the cell i .seg for each segment and grid cell. The modified Num avg is denoted as N avg and can be computed as Equation (10): where C num is the number of the cells, and n is the total number of the trajectory segments.
Definition 10 (directly density-reachable). Suppose there are two trajectory segments TS x and TS y in D TS , that is TS x ∈ D TS and TS y ∈ D TS . If TS x ∈ D core and TS y ∈ N eps (TS x ), TS y are said to be directly density-reachable from TS x . By Definition 10, no trajectory segments are directly density-reachable from a non-core segment.

Definition 11 (density-reachable).
Suppose there are m trajectory segments in D TS , that is TS 1 , TS 2 , . . . , TS m ∈ D TS , where m ≥ 2 and TS 1 , TS 2 , . . . , TS m−1 ∈ D core . If TS i is the directly density-reachable from TS i−1 , then TS m is the density-reachable from TS 1 .
The density-based trajectory clustering procedure includes three phases. The first phase is to map the discrete trajectory segments into the cell. Suppose there are n discrete representative trajectory segments obtained with the discrete trajectory partition algorithm (AD-MDL). Num avg is set to 2, and the area can be divided into n/Num avg cells. Then, it can calibrate two parameters eps and minPts to set the scanning radius of cells and the threshold of density of the trajectory segments and form a cluster, respectively, based on the Equations (9) and (10). The second phase is to execute the grid and density-based clustering with DBSCAN-based method. It starts with an arbitrary trajectory segment TS i that has not been visited. The TS i 's neighborhood is retrieved, and if its density ρ(TS i ) is greater than minPts, a cluster is started. Otherwise, the trajectory segment is labeled as noise. If the trajectory segment TS i is found to be a dense part of a cluster, its neighborhood N eps (TS i ) is also part of that cluster. All of the trajectory segments that are found within the neighborhood N eps (TS i ) are added, as is their own neighborhood when they are also dense. This process continues until the density-reachable cluster is completely found. Then, a new unvisited trajectory segment TS j is retrieved and processed, leading to the discovery of a further cluster or noise. After the trajectory clustering, the set of the candidate clusters, S cluster , are created. However, if one candidate cluster C i is not dense, which cannot meet the application's requirement for the clustering quality. The last phase is to check the cardinality for each cluster. For one candidate cluster C i , if the number of trajectory segments in the cluster C i is not greater than C num ∑ j=1 cell j .seg/C num , where C num is the number of the cells, the cluster C i should be the final cluster and be removed from the set of the candidate clusters.
Based on the procedure of the density-based trajectory clustering, it can be found that a trajectory that is neither a core segment nor directly-reachable is called as a noise segment. A cluster should satisfy two properties: all trajectory segments within the cluster are mutually density-reachable; and if a trajectory segment is density-reachable from any segment of the cluster, it is part of the cluster as well.
The density-based trajectory clustering algorithm can be expressed in pseudo-code as follows. From Algorithm 2, the density-based trajectory cluster algorithm includes three phases. From line 1 to line 2, the area is divided into the appropriate number of cells and the segments are mapped into the corresponding cells. Meanwhile, it executes the adaptive parameter calibration for eps and minPts. The complexity of the first phase is O(n). The clustering phase is from the line 3 to line 18, which adopts the DBSCAN-based method to cluster the discrete segments with the values of adaptive calibrated parameters, and get the candidate clusters. The complexity of clustering procedure is O(n log n). To further check the results of clustering, it checks the density of each cluster. If the density of cluster is not greater than the average density, the candidate cluster should be removed, as shown from line 19 to line 22. As a whole, the complexity of the trajectory clustering based on the density is O(n log n).

Experimental Setup
To evaluate the clustering performance of proposed trajectory cluster approach-ATCGD, two data sets are introduced. One is a series of randomly generated trajectories (hereafter referred to as Random Trajectory, RT), as shown in Figure 6a,b. The other is hurricane trajectory data in the Atlantic Ocean provided by American Weather Information System Company, referred to as Hurricane Track (HT) as shown in Figure 6c. RT data includes two patterns: RT1 and RT2. RT1 has about 100 trajectories and 2000 trajectory segments. Those trajectories can be clearly divided into four groups from top to bottom. RT2 has about 100 trajectories and 7000 trajectory segments, and is more complicated than RT1. The trajectories in RT2 are also divided into four groups. The trajectories in the RT1 and RT2 sets are similar to the trajectory data from vehicle movement, thus, RT1 and RT2 can represent a data set from a real application. The HT data set includes the hurricane track information about latitude, longitude, and the highest wind speed from 1851. The frequency of sampling is once for every 6 h. The experiments extract 100 hurricane trajectories with 2465 trajectory segments from 1940, which includes the latitude and longitude of the hurricane track. From Algorithm 2, the density-based trajectory cluster algorithm includes three phases. From line 1 to line 2, the area is divided into the appropriate number of cells and the segments are mapped into the corresponding cells. Meanwhile, it executes the adaptive parameter calibration for eps and minPts . The complexity of the first phase is ( ) O n . The clustering phase is from the line 3 to line 18, which adopts the DBSCAN-based method to cluster the discrete segments with the values of adaptive calibrated parameters, and get the candidate clusters. The complexity of clustering procedure is

Experimental Setup
To evaluate the clustering performance of proposed trajectory cluster approach-ATCGD, two data sets are introduced. One is a series of randomly generated trajectories (hereafter referred to as Random Trajectory, RT), as shown in Figure 6a,b. The other is hurricane trajectory data in the Atlantic Ocean provided by American Weather Information System Company, referred to as Hurricane Track (HT) as shown in Figure 6c. RT data includes two patterns: RT1 and RT2. RT1 has about 100 trajectories and 2000 trajectory segments. Those trajectories can be clearly divided into four groups from top to bottom. RT2 has about 100 trajectories and 7000 trajectory segments, and is more complicated than RT1. The trajectories in RT2 are also divided into four groups. The trajectories in the RT1 and RT2 sets are similar to the trajectory data from vehicle movement, thus, RT1 and RT2 can represent a data set from a real application. The HT data set includes the hurricane track information about latitude, longitude, and the highest wind speed from 1851. The frequency of sampling is once for every 6 h. The experiments extract 100 hurricane trajectories with 2465 trajectory segments from 1940, which includes the latitude and longitude of the hurricane track.  To further evaluate the clustering quality of the proposed ATCGD approach, one metric QMeasure is introduced as the standard to evaluate the clustering effect [10]. QMeasure includes two parts: one is the sum of squared error (SSE) and the other is the penalty value of noise. The QMeasure can be calculated as follows: where D n is the noise set, N cluster is the number of the cluster of the trajectory segments, and C i represents the i th cluster of trajectory segments. |C i | is the number of the trajectory segments in the i th cluster. |D n | is the number of the noise trajectories. The sum of squared error (SSE) can be calculated with , which reflects the distances between the different trajectory segments in each cluster. The smaller value of eps is and the greater value of minPts is, it can obtain smaller SSE. In the applications, if it can calibrate appropriate values of two parameters eps and minPts, it can exhibit good cluster quality. At the same time, the noise trajectory data are considered when calculating the value of QMeasure. 1 2|D n | ∑ p∈D n ∑ q∈D n dist(p, q) 2 is used to calculate the sum of squared distances between the any noise trajectory segments, which is as the penalty. Therefore, the value of QMeasure and the quality of the clustering exhibits the negative correlation. The smaller the metric value of QMeasure is, the higher quality of the clustering is. Figure 7 shows the clustering results with the RT1, RT2 and HT data sets, respectively. As shown in Figure 7, the different clusters are represented with different colors. From Figure 7a, the proposed ATCGD approach can cluster those trajectory data into four groups with high accuracy, which is in accordance with the expectation. Compared to the original trajectory data, it can be found that some trajectory segments are recognized as the noise. Figure 7b illustrates the clustering results with RT2. In contrast with RT1, the trajectories of RT2 exhibit apparent non-smoothness. This reveals that RT2 has greater difficulty than RT1 in clustering, but the ATCGD approach can still cluster those trajectories into four different groups. Therefore, the ATCGD can effectively be applied to the vehicle trajectory data, which has high similarity to the RT data set. Figure 7c shows the HT clustering results. From Figure 6c, the trajectories in the HT data set are much more complicated than those in the RT data set. The ATCGD approach can classify those hurricane data into two clusters, which conforms to the expectation. It implies that the ATCGD approach can also provide effective clustering for complex trajectory data.

Clustering Performance
To further evaluate the clustering quality of the proposed ATCGD approach, one metric QMeasure is introduced as the standard to evaluate the clustering effect [10]. QMeasure includes two parts: one is the sum of squared error ( SSE ) and the other is the penalty value of noise. The QMeasure can be calculated as follows:

 
is used to calculate the sum of squared distances between the any noise trajectory segments, which is as the penalty. Therefore, the value of QMeasure and the quality of the clustering exhibits the negative correlation. The smaller the metric value of QMeasure is, the higher quality of the clustering is. Figure 7 shows the clustering results with the RT1, RT2 and HT data sets, respectively. As shown in Figure 7, the different clusters are represented with different colors. From Figure 7a, the proposed ATCGD approach can cluster those trajectory data into four groups with high accuracy, which is in accordance with the expectation. Compared to the original trajectory data, it can be found that some trajectory segments are recognized as the noise. Figure 7b illustrates the clustering results with RT2. In contrast with RT1, the trajectories of RT2 exhibit apparent non-smoothness. This reveals that RT2 has greater difficulty than RT1 in clustering, but the ATCGD approach can still cluster those trajectories into four different groups. Therefore, the ATCGD can effectively be applied to the vehicle trajectory data, which has high similarity to the RT data set. Figure 7c shows the HT clustering results. From Figure 6c, the trajectories in the HT data set are much more complicated than those in the RT data set. The ATCGD approach can classify those hurricane data into two clusters, which conforms to the expectation. It implies that the ATCGD approach can also provide effective clustering for complex trajectory data.

Comparison Analysis
To further quantify the accuracy of the ATCGD clustering approach, we compare the ATCGD approach with TRACLUS in terms of QMeasure. Due to the slight differences about the distance calculation of the trajectory segments between the ATCGD and TRACLUS, it adopts the proposed distance computation equation between the trajectory segments in this paper, shown in Equation (3), to calculate QMeasure.
In the experiments, the different number of trajectories from RT and HT data sets are selected to evaluate the clustering quality. Thus 100, 200, 300, 400 hurricane track trajectories since 1940 from the HT data set are randomly selected and denoted as HT-100, HT-200, HT-300, and HT-400, respectively. Meanwhile, we apply the parameter calibration method proposed in the TRACLUS algorithm to conduct the experiments for twenty times and get the 20 different combination results of the two parameters eps and minPts. The minimum of combination results, that is the minimum QMeasure, is taken as the results of the TRACLUS algorithm. The experimental results of the ATCGD and TRACLUS algorithm are listed in Table 1. From Table 1, it can be seen that the run time of TRACLUS algorithm is much higher than that of the ATCGD method. Meanwhile, the difference in the run times becomes greater between the two algorithms as the data size increases. The reason is that the ATCGD approach adopts the belonging cells and adjacent cells to determine the candidate set, which can be used to compute the neighborhood of eps. That method can greatly improve the efficiency and reduce the execution time of the trajectory clustering. The computation complexity of the ATCGD approach is O(n log n) based on the analysis in the Section 4.2. On the contrary, without the index scheme, the computation complexity of the TRACLUS algorithm is up to O(n 2 ), where n is the number of trajectory points. On the other hand, as to the metric of the clustering quality QMeasure, the ATCGD approach does not appear to be much different from the TRACLUS algorithm. The ATCGD can obtain slightly better QMeasure than the TRACLUS algorithm. In most cases, the value of QMeasure in the ATCGD is smaller than that in the TRACLUS, except the HT-400 and RT2 data sets. The reason is that the ATCGD approach adopts the adaptive parameters calibration method to obtain the values close to the optimum, thus it can exhibit the good quality of clustering with the lower computation cost. While the TRACLUS algorithm can obtain the near-optimal combination results of two parameters eps and minPts through the large number of parameters calibrations, which results in the high accuracy and high computation complexity. If the combination results of two parameters are inappropriate, the TRACLUS algorithm will obtain the poor quality of trajectory clustering.

Parameter Sensitive Analysis
In order to further provide the quantitative analysis of the parameter values of Num avg , the HT-100, HT-200, HT-300, and HT-400 data sets are used to compute the quality of clustering metric QMeasure with the different values of Num avg . The experimental results are shown in the Figure 8. When Num avg = 2, the value of QMeasure is minimum for all of the data sets. When Num avg < 2, the value of QMeasure decreases with the increase of Num avg . On the contrary, when Num avg > 2, the value of QMeasure increases with the increase of Num avg . Based on the experimental results, when the ATCGD approach sets Num avg to 2, it can get better quality of the trajectory clustering.   To verify the correctness of the parameters calibration, two parameters eps and N avg (minPts can be computed based on N avg ) are selected for the sensitivity analysis. The data sets are still HT-100, HT-200, HT-300, and HT-400. We compare the different values of QMeasure with different combination of eps and N avg as well as the adaptive calibration values of those two parameters eps a and N avg_a . The value range of is [ eps a − 3 , eps a + 3 ] and the step is 1. The value range of N avg is [N avg_a − 0.6, N avg_a + 0.6] and the step is 0.2. Figure 9 illustrates the distributions of QMeasure with different values of eps and N avg in the data sets of HT-100, HT-200, HT-300, and HT-400. As shown in Figure 9, the red points are the results of adaptive parameter calibration for eps a and N avg_a ; the green points are the results of different combinations with different values of eps and N avg   From Figure 9, it can be found that there is a large variation range of QMeasure with the different combinations of two parameters' values, when adopting the TRACLUS algorithm. While the ATCGD approach can get the small value of QMeasure. The reason is that it adopts the adaptive parameters calibration method to compute the value of QMeasure. On the other hand, if the difference between the values of QMeasure by adopting the adaptive parameters calibration and the optimal combination is smaller, the ATCGD approach can obtain higher quality of trajectories clustering. Moreover, although the results of the adaptive parameter calibration are not optimal, in most cases, the difference between the values of QMeasure with the adaptive calibration and the optimal combination is less than 5%. It indicates that the adaptive calibrated parameters eps and N avg can gain good clustering effects.

Conclusions
Clustering analysis is one of the most important issues in trajectory data mining. Trajectory clustering can be widely applied in hotspots detection, mobile pattern analysis, urban transportation control, hurricane prediction, etc. Many trajectory clustering algorithms have been proposed to obtain good clustering performance. Nonetheless, most available trajectory clustering algorithms depend on calibration of one or multiple parameters. Meanwhile, the values of these parameters have a great influence on the effect of clustering. To reduce the complexity and overhead of parameter calibration in trajectory clustering, an Adaptive Trajectory Clustering approach based on Grid and Density, ATCGD, was proposed in this paper. ATCGD firstly divides the trajectory data into multiple discrete segments through the proposed the average angular difference-based MDL (AD-MDL) algorithm. All of the discrete segments are mapped into the corresponding cells. Then, it calculates the average distance among the different segments in each cell, and the average number of the trajectory segments in each cell. Finally, adopting a DBSCAN-based approach, ATCGD carries out an adaptive parameter calibration based on the above data to realize effective and accurate trajectory clustering. With two data sets from random trajectories and hurricane trajectories on the Atlantic Ocean, we evaluate the performance of the ATCGD approach on clustering quality and cost. The experimental results indicate that although the results of the adaptive parameter calibration are not optimal, in most cases, the difference between the adaptive calibration and the optimal is less than 5%, while the run time of clustering can be reduced by about 95%.