Semantic-Geographic Trajectory Pattern Mining Based on a New Similarity Measurement

Trajectory pattern mining is becoming increasingly popular because of the development of ubiquitous computing technology. Trajectory data contain abundant semantic and geographic information that reflects people’s movement patterns, i.e., who is performing a certain type of activity when and where. However, the variety and complexity of people’s movement activity and the large size of trajectory datasets make it difficult to mine valuable trajectory patterns. Moreover, most existing trajectory similarity measurements only consider a portion of the information contained in trajectory data. The patterns obtained cannot be interpreted well in terms of both semantic meaning and geographic distributions. As a result, these patterns cannot be used accurately for recommendation systems or other applications. This paper introduces a novel concept of the semantic-geographic pattern that considers both semantic and geographic meaning simultaneously. A flexible density-based clustering algorithm with a new trajectory similarity measurement called semantic intensity is used to mine these semantic-geographic patterns. Comparative experiments on check-in data from the Sina Weibo service demonstrate that semantic intensity can effectively measure both semantic and geographic similarities among trajectories. The resulting patterns are more accurate and easy to interpret.


Introduction and Motivations
Owing to advanced positioning technologies, a large amount of movement data can now be conveniently collected from daily activities. Some activities share the same movement pattern, which reflects similar lifestyles, habits, or behaviors. The study of movement data can reveal individual movement patterns, facilitate the understanding of the characteristics of human dynamics, and thus support recommendation, activity prediction, urban planning, and traffic monitoring alike. Therefore, movement pattern mining has become a hot topic when considering human movement activities [1,2].
Trajectories can be derived from movement data sampled from daily activities. They typically consist of a sequence of spatiotemporal points represented as (latitude, longitude) tagged with timestamps. Trajectory data can be divided into three main types based on sampling mode: time-frequency sampling data (e.g., animal migration, hurricane data), location-based sampling data (e.g., population migration data) and event-triggered sampling data (e.g., mobile phone call, check-in data). Whereas previous studies have mainly focused on processing the geometric and/or temporal properties of trajectory data, recent studies have gone one step further. They enrich a Figure 1 shows an example in which trajectories 1 and 2 are geometrically similar according to their shapes and distance. In addition, trajectories 1 and 3 have the same sequence of actions and share the same semantic pattern (i.e., school, park, and restaurant). In fact, these are two different trajectory similarity measurements that can generate two types of movement patterns: a geographic pattern and a semantic pattern. properties of trajectory data, recent studies have gone one step further. They enrich a movement track with more semantic, application-oriented information by adding geographical context. Figure 1 shows an example in which trajectories 1 and 2 are geometrically similar according to their shapes and distance. In addition, trajectories 1 and 3 have the same sequence of actions and share the same semantic pattern (i.e., school, park, and restaurant). In fact, these are two different trajectory similarity measurements that can generate two types of movement patterns: a geographic pattern and a semantic pattern. However, these two types of patterns are not sufficiently accurate to incorporate into recommendation systems or other applications. In Figure 1, two trajectories in the geographic pattern do not share similar semantic meanings because they pertain to different types of activities in their second and third steps. On the other hand, two trajectories in the semantic pattern are not geometrically similar; they pertain to the same type of activity in different places. Therefore, users who define these patterns are not actually similar. More accurate and valuable trajectory patterns that contain both semantic and geographic information are required. People who perform similar activities in nearby places share the same movement activity pattern, and they may have more common interests. Here, "similar activities" are determined by semantic similarity, and "nearby places" are determined by geographic similarity. It is thus necessary to combine these two similarities. However, it is not a trivial task to calculate geographic and semantic similarity simultaneously. Challenges are posed by the complexity of people's movement activity, the large size of trajectory datasets, different sampling strategies, and innate differences between these two types of dimensions. To the best of our knowledge, an appropriate method is still lacking.
To obtain more valuable trajectory patterns that share the same movement activities according to both sematic meaning and geographic distribution, we first define a novel trajectory pattern called a semantic-geographic pattern. Then, a flexible density-based clustering algorithm is used to mine this new type of pattern. The clustering algorithm involves a new similarity measurement called semantic intensity to calculate similarities between trajectories by combining the geographic and semantic features. We prove the effectiveness of our method through comparative trajectory pattern mining experiments on 4340 users over 14,729 check-in traces using four different similarity measurements.
The remainder of this paper is organized as follows. Section 2 provides a review of existing trajectory similarity measurements. Section 3 outlines the definitions of the proposed semanticgeographic pattern. Section 4 describes the framework and major steps of the new pattern mining method. We present the experimental results obtained for a real social media dataset in Section 5 and discuss our conclusions and future work in Section 6.

Related Work
There are three major steps to convert raw trajectory data to interesting movement patterns: which are trajectory representation, similarity calculation, and pattern mining [3,4]. Among them, the similarity calculation can be the most important step [5]. Different similarity measurements will generate quite different trajectory patterns. Early research on similarity analysis was based on spatial However, these two types of patterns are not sufficiently accurate to incorporate into recommendation systems or other applications. In Figure 1, two trajectories in the geographic pattern do not share similar semantic meanings because they pertain to different types of activities in their second and third steps. On the other hand, two trajectories in the semantic pattern are not geometrically similar; they pertain to the same type of activity in different places. Therefore, users who define these patterns are not actually similar. More accurate and valuable trajectory patterns that contain both semantic and geographic information are required. People who perform similar activities in nearby places share the same movement activity pattern, and they may have more common interests. Here, "similar activities" are determined by semantic similarity, and "nearby places" are determined by geographic similarity. It is thus necessary to combine these two similarities. However, it is not a trivial task to calculate geographic and semantic similarity simultaneously. Challenges are posed by the complexity of people's movement activity, the large size of trajectory datasets, different sampling strategies, and innate differences between these two types of dimensions. To the best of our knowledge, an appropriate method is still lacking.
To obtain more valuable trajectory patterns that share the same movement activities according to both sematic meaning and geographic distribution, we first define a novel trajectory pattern called a semantic-geographic pattern. Then, a flexible density-based clustering algorithm is used to mine this new type of pattern. The clustering algorithm involves a new similarity measurement called semantic intensity to calculate similarities between trajectories by combining the geographic and semantic features. We prove the effectiveness of our method through comparative trajectory pattern mining experiments on 4340 users over 14,729 check-in traces using four different similarity measurements.
The remainder of this paper is organized as follows. Section 2 provides a review of existing trajectory similarity measurements. Section 3 outlines the definitions of the proposed semantic-geographic pattern. Section 4 describes the framework and major steps of the new pattern mining method. We present the experimental results obtained for a real social media dataset in Section 5 and discuss our conclusions and future work in Section 6.

Related Work
There are three major steps to convert raw trajectory data to interesting movement patterns: which are trajectory representation, similarity calculation, and pattern mining [3,4]. Among them, the similarity calculation can be the most important step [5]. Different similarity measurements will generate quite different trajectory patterns. Early research on similarity analysis was based on spatial and/or temporal information in trajectories. For the spatial dimension, classical methods like average distance [6], Hausdorff distance [7], Fréchet distance [8], and Minkowski distance can be used for the similarity calculation. For the temporal dimension, several sequence matching approaches can be directly applied to trajectory similarity analysis, such as Edit Distance [9], Longest Common Subsequence [10], and Dynamic Time Warping [11]. More recently, a number of extensions were developed to deal with trajectory data more effectively. These extension methods can be divided into two categories: spatiotemporal-based and semantic-based.
(1) Spatiotemporal-based methods focus on both spatial and temporal features of trajectories. It's not an easy task to measure two moving objects' spatiotemporal similarity, because their activities are updating asynchronously in both spatial and temporal dimensions. Giannotti et al. [12] defined the T-pattern in a collection of GPS trajectories. A T-pattern is a region-of-interest (ROI) sequence with temporal annotations, where each ROI is a rectangle corresponding to a trajectory density greater than a threshold. Lu et al. [13] proposed a transaction similarity measurement named LBS-Alignment to calculate the similarity of mobile users. By using the longest common sequence, the ratio of the common parts of the sequential motion patterns was taken as the similarity. Etienne et al. [14] defined an ordered sequence of spatial zones (called a "zone graph") to extract and filter trajectories following a similar itinerary, and then qualified spatiotemporal patterns such as the main routes and spatiotemporal channels through statistical computations. Buchin and Purves [8] used space-time prisms to model trajectories, and then computed the similarity of these prisms based on equal time and Fréchet distance. Lv et al. [15] addressed the problem of mining mobile users' long-term activity similarity. After transforming users' GPS trajectories into reference places using a three-layered hierarchical clustering algorithm, a bottom-up agglomerative clustering algorithm based on cosine coefficient similarity was used to group users' one-day activities. Each cluster contained a set of users' one-day activities, which represented a routine activity. Then, the similarity of users' routine activities was calculated based on the optimal matching sequence similarity of their reference place sets. Finally, users' similarities were calculated based on all of their routine activities multiplied by the number of times the user follows each routine activity as weights. Dodge et al. [4] introduced a novel technique for spatiotemporal trajectory similarity that relied on trajectory segmentation based on the movement parameters (e.g., speed, acceleration, or direction). Each segmentation was assigned to a movement parameter class, which can transform a trajectory into a sequence of class labels. Then, a modified version of edit distance called normalized weighted edit distance was developed to measure the similarity between different sequences. Yuan and Raubal [16] also extended the traditional edit distance algorithm by incorporating the spatial distribution of cell towers, and then applied a newly developed spatiotemporal edit distance to compare the trajectories extracted from call detailed records and conduct a hierarchical clustering analysis. Etienne et al. [17] defined a new spatiotemporal pattern called Trajectory Box Plot (TBP) to describe trajectories by a median trajectory, a 3D box and a 3D fence. The median trajectory depicts the typical movement of mobile objects. The box and the fences describe the spatial and temporal spreading around the central tendency. Etienne et al. then used visual analysis to highlight the density of trajectory cluster that changes over time. The above methods measure the trajectories' similarity based on the geometric and temporal information without considering the semantics information. As a consequence, they tend to discover spatiotemporal patterns such as sequences of locations, which for some applications may not help the user extract more meaningful information. Moreover, similar spatiotemporal trajectories may not necessarily be semantically similar, because the activities implied by the nearby locations they pass through may be different.
(2) Semantic-based methods aim to obtain more meaning from trajectory information. A semantic trajectory fundamentally consists of a sequence of locations with semantic tags describing the corresponding landmarks [18]. According to the encoding and organization of semantic knowledge, there are mainly three kinds of semantic similarity approaches [19][20][21]: semantic or taxonomic relations (also called path distance measures) based, information content-based, and feature models (also called classic models) based. They all can be used for carrying the semantic meaning of trajectories beyond the low-level pure geographic positions. Alvares et al. [18] first identified the stops in the GPS trajectories of mobile users, and mapped these stops to semantic landmarks by using a background map. They then applied a sequential pattern-mining algorithm to extract the frequent place sequences (i.e., the semantic trajectory pattern) to represent the frequent semantic behaviors. Unfortunately, because of spatial discontinuities and the randomness of users' movement activity, such place-level sequential patterns can appear only when the support threshold is very low. Bogorny et al. [22] took both hierarchical geographic and semantic properties into consideration, and provided two different methods (IB-SMoT and CB-SMoT) to automatically integrate trajectory samples and geographic information in a higher abstraction level (i.e., stops and moves), where the user can define the important parts of trajectories. Ying et al. [5] proposed a novel similarity measurement called Semantic Trajectory Pattern Similarity to evaluate the similarity between two trajectories. First, a frequent sequence pattern mining algorithm was used to get users' Maximal Semantic Trajectory Patterns (MSTPs). Then, the similarity between two MSTPs was calculated by the MSTP-Similarity measurement, which was based on the longest common sequence. Finally, a sequential pattern in the form of a sequence of semantic labels (e.g., school to park) was obtained. Xiao et al. [23] proposed a method to estimate the similarity between user trajectories. Their method first modeled a user's GPS trajectories with a semantic location history (SLH) based on the semantic location hierarchy and users' stay points. Then, the similarities between different users' SLHs were calculated by using a maximal travel match algorithm that summarized the weighted similarity of semantic location sequences detected at each layer of the hierarchy. Although these approaches overcome the geographic constraint on user similarity measurements, the patterns obtained lack detailed geographic distribution information, and are difficult to display using geo-visualization tools. Moreover, the patterns are usually represented with a high-level abstraction of semantic landmarks (e.g., <school, park>, <school, hospital, restaurant>) [5,24], which may be difficult to interpret.
To obtain more valuable trajectory patterns that share the same movement activities according to both semantic meaning and geographic distribution, it is necessary to combine these two measurements and create a unified framework to calculate the similarity between trajectories. Recently, Ying et al. [25] provided a new definition of geographic-temporal-semantic (GTS) pattern tree to model users' historical trajectories. They measured GTS similarity between two users' trajectories by the sum of three weighted dimensional similarity scores to predict a mobile user's next location. However, the efficiency for GTS similarity to mining interesting trajectory patterns is uncertain. Buchin et al. [26] defined a context-aware similarity measure to integrate the various surrounding contexts of trajectories. The final similarity score was obtained by summing up all the weighted context distances. However, suitable prior knowledge is needed to set proper weights for each factor, and the calculation of the trajectory's context distance requires refined and classified land cover and land use data, which is difficult to fetch.

Problem Statement
This section presents and defines a new trajectory pattern called a semantic-geographic pattern to describe people's movement activity. Several related preliminaries are provided to help explain this pattern.

Semantic-Geographic Trajectory Pattern
Generally, people who conduct similar activities in nearby places may share common interests and the same movement activity pattern. This type of pattern involves both semantic and geographic similarity; therefore, we refer to it as a semantic-geographic pattern. In contrast to a purely geographic pattern, people with similar semantic-geographic patterns must conduct semantically similar activities. In contrast to a purely semantic pattern, people must be geographically close within a certain geographic scale when they are conducting these similar activities.

Preliminaries
Several strategies have been proposed to transfer the original GPS trajectory into a semantic trajectory [23,25]. This paper does not focus on this preprocessing step. Instead, a special type of event-triggered trajectory called "check-in" data is used. Check-in data always contains geographic locations, and the semantic meaning in most check-in records can be easily obtained from a POI (Point Of Interest) database. We provide definitions of terms relevant to the use of this type of data. In addition, Figure

Preliminaries
Several strategies have been proposed to transfer the original GPS trajectory into a semantic trajectory [23,25]. This paper does not focus on this preprocessing step. Instead, a special type of event-triggered trajectory called "check-in" data is used. Check-in data always contains geographic locations, and the semantic meaning in most check-in records can be easily obtained from a POI (Point Of Interest) database. We provide definitions of terms relevant to the use of this type of data. In addition, Figure Figure 2 is as follows (where Gij is the id of a grid, and POIs are considered nearby and grouped together when they locate in the same grid):     Figure 2 are as follows (where P ij is a serial of POIs for one POI category C i , and t i is the time sequence in the corresponding POI serial):

Definition 2. Geographic pattern.
Users who have small distance values among their check-in traces are grouped together, and their common location pairs and check-in times are considered a geographic pattern (GPattern). Note that the check-in times for a user at a certain location are not always the same; therefore, a time duration is set for each check-in location.
Example 2. One geographic pattern between user 1 and user 2 in Figure 2 is as follows (where G ij is the id of a grid, and POIs are considered nearby and grouped together when they locate in the same grid): GPattern(user 1 , Definition 5. Semantic-geographic trace. A user's semantic-geographic trace (SGTrace) consists of several sets of timestamped POI locations that are grouped by the semantic category of his/her check-in place.
Definition 6. Semantic-geographic pattern. Users who have high semantic similarity in their semantic trace and small distances in their semantic-geographic trace are grouped together. Their user set and common SGTraces comprise a semantic-geographic pattern (SGPattern). Example 6. One semantic-geographic pattern between user 1 and user 2 in Figure 2 is as follows, and its geographic distribution is the same as the geographic pattern: SGPatten(user 1 ,

Semantic-Geographic Pattern Mining Method
From the data mining perspective, clustering is one of the most powerful methods for trajectory pattern mining [27][28][29]. In this study, the semantic-geographic pattern mining process consists of two parts and four main steps, as shown in Figure 3. First, in the data preprocessing step, users' original check-in traces are transformed into semantic-geographic traces and semantic traces according to the definitions presented in Section 3. Then, in the pattern mining step, geographic similarity and semantic similarity are calculated for each type of trace. The semantic-geographic pattern combined semantic intensity similarity is then calculated to obtain the final similarities. Finally, a density-based clustering algorithm is used to detect the semantic-geographic combined movement patterns. Example 6. One semantic-geographic pattern between user1 and user2 in Figure 2 is as follows, and its geographic distribution is the same as the geographic pattern:

Semantic-Geographic Pattern Mining Method
From the data mining perspective, clustering is one of the most powerful methods for trajectory pattern mining [27][28][29]. In this study, the semantic-geographic pattern mining process consists of two parts and four main steps, as shown in Figure 3. First, in the data preprocessing step, users' original check-in traces are transformed into semantic-geographic traces and semantic traces according to the definitions presented in Section 3. Then, in the pattern mining step, geographic similarity and semantic similarity are calculated for each type of trace. The semantic-geographic pattern combined semantic intensity similarity is then calculated to obtain the final similarities. Finally, a density-based clustering algorithm is used to detect the semantic-geographic combined movement patterns.

Semantic-Geographic Combined Similarity Measurement
Similarity measurement is the key step of semantic-geographic pattern mining. In this context, a new concept called semantic intensity is introduced to calculate trajectory similarities based on both semantic and geographic features. Semantic intensity is a combination of semantic similarity and geographic similarity. It can be considered the average semantic similarity in a geographic unit distance. The calculation of this new semantic intensity measurement is described in the following subsection.

Semantic-Geographic Combined Similarity Measurement
Similarity measurement is the key step of semantic-geographic pattern mining. In this context, a new concept called semantic intensity is introduced to calculate trajectory similarities based on both ISPRS Int. J. Geo-Inf. 2017, 6, 212 7 of 18 semantic and geographic features. Semantic intensity is a combination of semantic similarity and geographic similarity. It can be considered the average semantic similarity in a geographic unit distance. The calculation of this new semantic intensity measurement is described in the following subsection.

Semantic Similarity Calculation
Because trajectory data are transformed into semantic vectors according to Definition 3, the feature model-based method is suitable for measuring their similarity. Cosine similarity is a feature model-based method that is widely applied in the field of information retrieval [30] and text mining [31], and for this reason is chosen to calculate the semantic similarity between two users' semantic traces.
The semantic similarity of the users' semantic traces STrace 1 and STrace 2 is as follows: Semantic similarity(STrace 1 , where n is the total number of POI category pairs (e.g., school → park) and W ji is the corresponding weight for each type of POI category pair. In addition, the start time and end time in each POI category pair must overlap in order to calculate the two traces' inner product. Different weighting schemes might be used in this context; one that is effective and widely used is term frequency and inverse document frequency (TF-IDF): where tf i,j is the number of times term i appears in a document dj, and df i is the number of documents in which the certain term appears. In this paper, a term corresponds to one POI category pair, and a document corresponds to a semantic trace. The TF factor indicates the importance of a POI category pair in the trace. The IDF factor is a global statistic that measures how widely a POI category pair is distributed over a collection. POI category pairs that appear often in a trace and do not appear in many traces therefore carry significant weight. Unlike in the weighting strategies of Buchin, et al. [26], the TF-IDF weight is directly obtained from original data. This weight carries more accurate semantic meaning, and does not require any prior knowledge. For detailed explanations of cosine similarity and TF-IDF weight please refer to [30][31][32].

Geographic Similarity Calculation
Calculating the geographic similarity calculation between two semantic-geographic traces is not a trivial task. Because trajectory data are collected over a long period, the places visited may not be the same for one person at a certain time. Thus, unlike in traditional trajectory distance measurements (such as DTW (Dynamic Time Warping), Fréchet, and edit distance), two polyline sets must be calculated at each time interval.
The Hausdorff distance is the maximum distance of a set to the nearest point in another set [33]. Because the computation of the Hausdorff distance can be extended to polygons and line segment sets, curves and surfaces, curve sets, etc., it is widely used for pattern matching and recognition [34,35]. The Hausdorff distance from point set A to point set B is formally defined as follows: where a and b are points in sets A and B, respectively, and dist(a, b) is usually the Euclidean distance between these two points.
This Hausdorff distance is not symmetric, which means that most of the time h (A, B) is not equal to h(B, A). Therefore, the Hausdorff distance between A and B can be more generally defined as follows: Two users' geographic similarity is calculated by the root mean square value of the Hausdorff distance at each POI category pair that is shared in their corresponding semantic-geographic traces. Therefore, only semantically related POI category pairs are used to calculate the geographic similarity, and the semantically unrelated POI category pairs are omitted. As shown in Figure 4, given Trace 1 and Trace 2 as one type of a POI category pair's polylines over the same period (t 1~t4 ) for user 1 and user 2 , respectively, the Hausdorff distance is calculated as follows: where: H(Trace 1 , Trace 2 ) = Max(Dist(xy, abcd), Dist(yz, abcd)), H(Trace 2 , Trace 1 ) = Max(Dist(ab, xyz), Dist(bc, xyz), Dist(cd, xyz)). As shown in Figure 4a, the distance between one segment pair from two traces (e.g., Dist(yz, cd)) is the arithmetic mean of the corresponding distances of their start points and end points. Moreover, as shown in Figure 4a, the distance from one trace's segment to the other trace (e.g., Dist(xy, abcd)) is the minimum distance between xy and trace2's segments (supposing Dist(xy, ab) < Dist(xy, bc), then Dist(xy, ab) remains). Note that the duration of the segments in trace2 must overlap the duration of xy's segments. As the Hausdorff distance increases, geographic similarity decreases. The final geographic similarity between two check-in traces is: where n is the number of related POI category pairs, and r is a distance threshold value such that a Hausdorff distance greater than r results in a geographic similarity equal to the minimum value of 0. The upper limit of geographic similarity is 1, which occurs when the Hausdorff distance between two traces is 0.

Semantic Intensity Similarity Calculation
Two users' semantic intensity is the semantic similarity within a geographical unit distance. Semantic intensity is calculated as a mixture of both semantic similarity and geographic similarity: Semantic intensity(user , user ) = Semantic similarity( , ) 2 − Geographic similarity( , ) The divisor of Formula (8) consists of the geographic distance plus 1 (according to Formula (7)), which avoids obtaining an infinite semantic intensity when the geographic similarity equals 0. Additionally, this modification forces the semantic intensity to vary between 0 and 1. As semantic similarity and geographic similarity increase, semantic intensity increases as well. Semantic intensity is assigned the maximum value of 1 when semantic similarity is equal to 1 and geographic similarity As shown in Figure 4a, the distance between one segment pair from two traces (e.g., Dist(yz, cd)) is the arithmetic mean of the corresponding distances of their start points and end points. Moreover, as shown in Figure 4a, the distance from one trace's segment to the other trace (e.g., Dist(xy, abcd)) is the minimum distance between xy and trace2's segments (supposing Dist(xy, ab) < Dist(xy, bc), then Dist(xy, ab) remains). Note that the duration of the segments in trace 2 must overlap the duration of xy's segments.
As the Hausdorff distance increases, geographic similarity decreases. The final geographic similarity between two check-in traces is: Geographic similarity(CTrace 1 , CTrace 2 ) where n is the number of related POI category pairs, and r is a distance threshold value such that a Hausdorff distance greater than r results in a geographic similarity equal to the minimum value of 0. The upper limit of geographic similarity is 1, which occurs when the Hausdorff distance between two traces is 0.

Semantic Intensity Similarity Calculation
Two users' semantic intensity is the semantic similarity within a geographical unit distance. Semantic intensity is calculated as a mixture of both semantic similarity and geographic similarity: Semantic intensity(user 1 , user 2 ) = Semantic similarity(STrace 1 , STrace 2 ) 2 − Geographic similarity(GTrace 1 , GTrace 2 ) The divisor of Formula (8) consists of the geographic distance plus 1 (according to Formula (7)), which avoids obtaining an infinite semantic intensity when the geographic similarity equals 0. Additionally, this modification forces the semantic intensity to vary between 0 and 1. As semantic similarity and geographic similarity increase, semantic intensity increases as well. Semantic intensity is assigned the maximum value of 1 when semantic similarity is equal to 1 and geographic similarity is equal to 1. Because semantic intensity scales with both semantic similarity and geographic similarity, a suitable similarity threshold is needed to identify trajectories with high semantic intensity similarity. This issue is addressed in the next subsection.

Density-Based Clustering
After transforming original trajectory data into feature vectors, generic clustering algorithms are then used to group them based on the similarity measurements mentioned above. However, as argued by Salvador and Chan [36], one important problem associated with these clustering algorithms entails determining the number of clusters. Moreover, a large amount of data corresponding to noise or outliers may adversely affect the efficiency and accuracy of these algorithms. Therefore, in this paper, a density-based spatial clustering of applications with noise (DBSCAN) algorithm [37] is used to overcome these drawbacks and effectively mine valuable trajectory patterns.
In DBSCAN, objects with many nearby neighbors are considered a high-density region. DBSCAN can then group these objects together to form a cluster, and make outliers objects that lie alone in low-density regions outliers. DBSCAN requires only two parameters that are insensitive to the order of the objects in a dataset. One is the neighborhood region's radius, epsilon (ε). The other is the minimum number of nearest objects (MinPts, or k) required to form a dense region. The performance of DBSCAN depends on the distance/similarity measure, because it defines the neighborhood region for objects and can strongly affect the region of final clusters. To obtain meaningful trajectory patterns by the similarity measures mentioned above, the two parameters (ε and k) are set based on the three following factors: (1) The geographic distance between two trajectories calculated by Formula (5) should be no greater than 1 km. A large distance is indicative of a non-similar relation, regardless of how semantically similar the trajectories are. Moreover, the neighborhood region's radius of geographic similarity is also set to 1 as explained in Formula (7). (2) The neighborhood region's radius of semantic intensity similarity between two trajectories should be no less than 0.5. According to Formulas (7) and (8), trajectories that have minimum geographic similarity (a geographic distance equals to 1) are considered neighbors only if they have a maximum semantic similarity equal to 1. Furthermore, trajectories that have a maximum geographic similarity equal to 1 are considered neighbors only if they have a minimum semantic similarity equal to 0.5. Thus, no matter how large the geographic similarity is, a semantic similarity less than 0.5 is not able to generate high-density regions in a semantic intensity similarity measurement. (3) A user is considered a dense object when the number of neighboring users in its neighborhood region is no less than a given parameter k ∈ [2,9]. A small k value will generate a large number of clusters that have very small cluster sizes. The cluster number will decrease as k increases. However a large k value tends to generate one special cluster that is much larger than other clusters. Therefore, the final value of k has to be determined experimentally.

Experimental Analysis
This section validates the accuracy and efficiency of the proposed similarity measurement. We conducted experiments comparing four similarity measurements: semantic similarity (Sem), geographic similarity (Geo), semantic intensity similarity (SG), and GTS similarity (Ying,et al. [25]). We then analyzed the mining results to demonstrate the advantages and disadvantages of each.

Data Collection and Preprocessing
The experimental data were extracted from the Chinese microblogging site Sina Weibo. We collected over 238,000 users' check-in records in Beijing during the year 2013. The total data comprised three million records, with about 54,000 POIs. The data were preprocessed based on three rules: (1) The 54,000 POIs were classified into five simple categories (educational institution (EI), hotel or restaurant (HR), indoor entertainment (IE), outdoor activity (OA) transportation facility (TF)), and one compound category (which is social institution or building (SB)). The compound category includes industrial and commercial buildings, banks, hospitals, and government organizations. These six categories cover a user's daily movement activities, comprising working, studying, eating, etc. (2) One check-in trace record for a user had to be generated in one day, from 0:00~23:59, and contained at least two POIs. (3) Users who had checked in at the same trace more than twice and in at least two different months were selected as experimental users. Their traces were considered regular movement activities.
After these three criteria were applied, 14,729 check-in traces from 4340 users remained. Sample records for one user are listed in Table 1. XA, YA, XB, YB are the coordinates of POI A and POI B; TA and TB are the check-in duration of POI A and POI B.

Experiment Design
The aim of the experiment was to compare the accuracy and efficiency of the trajectory pattern mining method based on four different trajectory similarity measurements.
Semantic similarity and semantic intensity similarity were measured using Formulas (1) and (8). The calculation of geographic similarity differed slightly from that described by Formula (7). The Hausdorff distance was calculated for two users' trajectories, without considering the POI categories. The GTS similarity between two user's trajectories was based on the method of Ying et al. [25], and is expressed as follows: In Formula (9), M i and M j are two segments of the trajectories of two users (U and V); m and n are the size of segmentations in each trajectory. When two segments' geographic distance is less than 1 km, their GTS similarity can be obtained as the sum of semantic similarity and temporal similarity. Their semantic similarity is based on the cosine similarity measurement, and their temporal similarity is the proportion of the intersection of their stay time intervals to the union of their stay time intervals at a certain place. If two segments have a geographic distance greater than 1 km, they have no GTS similarity.
The setting of the similarity threshold for pattern mining relies on the three factors described in Section 4.2. The similarity thresholds for geographic and semantic intensity measurements are 1 and 0.5, respectively. The threshold for GTS similarity is also set to 0.5 for comparison with other measurements. Because the POI categories for each user's check-in records are highly similar, the semantic similarity threshold is set to 0.9.
Two factors are used to verify the efficiency of the four measurements. One is the total user number of the five largest clusters (N5). The greater N5 is, the better measurement it becomes. The other factor is the ratio of the largest cluster user number in N5. An excessively large ratio is considered an inefficient measurement. On the other hand, accuracy is determined by analyzing the user set and POI set in each pattern's contents.

Results Analysis
By using four similarity measurements and running the DBSCAN algorithm with the parameters mentioned above, a large number of clusters are obtained. The clustered user number is considered an important indicator for describing the cluster results. Usually, the measurement and cluster results improve with the number of clustered users. However, as a by-product of the DBSCAN algorithm, many clusters are very small. Ultimately, only the five largest clusters of each similarity measurement are chosen as the most meaningful patterns to be analyzed. In addition, the largest cluster may cover a very large portion of the whole clustered user base as the parameter k increases, which will greatly impair the representation of other clusters. Therefore, the proportion of the largest cluster's user number to that of the five largest clusters is also used to determine the efficiency of each similarity measurement. The two indicators for four similarity measurements' cluster results are shown in Figure 5. Figure 5 shows that as the parameter k increases, the top cluster user ratio for semantic similarity jumps to a very high level when k is greater than 7. The clustered user number for geographic similarity reaches a maximum when k equals 6. On the other hand, semantic intensity similarity and GTS similarity generate much smaller clusters, and their total values change only slightly when k ranges from 2 to 6. Finally, the clustering parameter k is set to the same value of 6 to make better comparisons among the four similarity measurements. Table 2 shows the parameter settings and cluster results for the four similarity measurements.   Figure 5 shows that as the parameter k increases, the top cluster user ratio for semantic similarity jumps to a very high level when k is greater than 7. The clustered user number for geographic similarity reaches a maximum when k equals 6. On the other hand, semantic intensity similarity and GTS similarity generate much smaller clusters, and their total values change only slightly when k ranges from 2 to 6. Finally, the clustering parameter k is set to the same value of 6 to make better comparisons among the four similarity measurements. Table 2 shows the parameter settings and cluster results for the four similarity measurements.

Cluster Result Analysis by User Set
Based on the cluster results presented in Table 2, we perform an overlap analysis of the clustered user sets to demonstrate the differences among the four similarity measurements.
(1) As shown in Table 2 and Figure 6, semantic similarity generated thousands of clusters, most of which have very small user sizes. Furthermore, its clustered users covered 81.5% users of the

Cluster Result Analysis by User Set
Based on the cluster results presented in Table 2, we perform an overlap analysis of the clustered user sets to demonstrate the differences among the four similarity measurements. (1) As shown in Table 2 and Figure 6, semantic similarity generated thousands of clusters, most of which have very small user sizes. Furthermore, its clustered users covered 81.5% users of the whole data set, which means that most of the users have high semantic similarities with at least 6 other users. Among those users, 33.7% are also geographically similar. (2) Only approximately 32.4% of users' activities were conducted within 1 km of other activities. The ratio of semantically similar users to geographically similar users was 85%, which is only a little greater than the ratio of semantically similar users to the whole dataset. These two ratios indicate that there is no correlation between the two types of similar users. (3) As shown in Figure 6a, the semantic intensity similarity was practically a small subset of the union of the two similarities described above. Over 82.3% users in that union were both semantically and geographically similar. They were not, however, similar to each other when the two similarities were combined. In fact, they were often conducting different types of activities in nearby locations; thus, they were not truly similar. On the other hand, approximately 12.7% of semantic intensity similar users were identified among non-semantic similar and/or non-geographic similar users. These users were selected because their check-in POI sets were geographically similar for some of the semantic POI categories. In other words, they were not always conducting similar activities in nearby areas, but they did some particular type of activities in nearby areas. (4) As shown in Figure 6b, GTS similarity exhibited the smallest cluster size. Figure 6c shows that half of the GTS-clustered users overlapped with semantic intensity-clustered users. A detailed comparison between these two similarities is performed in the next subsection.
The ratio of semantically similar users to geographically similar users was 85%, which is only a little greater than the ratio of semantically similar users to the whole dataset. These two ratios indicate that there is no correlation between the two types of similar users. (3) As shown in Figure 6a, the semantic intensity similarity was practically a small subset of the union of the two similarities described above. Over 82.3% users in that union were both semantically and geographically similar. They were not, however, similar to each other when the two similarities were combined. In fact, they were often conducting different types of activities in nearby locations; thus, they were not truly similar. On the other hand, approximately 12.7% of semantic intensity similar users were identified among non-semantic similar and/or non-geographic similar users. These users were selected because their check-in POI sets were geographically similar for some of the semantic POI categories. In other words, they were not always conducting similar activities in nearby areas, but they did some particular type of activities in nearby areas. (4) As shown in Figure 6b, GTS similarity exhibited the smallest cluster size. Figure 6c shows that half of the GTS-clustered users overlapped with semantic intensity-clustered users. A detailed comparison between these two similarities is performed in the next subsection. The overlap analysis described above demonstrates that the semantic intensity measurement performs quite differently from the semantic similarity and geographic similarity measurements. Its clustering results contain both semantically and geographically similar users, indicating more valuable patterns. In addition, semantic intensity identified more clustered users than GTS similarity.

Pattern Result Analysis by POI Set
In this section, the POI sets in each cluster are used to analyze the semantic and geographic features of each pattern obtained by using different similarity measurements.   The overlap analysis described above demonstrates that the semantic intensity measurement performs quite differently from the semantic similarity and geographic similarity measurements. Its clustering results contain both semantically and geographically similar users, indicating more valuable patterns. In addition, semantic intensity identified more clustered users than GTS similarity.

Pattern Result Analysis by POI Set
In this section, the POI sets in each cluster are used to analyze the semantic and geographic features of each pattern obtained by using different similarity measurements.  (2) The five largest geographic patterns are illustrated in Figure 8. They are located in different places in Beijing and have different area sizes. However, no detailed semantic rules can be determined from these geographic patterns. It is thus unclear what types of activities are conducted by each pattern's users. In fact, the overlapping analysis in Figure 6 shows that most of the users are not performing similar activities in nearby places. (3) The semantic intensity similarity measurement generated 32 small semantic-geographic patterns. The geographic distributions of the top 5 pattern are shown in Figure 9. In SG pattern 1, for example, people from northern and western areas frequently check in at one place: Zhongguancun. In fact, Zhongguancun is the most popular information technology center in Beijing. Many active Weibo users are working or consuming at Zhongguancun; they may share common interests that can be used for accurate user or business recommendations. In addition, the geographic distribution and time duration of pattern 1 can provide detailed information that city planners can use to make better decisions in this area.  (2) The five largest geographic patterns are illustrated in Figure 8. They are located in different places in Beijing and have different area sizes. However, no detailed semantic rules can be determined from these geographic patterns. It is thus unclear what types of activities are conducted by each pattern's users. In fact, the overlapping analysis in Figure 6 shows that most of the users are not performing similar activities in nearby places. (2) The five largest geographic patterns are illustrated in Figure 8. They are located in different places in Beijing and have different area sizes. However, no detailed semantic rules can be determined from these geographic patterns. It is thus unclear what types of activities are conducted by each pattern's users. In fact, the overlapping analysis in Figure 6 shows that most of the users are not performing similar activities in nearby places. (3) The semantic intensity similarity measurement generated 32 small semantic-geographic patterns. The geographic distributions of the top 5 pattern are shown in Figure 9. In SG pattern 1, for example, people from northern and western areas frequently check in at one place: Zhongguancun. In fact, Zhongguancun is the most popular information technology center in Beijing. Many active Weibo users are working or consuming at Zhongguancun; they may share common interests that can be used for accurate user or business recommendations. In addition, the geographic distribution and time duration of pattern 1 can provide detailed information that city planners can use to make better decisions in this area.  (3) The semantic intensity similarity measurement generated 32 small semantic-geographic patterns. The geographic distributions of the top 5 pattern are shown in Figure 9. In SG pattern 1, for example, people from northern and western areas frequently check in at one place: Zhongguancun. In fact, Zhongguancun is the most popular information technology center in Beijing. Many active Weibo users are working or consuming at Zhongguancun; they may share common interests that can be used for accurate user or business recommendations. In addition, the geographic distribution and time duration of pattern 1 can provide detailed information that city planners can use to make better decisions in this area. (4) The GTS similarity measure only generated six GTS patterns as can be seen in Figure 10. Except pattern 1, all the other patterns contain very few users. According to Formula (9), the calculation of GTS similarity doesn't consider the value of geographic similarity when users' trajectories have a smaller distance than 1 km. They were all treated the same as those who have high geographic similarities. In the end, users who have low geographic similarity (e.g., geographic distance equals to 1 km) can still be clustered in GTS patterns. That's why pattern 1 and pattern 4 in Figure 10 both have larger areas than corresponding clusters in semantic-geographic patterns.
The cluster results suggest that, compared with the three other similarity measurements, semantic intensity is more effective in identifying various and valuable trajectory patterns. The pattern results reveal more accurate information pertaining to geographic distribution and semantic meaning. This information is important for interpreting patterns, and the patterns obtained can reveal more detailed movement activities among users.  (4) The GTS similarity measure only generated six GTS patterns as can be seen in Figure 10. Except pattern 1, all the other patterns contain very few users. According to Formula (9), the calculation of GTS similarity doesn't consider the value of geographic similarity when users' trajectories have a smaller distance than 1 km. They were all treated the same as those who have high geographic similarities. In the end, users who have low geographic similarity (e.g., geographic distance equals to 1 km) can still be clustered in GTS patterns. That's why pattern 1 and pattern 4 in Figure 10 both have larger areas than corresponding clusters in semantic-geographic patterns.
The cluster results suggest that, compared with the three other similarity measurements, semantic intensity is more effective in identifying various and valuable trajectory patterns. The pattern results reveal more accurate information pertaining to geographic distribution and semantic meaning. This information is important for interpreting patterns, and the patterns obtained can reveal more detailed movement activities among users. (4) The GTS similarity measure only generated six GTS patterns as can be seen in Figure 10. Except pattern 1, all the other patterns contain very few users. According to Formula (9), the calculation of GTS similarity doesn't consider the value of geographic similarity when users' trajectories have a smaller distance than 1 km. They were all treated the same as those who have high geographic similarities. In the end, users who have low geographic similarity (e.g., geographic distance equals to 1 km) can still be clustered in GTS patterns. That's why pattern 1 and pattern 4 in Figure 10 both have larger areas than corresponding clusters in semantic-geographic patterns.
The cluster results suggest that, compared with the three other similarity measurements, semantic intensity is more effective in identifying various and valuable trajectory patterns. The pattern results reveal more accurate information pertaining to geographic distribution and semantic meaning. This information is important for interpreting patterns, and the patterns obtained can reveal more detailed movement activities among users.

Discussion and Conclusions
In this paper, we discuss several similarity measurements for trajectory data. We formally define the new concept of a semantic-geographic pattern and propose a novel similarity measurement called semantic intensity to calculate semantic and geographic similarity within a unified framework. We then use a flexible density-based clustering algorithm to mine these semantic-geographic patterns. Comparative trajectory pattern mining experiments conducted using four different similarity measurements show that semantic intensity can effectively measure both semantic and geographic similarity among trajectory data. The experimental results also show that the patterns obtained using semantic intensity contain more interesting information than the other measurements. This information can be used to interpret patterns, clarify people's movement activities more carefully, and supply more accurate location and user recommendations.
Two factors enable semantic intensity to perform better than the other measurements. One is the proper combination of two different dimensions: semantic similarity and geographic similarity. The deviation strategy not only limits semantic intensity to values between 0 and 1, but it also avoids the task of setting weights for each dimension. Therefore, compared with the pure semantic similarity and geographic similarity measurements, semantic intensity can measure the two similarities simultaneously and is suitable for various trajectory data. The other factor is that semantic intensity only considers the segments of trajectories that have similar semantic meanings and shorter geographic distances. Because users' movement activities vary widely in the semantic, geographic and temporal dimensions, it is difficult to obtain high similarity values between two users' whole trajectories. Thus, compared with the calculation of GTS similarity, that of semantic intensity makes it easier to obtain high similarity values and to identify more common trajectory patterns.
Data quality problems are an inevitable issue that can affect the efficiency of semantic-geographic pattern mining task for all kinds of similarity measurements. In fact, the experimental data are limited by two main factors. First, check-in data can be considered a small and biased sample dataset corresponding to the entire population's activities. The check-in frequency for one person is quite low, and the check-in locations only cover a small portion of the places he/she has visited. Several data pre-processing strategies are used to fetch the representative places for each person. However, those strategies greatly reduce the number of experimental users and lead to a very small number of semantic-geographic pattern users. A recent study on mining human activity patterns (not trajectory pattern) from Twitter data also showed that only 2.72% users had very similar activity patterns based on space-time and semantics, whereas the majority (87.14%) showed different activity patterns (i.e., similar spatiotemporal patterns and different semantic patterns, similar semantic patterns and different spatiotemporal patterns, or different in both) [38]. Therefore, it is difficult to select a suitable sample dataset for the accuracy verification. Second, experimental users cover only a small portion of population, and their representativeness is hardly investigated because of privacy limitations. This limitation makes it difficult to identify the types of people in one trajectory pattern. For example, we can draw conclusions about characteristics shared by users in the semantic-geographic patterns described above. However, it is very hard to verify whether the users are actually IT employees. Thus, the conclusions can only be drawn directly from the dataset. Other trajectory data also have shortcomings that must be addressed when conducting semantic-geographic pattern mining.
In addition to the similarity measurement and data quality problem, several challenges still remain with respect to this new pattern mining task. Given the diversity of people's movement behavior, it is questionable to set fixed parameters in data clustering processing. A more flexible parameter-setting strategy and more stable clustering algorithms that can identify more accurate and valuable patterns are needed.