According to the statistics of China’s 2019 telecommunications business, the number of mobile phone users reached 1.6 billion by the end of 2019 [1
]. Due to the development of location techniques and widespread use of smart devices, personal trajectory data has become an important resource for understanding personal or group behaviors, and trajectory data mining has become a hot topic in many of research fields [2
]. For instance, Elragal et al. [3
] and Shingo Enami et al. [4
] used relative technologies in vehicle management. Tian Qin et al. [5
] proposed a method to mine spatio-temporal routine of people based on mobile phone data. Huan et al. [6
] tried to explore social behaviors on mobile sensors data. Chen et al. [7
] made disease predictions based on mobile big data. Xudong Liu et al. [8
] used the taxi trajectory data to identify urban functional regions in Chengdu. Besides, trajectory data analysis has applied in some practical applications, such as nearby friend recommendation based on location-based service (LBS) [9
] and route navigation in Map Applications, etc.
Discovering accompanying or group behavior pattern is an important branch in mining mobile trajectory data. The pattern is defined as more than one moving objects that travel together for a period of time. Such pattern discovery provides significant supports to a large amount of relative fields, such as control of key personnel, tourism development, accident investigation, group tracking etc. It has been applied in significant application scenarios. Tang et al. [10
] proposed a loose companion discovery for military object monitoring to describe the several members may temporarily leave the group and go back in short time. Meiling Zhu et al. [11
] proposed a novel algorithm to find Platoon companion pattern over a special type of spatio-temporal data stream. Zhu et al. used Hainan tourists data to find group movement pattern and classified tourists [12
], etc. Thus, mining and analyzing accompanying behavior pattern are necessary for relative applications and academic fields.
Since mobile devices can generate massive amounts of data, one big challenge is brought into accompanying pattern mining, i.e., high performance of algorithms are needed to process massive data in limited time. Another major challenge comes from the optimization of the traveling companion discovering algorithm. Traveling Companion Discovering Algorithm comes from the Clustering-and-Intersection method [13
], which defines the companion candidates to describe the similar companions in each time snapshot. Tang et al. [14
] optimized the Clustering-and-Intersection algorithm into a smart-and-closed algorithm by combining the buddy structure to improve the effectiveness of the method. In the mean time. Some studies [12
] use the similar way to discover Traveling Companions or other behavior pattern. However, it is easy to cause an omitting candidate problem with the time-snapshot-slicing-based method, especially when the time period is extremely short-term. Due to the sparsity of the mobile trajectory data, it is a hazard to cluster these trajectory data using the unbefitting time segmentation method. Concretely, some cluster-able trajectory data cannot be clustered possibly and are even filtered as noise. Therefore the approaches based on time-segmented slicing may not always be completely successful.
In this paper, we propose a new companion discovery method based on the clustering algorithm and association analysis algorithm to solve the above problems. In contrast to the time-snapshot-slicing-based methods or models, this method finds the closeness in the location and the closeness in time reflected in the moving-user data from a holistic perspective. In addition, more focus is given to the potential correlation between users. For example, if A and B are a pair of accompanying partners, they are more likely to spend time together in a small region, which can be defined as that B appears when A appears or A appears when B appears.
The proposed algorithm is an extension and optimization of our previous work [16
]. On this basis, we improve the algorithm and propose a 5-stage
framework. Firstly, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HBDSCAN) [17
] is used to mine similar moving users in a certain geographic area and within the time span. Then, a classical association analysis algorithm Frequent Pattern (FP-growth) is used to predict the internal association relationships among similar users, which takes full use of the characteristics of clustered data with high similarity to find potential accompanying patterns. The following stage involves a filtering strategy which is used to perform the necessary filtering to obtain the recommended travel companions for certain pseudo-companion scenarios. The last stage is designed to merge the results data into groups.
The main contributions proposed in this paper are as follows:
A framework of traveling companion discovery named GroupSeeker is proposed. Through a five-stage processing flow, GroupSeeker can find potential traveling companions in a huge amount of trajectory data with high performance and accuracy.
Parameter Setting Strategies are inherently embedded into GroupSeeker. Primary stages can determine their parameters according to the characteristic of datasets, which makes the framework much more practical and applicable.
A novel Spatio-temporal clustering method is used to deal with trajectory data of long-term time slices and solve the omitting problem of companion candidates caused by improper short-term time segmentation in previous work.
Experimental results on real-world and simulated datasets show the time cost of GroupSeeker is at a desirable level. Trajectory data for twenty-four hours can be processed within one and a half hours, which means GroupSeeker can be used in all-weather monitoring jobs.
The reminder of the paper is organized as follows. Section 2
introduces the related work; Section 3
gives the problem definition and the methodology, including the framework and methods; Section 4
presents the experimental results; Section 5
concludes this paper and gives some discussion about future work.
4. Experiment and Results
All the algorithms are implemented in python 3.8.2 on PyCharm and are performed on computers with Intel Core i7-8550U CPU 1.80 GHz, 16.0 GB RAM and windows 10.
4.1. Data Sets
Based on the two real-world data sets, various sample sets are extracted based on different criteria.. The criteria are shown as follows:
D1 (Traveling Users Dataset): This dataset is collected from real users in a certain region of China between 16 November 2014 and 18 November 2014, which was provided by a communication provider in China. The locations are from the cell-sites which are connected with many phones. The raw spatial trajectory data mainly includes the latitude and longitude coordinates, time-stamp and user information. When we got this dataset, personal-sensitive information in the dataset was anonymized and the coordinate information was re-adjusted by this provider for privacy protection.
D2 (Geolife Trajectory):
This dataset was collected in (Microsoft Research Asia) Geolife project from 182 users between April 2007 and August 2012
]. A GPS trajectory from that set is represented by a sequence of time-stamped points containing information on latitude, longitude and altitude.
of the tracks are in a dense representation, e.g., every 1–5 s or every 5–10 m per point, the overview of this data set shown in Figure 5
It is essential to choose suitable data sets. For D1 and D2, after data preprocessing, they are divided into many subsets according to the number of records. For instance, we split Geolife data set into 19 subsets according to the amount of 800,000 records. In these subsets, we choose 5 subsets from D1 and D2 relatively, which are shown in Table 4
. Notably, a simulated data set called Sim1 is generated based on a subset from D2. Sim1 is regarded as a subset from the real-world and simulated sources D3.
Despite the fact that Sim1 has a small size, it contains two companion simulation users we added for a particular user, which could quickly verify the effectiveness of the algorithm. The generation of the two simulation data comes from understanding the trajectory of a real user, especially to be able to have a simple understanding of its state changes, the most basic of which is its direction change in the two dimensions of latitude and longitude. By recording a state-change matrix, the basic state changes can be learned from the simulation data and hence the traveling companion’s behavior can be simulated.
Except for Sim1 and Set5, other subsets have similar data size and a similar number of records. 10 samples subsets (Set1–Set5 and Geo1–Geo5) are randomly selected from D1 and D2 respectively in order to compare with the impact of the sparseness and density of the dataset in the real-world scene on the algorithm results. Set5, whose size is about half of the remaining 9 sample sets, is used to show the effect of data size on the method. Certainly, for dealing with the scale of 800,000 records, our experimental environment can be close to the its memory limit.
4.2. Pseudo-Companion Scenarios Filtering Display
Some typical intermediate results are visualized in Figure 6
and Figure 7
, which are the situations that need to be filtered out. In order to facilitate the display, we select a data type of rule set to use. It is the case of the two-types scenarios in the sparse data set.
In sub-figures of Figure 6
, although there is brief contact between two users. For one of two users, the number of records representing the contact processing does not stand at a big proportion of the total number of records. Hence, they are filtered by the rule sets. For the Figure 6
b, they could be regarded as the no-contact scenario because they have few records presenting close contact. Finally, the Figure 6
d. is a partial enlargement of Figure 6
c. and the close-contact records between two users still account for too few, so they are not considered to be real companions satisfying the proportion of records. The sub-figures in Figure 7
show the cases of satisfying the filtering rules. Among them, the Figure 7
a. is the result of Sim1
including three users. These users move together in a small area. In addition, the Figure 7
d. is the partial enlargement of Figure 7
4.3. The Results of Traveling Companion Discovery and Validation
4.3.1. Measuring Time Overhead
highlights the time overhead of 10 data subsets in the framework, which is illustrated in Stage II to Stage V. It is evident that Stage II is the largest time-overhead stage in these 10 data subsets and has great differences between D1 and D2. The time overhead in Stage III is affected by the scale of the data subset. In Stage V, when the result of the previous stage leads to the absence of multiple targets, its time overhead will be 0. In addition, we use the average number of users’ records to show the sparseness of each data subset. Obviously, D1 is more spare than D2. Thus, the parameter setting should not be too strict for D1. Otherwise, it would be difficult to discover TC. In practice, the filter rules for this result is to filter the brief contact for D1. Since this rule set is not used in D2 with a dense sampling effect, so the time cost is 0 in Stage IV for D2. Finally, it is worth mentioning that the time overhead in Group Merging is too shorter than other stages to be negligible in this scale of data set. Therefore, the overhead in Stage V is not shown in Table 6
. Related parameter settings are shown in Table 7
. The distance are measured in meters, and the time threshold are measured in seconds.
4.3.2. Significant Result Analysis
The number of TC in each data subset is shown in Table 5
. Although some subsets produce a few or no results, it matches the real-world data scenarios with no accompanying pattern. In the following, some special and meaningful TC are presented by visualizing the experimental results. For instance, u0, u3, u4, and u30 are recommended from Geo3 as a TC. In the this long-term period of Geo3, all of them move through the road network in this geographic area within a close period of time. Therefore, their trajectories, which are shown in Figure 8
, are very similar and the coverage rate among them is so high. The main difference is shown in Figure 8
a. that is a small part of the trajectory difference exists, which may result from a short-term separation or a certain amount of data loss caused by a difference in the positioning signal. On the other hand, u0 and u3 have the same records within a long-term period. We further checked the undivided D2 dataset to verify this situation. It has been found that their records appeared same from 0:52 on 30 March 2009 to 2:58 on 5 July 2009. Therefore, it is reasonable to guess that this is likely to be the case of an individual carrying two mobile devices, which could offer positive support in the management of special objects, such as focusing on individuals or groups with sensitive behaviors. The sample data is shown in Table 8
5. Conclusions and Discussion
At present, mobile positioning devices represented by navigation devices, smart wearable devices, and smart infrastructures are increasingly popular in daily life. LBS has become an important element, and which is not available to most people. Locatable devices and LBS provide sufficient conditions for generating a massive amount of mobile trajectory data. The trajectory traveling companion discovery algorithm is widely used as an important method for discovering accompanying behavior patterns. However, it is necessary to improve the applicability and efficiency of the method as much as possible under the premise of current information explosion and diverse sampling methods.
Thus, as one basic support technology of many trajectory data mining applications, this paper proposes an applicable framework GroupSeeker to discover traveling companions in vast spatial-temporal data. The framework includes a five-stage processing flow and the core algorithms lie in the following three stages, Spatio-temporal Clustering, Companion Voting, and Pseudo-companion Filtering. GroupSeeker successfully avoids the problem that useful clusters are considered to be noise due to bad time segmentation. Besides, considering the different sparseness of data sources, the parameter setting strategies are proposed to improve the reliability of the framework and reduce the learning cost. Moreover, a set of imperfect but indeed effective methods for filtering confusing scenarios is proposed. In practice, parameters in GroupSeeker could be set according to the purpose of mining and specific scenarios. Finally, the framework is evaluated on several real-world datasets with different sparsity and data sizes. The experimental results show practically efficiency and stability.
In the future, more focus can be given to how effectively extract the features in the Pseudo-companion scenarios. Besides, it is necessary for the framework to further reduce the number of parameters and to simplify the parameter-setting strategies. In addition, if the entire framework can be upgraded in combination with a high-performance parallel and distributed computing solution to reduce the overhead time in Clustering Stage, the efficiency of the whole framework will be better optimized. Moreover, we plan to use a large amount of labeled accompanying trajectory data combined with machine learning methods to conduct more detailed rule formulation and algorithm design for the Pseudo-companion Filtering stage in our future work.