GroupSeeker: An Applicable Framework for Travel Companion Discovery from Vast Trajectory Data

The popularity of mobile locate-enabled devices and Location Based Service (LBS) generates massive spatio-temporal data every day. Due to the close relationship between behavior patterns and movement trajectory, trajectory data mining has been applied in numerous fields to find the behavior pattern. Among them, discovering traveling companions is one of the most fundamental techniques in these areas. This paper proposes a flexible framework named GroupSeeker for discovering traveling companions in vast real-world trajectory data. In the real-world data resource, it is significant to avoid the companion candidate omitting problem happening in the time-snapshot-slicing-based method. These methods do not work well with the sparse real-world data, which is caused by the equipment sampling failure or manual intervention. In this paper, a 5-stage framework including Data Preprocessing, Spatio-temporal Clustering, Candidate Voting, Pseudo-companion Filtering, and Group Merging is proposed to discover traveling companions. The framework even works well when there is a long time span during several days. The experiments result on two real-world data sources which offer massive amount of data subsets with different scale and different sampling frequencies show the effective and robustness of this framework. Besides, the proposed framework has a higher-efficiency performing when discovering satisfying companions over a long-term period.


Introduction
According to the statistics of China's 2019 telecommunications business, the number of mobile phone users reached 1.6 billion by the end of 2019 [1]. Due to the development of location techniques and widespread use of smart devices, personal trajectory data has become an important resource for understanding personal or group behaviors, and trajectory data mining has become a hot topic in many of research fields [2]. For instance, Elragal et al. [3] and Shingo Enami et al. [4] used relative technologies in vehicle management. Tian Qin et al. [5] proposed a method to mine spatio-temporal routine of people based on mobile phone data. Huan et al. [6] tried to explore social behaviors on mobile sensors data. Chen et al. [7] made disease predictions based on mobile big data. Xudong Liu et al. [8] used the taxi trajectory data to identify urban functional regions in Chengdu. Besides, trajectory data analysis has applied in some practical applications, such as nearby friend recommendation based on location-based service (LBS) [9] and route navigation in Map Applications, etc.
Discovering accompanying or group behavior pattern is an important branch in mining mobile trajectory data. The pattern is defined as more than one moving objects that travel together for a period of time. Such pattern discovery provides significant supports to a large amount of relative fields, such as control of key personnel, tourism development, accident investigation, group tracking etc.
It has been applied in significant application scenarios. Tang et al. [10] proposed a loose companion discovery for military object monitoring to describe the several members may temporarily leave the group and go back in short time. Meiling Zhu et al. [11] proposed a novel algorithm to find Platoon companion pattern over a special type of spatio-temporal data stream. Zhu et al. used Hainan tourists data to find group movement pattern and classified tourists [12], etc. Thus, mining and analyzing accompanying behavior pattern are necessary for relative applications and academic fields.
Since mobile devices can generate massive amounts of data, one big challenge is brought into accompanying pattern mining, i.e., high performance of algorithms are needed to process massive data in limited time. Another major challenge comes from the optimization of the traveling companion discovering algorithm. Traveling Companion Discovering Algorithm comes from the Clustering-and-Intersection method [13], which defines the companion candidates to describe the similar companions in each time snapshot. Tang et al. [14] optimized the Clustering-and-Intersection algorithm into a smart-and-closed algorithm by combining the buddy structure to improve the effectiveness of the method. In the mean time. Some studies [12,15] use the similar way to discover Traveling Companions or other behavior pattern. However, it is easy to cause an omitting candidate problem with the time-snapshot-slicing-based method, especially when the time period is extremely short-term. Due to the sparsity of the mobile trajectory data, it is a hazard to cluster these trajectory data using the unbefitting time segmentation method. Concretely, some cluster-able trajectory data cannot be clustered possibly and are even filtered as noise. Therefore the approaches based on time-segmented slicing may not always be completely successful.
In this paper, we propose a new companion discovery method based on the clustering algorithm and association analysis algorithm to solve the above problems. In contrast to the time-snapshot-slicing-based methods or models, this method finds the closeness in the location and the closeness in time reflected in the moving-user data from a holistic perspective. In addition, more focus is given to the potential correlation between users. For example, if A and B are a pair of accompanying partners, they are more likely to spend time together in a small region, which can be defined as that B appears when A appears or A appears when B appears.
The proposed algorithm is an extension and optimization of our previous work [16]. On this basis, we improve the algorithm and propose a 5-stage framework. Firstly, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HBDSCAN) [17] is used to mine similar moving users in a certain geographic area and within the time span. Then, a classical association analysis algorithm Frequent Pattern (FP-growth) is used to predict the internal association relationships among similar users, which takes full use of the characteristics of clustered data with high similarity to find potential accompanying patterns. The following stage involves a filtering strategy which is used to perform the necessary filtering to obtain the recommended travel companions for certain pseudo-companion scenarios. The last stage is designed to merge the results data into groups.
The main contributions proposed in this paper are as follows: • A framework of traveling companion discovery named GroupSeeker is proposed. Through a five-stage processing flow, GroupSeeker can find potential traveling companions in a huge amount of trajectory data with high performance and accuracy.

•
Parameter Setting Strategies are inherently embedded into GroupSeeker. Primary stages can determine their parameters according to the characteristic of datasets, which makes the framework much more practical and applicable. • A novel Spatio-temporal clustering method is used to deal with trajectory data of long-term time slices and solve the omitting problem of companion candidates caused by improper short-term time segmentation in previous work.

•
Experimental results on real-world and simulated datasets show the time cost of GroupSeeker is at a desirable level. Trajectory data for twenty-four hours can be processed within one and a half hours, which means GroupSeeker can be used in all-weather monitoring jobs.
The reminder of the paper is organized as follows. Section 2 introduces the related work; Section 3 gives the problem definition and the methodology, including the framework and methods; Section 4 presents the experimental results; Section 5 concludes this paper and gives some discussion about future work.

Related Work
In this section, the related work based on two main categories is introduced, i.e., the trajectory clustering and the companion pattern discovery.

Trajectory Clustering
For the clustering of similar trajectories based on the time dimension, Agrawal et al. proposed the trajectory similarity measurement based on Euclidean distance [18] in 1993. Faloutsos C et al. and Chan KP et al. used discrete Fourier transform and discrete wavelet transform respectively to preprocess the trajectory similarity measurement based on Euclidean distance [19,20]. Elnekave S et al. improved the expression of MBR by proposing MBB (Minimum Boundary Box) smooth trajectory to deal with the influence of noise better [21].
For clustering based on the similarity of trajectories, the similarity mining of entire trajectory features is focused on reducing the requirement in the time dimension, only requiring the chronological order among the trajectory-recording points, thus, general DTWD is used to deal with such clustering [22]. For local clustering with a single trajectory, Lee et al. presented a framework that divides first and then aggregates to divide into sub-trajectories according to the principle of minimum-description length, using the density clustering method [23]. In addition, several density-based clustering methods were proposed, such as DBSCAN, DENCLUE, OPTICS, etc. DBSCAN [24] is a widely used spatial location clustering algorithm. It has the characteristics of not needing to determine the number of clusters in advance and can find clusters of arbitrary shapes [25]. In 1999, OPTICS was proposed by Ankerst M. et al. Instead of producing clusters of a data set explicitly; however, it creates an augmented ordering of the databaset representing its density-based clustering structure [26]. Hinneburg, A and Gabriel, HH proposed DENCLUE 2.0 to improve the disadvantage of DENCLUE 1.0 [27] that making small steps at first could lead to never converges to the maximum [28]. In 2017, Mclnnes et al. proposed a hierarchical density-based clustering algorithm and released a related codebase as a package in Python to use [17]. Yuqing Yang et al. proposed a trajectory clustering algorithm to extract trajectory Stays based on the density analysis in spatial-temporal trajectory data and achieved higher clustering accuracy in the real-world data sets [29].
For clustering of the trajectory points, Gao Y et al. proposed a constrained k-nearest neighbor queries among trajectories [30]. A Subtrajectory Clustering algorithm based on the Fréchet Distance using GPU was proposed by Gudmundsson J et al. [31] to take advantage of continuous Fréchet Distance as the measurement of similarity among trajectory curves which has obvious performance advantages. Similarly, Deng Z et al. proposed a modified OPTICS algorithm, called Tra-OPTICS, to cluster trajectory. Besides, a GPU-based version is proposed to optimize performance, called G-Tra-OPTICS, which is based on the STR-tree as the indexing structure [32]. Yuan, G et al. summarized these important techniques of trajectory clustering [33].
For semantic trajectory clustering, Xiao X et al. proposed a method for finding similar users using category-based history [34]. Ying JC et al. proposed semantic trajectory clustering based on the location prediction to recommend the user to the next dimension [35]. Liu S et al. presented an approach to achieve recognition of hot spots among trajectories [36]. Andrienko et al. presented generic techniques and visualization guidelines to support movement data analysis, using the trajectory clustering on a real air traffic data-set [37]. Olive, X., and Morio, J. applied the trajectory clustering in the air traffic management and validated the effectiveness of the proposed method on a real-world trajectory set [38].

Companion Pattern Discovery
Through analyzing the behavioral patterns of mobile-object groups, accident investigation and group tracking based on the spatio-temporal environment can be realized. In a series of research outputs, representative trajectory patterns were defined, mainly including flock [39], convoy [40], swarms [41] and gathering [42]. In 2016, Zhenzhen Wang et al. presented a literature review to summarize the existing travel behavior studies that applied mobile phone data and have discussed the potential of mobile phone data in advancing travel behavior research [43].
Gudmundsson et al. [39] defined the flocking model which optimizes the early research population patterns by predefining the regional shape and population size. The convoy model defined by Jeung et al. [40] realized trajectory mining with arbitrary shapes based on density clustering, avoiding predefined spatial thresholds, and the model requires a certain number of moving objects to be connected in density over k durations. Further optimization based on the first two models was the Swarm model defined by Li et al. [41]. In their method, the time is not required to be continuous when the moving objects move together for a certain period of time. Zheng et al. [42] defined gathering pattern that simulates group events in trajectories, such as celebrations, parades, protests, etc. In addition, effective index structures and fast patterns based on bit vectors are proposed to improve mining efficiency. Fan Chen et al. proposed a method for detecting group interactions for groups of varying numbers of objects [44]. Zhang et al. [45] used the spatio-temporal graph to retrieve gathering. The researchers presented the CUTis [46] (Clustering Trajectory data stream), which is a processing algorithm for an incremental trajectory data stream. A method for identifying the group movement pattern through mobile phone call detail records (CDRs) based on similarity to discover tourist groups was proposed by Zhu et al. [12]. An algorithm for finding gradual moving objects clusters pattern among trajectory streams was proposed by Yujie Zhang et al. [15]. In order to discover accompanying vehicles, in intelligent transportation system (ITS), a typical application in software engineering technology, Meiling Zhu et al. [47] proposed a method for discovering Traveling Companions through Automatic Number Plate Recognition (ANPR) data stream, using frequent sequence mining with time constraints. Zhang et al. [15] used the sliding window to mining the cluster pattern in trajectory data.
Moreover, the correlation analysis algorithm is used in the trajectory analysis and pattern discovery. Xia Dawen et al. [48] proposed a method using a parallel frequent pattern growth algorithm based on map-reduce to analyze trajectory big data. Hu et al. [49] used OPTICS clustering and association. Based on frequent item-set, Al-badwi et al. [50] proposed a breadth-first and depth-first hybrid distributed approach with Frequent itemset mining (HD-FIM) on Spark to increase the efficiency of discovering companion vehicles.
Regarding to the methods of discovering traveling companions, Puntheeranurak et al. [46] proposed a micro-group-based clustering algorithm to reduce the computational cost and they conducted experiments on a real taxi trajectory data and synetic data. Nevertheless, their research is difficult to avoid Companion Candidate Omitting Problem and the scale of their testing samples are smaller than our work. Besides, Xinning Zhu et al. [12] proposed a threshold-based method and safe semi-supervised support vector machines (S4VMs) to calculate the similarity vectors of tourists and detect their transportation mode for finding the group movement pattern through CDRs. However, this research and the proposed framework are mainly used in special applications such as tourism. Thus, the motivation of their work is actually different from our research. In contrast, our research is closer to the study of the underlying framework in the filed about discovering traveling companion.

Materials and Methods
In this section, the problems are illustrated to describe the situation for our methods and problem definitions are presented to facilitate subsequent descriptions. Finally, a framework is proposed, including five stages to discover traveling companion.

Companion Candidate Omitting Problem
Traveling Companion is a set of moving objects that move together as a group for a period of time. In terms of spatio-temporal trajectory data, traveling companions are formalized as a set of moving objects, whose spatial positions are density-connected in a cluster within a short-term time span. Previous studies divided continuous time into time snapshots in order to discover traveling companions from spatio-temporal trajectory data and checked each time snapshot for candidate partners. However, since real-time spatio-temporal trajectory data is not always uniformly sampled in the time dimension or the geographic dimension, such time-division operations may lead to the problem of omitting candidates. We will describe the above issues in detail, and give the definitions used in the following work.
After preprocessing the real-world non-intensively sampled trajectory data, two data characteristics are found: • Signals of real-world positioning data may be blocked during acquisition and transmission. The reason for blocking is because users can actively turn off devices or terminate location service and the transmission of location information may be interfered or blocked by surrounding environments. • Due to differences in sampling methods and loss of data transmission, trajectory data will be sparse or partially lost during data collection.
Because of the above characteristics, when the conventional accompanying-pattern discovery algorithm uses a time-segment slicing method in a highly sparse trajectory data set, the recording points at the edges of the formed clusters are likely to be filtered as noises due to the inappropriate duration of time slices. Figure 1 shows a companion candidate omitting examples. There are adjacent time snapshots, i.e., s1, s2 and s3. One or more clusters in each snapshot can be obtained after cluster processing, along with several unclustered points such as A, B and C. Because the time segmentation happens to be in the middle of their sampling times, it can be seen that sample A and B are divided into different time snapshots even though they actually have a tight relationship. If merging s1, s2 and s3 into one long-time snapshot, a cluster including points A and B will be founded during clustering and the cluster will be a potential companion candidate for the following processes. In fact, A and B are traveling buddies, whereas C is a real noise point. That is the classic companion candidate omitting problems caused by inappropriate time snapshot boundaries. This problem arises more frequently when trajectory data is more sparse.
The probability of this problem is related to the length of the time segment. For example, if raw data contains trajectory records in a region within one day, researchers would hope to avoid omitting problems as much as possible. There are two choices about whether to slice 24 h according to 5 min, or directly calculate according to the whole 24 h. If choosing time-segment based on short-term slicing, it will lead to an increased possibility of introducing problems. On the contrary, if a method can use 1 day or several hours of trajectory data as input this possibility will be greatly reduced. Obviously, frequent short-time slicing can easily introduce the Companion Candidate Omitting Problem, resulting in non-noisy records being filtered out.
From a holistic perspective, we take records in a longer-term time span as mining targets. Closer geographical similarity and closer temporal dimension features are concerned. The trajectory clustering algorithm is used to mine the similar features in spatio-temporal dimension for these records among users. Meanwhile, the frequent accompanying situation is regarded as the important standard to discover Associated Traveling Companion Candidate (ATCC). Then the characteristics of the accompanying scenarios are combined to specifically confirm the accuracy of the accompanying situation. It will greatly improve the robustness of the method to the degree of data density. .., c m } is a set as a set of companion candidates clustered by location information, where m represents the number of clusters. This paper uses the density-based clustering algorithm. Some parameters need to be defined. δ s is defined as a size threshold of clustering, ε is used as a distance threshold. The default distance formula of several clustering algorithms is based on the Euclidean distance formula, which can provide certain efficiency advantages. However, in order to facilitate the parameter setting of trajectory data mining and improve the accuracy of trajectory data mining results, the distance formula here may be replaced by a distance formula that better meets the needs of the scene. A locational potential candidate set is a cluster set satisfying w.r.t. δ s and ε. • Definition 4 (Time and Location Potential Candidate (TLPC)): On the basis of potential candidates for position, the clusters of the candidates satisfy clustering based on time to form clusters. The collection of objects in these clusters is regarded as Time and Location Potential Candidate. Among them, δ t s is defined as the minimum cluster size. In addition, because HDBSCAN is used to weaken another distance parameter, it is not defined here. The key-value pair of the dictionary W corresponds to the frequent item and its support. m is a frequent item with its support not less than the minimum support. The key of the association rule is a frequent item M with its confidence is not less than the minimum confidence. • Definition 6 (Pseudo-companion Scenarios): The Pseudo-companion scenarios refer to scenarios that already have potentially associated companionship while some important features do not fully conform to the accompanying pattern. • Definition 7 (Tolerance Strategy): When performing trajectory data mining in a sparse data set, some parameters cannot be set strictly. Otherwise, it will be difficult to find the research objects that meet the relevant conditions. For this reason, a Tolerance Strategy needs to be considered to discover moving objects. • Definition 8 (Traveling Companion (TC)): Q = {q 1 , q 2 , ..., q n } is a set of traveling companion, where a traveling companion group q i is a group that satisfies the number of records satisfying the potential accompany situation is greater than the frequency threshold δ f , and the proportion of the records satisfying is greater than the percentage threshold δ r within the time period S.

Framework
Raw trajectory data generated from different sensing sources has different data formats and positional accuracies. A flexible framework named GroupSeeker is proposed to discover traveling companions from those divers trajectory data. The framework primarily includes a five-stage processing flow, which is composed of Data Preprocessing, Spatio-temporal Clustering, Candidate Voting, Pseudo-companion Filtering and Group Merging. Then a series of parameter-setting strategies throughout the whole processing flow are proposed to deal with different scenarios. The entire processing is shown in Figure 2. The various categories of sampling methods could bring several different characteristics of trajectory data and this paper focuses on two sampling methods, i.e., GPS and CDRs, which have the characteristics of collecting easily and having huge scales. Data Preprocessing removes unnecessary fields in raw trajectory data and filters noise and redundant data in remaining fields. Then the entire trajectory data is split into many sub-trajectory data sets to reduce the computational overhead. In the Stage II, Spatio-temporal Clustering, trajectory data is clustered in the spatial dimensional to discover Location Potential Candidate (LPC). Through clustering these LPC, Time and Location Potential Candidate (TLPC) can be discovered from the temporal dimensional. In addition, Candidate Voting stage focuses on the accompanying frequency between each pair of users in TLPC to discover the Associated Traveling Companion Candidate (ATCC). Subsequent Stage IV is Pseudo-companion Filtering that aims to offer some rules to filter some confusing pseudo-companions from ATCC. At the last stage, called Group Merging, it is to merge the companion sets with the same moving objects to make them as an accompanying group with multiple objects. As a semi-supervised framework, parameter-setting strategies could offer some significant strategies to guide these methods in Stage II, stage III and Stage IV to set relatively appropriate parameters.

Data Preprocessing
Stage I, Trajectory Data Preprocessing, aims to standardize the raw trajectory data, no matter what type of data source it comes from. A set of standardized sample data is shown in Table 1. Since many redundant fields are included in the raw data, such as acquisition-action number, base-station number, cell number, operator code, administrative-area code, and altitude, acquisition action number, base station number, cell number, operator code, administrative area code, etc., it is necessary to use various preprocessing methods in this stage including noise filtering (trajectory cleaning), trajectory segmentation, map-matching. Firstly, these redundant fields are abandoned and the remaining fields get uniform naming, such as Time, User ID, Longitude, Latitude and an index number. When cleaning these trajectories, some obvious noise points should be filtered, such as records containing error data type corresponding to a certain filed, records including wrong longitude and latitude in the range of known geographic area, and records containing timestamp that does not match the actual sample time. Besides, median filtering is used to deal with the single-noise point and Kalman filtering is used to deal with continuous-noise points. Using a stay point detection method through these filtered data, stay points in these trajectories could be found, which could be used to guide the further-patterns discovery. To reduce the computing scale for trajectory clustering and mining as much as possible about the behavior patterns among the sub-trajectory segments, the trajectory-segment operation is executed to divide the whole trajectory records into several sub-trajectories. We split a sparse trajectory data set into 18 sub-datasets and split Geolife trajectory data set into 19 sub-data set according to the number of records. A part of trajectory data is selected for map-matching to briefly verify the reliably of the trajectory data. In addition, the filtered data basically conforms to the map and there is no big drift.

Spatio-Temporal Clustering
To find representative sub-trajectories or public propensity behaviour through different moving users, trajectory clustering plays an important role by clustering similar trajectories. Generally, a feature vector is used to represent a trajectory. The similarities of two trajectories can be measured by calculating the distance between their feature vectors. The input of clustering algorithms in previous companion discovery is the data in a time segmentation. Because of the difficulty in collecting complete data and the data sparsity, it is a hazard to cluster these trajectory data using the unbefitting time segmentation method. Concretely, some cluster-able trajectory data cannot be clustered possibly and are even filtered as noise data. Therefore the approaches based on time-segmented slicing may not always be completely successful. Thus, a Spatio-temporal clustering for location-and-time dimension is proposed to solve these problems that cause omitting traveling companions. Figure 3 shows the clustering process in detail. In this processing, HDBSCAN is used to discover Location Potential Candidate (LPC) and Time and Location Potential Candidate (TLPC). LPC shows the similarity in location attributes. On this basis, TLPC requires the similarity in the time dimension more strictly. Figure 4 shows an example of a specific process for combining data fields. In Figure 4a, a set of data samples is presented that several fields (User ID, Latitude, Longitude, Time) are the remained fields after preprocessing and the Fill field is added for these records as LPC. To discover LPC, two parameters are used to limit the minimum size of the cluster and the neighborhood-distance threshold, which make HDBSCAN get the steady and effective results to discover LPC and to filter some noise which cannot be clustered. A Fill field is increased into the collection of LPC to increase the dimension to meet the requirement. The value of FILL field is set to 1 to simplify the calculation. In each cluster of LPC, HDBSCAN is executed once to find TLPC, including similar time-and-location characteristics, and to filter some noise records. The Figure 4b. illustrates this process visually. In Algorithm 1, steps 4-8 show the stage from the algorithm level. Notably, the number of these filtered records could influence the promotion of satisfying records. For different research purposes, they are valued differently.

Algorithm 1: Spatio-temporal Clusteirng and Companion Voting Algorithm.
Input: Trajectory records set R, a period time S; Distance thershold ε, size threshold δ s for location clustering, the size threshold δ t s for time clusteirng; support threshold min_sup and confident threshold min_con f Output: Assocaited Users frequencey-itemset Set M 1 location potential candidate set C l = Φ; 2 location and time potential candidate set C t l = Φ; 3 User ID set U ID = Φ; 4 C l ←cluster R with ε and δ s during S; 5 foreach cluster c i ∈C l do 6 C t l ←cluster c i with δ t s ; 7 foreach cluster c t j ∈C t l do 8 A ID ← the account ID of c t j ; 9 associated companion frequent-itemset set M = Φ; 10 associated rule dictionary W = Φ; 11 frequent itemset set F = Φ; 12 Using FP-growth algorithm F← find frequent itemsets in U ID ; 13 W← get ( f i ,con i ) by finding items f i in F and its confidence con i ≥min_con ; 14 for ( f i ,con i )∈W do 15 M← get the user ID u i from f i 16 return M;

Companion Voting
Mining the frequent pattern is to discover the frequent-temporal mode from the extensive trajectory data, which could mine the rules of publicity or frequent paths in public trajectories. In this method, information such as location, time and semantic information could be combined to mining the characteristics of moving objects.
Stage III takes advantage of the FP-growth algorithm to discover Associated Traveling Companion Candidate (ATCC) and FP-growth algorithm is a tree-based method using the frequent items. A technique is used to shorten the time for this algorithm to search through the suffix tree. Specifically, because the FP-growth searches from the frequent single-item set to the frequent n-item set, the potential accompanying situation between two users will be focused on and the convergence time will be reduced greatly, if the length of the suffix is set to 2. In order to further mine users with accompanying patterns, the occurrence of associations between users is worth noting. In addition, a collection including all of the user set from each TLPC is regarded as the target to find ATCC. The Total Proportion of Accompanying Frequency (TPAF) between two users in this collection is calculated by Equation (1), which will be compared with a parameter and be used to vote for ATCC. The Mutual Promotion of Accompanying Frequency (MPAF) is calculated using Equation (2) to judge the occurrence of the accompanying pattern between two users. Furthermore, it will be compared with a threshold to decide whether to vote for these two users meeting ATCC. In Algorithm 1, steps 12-15 show how to discover ATCC using FP-growth. Figure 3 shows the Companion Voting process in detail, such as building FP-tree, calculating TPAF and TPAF to get ATCC.
If two records of user X and Y that want to analyze the TPAF, the corresponding TPAF is: For X and Y, the MPAF is obtained as:

Pseudo-Companion Filtering
Pseudo-companion Filtering aims to offer significant rules for filtering some confusing Pseudo-companion scenarios. These pseudo-companion situations in discovering Traveling Companion Patterns from the ATCC are from the intermediate results in previous processes. Figure 3 shows the Pseudo-companion Filtering process in detail and the process need to use the Filter Rules Set for Different Data Sources (DTS).
Due to the diversity of trajectory data, there is no recognized method of confirmation to evaluate the sparseness of trajectory data. Combined with the analysis of experimental data, the sparseness of the trajectory data discussed in this article can be regarded as follows. It is the average value of the number of individual user records per unit time as a standard. When this value is larger, the data set is denser, otherwise the data set is sparse. Generally, positioning and sampling are used to quickly determine the sparseness of data sources. According to the sparseness of the data source, different data types at this stage will correspond to different rule sets, which will affect the time cost of this stage but improve the accuracy of the results.
The sparseness of the data affects the judgment of such pseudo-complications. Therefore, it is necessary to distinguish between pseudo-companion scenarios in different data sources. Through the visualization of the intermediate results of the experiment and the situation of mobile data sources, we briefly distinguished the pseudo-companion scenarios in the two types of data types. In order to prevent these scenes from interfering with the real accompanying results, the necessary filtering rules are proposed. Table 2 shows these scenarios, scenario descriptions and corresponding rules.
For different data sources, there are some differences in the factors that distinguish Pseudo-companion scenarios. For example, in a long-term period, users from ATCC may not have many accompanying records for sparse sample sets. Meanwhile, it can be determined that they have accompanying circumstances. Certainly, it may be identified as a short-term encounter.
In Algorithm 2, steps 1-20 show this stage. Among them, steps 1-2 are short pseudo codes of this stage. Steps 4-20 clearly show one of the rule sets, which is a scenario of filtering brief contact in the intermediate result set from the sparse data source.

Breif Contact
There is a small amount of close contact in the total record of A or B within a small area.
The number of pseudo-accompanying records is small and the total number of records is relatively small. If either is less than the relative threshold, the two objects are filtered.

Traveling Users Data
No-contact There is almost no close contact in the total record of A or B within a small area.
Pseudo-accompanying cases account for a so small proportion but the distance between the central geographic location of two objects is within the signal strength range of a base station.

Breif Encounter
A and B have frequent contacts in a small area within a short-term period within a small area.
The time span of the accompanying records is short-term. The directions of these moving objects change after these records. There is no accompanying record for a long-term period.

Group Merging
The purpose of Stage V is to discover Traveling Companions including multiple users rather than only a pair containing two users. The set including multiple users is regarded as a group. Figure 3 shows the Group Merging process in detail. It is necessary to identify whether there is a group accompanying situation and decide to merge them. In the Stage III for discovering Associated Traveling Companion Candidate (ATCC), a trick is used to optimize computational overhead to reduce the convergence time, which leads the research scenarios to be discussed between two users. However, the virtual number of accompanying users may be multiple, such as tourist groups, participants in group activities of a family of three in shopping, etc. From the perspective of the designer, it is necessary to mine further the final stage for the Multiple-User situation that may exist among the traveling companion candidate previously discovered. If these users are filtered in Pseudo-companion Filtering and are stayed, they should be merged using existing common sub-sets. For instance, for the set {{u 0 , u 3 }, {u 3 , u 4 }}, because two of the sub-items contain a common sub-set {u 3 }, we merge the two sub-items and remove the other true subsets. Finally, the set changes to {{u 0 , u 3 , u 4 }}. In Algorithm 2, steps 19-25 show this process in the last stage.

Parameter Setting Strategy
Since many factors need to be considered in the scenario of discovering traveling companion patterns by combining with real-world data samples, the algorithms related to parameter settings are used in three significant stages of this framework (Spatio-temporal Clustering, Companion Voting, Pseudo-companion Filtering). Some of these algorithms have obvious semi-supervised methods. Although we have reduced the number of parameters and simplified the complexity of using those as much as possible during the implantation of important algorithms. For example, we no longer consider using DBSCAN but use HDBSCAN as a clustering algorithm, it is inevitable to think about optimization of existing parameters and establishment of a set of strategies. The necessary parameter strategy will boost the effectiveness and efficiency of the method, which could reduce the learning cost for this method. In addition, all parameter notations are archived in Table 3.

•
General Strategy: The general strategy is explained here in order to highlight the tolerance strategy. First, the haversine formula is a formula especially calculated to the distance between two points through their latitudes and longitudes. Many clustering algorithms include a parameter called "metric", which can be set as "haversine". Secondly, for discovering Traveling Companions, the minimal clustered number for clustering should be larger than 3 to reduce the number of clusters. Moreover, for the support-and-confidence setting, Table 4 shows a preliminary correspondence between participation and confidence level. We hope to guarantee a higher confidence level, so the default confidence value set in this study is 0.6. For the support level, we will focus on the frequency of the target object at the same time and not necessarily require to get a ratio. Finally, it is important for the consistency of the results of a data set to ensure the distance threshold parameter. For instance, for ε and δ d , they are set to the same value in consideration of sampling accuracy at different stages. Absolutely, if the purpose of applications requires stricter filtering, it needs to set the latter parameter smaller.

•
Tolerance Strategy: Compared with the strictness of the general strategy, the tolerance strategy provides good support for the data sets from some special data sources, such as CDRs. Besides, it is difficult to give a clear value range for some parameters for various data sets, while the proposed tolerance strategy can guide users to weaken some parameter setting ideas from the purpose of mining. The original intention of this strategy is that for data samples with higher sparseness, strict threshold constraints are bound to make the result set as small as possible.
In fact, the setting of this strategy comes more from the practicality of the results. In this field, the sparseness of trajectory data has always been a major challenge. At the same time, it is difficult for some specific data sources to collect data information of all users in a specific geographic area within a long period of time. This results in the sparseness of real-world data that is reasonable and unavoidable. For this reason, researchers should hope to make full use of each recorded information (except obvious noise). Specifically, for some important scenarios, such as mining the behavior patterns of specific groups and specific individuals to discover the traveling companion pattern, sometimes various factors disturb the collecting process so that these data are caused to be sparse. In this case, the tolerance strategy can better prevent some records from being strictly filtered out, which is more likely to find other related moving objects. In our study, it is important for δ f and δ r in data source D1 to consider tolerance. These two parameters can be set to larger values to limit the confusion scenarios, such as only a small number of records are related and most of the records are far apart, or the number of records of an object is so small that it should be filtered out.

Experiment and Results
All the algorithms are implemented in python 3.8.2 on PyCharm and are performed on computers with Intel Core i7-8550U CPU 1.80 GHz, 16.0 GB RAM and windows 10.

Data Sets
Based on the two real-world data sets, various sample sets are extracted based on different criteria.. The criteria are shown as follows: • The Sampling Frequency It is essential to choose suitable data sets. For D1 and D2, after data preprocessing, they are divided into many subsets according to the number of records. For instance, we split Geolife data set into 19 subsets according to the amount of 800,000 records. In these subsets, we choose 5 subsets from D1 and D2 relatively, which are shown in Table 4. Notably, a simulated data set called Sim1 is generated based on a subset from D2. Sim1 is regarded as a subset from the real-world and simulated sources D3.
Despite the fact that Sim1 has a small size, it contains two companion simulation users we added for a particular user, which could quickly verify the effectiveness of the algorithm. The generation of the two simulation data comes from understanding the trajectory of a real user, especially to be able to have a simple understanding of its state changes, the most basic of which is its direction change in the two dimensions of latitude and longitude. By recording a state-change matrix, the basic state changes can be learned from the simulation data and hence the traveling companion's behavior can be simulated.
Except for Sim1 and Set5, other subsets have similar data size and a similar number of records. 10 samples subsets (Set1-Set5 and Geo1-Geo5) are randomly selected from D1 and D2 respectively in order to compare with the impact of the sparseness and density of the dataset in the real-world scene on the algorithm results. Set5, whose size is about half of the remaining 9 sample sets, is used to show the effect of data size on the method. Certainly, for dealing with the scale of 800,000 records, our experimental environment can be close to the its memory limit. Figures 6 and 7, which are the situations that need to be filtered out. In order to facilitate the display, we select a data type of rule set to use. It is the case of the two-types scenarios in the sparse data set. In sub-figures of Figure 6, although there is brief contact between two users. For one of two users, the number of records representing the contact processing does not stand at a big proportion of the total number of records. Hence, they are filtered by the rule sets. For the Figure 6b, they could be regarded as the no-contact scenario because they have few records presenting close contact. Finally, the Figure 6d. is a partial enlargement of Figure 6c. and the close-contact records between two users still account for too few, so they are not considered to be real companions satisfying the proportion of records. The sub-figures in Figure 7 show the cases of satisfying the filtering rules. Among them, the Figure 7a. is the result of Sim1 including three users. These users move together in a small area. In addition, the Figure 7d. is the partial enlargement of Figure 7c.  Table 5 highlights the time overhead of 10 data subsets in the framework, which is illustrated in Stage II to Stage V. It is evident that Stage II is the largest time-overhead stage in these 10 data subsets and has great differences between D1 and D2. The time overhead in Stage III is affected by the scale of the data subset. In Stage V, when the result of the previous stage leads to the absence of multiple targets, its time overhead will be 0. In addition, we use the average number of users' records to show the sparseness of each data subset. Obviously, D1 is more spare than D2. Thus, the parameter setting should not be too strict for D1. Otherwise, it would be difficult to discover TC. In practice, the filter rules for this result is to filter the brief contact for D1. Since this rule set is not used in D2 with a dense sampling effect, so the time cost is 0 in Stage IV for D2. Finally, it is worth mentioning that the time overhead in Group Merging is too shorter than other stages to be negligible in this scale of data set. Therefore, the overhead in Stage V is not shown in Table 6. Related parameter settings are shown in Table 7. The distance are measured in meters, and the time threshold are measured in seconds.

Significant Result Analysis
The number of TC in each data subset is shown in Table 5. Although some subsets produce a few or no results, it matches the real-world data scenarios with no accompanying pattern. In the following, some special and meaningful TC are presented by visualizing the experimental results. For instance, u0, u3, u4, and u30 are recommended from Geo3 as a TC. In the this long-term period of Geo3, all of them move through the road network in this geographic area within a close period of time. Therefore, their trajectories, which are shown in Figure 8, are very similar and the coverage rate among them is so high. The main difference is shown in Figure 8a. that is a small part of the trajectory difference exists, which may result from a short-term separation or a certain amount of data loss caused by a difference in the positioning signal. On the other hand, u0 and u3 have the same records within a long-term period. We further checked the undivided D2 dataset to verify this situation. It has been found that their records appeared same from 0:52 on 30 March 2009 to 2:58 on 5 July 2009. Therefore, it is reasonable to guess that this is likely to be the case of an individual carrying two mobile devices, which could offer positive support in the management of special objects, such as focusing on individuals or groups with sensitive behaviors. The sample data is shown in Table 8.

Conclusions and Discussion
At present, mobile positioning devices represented by navigation devices, smart wearable devices, and smart infrastructures are increasingly popular in daily life. LBS has become an important element, and which is not available to most people. Locatable devices and LBS provide sufficient conditions for generating a massive amount of mobile trajectory data. The trajectory traveling companion discovery algorithm is widely used as an important method for discovering accompanying behavior patterns. However, it is necessary to improve the applicability and efficiency of the method as much as possible under the premise of current information explosion and diverse sampling methods.
Thus, as one basic support technology of many trajectory data mining applications, this paper proposes an applicable framework GroupSeeker to discover traveling companions in vast spatial-temporal data. The framework includes a five-stage processing flow and the core algorithms lie in the following three stages, Spatio-temporal Clustering, Companion Voting, and Pseudo-companion Filtering. GroupSeeker successfully avoids the problem that useful clusters are considered to be noise due to bad time segmentation. Besides, considering the different sparseness of data sources, the parameter setting strategies are proposed to improve the reliability of the framework and reduce the learning cost. Moreover, a set of imperfect but indeed effective methods for filtering confusing scenarios is proposed. In practice, parameters in GroupSeeker could be set according to the purpose of mining and specific scenarios. Finally, the framework is evaluated on several real-world datasets with different sparsity and data sizes. The experimental results show practically efficiency and stability.
In the future, more focus can be given to how effectively extract the features in the Pseudo-companion scenarios. Besides, it is necessary for the framework to further reduce the number of parameters and to simplify the parameter-setting strategies. In addition, if the entire framework can be upgraded in combination with a high-performance parallel and distributed computing solution to reduce the overhead time in Clustering Stage, the efficiency of the whole framework will be better optimized. Moreover, we plan to use a large amount of labeled accompanying trajectory data combined with machine learning methods to conduct more detailed rule formulation and algorithm design for the Pseudo-companion Filtering stage in our future work.  Acknowledgments: In this section you can acknowledge any support given which is not covered by the author contribution or funding sections. This may include administrative and technical support, or donations in kind (e.g., materials used for experiments).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: