In this section, we first introduce the data set we used in this research. The problem and framework of group pattern mining is presented then. We also described the proposed group movement pattern mining based on similarity method in detail.

#### 3.2. Data Preprocessing

In this paper, we focus on the travel behaviors analysis of group tourists and individual tourists in Hainan province. At the foremost, we need to extract the trajectories of the tourists. Tourists identification is non-trivial though, we simplified the problem by an assumption that the tourists are those whose mobile phone’s home location are provinces other than Hainan. In CDR data, there is one field that indicates which province the user belongs to. So we remove the users with this field indicating Hainan at first. The remaining users are regarded as tourists. Besides, we also discard the trajectory data of the users whose total number of records is less than 100 per month, which may not contribute to the analysis.

When a mobile phone is in the overlapped areas between adjacent cells, it may switch between two cells when actually the user’s location hasn’t changed. This phenomenon is called the Ping-pong effect which leads to abnormal trajectory in the data. To eliminate such noise, we refer to [

43] for detecting and removing oscillation records. After that, we identify stay points from raw trajectories.

**Definition 1.** (Stay Point): A stay point $sp$ represents a location characterized by a sequence of consecutive points in the raw trajectory data which is limited by both temporal and spatial constraints.

For a given trajectory $T=\{{p}_{1},{p}_{2},...,{p}_{k}\}$, a stay point $sp$ is defined as the centroid of a sub-trajectory ${T}_{sub}=\{{p}_{i},...,{p}_{j}\}$, $1\le i\le j\le k$, which satisfies the condition that distance between two points in ${T}_{sub}$ is less than a threshold ${d}_{th}$, the time interval between ${p}_{i}$ and ${p}_{j}$ is greater than a threshold ${\tau}_{th}$.

A stay point $sp$ is generated from ${T}_{sub}$ and can be denoted as $s{p}_{sub}=\{u,x,y,t,du\}$, where u is the id of the user, x and y are the longitude and latitude of the centroid of ${T}_{sub}$, t is the timestamp of the first point of ${T}_{sub}$ i.e., ${p}_{i}$, and $du$ is the time interval between ${p}_{i}$ and ${p}_{j}$ in ${T}_{sub}$, which indicates how long a user stays in this region.

After identifying stay points from raw data, the trajectory of a user converts into a sequence of pass-by points separated by some stay points. The points which do not satisfy the conditions in Definition 1 are called pass-by points.

#### 3.3. Candidate Groups Filtering

After data preprocessing, we are able to calculate the similarity between tourists to identify tourist groups with high similarity between its members. A group of tourists traveling together must have some features in common, such as trajectory, accommodation and so on. Our algorithm aims to discover tourist groups whose members have similar behaviors in relation to each other. However, too many tourists means that the process of similarity calculation between all pairs of tourists will be overwhelmingly time-consuming as well as memory consuming. To solve the problem, we filter out massive trajectories of tourists who are impossible within a tourist group at first by deploying a frequent itemset mining method before the process of similarity measurement. In this step, we can get candidate groups whose members appeared together for at least K snapshots of the trajectories. So only the similarity among tourists in a candidate group rather than all tourists need to be computed, which greatly reduces the computational complexity.

Considering that the stay points of the members in one tourist group should be close to each other in space for most of the time, we can remove the trajectories of tourists who don’t satisfy these conditions and may be individual tourists by applying a frequent itemsets mining method.

At first, we divide a trajectory into a sequence of snapshots by time. Let $TI$ be the interval of snapshots, $Ts$ be the time span of the considered group movement patterns. Snapshot set S is a sequence of snapshots $\{{s}_{1},{s}_{2},...,{s}_{i},...,{s}_{M}\}$, $M=Ts/TI,i=1,...,M$. Each element ${s}_{i}$ can be expressed as a set of stay points in trajectory, i.e., ${s}_{i}=\{s{p}_{i,1},s{p}_{i,2},...,s{p}_{i,j},...,s{p}_{i,n}|\phantom{\rule{4pt}{0ex}}s{p}_{i,j}.t\in [{t}_{i},{t}_{i}+TI]\}$, where ${t}_{i}$ is the start time of the i-th snapshot, $s{p}_{i,j}$ is the j-th stay point whose timestamp is within the interval of the snapshot i and n is the total number of stay points in the snapshot.

For the sake of the sparsity data of CDRs and its inconsistent sampling rate of trajectories, the interval of snapshots need to be selected long enough to ensure that the trajectory points can be included in the same snapshots as long as the difference of their timestamps is no more than $TI$.

**Definition 2.** (Collection): A collection c is a group of stay points in a snapshot within a distance threshold ${d}_{c}$. The snapshot ${s}_{i}$ can be expressed in collections as ${s}_{i}=\{{c}_{i,1},{c}_{i,2},...,{c}_{i,j},...,{c}_{i,m}\}$, where ${c}_{i,j}$ is the j-th collection in ${s}_{i}$ and m is the total number of collections in ${s}_{i}$.

For example, as shown in

Figure 2,

${s}_{1}$ to

${s}_{7}$ denote seven continuous snapshots, each of which contains stay points whose timestamp is within the interval of the snapshot. And the objects in a snapshot within a distance threshold belong to a collection. A snapshot may contain numbers of collections. As in

${s}_{5}$, there are two collections

${c}_{5,1}=\{O1,O3\}$ and

${c}_{5,2}=\{O2,O4\}$. In this work, we perform a density-based clustering method (DBSCAN) on the snapshots to get collections of the objects. The objects which are close enough to each other are clustered into a collection.

By observing some group’s trajectories manually, we find that even if a group of tourists are traveling companions, they won’t stay together all the time. In some situations, some members in the group may leave and will not come back in the next several snapshots. The strict restrictions on continuity will lead to the loss of the real groups. So the problem is how to discover members of a group which appear in the same collection for at least K possibly non-consecutive snapshots. We formulate the problem into “Market Basket Analysis”.

Market Basket Analysis is a modelling technique that mines the association between different items. For example, people who buy bread may also buy butter, thus bread and butter often occur together in the bills. So these kinds of problems are formulated to mine frequent itemsets from transaction records. In our study, tourists and collections can be viewed as items and transactions respectively. We aim to find the tourist groups (itemsets) which frequently occurred in the collections and satisfies the threshold of support. Therefore, we adopted FP-growth, an efficient method proposed by Han et al. [

44], to mine the complete set of the frequent itemsets.

With the minimum number of snapshots K and the minimum size M, the FP-growth algorithm aims to find out groups containing at least M members that traveled together for at least K possibly non-consecutive snapshots. A candidate group ${g}_{i}$ is denoted as ${g}_{i}=\{size,object1|object2|...|objectN,frequency\}$, where $frequency$ is the number of times that ${g}_{i}$ occurred in collections. However, the itemsets obtained from FP-growth are not closed frequent itemsets, resulting an increase in computation. To solve this problem, we define a filtering rule as: For candidate group ${g}_{i}$ and ${g}_{j}$, if ${g}_{i}\subseteq {g}_{j}$ and ${g}_{i}$’s support is less than ${g}_{j}$’s support, then ${g}_{i}$ is removed from the result set. After this step, candidate groups in the result set are guaranteed to be closed.

#### 3.4. Similarity Measurement

After filtering out the trajectories of tourists who are impossible within a tourist group, we get a number of closed candidate groups. To discover real tourist groups, we propose a similarity measurement taking into account four features. Considering that in most cases, the members in the same tourist group may have the same travel routes, stay in the same hotel, come from the same province, and have the same travel time in Hainan. So we apply these four features to measure the similarity of tourists, i.e., the trajectory, the accommodation, the attribution and the number of days stayed in Hainan. The similarity of tourist

a and

b is defined as a vector:

where

$Tsim(a,b)$ is the trajectory similarity of

a and

b,

$Asim(a,b)$ is the accommodation similarity and

$Nsim(a,b)$ is the similarity of the other two features.

#### 3.4.1. Trajectory Similarity

In this part, we perform trajectory similarity measurement for each pair of tourists in the same candidate group.

Because tourists in one group may not stay together all the time and the points making up CDR trajectories are scarce, their travel routes and places visited may be different in a local area.

Figure 4 illustrates the trajectories of the two tourists belonging to the same group. Although they traveled together in Haikou city, but their trajectories are not similar to each other in some areas. We can see that there are 9 points they stayed together in their trajectories. The trajectory of tourist

a in green line has more sampling points than tourist

b from

${m}_{1}$ to

${m}_{4}$, which causes the difference between trajectories in these areas. Another problem we can see from

Figure 4 is the different travel routes of tourists between

${m}_{5}$ and

${m}_{6}$. In such a case, some existing trajectory similarity algorithms such as LCSS, DTW, ED are not suitable. So we design a trajectory similarity measurement method to deal with CDR trajectories, which is shown in Algorithm 1.

**Algorithm 1:** Trajectory Similarity |

**Input**: $trajectory\phantom{\rule{4pt}{0ex}}{T}_{a},trajectory\phantom{\rule{4pt}{0ex}}{T}_{b},{\delta}_{t},{\delta}_{d}$
**Output**: $Tsim(a,b)$ |

The core concept of the algorithm is to divide an entire trajectory into sub-trajectories by the matching points on two trajectories, then measure the similarity of two trajectories based on the distance between centroids of each pair of sub-trajectories. For the two trajectories in

Figure 4, the trajectory similarity calculated by the proposed algorithm is 0.693, compared with 0.455 for LCSS, 0.102 for ED, 0.183 for normalized DTW. So those algorithms are obviously not suitable in this situation.

**Definition 3.** (Matching Point): Given trajectory ${T}_{a}$ and ${T}_{b}$, $s{p}_{i}$ and $s{p}_{j}$ are the stay points of ${T}_{a}$ and ${T}_{b}$ respectively. ${\epsilon}_{t}$ is a time threshold. $s{p}_{i}$ and $s{p}_{j}$ are called matching points if:

- (1)
$|s{p}_{i}.t-s{p}_{j}.t|<{\epsilon}_{t}$

- (2)
$dis(s{p}_{i},s{p}_{j})<{d}_{th}$

where $dis\left(\right)$ is the distance of the two points.

The algorithm consists of two functions. The function TraSimilarity (Line 1–20) is to find the matching points and then divide trajectories into sub-trajectories by the matching points.

${M}_{a}$ and

${M}_{b}$ are the set of the matching points in

${T}_{a}$ and

${T}_{b}$. The function Measure (Line 21–29) is to judge if the two sub-trajectories are similar by calculating the distance between two centroids of sub-trajectories. First, we try to find the matching points of the two trajectories (Line 5). To avoid the situation that one point can be matched with two different points in another trajectory, we choose the first matching point in another trajectory as the start of the sub-trajectory (Line 6–7). To measure similarity of sub-trajectories with different sampling rate, we merge stay points on the sub-trajectory into a centroid and then use the distance between the two centroids to estimate the similarity of sub-trajectories (Line 25–27). In this way, the issue caused by different sampling rate can be addressed somehow. When the distance between the centroids of sub-trajectories is within

${\epsilon}_{d}$, the two tourists are considered as a traveling companion in the sub-trajectory. Finally, the trajectory similarity of tourist

a and

b is denoted as:

where

v is the number of sub-trajectories in the entire trajectories and

w is the number of sub-trajectories on which tourist

a and

b are considered to be a traveling companion.

#### 3.4.2. Accommodation Similarity

In the process of measuring similarity of the tourists, we consider not only the trajectories of the tourists, but also the places they stayed at night. Group movement patterns for tourists have distinct characteristics compared with other kinds of group movement patterns. For example, in the peak season, thousands of tourists crowd to famous scenic areas in Hainan at the same period, which leads to overlapped trajectories of different tourist groups in the daytime. So it’s hard to distinguish different tourists groups only by trajectory data. We try to find places tourists stayed at night to measure their similarities in accommodations. Generally speaking, a group of tourists will stay in the same place (maybe a hotel or a residence) at night which can be an important feature to measure the similarity of the tourists in a group.

The first step is to identity lodgings tourists stayed at each night. We define 21:00 to 9:00 as

$Hometime$. It is obvious that tourists will spend most time in lodgings at Hometime. In the algorithm, we try to find the stay points with the longest duration at Hometime, and identified them as tourists’ lodgings. Supposing the length of stay for tourist

a and

b in Hainan is

z nights which is calculated based on the timestamp difference of the first and the last records of each tourist, his/her lodgings in Hainan are denoted as a sequence of lodgings with date attached, i.e.

${H}_{a}=\{{h}_{a1},{h}_{a2},...,{h}_{az}\}$. So we define the accommodation similarity of tourist

a and

b as

where

$samelod$ denotes the number of same lodgings during

z nights. Since we can not identify which specific hotel the tourist stays in when there are multiple hotels in the coverage area of the same base station, accommodation similarity is just one part of the similarity measurement.

#### 3.4.3. The Similarity of other Features

Besides similarity measurements mentioned above, there are also other features related to travel behaviors that can be used to measure similarity. In this part, we extract the other two features which can help us identify the relationship between tourists in the groups, then combine the two features.

The first feature is the mobile phones’ home locations of the tourists, which can help to discover groups from a certain province. In CDRs, each record has a field to indicate the mobile phones’ home locations, for example, “301” represents Guangdong Province, “302” represents Shandong Province and so on. Utilizing this field we can easily discover tourists from the same province. Two tourists traveling together within the same tourist group are more likely to be from the same province.

The second feature is the number of days the tourists spent in Hainan which is also an important feature to distinguish different tourist groups. Generally, the members’ arrival and departure time in a group is usually consistent and the days they spent in Hainan will also be the same. We use the timestamp difference of the first and the last records of each tourist to calculate the maximum continuous days in Hainan as his/her second feature.

For tourist

a and

b, when the feature of tourist

a and

b has the same value, this feature is considered to be matched. We measure the similarity of the two features of tourists

a and

b by this equation:

where

$allfeas$ is the total number of features. Here the value of

$allfeas$ is 2. The

$matfeas$ is the number of matches of

a and

b.

#### 3.5. Identify Group Tourists

After obtaining the similarity of tourist a and b which is denoted as $Sim(a,b)=(Tsim,Asim,Nsim)$, we need to judge whether a and b are a pair of traveling companions or not by the similarity vector.

We use two different methods to determine which pairs of the tourists are the traveling companions. The first method is a threshold-based method which sets a threshold to filter out the tourists who have low similarity with others in the candidate groups. We define $totalsim={w}_{1}*Tsim+{w}_{2}*Asim+{w}_{3}*Nsim$, where ${w}_{1},{w}_{2},{w}_{3}$ are the weights of three features. ${w}_{1}+{w}_{2}+{w}_{3}=1$ and ${w}_{1},{w}_{2},{w}_{3}\in [0,1]$. If $totalsim>{\epsilon}_{s}$, the two tourists are identified as a traveling companion, otherwise not. Because of the complexity of travel behaviors, it is not so easy to choose the proper value of ${w}_{1},{w}_{2},{w}_{3},{\epsilon}_{s}$. We set four sets of weights with different threshold in our work to analyze.

In the second method, we apply safe semi-supervised support vector machines(S4VMs), a semi-supervised learning algorithm proposed by [

45], after labeling some pairs of tourists manually, to identify the traveling companions. The algorithm uses unlabeled data to improve the performance of classification results when labeled data are limited.

The traveling companions that are in the same candidate group are identified as a real group. For example, in a candidate group $\{a,b,c,d\}$, we find two pairs of traveling companions, namely $a,b$ and $b,c$, then we consider the real group as $\{a,b,c\}$.